# HW 9 SP22

# Natural Language Processing and Recommender systems

In [61]:
# Importing required libraries
import pandas as pd
import numpy as np
import re
import string

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('movie_reviews')

from nltk.corpus import stopwords
from nltk import FreqDist
from nltk.tokenize import word_tokenize
from nltk import NaiveBayesClassifier
!pip install rake-nltk
from rake_nltk import Rake

from tqdm import tqdm
from gensim.models.word2vec import Word2Vec


from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import scale
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import GradientBoostingClassifier

import ast
from ast import literal_eval

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Collecting rake-nltk
  Downloading rake_nltk-1.0.6-py3-none-any.whl (9.1 kB)
Collecting nltk<4.0.0,>=3.6.2
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 5.4 MB/s 
Collecting regex>=2021.8.3
  Downloading regex-2022.3.15-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (749 kB)
[K     |████████████████████████████████| 749 kB 46.5 MB/s 
[?25hInstalling collected packages: regex, nltk, rake-nltk
  Attempting uninstall: regex
    Found existing installation: regex 2019.12.20
    Uninstalling regex-2019.12.20:
      Successfully uninstalled regex-2019.12.20
  Attempting uninst

In [62]:
# Mounting the Google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 1. Explain natural language processing in your own words

Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken and written -- referred to as natural language.<br> It is a component of artificial intelligence (AI).

NLP enables computers to understand natural language as humans do.<br> Whether the language is spoken or written, natural language processing uses artificial intelligence to take real-world input, process it, and make sense of it in a way a computer can understand.<br> Just as humans have different sensors -- such as ears to hear and eyes to see -- computers have programs to read and microphones to collect audio.<br> And just as humans have a brain to process that input, computers have a program to process their respective inputs.<br> At some point in processing, the input is converted to code that the computer can understand.<br>
<br>
Some NLP-based solutions include:

* Translation
* Speech recognition
* Sentiment Analysis
* Chatbots
* Question-Answer Systems
* Text summarization
* Market Intelligence
* Text classification
* Grammar checking

## 2. discuss what is word embedding, lemmatization, stemming

### Word Embedding
Word embeddings is one of the most used techniques in natural language processing (NLP). <br>
Word embeddings are a way to represent words and whole sentences in a numerical manner.<br> Word embeddings are basically a form of word representation that bridges the human understanding of language to that of a machine.<br> They have learned representations of text in an n-dimensional space where words that have the same meaning have a similar representation.<br> Meaning that two similar words are represented by almost similar vectors that are very closely placed in a vector space.
<br>
<div>
<img src="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/01/one-hot-word-embedding-vectors.png" width=500>
</div><br>

[Source](https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/)
<br>Thus when using word embeddings, all individual words are represented as real-valued vectors in a predefined vector space.<br> Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network.
<br>
**Word2Vec is one of the most popular technique to learn word embeddings.**

### Lemmatization
Lemmatization, reduces the inflected words properly ensuring that the root word belongs to the language.<br> In Lemmatization root word is called Lemma.<br> A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.<br> Because lemmatization returns an actual word of the language, it is used where it is necessary to get valid words.

Lemmatization, lets a word like “studies” undergo a morphological analysis based on a dictionary that the algorithm can consult to produce the correct root word.<br> As such, a lemmatization-capable machine would know that “studies” is the singular verb form of the word “study” in the present tense.

### Stemming
"Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language."

Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis). <br>So stemming a word or sentence may result in words that are not actual words.<br> Stems are created by removing the suffixes or prefixes used with a word.

In stemming, a computer algorithm often cuts off the ending or beginning of the word being analyzed.<br> The cut thus takes out prefixes and suffixes, which can lead to errors. Let’s take the words “studies” as an example.<br> A stemming algorithm would drop the suffix “es,” thus arriving at the root word “studi,” which we all know is not right. There’s no such word.

<div>
<img src = 'https://miro.medium.com/max/1400/1*ES5bt7IoInIq2YioQp2zcQ.png' width=700>
</div>

[Source](https://medium.com/geekculture/introduction-to-stemming-and-lemmatization-nlp-3b7617d84e65)

## 3. What is TF-IDF?

TF-IDF means **term frequency-inverse document frequency**.<br>
It is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.<br>
This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

It has many uses, most importantly in automated text analysis, and is very useful for scoring words in machine learning algorithms for Natural Language Processing (NLP).<br>
TF-IDF was invented for document search and information retrieval.<br> It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word.<br> So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.<br><br>
TF-IDF can be broken down into two parts:
* TF (term frequency) and 
* IDF (inverse document frequency).

TF (term frequency) : Term frequency works by looking at the frequency of a particular term you are concerned with relative to the document. 

IDF (inverse document frequency) : Inverse document frequency looks at how common (or uncommon) a word is amongst the corpus. 

<div>
<img src='https://miro.medium.com/max/1200/1*qQgnyPLDIkUmeZKN2_ZWbQ.png' width=600>
</div>

[Souce](https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558)

## 4. What do you mean by recommender systems?

Recommender systems are an advantageous alternative to search algorithms.<br> They help users discover items they might not have found otherwise and offer personalized products to the taste of the users.<br> For this reason, any large platform needs a recommendation system algorithm to make the user’s shopping more enjoyable by automating the search process, offering personalized items, and saving their time.<br><br>
There are 2 major types of recommender systems : 
* Collaborative  
  Collaborative methods for recommender systems are methods that are based solely on the past interactions recorded between users and items in order to produce new recommendations.
* Content based 
  Unlike collaborative methods that only rely on the user-item interactions, content based approaches use additional information about users and/or items.

## 5. Compare and Contrast content based vs collaborative recommender systems.

Below are the differences between Content-based and Collaborative recommender systems:

CONTENT-BASED  | COLLABORATIVE 
-------------------|------------------
In content based filtering, we use properties of the objects<br> and link similar ones and show them.       | In collaborative filtering, we usually use data of what was in any way linked together<br> by an outside sorting entity (e.g. bought together by an online shopper)<br> and show them in an ordered list.
A content-based recommendation engine emphasizes on the content features.     | A collaborative recommendation engine emphasizes on the user preference.
In content-based filtering, a recommendation system uses<br> the content profiles which includes the content features.     | In collaborative filtering, a recommendation engine requires the user profiles to suggest relevant content.
The content-based recommendation systems are product features oriented<br> and hence don't have cold start problems.    |  The collaborative recommendation systems feed on the user ratings, reviews, thumbs ups & downs,<br> and other feedback on various products or services.<br> So, the products with no ratings or feedback can't be recommended to any user.<br> Neither a new user who hasn't given any reviews or ratings can get any recommendation by the collaborative recommendation engine.<br>** This is called the cold start problem.**
A content-based recommendation engine can provide more accurate<br> recommendations as it focuses on the features of the content the user likes.       | A collaborative recommendation engine doesn't always ensure precise recommendations<br> because the users with similar taste may not like the same products.
Example of Content filtering: we show all the books that have have<br> same author, same publisher, same genre and the most similar number<br> of pages as book.      | Example of Collaborative filtering: we analyse what books have been read by people that have read book A<br> and the ones with highest count comes on the top of the list.

## 6. Discuss any 3 similarity metrics.

Similarity matrix is a real-valued function that quantifies the similarity between two objects. <br>
Similarity matrix is the opposite concept to the distance matrix .<br> The elements of a similarity matrix measure pairwise similarities of objects - the greater similarity of two objects, the greater the value of the measure.<br>

For example, the correlation matrix often may be considered as as a similarity matrix of variables - because it is natural to consider pairs of variables with higher values of the correlation coefficient as more similar each to other than pairs with lower values of the correlation coefficient.<br>

Below are the 3 types of similarity metrices:

1. Cosine Similarity: <br>
  Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.<br>
  The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.<br>
  Mathematically, Cosine similarity measures the cosine of the angle between two vectors projected in a multi-dimensional space. <br>
  Here, the two vectors are arrays containing the word counts of two documents.<br>
  The mathematical equation of Cosine similarity between two non-zero vectors is:
  <div>
  <img src='https://www.tyrrell4innovation.ca/wp-content/uploads/2021/06/rsz_jenny_du_miword.png' width=300>
  </div>
2. Jaccard Similarity: <br>
  The Jaccard Similarity Index is a measure of the similarity between two sets of data.<br>
  It compares members for two sets to see which members are shared and which are distinct. It’s a measure of similarity for the two sets of data, with a range from 0% to 100%. The higher the percentage, the more similar the two populations.<br> If two datasets share the exact same members, their Jaccard Similarity Index will be 1. Conversely, if they have no members in common then their similarity will be 0.
  <div>
  <img src='https://dev-to-uploads.s3.amazonaws.com/i/zbj2nxs9dh9mwohapjng.jpg' width=400>
  </div>
3. Euclidean Distance matrix: <br>
  The Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.<br> It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefore occasionally being called the Pythagorean distance.<br>
  It is the square root of the sum of squared differences between corresponding elements of the two vectors
  The formula for pairwise Euclidean distance is given as below:
  <div>
  <img src = 'https://miro.medium.com/max/1400/1*9LeaMTcOXxeTPN-VCbKloQ.png' width=300>
  </div>

## 7. What are sparse matrices and how do you create them in python?

**A sparse matrix is a matrix that is comprised of mostly zero values.**<br>
Sparse matrices are distinct from matrices with mostly non-zero values, which are referred to as dense matrices.<br>
An example of 4 * 4 Sparce matrix is:
<div>
<img src='https://quescol.com/wp-content/uploads/2020/12/Sparse-Matrix.jpg' width=350>
</div>

[Source](https://quescol.com/data-structure/sparse-matrix-in-data-structure)

The advantage of Sparse matrix is that it contains lesser non-zero elements than zero, so less memory can be used to store elements. It evaluates only the non-zero elements. Another advantage is that Sparse matrices provide computational speed.<br><br>

Python’s SciPy provides tools for creating sparse matrices using multiple data structures, as well as tools for converting a dense matrix to a sparse matrix.<br>The sparse matrix representation outputs the row-column tuple where the matrix contains non-zero values along with those values.

Below is the code snippet to create Sparse matrix:

```
import numpy as np
from scipy.sparse import csr_matrix

# create a 2-D representation of the matrix
A = np.array([[1, 0, 0, 0, 0, 0], [0, 0, 2, 0, 0, 1],\
 [0, 0, 0, 2, 0, 0]])
print("Dense matrix representation: \n", A)

# convert to sparse matrix representation 
S = csr_matrix(A)
print("Sparse matrix: \n",S)

# convert back to 2-D representation of the matrix
B = S.todense()
print("Dense matrix: \n", B)
```

Output:
```
Dense matrix representation: 
 [[1 0 0 0 0 0]
 [0 0 2 0 0 1]
 [0 0 0 2 0 0]]
Sparse matrix: 
   (0, 0)	1
  (1, 2)	2
  (1, 5)	1
  (2, 3)	2
Dense matrix: 
 [[1 0 0 0 0 0]
 [0 0 2 0 0 1]
 [0 0 0 2 0 0]]
```

## 8. Perform negative and positive text classification on nltk movie recommendation dataset, explain each steps performed.

In [63]:
# Getting the movie_reviews from nltk
from nltk.corpus import movie_reviews

In [64]:
# Looking at the list of all the words in 'movie_reviews'
print(movie_reviews.words())

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]


In [65]:
# Printing the total number of reviews
print (len(movie_reviews.fileids()))

2000


In [66]:
# Printing the total number of words in 'movie_reviews'
len(movie_reviews.words())

1583820

Total number of words is 1583820.

In [67]:
# Printing the target label
movie_reviews.categories()

['neg', 'pos']

In [68]:
# Looking at the number of positive reviews
print (len(movie_reviews.fileids('pos')))

1000


In [69]:
# Looking at the number of negative reviews
print (len(movie_reviews.fileids('neg')))

1000


In [70]:
# Displaying frequency of words in ‘movie_reviews’
nltk.FreqDist(movie_reviews.words())

FreqDist({'plot': 1513,
          ':': 3042,
          'two': 1911,
          'teen': 151,
          'couples': 27,
          'go': 1113,
          'to': 31937,
          'a': 38106,
          'church': 69,
          'party': 183,
          ',': 77717,
          'drink': 32,
          'and': 35576,
          'then': 1424,
          'drive': 105,
          '.': 65876,
          'they': 4825,
          'get': 1949,
          'into': 2623,
          'an': 5744,
          'accident': 104,
          'one': 5852,
          'of': 34123,
          'the': 76529,
          'guys': 268,
          'dies': 104,
          'but': 8634,
          'his': 9587,
          'girlfriend': 218,
          'continues': 88,
          'see': 1749,
          'him': 2633,
          'in': 21822,
          'her': 4522,
          'life': 1586,
          'has': 4719,
          'nightmares': 26,
          'what': 3322,
          "'": 30585,
          's': 18513,
          'deal': 219,
          '?': 3771,
          'wa

In [71]:
# Printing the file for positive revies
positive_review_file = movie_reviews.fileids('pos')[0] 
print (positive_review_file)

pos/cv000_29590.txt


#### Creating the list of movie review document
This list contains array containing tuples of all movie review words and their respective category (pos or neg).

In [72]:
documents = []

for category in movie_reviews.categories():
	for fileid in movie_reviews.fileids(category):
		#documents.append((list(movie_reviews.words(fileid)), category))
		documents.append((movie_reviews.words(fileid), category))

# Printing the length of the document
print (len(documents))

# x = [str(item) for item in documents[0][0]]
# print (x)

# Printing first tuple
print (documents[0])

# Shuffling the document list
from random import shuffle 
shuffle(documents)

2000
(['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...], 'neg')


#### Feature Extraction
To classify the text into any category, we need to define some criteria.<br> On the basis of those criteria, our classifier will learn that a particular kind of text falls in a particular category.<br> This kind of criteria is known as `feature`.<br> We can define one or more feature to train our classifier.
<br>
In this example, we will use the `top-N words feature`.



##### Fetching all words from the movie reviews corpus


In [73]:
# We first fetch all the words from all the movie reviews and create a list.
all_words = [word.lower() for word in movie_reviews.words()]

# Printing first 10 words
print (all_words[:10])

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party']


#### Creating Frequency Distribution of all words
Frequency Distribution will calculate the number of occurence of each word in the entire list of words.

In [74]:
all_words_frequency = FreqDist(all_words)
print (all_words_frequency)

<FreqDist with 39768 samples and 1583820 outcomes>


In [75]:
# Printing 10 most frequently occurring words
print (all_words_frequency.most_common(10))

[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822)]


#### Removing Punctuation and Stopwords
From the above frequency distribution of words, we can see the most frequently occurring words are either punctuation marks or stopwords.

Stop words are those frequently words which do not carry any significant meaning in text analysis. For example, I, me, my, the, a, and, is, are, he, she, we, etc.

Punctuation marks like comma, fullstop. inverted comma, etc. occur highly in any text data.

We will do `data cleaning` by removing stop words and punctuations.

##### Removing Stopwords

In [76]:
stopwords_english = stopwords.words('english')
print (stopwords_english)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [77]:
# Creating a new list of words by removing stopwords from all_words
all_words_without_stopwords = [word for word in all_words if word not in stopwords_english]

In [78]:
# Printing the first 10 words
print (all_words_without_stopwords[:10])

['plot', ':', 'two', 'teen', 'couples', 'go', 'church', 'party', ',', 'drink']


Here, after removing stopwords, the words `to` and `a` has been removed from the first 10 words result.

##### Removing Punctuation

In [79]:
print (string.punctuation)

# Creating a new list of words by removing punctuation from all_words
all_words_without_punctuation = [word for word in all_words if word not in string.punctuation]

# Printing the first 10 words
print (all_words_without_punctuation[:10])

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
['plot', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', 'drink']


Here, after removing punctuations on the list, all punctuations like 
semi-colon `:`, comma `,` are removed

#### Removing both Stopwords & Punctuation
Now, finally removing all the stopwords and punctuations from the entire words list

In [80]:
all_words_clean = []
for word in all_words:
	if word not in stopwords_english and word not in string.punctuation:
		all_words_clean.append(word)

print (all_words_clean[:10])

['plot', 'two', 'teen', 'couples', 'go', 'church', 'party', 'drink', 'drive', 'get']


#### Creating Frequency Distribution of cleaned words list
Below is the frequency distribution of the new list after removing stopwords and punctuation.

In [81]:
all_words_frequency = FreqDist(all_words_clean)
print (all_words_frequency)

# Printing 10 most frequently occurring words
print (all_words_frequency.most_common(10))

<FreqDist with 39586 samples and 710578 outcomes>
[('film', 9517), ('one', 5852), ('movie', 5771), ('like', 3690), ('even', 2565), ('good', 2411), ('time', 2411), ('story', 2169), ('would', 2109), ('much', 2049)]


Previously, before removing stopwords and punctuation, the frequency distribution was:

`FreqDist with 39768 samples and 1583820 outcomes`

Now, the frequency distribution is:

`FreqDist with 39586 samples and 710578 outcomes`

This shows that after removing around 200 stop words and punctuation, the outcomes/words number has reduced to around half of the original size.

The `most common words` or highly occurring words list has also got meaningful words in the list. Before, the first 10 frequently occurring words were only stop-words and punctuations.

#### Create Word Feature using 2000 most frequently occurring words

Here, we take 2000 most frequently occurring words as our feature.

In [82]:
print (len(all_words_frequency))

39586


In [83]:
# Getting 2000 frequently occuring words
most_common_words = all_words_frequency.most_common(2000)
print (most_common_words[:10])

[('film', 9517), ('one', 5852), ('movie', 5771), ('like', 3690), ('even', 2565), ('good', 2411), ('time', 2411), ('story', 2169), ('would', 2109), ('much', 2049)]


In [84]:
print (most_common_words[1990:])

[('remain', 64), ('anna', 64), ('moved', 64), ('asking', 64), ('genuinely', 64), ('rain', 64), ('path', 64), ('aware', 64), ('causes', 64), ('international', 64)]


In [85]:
# The most common words list's elements are in the form of tuple
# Getting only the first element of each tuple of the word list
word_features = [item[0] for item in most_common_words]
print (word_features[:10])

['film', 'one', 'movie', 'like', 'even', 'good', 'time', 'story', 'would', 'much']


#### Creating Feature Set

Now, we write a function that will be used to create feature set. The feature set is used to train the classifier.

In [86]:
# We define a feature extractor function that checks if the words in a given document are present in the word_features list or not.
def document_features(document):
	# "set" function will remove repeated/duplicate tokens in the given list
	document_words = set(document)
	features = {}
	for word in word_features:
		features['contains(%s)' % word] = (word in document_words)
	return features

In [87]:
# Getting the first negative movie review file
movie_review_file = movie_reviews.fileids('neg')[0] 
print (movie_review_file)

neg/cv000_29416.txt


In [88]:
print (document_features(movie_reviews.words(movie_review_file)))

{'contains(film)': True, 'contains(one)': True, 'contains(movie)': True, 'contains(like)': True, 'contains(even)': True, 'contains(good)': True, 'contains(time)': False, 'contains(story)': False, 'contains(would)': True, 'contains(much)': False, 'contains(character)': True, 'contains(also)': True, 'contains(get)': True, 'contains(two)': True, 'contains(well)': True, 'contains(characters)': True, 'contains(first)': False, 'contains(--)': False, 'contains(see)': True, 'contains(way)': True, 'contains(make)': True, 'contains(life)': True, 'contains(really)': True, 'contains(films)': True, 'contains(plot)': True, 'contains(little)': True, 'contains(people)': True, 'contains(could)': False, 'contains(scene)': False, 'contains(man)': False, 'contains(bad)': True, 'contains(never)': False, 'contains(best)': False, 'contains(new)': True, 'contains(scenes)': True, 'contains(many)': False, 'contains(director)': True, 'contains(know)': True, 'contains(movies)': True, 'contains(action)': False, 'c

In [89]:
# In the above statement, we have created the `documents` list which contains data of all the movie reviews.
# Its elements are tuples with word list as first item and review category as the second item of the tuple.
# Printing first tuple of the documents list
print (documents[0])

(['i', 'have', 'nothing', 'against', 'unabashedly', ...], 'neg')


We now loop through the `documents` list and create a feature set list using the `document_features` function defined above.

– Each item of the feature_set list is a tuple.
– The first item of the tuple is the dictionary returned from `document_features` function
– The second item of the tuple is the category (pos or neg) of the movie review

In [90]:
feature_set = [(document_features(doc), category) for (doc, category) in documents]
print (feature_set[0])

({'contains(film)': True, 'contains(one)': True, 'contains(movie)': True, 'contains(like)': True, 'contains(even)': True, 'contains(good)': False, 'contains(time)': False, 'contains(story)': False, 'contains(would)': True, 'contains(much)': False, 'contains(character)': False, 'contains(also)': False, 'contains(get)': True, 'contains(two)': True, 'contains(well)': False, 'contains(characters)': True, 'contains(first)': True, 'contains(--)': False, 'contains(see)': False, 'contains(way)': False, 'contains(make)': True, 'contains(life)': True, 'contains(really)': False, 'contains(films)': True, 'contains(plot)': False, 'contains(little)': True, 'contains(people)': True, 'contains(could)': True, 'contains(scene)': False, 'contains(man)': True, 'contains(bad)': False, 'contains(never)': True, 'contains(best)': False, 'contains(new)': False, 'contains(scenes)': False, 'contains(many)': False, 'contains(director)': False, 'contains(know)': False, 'contains(movies)': False, 'contains(action)'

#### Training Classifier
From the feature set we created above, we now create a separate training set and a separate testing/validation set. The train set is used to train the classifier and the test set is used to test the classifier to check how accurately it classifies the given text.


In [91]:
# Creating Train and Test Dataset
# Here, we take first 400 elements as test set and the rest as train set.
print (f'Total records: {len(feature_set)}')

test_set = feature_set[:400]
train_set = feature_set[400:]

print (f'Training set: {len(train_set)}') 
print (f'Testing set: {len(test_set)}')

Total records: 2000
Training set: 1600
Testing set: 400


#### Training a Classifier

In [92]:
# Here, we are using the Naive Bayes Classifier
classifier = NaiveBayesClassifier.train(train_set)

In [93]:
from nltk import classify 

# Evaluating on training set
train_accuracy = classify.accuracy(classifier, train_set)
print (train_accuracy)

0.865625


In [94]:
# Evaluating on testing set
test_accuracy = classify.accuracy(classifier, test_set)
print (test_accuracy)

0.8075


In [95]:
# Looking at the output of the classifier by providing some custom reviews
custom_review = "I hated the film. It was a disaster. Poor direction, bad acting."
custom_review_tokens = word_tokenize(custom_review)
custom_review_set = document_features(custom_review_tokens)
print (classifier.classify(custom_review_set))


neg


**Here, Negative review correctly classified as negative**

In [96]:
# Looking at the Probability result
prob_result = classifier.prob_classify(custom_review_set)
print (prob_result) 
print (prob_result.max()) 
print (prob_result.prob("neg")) 
print (prob_result.prob("pos")) 

<ProbDist with 2 samples>
neg
0.9999991042600123
8.957399688626409e-07


In [97]:
custom_review = "It was a wonderful and amazing movie. I loved it. Best direction, good acting."
custom_review_tokens = word_tokenize(custom_review)
custom_review_set = document_features(custom_review_tokens)

print (classifier.classify(custom_review_set))

neg


Here, Positive review is classified as negative
We need to improve our feature set for more accurate prediction


In [98]:
# Looking at result
prob_result = classifier.prob_classify(custom_review_set)
print (prob_result) 
print (prob_result.max())
print (prob_result.prob("neg"))
print (prob_result.prob("pos"))

<ProbDist with 2 samples>
neg
0.999980531191118
1.9468808872496097e-05


#### Looking at the most informative features among the entire features in the feature set.

In [99]:
# Showing 5 most informative features
print (classifier.show_most_informative_features(10))

Most Informative Features
   contains(outstanding) = True              pos : neg    =     12.0 : 1.0
          contains(anna) = True              pos : neg    =      9.6 : 1.0
   contains(wonderfully) = True              pos : neg    =      7.9 : 1.0
        contains(seagal) = True              neg : pos    =      6.2 : 1.0
         contains(waste) = True              neg : pos    =      6.1 : 1.0
     contains(pointless) = True              neg : pos    =      5.9 : 1.0
        contains(wasted) = True              neg : pos    =      5.4 : 1.0
          contains(lame) = True              neg : pos    =      5.2 : 1.0
         contains(damon) = True              pos : neg    =      5.2 : 1.0
         contains(awful) = True              neg : pos    =      5.1 : 1.0
None


The result shows that the word `outstanding` is used in positive reviews 14.7 times more often than it is used in negative reviews the word `poorly` is used in negative reviews 7.7 times more often than it is used in positive reviews. Similarly, for other letters. These ratios are also called `likelihood ratios`.

Therefore, a review has a high chance to be classified as positive if it contains words like `outstanding` and `wonderfully`. Similarly, a review has a high chance of being classified as negative if it contains words like `poorly`, `awful`, `waste`, etc.

## 9. Perform content based movie recommendation on the dataset given and explain each steps in detail.

In [100]:
# Loading the movie dataset

# Using converter for 'genres' field to get the data in list format
movies_df = pd.read_csv('/content/drive/My Drive/movies_metadata.csv', converters={'genres': literal_eval})
movies_df.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [101]:
# Looking at the shape of the dataset
movies_df.shape

(45466, 24)

In [102]:
# Checking the information about each column
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [103]:
# Checking the percentage null for each column
round(100*(movies_df.isnull().sum()/len(movies_df.index)), 2)

adult                     0.00
belongs_to_collection    90.12
budget                    0.00
genres                    0.00
homepage                 82.88
id                        0.00
imdb_id                   0.04
original_language         0.02
original_title            0.00
overview                  2.10
popularity                0.01
poster_path               0.85
production_companies      0.01
production_countries      0.01
release_date              0.19
revenue                   0.01
runtime                   0.58
spoken_languages          0.01
status                    0.19
tagline                  55.10
title                     0.01
video                     0.01
vote_average              0.01
vote_count                0.01
dtype: float64

In [104]:
# Converting the budget column from object to float
movies_df['budget'] = pd.to_numeric(movies_df['budget'], errors = 'coerce')
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45463 non-null  float64
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [105]:
# Now, creating a budget-revenue column to calculate profit
movies_df['profit'] = movies_df['budget'] - movies_df['revenue']

In [106]:
# Checking the movies which made the highest profit
movies_df = movies_df.sort_values('profit', ascending = False)
movies_df.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,profit
21175,False,,255000000.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",http://disney.go.com/the-lone-ranger/,57201,tt1210819,en,The Lone Ranger,The Texas Rangers chase down a gang of outlaws...,...,89289910.0,149.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Never Take Off the Mask,The Lone Ranger,False,5.9,2361.0,165710090.0
14823,False,,150000000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 27, 'name...",http://www.thewolfmanmovie.com/,7978,tt0780653,en,The Wolfman,"Lawrence Talbot, an American man on a visit to...",...,0.0,102.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,When the moon is full the legend comes to life,The Wolfman,False,5.5,562.0,150000000.0
32849,False,"{'id': 34055, 'name': 'Pokémon Collection', 'p...",150000000.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 16, '...",http://www.pokemon-movie.jp/,350499,tt4503906,ja,ポケモン・ザ・ムービーXY 光輪の超魔神 フーパ,"Ash, Pikachu, and their friends come to a dese...",...,0.0,73.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,A Power Unbound. A Battle of Legends.,Pokémon the Movie: Hoopa and the Clash of Ages,False,6.2,39.0,150000000.0


### Now, for the Recommender system, we are not going to use all the columns available in the dataset.<br>We will be using only the below columns:
* title
* genres
* original_language
* overview
* production_countries
* release_date


In [107]:
# Creating a dataframe for our selected columns
df = movies_df[['title', 'genres','original_language','overview','production_countries', 'release_date']]
# df = movies_df[['title', 'genres','original_language']]
df.head()

Unnamed: 0,title,genres,original_language,overview,production_countries,release_date
21175,The Lone Ranger,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",en,The Texas Rangers chase down a gang of outlaws...,"[{'iso_3166_1': 'US', 'name': 'United States o...",2013-07-03
14823,The Wolfman,"[{'id': 18, 'name': 'Drama'}, {'id': 27, 'name...",en,"Lawrence Talbot, an American man on a visit to...","[{'iso_3166_1': 'US', 'name': 'United States o...",2010-02-11
32849,Pokémon the Movie: Hoopa and the Clash of Ages,"[{'id': 12, 'name': 'Adventure'}, {'id': 16, '...",ja,"Ash, Pikachu, and their friends come to a dese...","[{'iso_3166_1': 'JP', 'name': 'Japan'}]",2015-07-18
43190,Band of Brothers,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",en,Drawn from interviews with survivors of Easy C...,"[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",2001-09-09
27656,The Pacific,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",en,"A 10-part mini-series from the creators of ""Ba...","[{'iso_3166_1': 'US', 'name': 'United States o...",2010-03-15


In [108]:
# Looking at the shape
df.shape

(45466, 6)

### Pre-processing steps

In [109]:
# Dropping all the rows with null value
df = df.dropna()

In [110]:
df.shape

(44425, 6)

In total we dropped 1041 records which contained nan values

In [111]:
# Now, if a movie 'title' is repeated more than once, we are going to drop the rows except the first one
df = df.drop_duplicates(subset=['title'], keep='first')
df.shape

(41293, 6)

In [112]:
# Resetting the index
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,title,genres,original_language,overview,production_countries,release_date
0,The Lone Ranger,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",en,The Texas Rangers chase down a gang of outlaws...,"[{'iso_3166_1': 'US', 'name': 'United States o...",2013-07-03
1,The Wolfman,"[{'id': 18, 'name': 'Drama'}, {'id': 27, 'name...",en,"Lawrence Talbot, an American man on a visit to...","[{'iso_3166_1': 'US', 'name': 'United States o...",2010-02-11
2,Pokémon the Movie: Hoopa and the Clash of Ages,"[{'id': 12, 'name': 'Adventure'}, {'id': 16, '...",ja,"Ash, Pikachu, and their friends come to a dese...","[{'iso_3166_1': 'JP', 'name': 'Japan'}]",2015-07-18
3,Band of Brothers,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",en,Drawn from interviews with survivors of Easy C...,"[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",2001-09-09
4,The Pacific,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",en,"A 10-part mini-series from the creators of ""Ba...","[{'iso_3166_1': 'US', 'name': 'United States o...",2010-03-15


In [113]:
''' Pre-processing for 'genres'
We can see that 'genre' contains the list of genres in dictionary format
In the below steps, we'll collect all the genres of a movie in a list
'''
for index, row in df.iterrows():
    row['genres'] = [li_g['name'].lower() for li_g in row['genres']]
    df.at[index, 'genres'] = row['genres']

# By the pre-processing step, we obtained the list of genres associated with the movie
df['genres'].head(5)

0                [action, adventure, western]
1                   [drama, horror, thriller]
2              [adventure, animation, action]
3                        [action, drama, war]
4    [action, adventure, drama, history, war]
Name: genres, dtype: object

In [114]:
''' Pre-processing for 'production_countries'
Like genre, production_countries also contains the the list of countries in a code format and full name of the country.
In the below steps, we'll collect the code of the country of a movie in a list
'''

# Transforming the field to list type
df['production_countries']=df['production_countries'].apply(lambda x: ast.literal_eval(x))

# This variable is to collect the index if production_country is not of the type dict
index_with_float_data = []

for index, row in df.iterrows():
  try:
    row['production_countries'] = [li['iso_3166_1'].lower() for li in row['production_countries']]
    df.at[index, 'production_countries'] = row['production_countries']
  except:
    index_with_float_data.append(index)

# By the pre-processing step, we obtained the list of production_countries associated with the movie
df['production_countries'].head(5)

0        [us]
1        [us]
2        [jp]
3    [gb, us]
4        [us]
Name: production_countries, dtype: object

In [115]:
# Extracting only the year from the 'release_date' field into a new column 'year'
df['year'] = pd.DatetimeIndex(df['release_date']).year

# And dropping the 'release_date' column
df.drop(columns = ['release_date'], inplace = True)

In [116]:
df.head()

Unnamed: 0,title,genres,original_language,overview,production_countries,year
0,The Lone Ranger,"[action, adventure, western]",en,The Texas Rangers chase down a gang of outlaws...,[us],2013
1,The Wolfman,"[drama, horror, thriller]",en,"Lawrence Talbot, an American man on a visit to...",[us],2010
2,Pokémon the Movie: Hoopa and the Clash of Ages,"[adventure, animation, action]",ja,"Ash, Pikachu, and their friends come to a dese...",[jp],2015
3,Band of Brothers,"[action, drama, war]",en,Drawn from interviews with survivors of Easy C...,"[gb, us]",2001
4,The Pacific,"[action, adventure, drama, history, war]",en,"A 10-part mini-series from the creators of ""Ba...",[us],2010


In [117]:
# Converting the 'original_language' to lower case
df['original_language'] = df['original_language'].apply(lambda x: x.lower())

In [118]:
df.head()

Unnamed: 0,title,genres,original_language,overview,production_countries,year
0,The Lone Ranger,"[action, adventure, western]",en,The Texas Rangers chase down a gang of outlaws...,[us],2013
1,The Wolfman,"[drama, horror, thriller]",en,"Lawrence Talbot, an American man on a visit to...",[us],2010
2,Pokémon the Movie: Hoopa and the Clash of Ages,"[adventure, animation, action]",ja,"Ash, Pikachu, and their friends come to a dese...",[jp],2015
3,Band of Brothers,"[action, drama, war]",en,Drawn from interviews with survivors of Easy C...,"[gb, us]",2001
4,The Pacific,"[action, adventure, drama, history, war]",en,"A 10-part mini-series from the creators of ""Ba...",[us],2010


### Extracting the key words from the 'overview' column

In [119]:
# Initializing a new column as 'Key_words'
df['Key_words'] = ""

for index, row in df.iterrows():

    overview = row['overview'].lower()

    # instantiating Rake, by default is uses english stopwords from NLTK
    # and discard all puntuation characters
    r = Rake()

    # extracting the words by passing the text
    r.extract_keywords_from_text(overview)
    # getting the dictionary whith key words and their scores
    key_words_dict_scores = r.get_word_degrees()
    # assigning the key words to the new column
    df.at[index, 'Key_words'] = list(key_words_dict_scores.keys())[:10]

# dropping the overview column
df.drop(columns = ['overview'], inplace = True)

Because of memory issue, here we have only used the top 10 keywords in each overview.<br>
Also, in the below statement, we are only going to work with 2000 records instead of using all the records.

In [120]:
df = df.head(2000)

### Now, reassigning the index to the movie title column

In [121]:
df.set_index('title', inplace = True)
df.head()

Unnamed: 0_level_0,genres,original_language,production_countries,year,Key_words
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Lone Ranger,"[action, adventure, western]",en,[us],2013,"[texas, rangers, chase, gang, outlaws, led, bu..."
The Wolfman,"[drama, horror, thriller]",en,[us],2010,"[lawrence, talbot, american, man, visit, victo..."
Pokémon the Movie: Hoopa and the Clash of Ages,"[adventure, animation, action]",ja,[jp],2015,"[ash, pikachu, friends, come, desert, city, se..."
Band of Brothers,"[action, drama, war]",en,"[gb, us]",2001,"[drawn, interviews, survivors, easy, company, ..."
The Pacific,"[action, adventure, drama, history, war]",en,[us],2010,"[10, part, mini, series, creators, band, broth..."


### The dataframe is now ready for vectorization

In [122]:
df['bag_of_words'] = ''
columns = df.columns
for index, row in df.iterrows():
    words = ""
    for col in columns:
        if col == 'original_language':
          words = words + row[col]+ ' '
        elif col == 'year':
          words = words + str(row[col])+ ' '
        else:
            words = words + ' '.join(row[col])+ ' '
    df.at[index, 'bag_of_words'] = words
    
df.drop(columns = [col for col in df.columns if col!= 'bag_of_words'], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [131]:
df.head()

Unnamed: 0_level_0,bag_of_words
title,Unnamed: 1_level_1
The Lone Ranger,action adventure western en us 2013 texas rang...
The Wolfman,drama horror thriller en us 2010 lawrence talb...
Pokémon the Movie: Hoopa and the Clash of Ages,adventure animation action ja jp 2015 ash pika...
Band of Brothers,action drama war en gb us 2001 drawn interview...
The Pacific,action adventure drama history war en us 2010 ...


In [124]:
# Instantiating and generating the count matrix
count = CountVectorizer()
count_matrix = count.fit_transform(df['bag_of_words'])

# creating a Series for the movie titles so they are associated to an ordered numerical
# list I will use later to match the indexes
indices = pd.Series(df.index)
indices[:5]

0                                   The Lone Ranger
1                                       The Wolfman
2    Pokémon the Movie: Hoopa and the Clash of Ages
3                                  Band of Brothers
4                                       The Pacific
Name: title, dtype: object

In [125]:
# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim

array([[1.        , 0.125     , 0.125     , ..., 0.13867505, 0.05735393,
        0.06681531],
       [0.125     , 1.        , 0.        , ..., 0.2773501 , 0.11470787,
        0.13363062],
       [0.125     , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.13867505, 0.2773501 , 0.        , ..., 1.        , 0.06362848,
        0.07412493],
       [0.05735393, 0.11470787, 0.        , ..., 0.06362848, 1.        ,
        0.12262787],
       [0.06681531, 0.13363062, 0.        , ..., 0.07412493, 0.12262787,
        1.        ]])

In [126]:
# function that takes in movie title as input and returns the top 10 recommended movies
def recommendations(title, cosine_sim = cosine_sim):
    
    recommended_movies = []
    
    # gettin the index of the movie that matches the title
    idx = indices[indices == title].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most similar movies
    top_10_indexes = list(score_series.iloc[1:11].index)
    
    # populating the list with the titles of the best 10 matching movies
    for i in top_10_indexes:
        recommended_movies.append(list(df.index)[i])
        
    return recommended_movies

## Looking at the content-based recommendation if our movie was 'The Monkey King'

In [134]:
recommendations('The Monkey King')

["Dragon Nest: Warriors' Dawn",
 'League of Gods',
 'Flying Swords of Dragon Gate',
 'Tai Chi Hero',
 'Sacrifice',
 'Saving General Yang',
 'Little Big Soldier',
 'Bodyguards and Assassins',
 'Call of Heroes',
 'The Accidental Spy']

In the above result, we saw that if a person watches 'The Monkey King' which is made in China and has genre like action and fantasy, our recommendation system will recommend more movies with same features as the 'The Monkey King'.