# Movie Recommender with NLP

<img src="Netflix.jpg" style="width:80%;">

We are planning to develop a movie recommendation system that, when given a movie, will generate a list of similar movies. To accomplish this, we will proceed with the following steps:

1. **Load the dataSet with contains a lot of movies.**

2. **Text preprocessing: Preprocess the movie descriptions.**

3. **Generate TF-IDF vectors for the general descriptions.**

4. **Generate the cosine similarity matrix: This matrix contains pairwise similarity scores for each movie with all others.**


## 1. Load the dataSet from Wikipedia

We are going to import the dataset using read_csv() from pandas.

In [1]:
import pandas as pd
df = pd.read_csv("wiki_movie_plots_deduped.csv")
df.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


We will only use two features: **'Plot'**, which contains a brief description of the movie, and **'Title'** which will be our target.

As I mentioned at the beginning, we will employ **TF-IDF** for preprocessing the movie plots. **TF-IDF** is a numerical statistic that signifies the significance of a word or term within a document concerning a collection of documents. It is commonly utilized in information retrieval and text mining.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(df["Plot"])
#print(tfidf_matrix.toarray())

**stop_words='english'**: This is an optional parameter that you can pass to the TfidfVectorizer constructor. It specifies a list of common English stopwords that should be ignored during the vectorization process. Stopwords are words like "the," "and," "in," "of," etc., which are typically considered to be of low significance in many NLP tasks because they appear frequently in most texts and don't carry much unique information.

While working with TF-IDF vectors, we can utilize the **linear_kernel** function, which computes the pairwise dot product of each vector with every other vector.

In [3]:
from sklearn.metrics.pairwise import linear_kernel
import time
# Record start time
start = time.time()
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
# Print time taken
print("Time taken: %s seconds" %round((time.time() - start)))

Time taken: 44 seconds


In [4]:
print(cosine_sim)
print(cosine_sim.shape)

[[1.         0.03075243 0.00770344 ... 0.         0.0079289  0.        ]
 [0.03075243 1.         0.00809798 ... 0.         0.00998054 0.0178333 ]
 [0.00770344 0.00809798 1.         ... 0.00679746 0.00662916 0.        ]
 ...
 [0.         0.         0.00679746 ... 1.         0.0111069  0.00546489]
 [0.0079289  0.00998054 0.00662916 ... 0.0111069  1.         0.00338896]
 [0.         0.0178333  0.         ... 0.00546489 0.00338896 1.        ]]
(34886, 34886)


The calculation of the cosine similarity matrix took 44 seconds.

**The recommender function**
1. The function takes a movie title, the **cosine similarity matrix**, and an indices series as arguments. The indices series is a reverse mapping of movie titles to their indices in the original data frame.

2. The function extracts pairwise cosine similarity scores of the given movie with all other movies.

3. It then sorts the similarity scores in descending order.

4. Finally, it generates the titles of movies corresponding to the highest similarity scores.

5. The function disregards the highest similarity score (which is 1).

In [5]:
# Generate mapping between titles and index
indices = pd.Series(df.index, index=df['Title']).drop_duplicates()
def get_recommendations(title, cosine_sim, indices):
    # Get index of movie that matches title
    idx = indices[title]
    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return df['Title'].iloc[movie_indices]

## Examples

In [6]:
print(get_recommendations('The Dark Knight Rises', cosine_sim, indices))

21247           The Dark Knight Rises
14600                   Batman Begins
15411                 The Dark Knight
20979                 The Dark Knight
12917                Batman and Robin
11121                          Batman
8060                           Batman
12371                  Batman Forever
17182           The Lego Batman Movie
11948    Batman: Mask of the Phantasm
Name: Title, dtype: object


In [11]:
print(get_recommendations('The Lord of the Rings: The Fellowship of the Ring', cosine_sim, indices))

9513                                 The Lord of the Rings
14303        The Lord of the Rings: The Return of the King
9759                                The Return of the King
14093                The Lord of the Rings: The Two Towers
16549    Hobbit: The Desolation of Smaug, TheThe Hobbit...
23814                           Four Sisters and a Wedding
25362                                            Maa Kasam
6848                                       My Gun Is Quick
25114                                          Guest House
23630                                                Coweb
Name: Title, dtype: object
