## MOVIE RECOMMENDATION ENGINE using NLP

## Introduction

we will develop a Content-Based Movie Recommendation System with the IMDB  top 250 English Movies dataset.

Here we will be covering the following:

1. Importing the Libraries/Dependencies and Loading the Data

2. Text Preprocessing with NLP

3. Generating Word Representations using Bag Of Words

4. Vectorizing Words and Creating the Similarity Matrix

5. Training and Testing Our Recommendation Engine

In [2]:
pip install rake-nltk

Collecting rake-nltk
  Downloading rake_nltk-1.0.6-py3-none-any.whl (9.1 kB)
Collecting nltk<4.0.0,>=3.6.2
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
Collecting regex>=2021.8.3
  Downloading regex-2022.7.9-cp38-cp38-win_amd64.whl (262 kB)
Installing collected packages: regex, nltk, rake-nltk
  Attempting uninstall: regex
    Found existing installation: regex 2021.4.4
    Uninstalling regex-2021.4.4:
      Successfully uninstalled regex-2021.4.4
Note: you may need to restart the kernel to use updated packages.
  Attempting uninstall: nltk
    Found existing installation: nltk 3.6.1
    Uninstalling nltk-3.6.1:
      Successfully uninstalled nltk-3.6.1
Successfully installed nltk-3.7 rake-nltk-1.0.6 regex-2022.7.9


### Importing the Libraries/Dependencies and Loading the Data

In [3]:
#Import the required libraries/dependencies
#ignore warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
#imports
import pandas as pd
import numpy as np
from rake_nltk import Rake
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
#Load the dataset
df = pd.read_csv('IMDB_Top250EngMovies.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Title,Year,Rated,Released,Runtime,Genre,Director,Writer,Actors,...,tomatoConsensus,tomatoUserMeter,tomatoUserRating,tomatoUserReviews,tomatoURL,DVD,BoxOffice,Production,Website,Response
0,1,The Shawshank Redemption,1994,R,14 Oct 1994,142 min,"Crime, Drama",Frank Darabont,"Stephen King (short story ""Rita Hayworth and S...","Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",...,,,,,http://www.rottentomatoes.com/m/shawshank_rede...,27 Jan 1998,,Columbia Pictures,,True
1,2,The Godfather,1972,R,24 Mar 1972,175 min,"Crime, Drama",Francis Ford Coppola,"Mario Puzo (screenplay), Francis Ford Coppola ...","Marlon Brando, Al Pacino, James Caan, Richard ...",...,,,,,http://www.rottentomatoes.com/m/godfather/,09 Oct 2001,,Paramount Pictures,http://www.thegodfather.com,True
2,3,The Godfather: Part II,1974,R,20 Dec 1974,202 min,"Crime, Drama",Francis Ford Coppola,"Francis Ford Coppola (screenplay), Mario Puzo ...","Al Pacino, Robert Duvall, Diane Keaton, Robert...",...,,,,,http://www.rottentomatoes.com/m/godfather_part...,24 May 2005,,Paramount Pictures,http://www.thegodfather.com/,True
3,4,The Dark Knight,2008,PG-13,18 Jul 2008,152 min,"Action, Crime, Drama",Christopher Nolan,"Jonathan Nolan (screenplay), Christopher Nolan...","Christian Bale, Heath Ledger, Aaron Eckhart, M...",...,,,,,http://www.rottentomatoes.com/m/the_dark_knight/,09 Dec 2008,"$533,316,061",Warner Bros. Pictures/Legendary,http://thedarkknight.warnerbros.com/,True
4,5,12 Angry Men,1957,APPROVED,01 Apr 1957,96 min,"Crime, Drama",Sidney Lumet,"Reginald Rose (story), Reginald Rose (screenplay)","Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",...,,,,,http://www.rottentomatoes.com/m/1000013-12_ang...,06 Mar 2001,,Criterion Collection,http://www.criterion.com/films/27871-12-angry-men,True


In [10]:
df.shape

(250, 38)

### Dropping columns with NaN values

In [21]:
df = df.dropna(axis=1)

In [22]:
df.nunique()

Unnamed: 0        250
Title             250
Year               85
Rated              10
Runtime           103
Genre             110
Director          155
Actors            248
Plot              250
Language           68
Country            39
Poster            250
Ratings.Source      1
Ratings.Value      13
imdbRating         13
imdbVotes         250
imdbID            250
Type                1
tomatoURL         250
Production         89
Response            1
dtype: int64

In [23]:
# data overview
print('Rows x Columns : ', df.shape[0], 'x', df.shape[1])
print('Features: ', df.columns.tolist())
print('nUnique values:')
print(df.nunique())

Rows x Columns :  250 x 21
Features:  ['Unnamed: 0', 'Title', 'Year', 'Rated', 'Runtime', 'Genre', 'Director', 'Actors', 'Plot', 'Language', 'Country', 'Poster', 'Ratings.Source', 'Ratings.Value', 'imdbRating', 'imdbVotes', 'imdbID', 'Type', 'tomatoURL', 'Production', 'Response']
nUnique values:
Unnamed: 0        250
Title             250
Year               85
Rated              10
Runtime           103
Genre             110
Director          155
Actors            248
Plot              250
Language           68
Country            39
Poster            250
Ratings.Source      1
Ratings.Value      13
imdbRating         13
imdbVotes         250
imdbID            250
Type                1
tomatoURL         250
Production         89
Response            1
dtype: int64


In [24]:
# type of entries, how many missing values/null fields
df.info()
print('nMissing values:  ', df.isnull().sum().values.sum())
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      250 non-null    int64  
 1   Title           250 non-null    object 
 2   Year            250 non-null    int64  
 3   Rated           250 non-null    object 
 4   Runtime         250 non-null    object 
 5   Genre           250 non-null    object 
 6   Director        250 non-null    object 
 7   Actors          250 non-null    object 
 8   Plot            250 non-null    object 
 9   Language        250 non-null    object 
 10  Country         250 non-null    object 
 11  Poster          250 non-null    object 
 12  Ratings.Source  250 non-null    object 
 13  Ratings.Value   250 non-null    object 
 14  imdbRating      250 non-null    float64
 15  imdbVotes       250 non-null    object 
 16  imdbID          250 non-null    object 
 17  Type            250 non-null    obj

Unnamed: 0        0
Title             0
Year              0
Rated             0
Runtime           0
Genre             0
Director          0
Actors            0
Plot              0
Language          0
Country           0
Poster            0
Ratings.Source    0
Ratings.Value     0
imdbRating        0
imdbVotes         0
imdbID            0
Type              0
tomatoURL         0
Production        0
Response          0
dtype: int64

In [25]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,250.0,125.5,72.312977,1.0,63.25,125.5,187.75,250.0
Year,250.0,1982.676,24.809212,1921.0,1961.25,1988.0,2003.0,2017.0
imdbRating,250.0,8.244,0.245735,8.0,8.1,8.2,8.375,9.3


In [26]:
df.shape

(250, 21)

In [27]:
df.head()

Unnamed: 0.1,Unnamed: 0,Title,Year,Rated,Runtime,Genre,Director,Actors,Plot,Language,...,Poster,Ratings.Source,Ratings.Value,imdbRating,imdbVotes,imdbID,Type,tomatoURL,Production,Response
0,1,The Shawshank Redemption,1994,R,142 min,"Crime, Drama",Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",Two imprisoned men bond over a number of years...,English,...,https://images-na.ssl-images-amazon.com/images...,Internet Movie Database,9.3/10,9.3,1825626,tt0111161,movie,http://www.rottentomatoes.com/m/shawshank_rede...,Columbia Pictures,True
1,2,The Godfather,1972,R,175 min,"Crime, Drama",Francis Ford Coppola,"Marlon Brando, Al Pacino, James Caan, Richard ...",The aging patriarch of an organized crime dyna...,"English, Italian, Latin",...,https://images-na.ssl-images-amazon.com/images...,Internet Movie Database,9.2/10,9.2,1243444,tt0068646,movie,http://www.rottentomatoes.com/m/godfather/,Paramount Pictures,True
2,3,The Godfather: Part II,1974,R,202 min,"Crime, Drama",Francis Ford Coppola,"Al Pacino, Robert Duvall, Diane Keaton, Robert...",The early life and career of Vito Corleone in ...,"English, Italian, Spanish, Latin, Sicilian",...,https://images-na.ssl-images-amazon.com/images...,Internet Movie Database,9.0/10,9.0,856870,tt0071562,movie,http://www.rottentomatoes.com/m/godfather_part...,Paramount Pictures,True
3,4,The Dark Knight,2008,PG-13,152 min,"Action, Crime, Drama",Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",When the menace known as the Joker emerges fro...,"English, Mandarin",...,https://images-na.ssl-images-amazon.com/images...,Internet Movie Database,9.0/10,9.0,1802351,tt0468569,movie,http://www.rottentomatoes.com/m/the_dark_knight/,Warner Bros. Pictures/Legendary,True
4,5,12 Angry Men,1957,APPROVED,96 min,"Crime, Drama",Sidney Lumet,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",A jury holdout attempts to prevent a miscarria...,English,...,https://images-na.ssl-images-amazon.com/images...,Internet Movie Database,8.9/10,8.9,494215,tt0050083,movie,http://www.rottentomatoes.com/m/1000013-12_ang...,Criterion Collection,True


In [29]:
df.dtypes

Unnamed: 0          int64
Title              object
Year                int64
Rated              object
Runtime            object
Genre              object
Director           object
Actors             object
Plot               object
Language           object
Country            object
Poster             object
Ratings.Source     object
Ratings.Value      object
imdbRating        float64
imdbVotes          object
imdbID             object
Type               object
tomatoURL          object
Production         object
Response             bool
dtype: object

In [35]:
df.drop(['Unnamed: 0', 'Year', 'Rated','Runtime','Language','Country', 'Poster','Ratings.Source','Ratings.Value',
                'imdbID','Type','tomatoURL','Production','Response'], axis=1, inplace=True)

In [36]:
df.shape

(250, 7)

In [43]:
df.head()

Unnamed: 0,Title,Genre,Director,Actors,Plot,imdbVotes
0,The Shawshank Redemption,"Crime, Drama",Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",Two imprisoned men bond over a number of years...,1825626
1,The Godfather,"Crime, Drama",Francis Ford Coppola,"Marlon Brando, Al Pacino, James Caan, Richard ...",The aging patriarch of an organized crime dyna...,1243444
2,The Godfather: Part II,"Crime, Drama",Francis Ford Coppola,"Al Pacino, Robert Duvall, Diane Keaton, Robert...",The early life and career of Vito Corleone in ...,856870
3,The Dark Knight,"Action, Crime, Drama",Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",When the menace known as the Joker emerges fro...,1802351
4,12 Angry Men,"Crime, Drama",Sidney Lumet,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",A jury holdout attempts to prevent a miscarria...,494215


In [44]:
df.drop(['imdbVotes'], axis=1, inplace=True)

In [45]:
df.head()

Unnamed: 0,Title,Genre,Director,Actors,Plot
0,The Shawshank Redemption,"Crime, Drama",Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",Two imprisoned men bond over a number of years...
1,The Godfather,"Crime, Drama",Francis Ford Coppola,"Marlon Brando, Al Pacino, James Caan, Richard ...",The aging patriarch of an organized crime dyna...
2,The Godfather: Part II,"Crime, Drama",Francis Ford Coppola,"Al Pacino, Robert Duvall, Diane Keaton, Robert...",The early life and career of Vito Corleone in ...
3,The Dark Knight,"Action, Crime, Drama",Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",When the menace known as the Joker emerges fro...
4,12 Angry Men,"Crime, Drama",Sidney Lumet,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",A jury holdout attempts to prevent a miscarria...


In [46]:
print('Rows x Columns : ', df.shape[0], 'x', df.shape[1])
print('Features: ', df.columns.tolist())
print('nUnique values:')
print(df.nunique())

Rows x Columns :  250 x 5
Features:  ['Title', 'Genre', 'Director', 'Actors', 'Plot']
nUnique values:
Title       250
Genre       110
Director    155
Actors      248
Plot        250
dtype: int64


In [47]:
df.describe().T

Unnamed: 0,count,unique,top,freq
Title,250,250,Schindler's List,1
Genre,250,110,Drama,19
Director,250,155,Alfred Hitchcock,9
Actors,250,248,"Chris Pratt, Zoe Saldana, Dave Bautista, Vin D...",2
Plot,250,250,The Tramp struggles to live in modern industri...,1


In [48]:
print('nMissing values:  ', df.isnull().sum().values.sum())
df.isnull().sum()

nMissing values:   0


Title       0
Genre       0
Director    0
Actors      0
Plot        0
dtype: int64

### Text Preprocessing with NLP

Creating a new column in our dataframe that will hold all necessary keywords required for the model. We name it **‘Key_words’**. Further, we use a very special NLP library known as **RAKE** (short for Rapid Automatic Keyword Extraction algorithm). RAKE is a keyword extraction algorithm that extracts those key phrases in a text corpus by determining the frequency of words and their relative occurrence with other words in the corpus.

In [49]:
# Text Pre-processing with NLP

# to remove punctuations from Plot
df['Plot'] = df['Plot'].str.replace('[^ws]','')

In [51]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\joshm\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [53]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\joshm\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [54]:
# to extract key words from Plot to a list
df['Key_words'] = ''   # initializing a new column
r = Rake()   # using Rake to remove stop words

for index, row in df.iterrows():
    r.extract_keywords_from_text(row['Plot'])   # to extract key words 
    key_words_dict_scores = r.get_word_degrees()    # to get dictionary with key words and their similarity scores
    row['Key_words'] = list(key_words_dict_scores.keys())   # to assign it to new column

df

Unnamed: 0,Title,Genre,Director,Actors,Plot,Key_words
0,The Shawshank Redemption,"Crime, Drama",Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",wssss,[wssss]
1,The Godfather,"Crime, Drama",Francis Ford Coppola,"Marlon Brando, Al Pacino, James Caan, Richard ...",sssssss,[sssssss]
2,The Godfather: Part II,"Crime, Drama",Francis Ford Coppola,"Al Pacino, Robert Duvall, Diane Keaton, Robert...",swswssssss,[swswssssss]
3,The Dark Knight,"Action, Crime, Drama",Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",wsssssswssssssssss,[wsssssswssssssssss]
4,12 Angry Men,"Crime, Drama",Sidney Lumet,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",ssssss,[ssssss]
...,...,...,...,...,...,...
245,The Lost Weekend,"Drama, Film-Noir",Billy Wilder,"Ray Milland, Jane Wyman, Phillip Terry, Howard...",ssw,[ssw]
246,Short Term 12,Drama,Destin Daniel Cretton,"Brie Larson, John Gallagher Jr., Stephanie Bea...",sssssswswsw,[sssssswswsw]
247,His Girl Friday,"Comedy, Drama, Romance",Howard Hawks,"Cary Grant, Rosalind Russell, Ralph Bellamy, G...",wssssw,[wssssw]
248,The Straight Story,"Biography, Drama",David Lynch,"Sissy Spacek, Jane Galloway Heitz, Joseph A. C...",swssw,[swssw]


In [55]:
# to extract all genre into a list, only the first three actors into a list, and all directors into a list
df['Genre'] = df['Genre'].map(lambda x: x.split(','))
df['Actors'] = df['Actors'].map(lambda x: x.split(',')[:3])
df['Director'] = df['Director'].map(lambda x: x.split(','))

# create unique names by merging firstname & surname into one word, & convert to lowercase 
for index, row in df.iterrows():
    row['Genre'] = [x.lower().replace(' ','') for x in row['Genre']]
    row['Actors'] = [x.lower().replace(' ','') for x in row['Actors']]
    row['Director'] = [x.lower().replace(' ','') for x in row['Director']]

### Generating Word Representations using Bag of Words
BoW or Bag of Words is an Information Retrieval(IR) model which is useful for creating a representation of text, which describes the occurrence of words in a document or simply implies to us whether a particular word is frequent in the text corpus or not.

It is very useful for creating vector representations of frequent words in a corpus and then computing the similarity scores (‘similarity matrix’.)

In [56]:
# Generating Word Representations using Bag of Words

# to combine 4 lists (4 columns) of key words into 1 sentence under Bag_of_words column
df['Bag_of_words'] = ''
columns = ['Genre', 'Director', 'Actors', 'Key_words']

for index, row in df.iterrows():
    words = ''
    for col in columns:
        words += ' '.join(row[col]) + ' '
    row['Bag_of_words'] = words
    
# strip white spaces infront and behind, replace multiple whitespaces (if any)
df['Bag_of_words'] = df['Bag_of_words'].str.strip().str.replace('   ', ' ').str.replace('  ', ' ')

df = df[['Title','Bag_of_words']]

In [57]:
df.head()

Unnamed: 0,Title,Bag_of_words
0,The Shawshank Redemption,crime drama frankdarabont timrobbins morganfre...
1,The Godfather,crime drama francisfordcoppola marlonbrando al...
2,The Godfather: Part II,crime drama francisfordcoppola alpacino robert...
3,The Dark Knight,action crime drama christophernolan christianb...
4,12 Angry Men,crime drama sidneylumet martinbalsam johnfiedl...


### Vectorizing BoW and Creating the Similarity Matrix
Here, we will convert our BoW into vector representations with the help of much of a popular tool named Count Vectorizer. Count Vectorizer, gifted by our very own Scikit Learn library, converts words into their respective vector forms, on the basis of the frequency count of each word. Hence, the name, Count Vectorizer! Once the count vectorizer has done its job, and we have a matrix of all word counts, we will generate the similarity matrix with the help of cosine similarity.

Cosine similarity measures the similarity between two vectors, by the cosine of the angle between them, and based on the value it gets, it decides whether the two vectors are moving in the same direction.

In [58]:
#Vectorizing BoW and Creating the Similarity Matrix

# to generate the count matrix
count = CountVectorizer()
count_matrix = count.fit_transform(df['Bag_of_words'])
count_matrix

cosine_sim = cosine_similarity(count_matrix, count_matrix)
print(cosine_sim)

[[1.         0.28571429 0.28571429 ... 0.13363062 0.13363062 0.14285714]
 [0.28571429 1.         0.57142857 ... 0.13363062 0.13363062 0.14285714]
 [0.28571429 0.57142857 1.         ... 0.13363062 0.13363062 0.14285714]
 ...
 [0.13363062 0.13363062 0.13363062 ... 1.         0.125      0.13363062]
 [0.13363062 0.13363062 0.13363062 ... 0.125      1.         0.13363062]
 [0.14285714 0.14285714 0.14285714 ... 0.13363062 0.13363062 1.        ]]


So, we have successfully computed our similarity matrix of 250 rows x 250 columns!

In [60]:
#  make sure that our Title column is well inclined with the row and column index of the similarity matrix.

indices = pd.Series(df['Title'])

### Training an Testing the Recommendation Engine

Steps involved in this method-

1. First, it takes in a movie title as user input.

2. Matches the input title with the respective index of the similarity matrix

3. Extracts the similarity values in the top to bottom or descending fashion

4. Extract (N+1) movies and remove the 1st one as it’s the user input itself

5. Give the Top N recommendations to the user

In [66]:
# Training and Testing Our Recommendation Engine

# this function takes in a movie title as input and returns the top 5 recommended (similar) movies

def recommend(title, cosine_sim = cosine_sim):
    recommended_movies = []
    idx = indices[indices == title].index[0]   # to get the index of the movie title matching the input movie
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)   # similarity scores in descending order
    top_5_indices = list(score_series.iloc[1:6].index)   # to get the indices of top 6 most similar movies
    # [1:6] to exclude 0 (index 0 is the input movie itself)
    
    for i in top_5_indices:   # to append the titles of top 10 similar movies to the recommended_movies list
        recommended_movies.append(list(df['Title'])[i])
        
    return recommended_movies


In [67]:
recommend('The Godfather')

['The Godfather: Part II',
 'Scarface',
 'Apocalypse Now',
 'Heat',
 'Dog Day Afternoon']

In [68]:
recommend('The Avengers')

['Spider-Man: Homecoming',
 'The Terminator',
 'Guardians of the Galaxy Vol. 2',
 'Terminator 2: Judgment Day',
 'Aliens']

In [69]:
recommend('Slumdog Millionaire')

['Trainspotting',
 'A Streetcar Named Desire',
 'Paris, Texas',
 'The Last Picture Show',
 'Room']

In [70]:
recommend('His Girl Friday')

['The Philadelphia Story',
 'The Apartment',
 'City Lights',
 'Forrest Gump',
 'Notorious']