<a href="https://colab.research.google.com/github/AyushiKashyapp/NLP/blob/main/MovieRecommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommendation with NLP

A project to create a recommendation system based on [IMDB Top 250 Lists and 5000 plus IMDB records](https://data.world/studentoflife/imdb-top-250-lists-and-5000-or-so-data-records)
A recommendation system predicts and filters user preferences after learning about the user's past choices.

There are two types of recommendation systems:

- Content Based Recommendation System: This system follows a content-based filtration method to generate recommendations to the user. Content-based filtration is mainly focused on recommending similar products to the user based on their history.

- Collaborative Recommendation System: This system does not take an individual user at a time but a cluster of similar or alike users, and based on those users' ratings, recommends similar products to those group or cluster of users.

Steps followed:

1. Importing the dependencies and loading the data.
2. Text Preprocessing with NLP.
3. Generating word representations using Bag Of Words.
4. Vectorising words and creating the similarity matrix.
5. Training and testing our recommendation engine.

In [1]:
!pip install rake-nltk

Collecting rake-nltk
  Downloading rake_nltk-1.0.6-py3-none-any.whl (9.1 kB)
Installing collected packages: rake-nltk
Successfully installed rake-nltk-1.0.6


In [25]:
import pandas as pd
import numpy as np
from rake_nltk import Rake
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [66]:
df = pd.read_csv('/content/IMDB_Top250Engmovies2_OMDB_Detailed.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Title,Year,Rated,Released,Runtime,Genre,Director,Writer,Actors,...,tomatoConsensus,tomatoUserMeter,tomatoUserRating,tomatoUserReviews,tomatoURL,DVD,BoxOffice,Production,Website,Response
0,1,The Shawshank Redemption,1994,R,14 Oct 1994,142 min,"Crime, Drama",Frank Darabont,"Stephen King (short story ""Rita Hayworth and S...","Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",...,,,,,http://www.rottentomatoes.com/m/shawshank_rede...,27 Jan 1998,,Columbia Pictures,,True
1,2,The Godfather,1972,R,24 Mar 1972,175 min,"Crime, Drama",Francis Ford Coppola,"Mario Puzo (screenplay), Francis Ford Coppola ...","Marlon Brando, Al Pacino, James Caan, Richard ...",...,,,,,http://www.rottentomatoes.com/m/godfather/,09 Oct 2001,,Paramount Pictures,http://www.thegodfather.com,True
2,3,The Godfather: Part II,1974,R,20 Dec 1974,202 min,"Crime, Drama",Francis Ford Coppola,"Francis Ford Coppola (screenplay), Mario Puzo ...","Al Pacino, Robert Duvall, Diane Keaton, Robert...",...,,,,,http://www.rottentomatoes.com/m/godfather_part...,24 May 2005,,Paramount Pictures,http://www.thegodfather.com/,True
3,4,The Dark Knight,2008,PG-13,18 Jul 2008,152 min,"Action, Crime, Drama",Christopher Nolan,"Jonathan Nolan (screenplay), Christopher Nolan...","Christian Bale, Heath Ledger, Aaron Eckhart, M...",...,,,,,http://www.rottentomatoes.com/m/the_dark_knight/,09 Dec 2008,"$533,316,061",Warner Bros. Pictures/Legendary,http://thedarkknight.warnerbros.com/,True
4,5,12 Angry Men,1957,APPROVED,01 Apr 1957,96 min,"Crime, Drama",Sidney Lumet,"Reginald Rose (story), Reginald Rose (screenplay)","Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",...,,,,,http://www.rottentomatoes.com/m/1000013-12_ang...,06 Mar 2001,,Criterion Collection,http://www.criterion.com/films/27871-12-angry-men,True


**Data Overview**

In [52]:
print('Rows x Columns : ', df.shape[0], 'x', df.shape[1])
print('Features: ', df.columns.tolist())
print('nUnique values:')
print(df.nunique())

Rows x Columns :  250 x 38
Features:  ['Unnamed: 0', 'Title', 'Year', 'Rated', 'Released', 'Runtime', 'Genre', 'Director', 'Writer', 'Actors', 'Plot', 'Language', 'Country', 'Awards', 'Poster', 'Ratings.Source', 'Ratings.Value', 'Metascore', 'imdbRating', 'imdbVotes', 'imdbID', 'Type', 'tomatoMeter', 'tomatoImage', 'tomatoRating', 'tomatoReviews', 'tomatoFresh', 'tomatoRotten', 'tomatoConsensus', 'tomatoUserMeter', 'tomatoUserRating', 'tomatoUserReviews', 'tomatoURL', 'DVD', 'BoxOffice', 'Production', 'Website', 'Response']
nUnique values:
Unnamed: 0           250
Title                250
Year                  85
Rated                 10
Released             244
Runtime              103
Genre                110
Director             155
Writer               238
Actors               248
Plot                 250
Language              68
Country               39
Awards               235
Poster               250
Ratings.Source         1
Ratings.Value         13
Metascore             44
imdb

In [34]:
df.info()
print('nMissing values: ', df.isnull().sum().values.sum())
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 38 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         250 non-null    int64  
 1   Title              250 non-null    object 
 2   Year               250 non-null    int64  
 3   Rated              250 non-null    object 
 4   Released           248 non-null    object 
 5   Runtime            250 non-null    object 
 6   Genre              250 non-null    object 
 7   Director           250 non-null    object 
 8   Writer             249 non-null    object 
 9   Actors             250 non-null    object 
 10  Plot               250 non-null    object 
 11  Language           250 non-null    object 
 12  Country            250 non-null    object 
 13  Awards             245 non-null    object 
 14  Poster             250 non-null    object 
 15  Ratings.Source     250 non-null    object 
 16  Ratings.Value      250 non

Unnamed: 0             0
Title                  0
Year                   0
Rated                  0
Released               2
Runtime                0
Genre                  0
Director               0
Writer                 1
Actors                 0
Plot                   0
Language               0
Country                0
Awards                 5
Poster                 0
Ratings.Source         0
Ratings.Value          0
Metascore             73
imdbRating             0
imdbVotes              0
imdbID                 0
Type                   0
tomatoMeter          250
tomatoImage          250
tomatoRating         250
tomatoReviews        250
tomatoFresh          250
tomatoRotten         250
tomatoConsensus      250
tomatoUserMeter      250
tomatoUserRating     250
tomatoUserReviews    250
tomatoURL              0
DVD                    3
BoxOffice            175
Production             0
Website              119
Response               0
dtype: int64

**Text Preprocessing with NLP**

A new column in the dataframe that will hold all necessary keywords required for the model. Using NLP library known as RAKE (Rapid Automatic Keyword Extraction algorithm). RAKE is a keyword extraction algorithm that extracts those key phrases in a corpus by determining the frequency of words and their relative occurence with other words in the corpus.

In [67]:
# to remove punctuations from Plot
df['Plot'] = df['Plot'].str.replace('[^ws]','')
# to extract key words from Plot to a list
df['Key_words'] = ''   # initializing a new column
r = Rake()   # using Rake to remove stop words

for index, row in df.iterrows():
    r.extract_keywords_from_text(row['Plot'])
    key_words_dict_scores = r.get_word_degrees()
    df.at[index, 'Key_words'] = list(key_words_dict_scores.keys())


In [68]:
#df['Plot'].head()
df['Key_words'].head()

0    [two, imprisoned, men, bond, number, years, fi...
1    [aging, patriarch, organized, crime, dynasty, ...
2    [early, life, career, vito, corleone, 1920s, n...
3    [menace, known, joker, emerges, mysterious, pa...
4    [jury, holdout, attempts, prevent, miscarriage...
Name: Key_words, dtype: object

**Feature Engineering Actor and Director names**

To avoid any confusing between actors and directors with same first name, we are going to feature engineer the names by merging the first name and surname into a single word to ensure that the recommender detects a similarity only if the person associated with different movies is exactly same.

In [69]:
df['Genre'] = df['Genre'].map(lambda x: x.split(','))
df['Actors'] = df['Actors'].map(lambda x: x.split(',')[:3])
df['Director'] = df['Director'].map(lambda x: x.split(','))

for index, row in df.iterrows():
    df.at[index, 'Genre'] = [x.lower().replace(' ', '') for x in row['Genre']]
    df.at[index, 'Actors'] = [x.lower().replace(' ', '') for x in row['Actors']]
    df.at[index, 'Director'] = [x.lower().replace(' ', '') for x in row['Director']]

In [63]:
df['Genre'].head()

0            [crime, drama]
1            [crime, drama]
2            [crime, drama]
3    [action, crime, drama]
4            [crime, drama]
Name: Genre, dtype: object

In [70]:
df['Actors'].head()

0        [timrobbins, morganfreeman, bobgunton]
1           [marlonbrando, alpacino, jamescaan]
2         [alpacino, robertduvall, dianekeaton]
3    [christianbale, heathledger, aaroneckhart]
4        [martinbalsam, johnfiedler, leej.cobb]
Name: Actors, dtype: object

**Generating word representation using Bag of Words (BoW)**

Bag of Words is an Information Retrieval model which is useful for creating a representation of text, which describes the occurrence of words in a document or simply implies to us whether a particular word is frequent in the text corpus or not.

It is useful for creating vector representations of frequent words in a corpus and then computing the similarity scores.

In [71]:
# Combine columns into Bag_of_words
df['Bag_of_words'] = ''

columns = ['Genre', 'Director', 'Actors', 'Key_words']

for index, row in df.iterrows():
    words = ''
    for col in columns:
        words += ' '.join(row[col]) + ' '
    df.at[index, 'Bag_of_words'] = words

# Strip white spaces in front and behind, replace multiple whitespaces (if any)
df['Bag_of_words'] = df['Bag_of_words'].str.strip().str.replace('\s+', ' ', regex=True)

# Select only Title and Bag_of_words
df = df[['Title', 'Bag_of_words']]
df.head()

Unnamed: 0,Title,Bag_of_words
0,The Shawshank Redemption,crime drama frankdarabont timrobbins morganfre...
1,The Godfather,crime drama francisfordcoppola marlonbrando al...
2,The Godfather: Part II,crime drama francisfordcoppola alpacino robert...
3,The Dark Knight,action crime drama christophernolan christianb...
4,12 Angry Men,crime drama sidneylumet martinbalsam johnfiedl...


**Vectorising BoW and creating a similarity matrix**

Converting the BoW into vector representations using CountVectorizer, which converts the words into their respective vector forms, on the basis of the frequency count of each word.

To generate a similarity matrix, Cosine similarity is used. It measures the similarity between two vectors, by the cosine of the angle between them, and based on the value it gets, it decides whether the two vectors are similar.

In [72]:
count = CountVectorizer()
count_matrix = count.fit_transform(df['Bag_of_words'])
count_matrix

cosine_sim = cosine_similarity(count_matrix, count_matrix)
print(cosine_sim)

[[1.         0.15789474 0.13764944 ... 0.05263158 0.05263158 0.05564149]
 [0.15789474 1.         0.36706517 ... 0.05263158 0.05263158 0.05564149]
 [0.13764944 0.36706517 1.         ... 0.04588315 0.04588315 0.04850713]
 ...
 [0.05263158 0.05263158 0.04588315 ... 1.         0.05263158 0.05564149]
 [0.05263158 0.05263158 0.04588315 ... 0.05263158 1.         0.05564149]
 [0.05564149 0.05564149 0.04850713 ... 0.05564149 0.05564149 1.        ]]


Ensuring that the 'Title' column is well inclined with the row and column index of the similarity matrix.

In [73]:
indices = pd.Series(df['Title'])

**Training and Testing the recommendation system**

A function to take the movie title as input, and return the top 5 similar movies.

Steps:

1. Takes in a movie title as user input.
2. Matches the input title with the respective index of the similarity matrix.
3. Extracts the similarity values in the top to bottom or descending fashion.
4. Extract (N+1) movies and remove the 1st one as it's the user input itself.
5. Give the top N recommendations to the user.

In [74]:
def recommend(title, cosine_sim = cosine_sim):
    recommended_movies = []
    idx = indices[indices == title].index[0]   # to get the index of the movie title matching the input movie
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)   # similarity scores in descending order
    top_5_indices = list(score_series.iloc[1:6].index)   # to get the indices of top 6 most similar movies
    # [1:6] to exclude 0 (index 0 is the input movie itself)

    for i in top_5_indices:   # to append the titles of top 10 similar movies to the recommended_movies list
        recommended_movies.append(list(df['Title'])[i])

    return recommended_movies

In [75]:
recommend('The Avengers')

['Guardians of the Galaxy Vol. 2',
 'Guardians of the Galaxy',
 'Aliens',
 'The Martian',
 'Interstellar']