# Movie Recommendation System

Be it the OTT platforms like Netflix and Hotstar or the e-Commerce websites like Flipkart and Amazon, all provides suggestions to us that recommends new movies based on the watch history, or suggests new products based on our order or search history.

These suggestions or recommendations are done by a system called a recommendation system. This engine makes suggestions by learning and understanding the patterns from our previous movements and then applies those patterns and findings to make new suggestions.

In [2]:
# Importing the Modules

import pandas as pd
import seaborn as sns
import difflib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
# Loading the dataset

df = pd.read_csv('movies.csv')
df.head(3)

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes


In [4]:
# Getting the columns

df.columns

Index(['index', 'budget', 'genres', 'homepage', 'id', 'keywords',
       'original_language', 'original_title', 'overview', 'popularity',
       'production_companies', 'production_countries', 'release_date',
       'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title',
       'vote_average', 'vote_count', 'cast', 'crew', 'director'],
      dtype='object')

In [5]:
#Getting the shape of the dataset

df.shape

(4803, 24)

In [6]:
# Selecting the features based on which the recommendation is made
# You can also chnage the features for better modelling

sel_fet = ['genres', 'keywords', 'tagline', 'cast', 'director']

In [7]:
# Checking for null values in the features selected

df[sel_fet].isnull().sum()

genres       28
keywords    412
tagline     844
cast         43
director     30
dtype: int64

In [8]:
# Filling the null values with white-space and adding all the features and preparing it to be fed into the model

comb_fet = ""
for i in sel_fet:
    df[i].fillna('', inplace=True)
    comb_fet += df[i]

In [9]:
# Printing the first 5 combined features and the length of the data

print(comb_fet[:5])
print(len(comb_fet))

0    Action Adventure Fantasy Science Fictioncultur...
1    Adventure Fantasy Actionocean drug abuse exoti...
2    Action Adventure Crimespy based on novel secre...
3    Action Crime Drama Thrillerdc comics crime fig...
4    Action Adventure Science Fictionbased on novel...
Name: genres, dtype: object
4803


# TFID Vectorizer

TfidfVectorizer transforms text to feature vectors that can be used as input to estimator.
It uses an in-memory vocabulary (a python dict) to map the most frequent words to features indices and hence compute a word occurrence frequency(sparse) matrix.

In [10]:
# Creating an object of TFIDVectorizer

vector = TfidfVectorizer()

In [11]:
# Fitting the Combined Features into the TFIDVectorizer

comb_vec = vector.fit_transform(comb_fet)
# print(comb_vec)

# Cosine Similarity

Cosine similarity measures the similarity between two vectors. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis with respect to a given vector of query words.

In [12]:
# Calculating the Similarity Matrix

similarity = cosine_similarity(comb_vec)
print(similarity)
print(similarity.shape)

[[1.         0.06865296 0.01492221 ... 0.         0.         0.        ]
 [0.06865296 1.         0.02799128 ... 0.01243107 0.         0.        ]
 [0.01492221 0.02799128 1.         ... 0.         0.         0.        ]
 ...
 [0.         0.01243107 0.         ... 1.         0.         0.        ]
 [0.         0.         0.         ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]
(4803, 4803)


In [13]:
#Printing the first 5 titles from the dataset

all_movies = list(df['original_title'])
all_movies[:5]

['Avatar',
 "Pirates of the Caribbean: At World's End",
 'Spectre',
 'The Dark Knight Rises',
 'John Carter']

In [14]:
# Input a Movie name and get the suggestions

movie = input("Enter Movie Name: ")

Enter Movie Name: spider-man


# difflib.getclose_matches()

The difflib module serves a simple yet powerful utility as the get_close_matches method. This tool will accept parameters and return the closest matches to the target string

In [15]:
# Printing all the closest matches with respect to the input

close_match = difflib.get_close_matches(movie, all_movies)
print(close_match)

['Spider-Man', 'Spider-Man 3', 'Spider-Man 2']


In [22]:
# Here I have done a couple of things:
# 1. Getting the index of each matched movies.
# 2. Getting the similarity score for the matched movies (i.e. its entire indexed row) along with a counters.
# 3. Sorting the scores in desceding order with respect to the scores.
# 4. Getting only first 5 matches from each matched movies and eliminating the duplicating ones.
# 5. Finally printing the list of all the Recommended Movies.

movie_recommend = []

for movie in close_match:
    
    movie_idx = all_movies.index(movie)
    sim_score = list(enumerate(similarity[movie_idx]))
    simScore_sort = sorted(sim_score, key = lambda x:x[1], reverse=True)

    i = 1
    for score in simScore_sort:
        idx = score[0]
        if(i <= 5):
            movie_recommend.append(df['original_title'][idx])
            i += 1
    i = 1

movie_recommend = list(set(movie_recommend))

for i in range(len(movie_recommend)):
    print(f"{i+1}. {movie_recommend[i]}")

1. The Good German
2. The Notebook
3. Spider-Man
4. Spider-Man 3
5. Frida
6. The Specials
7. Deadpool
8. Seabiscuit
9. Spider-Man 2
