In [3]:
#User-based nearest neighbor recommendation system
#Cluster analysis is a technique used to group similar objects into respective categories. 
# It is a common technique used in data mining and statistical data analysis. 
# The main goal of cluster analysis is to partition a set of objects 
# into clusters such that objects in the same cluster are more similar
# to each other than to objects in other clusters. The similarity between objects 
# is measured using a distance measure. such as Euclidean distance, Manhattan distance,Minkowski distance, 
# Pearson correlation, and Cosine similarity.The first two are distance measures, are sensitive to the scale of the data.
# The last three are similarity measures, are not sensitive to the scale of the data. Between the pearson correlation and cosine similarity,
# pearson correlation is more sensitive to the magnitude of the rating values, and cosine similarity is not sensitive to the magnitude of the rating values.
#Since our recommendation system is based on the similarity between users rating, we will use cosine similarity as our distance measure,
# because it is not sensitive to the magnitude of the rating values.

#user based collaborative filtering
#user based collaborative filtering is a technique used to predict the items that a user might like on the basis of ratings by similar users.
#The basic idea behind this filtering is that similar users share the same interest and that similar items are liked by a user.
#The similarity between users is calculated by using the cosine similarity measure.
#The similarity between items is calculated by using the Pearson correlation similarity measure.



In [4]:
#Data Source: https://raw.githubusercontent.com/YBI-Foundation/Dataset/main/Movies%20Recommendation.csv
#Data Set Information:
#This data set consists of:
#* 100,000 ratings (1-5) from 943 users on 1682 movies.
#* Each user has rated at least 20 movies.

#Attribute Information:
#* user id
#* item id
#* rating
#* timestamp

#Relevant Papers:
#* F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872
#* http://files.grouplens.org/papers/harper-tiis2015.pdf


In [5]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [6]:
#read the data
df = pd.read_csv("https://raw.githubusercontent.com/YBI-Foundation/Dataset/main/Movies%20Recommendation.csv")
df.head()

Unnamed: 0,Movie_ID,Movie_Title,Movie_Genre,Movie_Language,Movie_Budget,Movie_Popularity,Movie_Release_Date,Movie_Revenue,Movie_Runtime,Movie_Vote,...,Movie_Homepage,Movie_Keywords,Movie_Overview,Movie_Production_House,Movie_Production_Country,Movie_Spoken_Language,Movie_Tagline,Movie_Cast,Movie_Crew,Movie_Director
0,1,Four Rooms,Crime Comedy,en,4000000,22.87623,09-12-1995,4300000,98.0,6.5,...,,hotel new year's eve witch bet hotel room,It's Ted the Bellhop's first night on the job....,"[{""name"": ""Miramax Films"", ""id"": 14}, {""name"":...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""iso_639_1"": ""en"", ""name"": ""English""}]",Twelve outrageous guests. Four scandalous requ...,Tim Roth Antonio Banderas Jennifer Beals Madon...,"[{'name': 'Allison Anders', 'gender': 1, 'depa...",Allison Anders
1,2,Star Wars,Adventure Action Science Fiction,en,11000000,126.393695,25-05-1977,775398007,121.0,8.1,...,http://www.starwars.com/films/star-wars-episod...,android galaxy hermit death star lightsaber,Princess Leia is captured and held hostage by ...,"[{""name"": ""Lucasfilm"", ""id"": 1}, {""name"": ""Twe...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""iso_639_1"": ""en"", ""name"": ""English""}]","A long time ago in a galaxy far, far away...",Mark Hamill Harrison Ford Carrie Fisher Peter ...,"[{'name': 'George Lucas', 'gender': 2, 'depart...",George Lucas
2,3,Finding Nemo,Animation Family,en,94000000,85.688789,30-05-2003,940335536,100.0,7.6,...,http://movies.disney.com/finding-nemo,father son relationship harbor underwater fish...,"Nemo, an adventurous young clownfish, is unexp...","[{""name"": ""Pixar Animation Studios"", ""id"": 3}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""iso_639_1"": ""en"", ""name"": ""English""}]","There are 3.7 trillion fish in the ocean, they...",Albert Brooks Ellen DeGeneres Alexander Gould ...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton
3,4,Forrest Gump,Comedy Drama Romance,en,55000000,138.133331,06-07-1994,677945399,142.0,8.2,...,,vietnam veteran hippie mentally disabled runni...,A man with a low IQ has accomplished great thi...,"[{""name"": ""Paramount Pictures"", ""id"": 4}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""iso_639_1"": ""en"", ""name"": ""English""}]","The world will never be the same, once you've ...",Tom Hanks Robin Wright Gary Sinise Mykelti Wil...,"[{'name': 'Alan Silvestri', 'gender': 2, 'depa...",Robert Zemeckis
4,5,American Beauty,Drama,en,15000000,80.878605,15-09-1999,356296601,122.0,7.9,...,http://www.dreamworks.com/ab/,male nudity female nudity adultery midlife cri...,"Lester Burnham, a depressed suburban father in...","[{""name"": ""DreamWorks SKG"", ""id"": 27}, {""name""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...","[{""iso_639_1"": ""en"", ""name"": ""English""}]",Look closer.,Kevin Spacey Annette Bening Thora Birch Wes Be...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes


In [7]:
#describe the data
df.describe()

Unnamed: 0,Movie_ID,Movie_Budget,Movie_Popularity,Movie_Revenue,Movie_Runtime,Movie_Vote,Movie_Vote_Count
count,4760.0,4760.0,4760.0,4760.0,4758.0,4760.0,4760.0
mean,2382.566387,29201290.0,21.59951,82637430.0,107.184111,6.113866,692.508403
std,1377.270159,40756200.0,31.887919,163055400.0,21.960332,1.141294,1235.007337
min,1.0,0.0,0.000372,0.0,0.0,0.0,0.0
25%,1190.75,925750.0,4.807074,0.0,94.0,5.6,55.0
50%,2380.5,15000000.0,13.119058,19447160.0,104.0,6.2,238.0
75%,3572.25,40000000.0,28.411929,93412760.0,118.0,6.8,740.25
max,4788.0,380000000.0,875.581305,2787965000.0,338.0,10.0,13752.0


In [8]:
#data information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4760 entries, 0 to 4759
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Movie_ID                  4760 non-null   int64  
 1   Movie_Title               4760 non-null   object 
 2   Movie_Genre               4760 non-null   object 
 3   Movie_Language            4760 non-null   object 
 4   Movie_Budget              4760 non-null   int64  
 5   Movie_Popularity          4760 non-null   float64
 6   Movie_Release_Date        4760 non-null   object 
 7   Movie_Revenue             4760 non-null   int64  
 8   Movie_Runtime             4758 non-null   float64
 9   Movie_Vote                4760 non-null   float64
 10  Movie_Vote_Count          4760 non-null   int64  
 11  Movie_Homepage            1699 non-null   object 
 12  Movie_Keywords            4373 non-null   object 
 13  Movie_Overview            4757 non-null   object 
 14  Movie_Pr

In [9]:
#check data shape
df.shape


(4760, 21)

In [10]:
df.columns

Index(['Movie_ID', 'Movie_Title', 'Movie_Genre', 'Movie_Language',
       'Movie_Budget', 'Movie_Popularity', 'Movie_Release_Date',
       'Movie_Revenue', 'Movie_Runtime', 'Movie_Vote', 'Movie_Vote_Count',
       'Movie_Homepage', 'Movie_Keywords', 'Movie_Overview',
       'Movie_Production_House', 'Movie_Production_Country',
       'Movie_Spoken_Language', 'Movie_Tagline', 'Movie_Cast', 'Movie_Crew',
       'Movie_Director'],
      dtype='object')

In [11]:
features= df[['Movie_Genre','Movie_Keywords','Movie_Tagline','Movie_Cast', 'Movie_Director']]

In [12]:
features

Unnamed: 0,Movie_Genre,Movie_Keywords,Movie_Tagline,Movie_Cast,Movie_Director
0,Crime Comedy,hotel new year's eve witch bet hotel room,Twelve outrageous guests. Four scandalous requ...,Tim Roth Antonio Banderas Jennifer Beals Madon...,Allison Anders
1,Adventure Action Science Fiction,android galaxy hermit death star lightsaber,"A long time ago in a galaxy far, far away...",Mark Hamill Harrison Ford Carrie Fisher Peter ...,George Lucas
2,Animation Family,father son relationship harbor underwater fish...,"There are 3.7 trillion fish in the ocean, they...",Albert Brooks Ellen DeGeneres Alexander Gould ...,Andrew Stanton
3,Comedy Drama Romance,vietnam veteran hippie mentally disabled runni...,"The world will never be the same, once you've ...",Tom Hanks Robin Wright Gary Sinise Mykelti Wil...,Robert Zemeckis
4,Drama,male nudity female nudity adultery midlife cri...,Look closer.,Kevin Spacey Annette Bening Thora Birch Wes Be...,Sam Mendes
...,...,...,...,...,...
4755,Horror,,The hot spot where Satan's waitin'.,Lisa Hart Carroll Michael Des Barres Paul Drak...,Pece Dingo
4756,Comedy Family Drama,,It’s better to stand out than to fit in.,Roni Akurati Brighton Sharbino Jason Lee Anjul...,Frank Lotito
4757,Thriller Drama,christian film sex trafficking,She never knew it could happen to her...,Nicole Smolen Kim Baldwin Ariana Stephens Brys...,Jaco Booyens
4758,Family,,,,


In [13]:
#NaN values
features.isnull().sum()

Movie_Genre         0
Movie_Keywords    387
Movie_Tagline     818
Movie_Cast         27
Movie_Director     22
dtype: int64

In [14]:
features.shape

(4760, 5)

In [17]:
X = features['Movie_Genre'] + ' '  + features['Movie_Keywords'] + ' ' + features['Movie_Tagline'] + ' ' + features['Movie_Cast'] + ' ' + features['Movie_Director']

In [18]:
X

0       Crime Comedy hotel new year's eve witch bet ho...
1       Adventure Action Science Fiction android galax...
2       Animation Family father son relationship harbo...
3       Comedy Drama Romance vietnam veteran hippie me...
4       Drama male nudity female nudity adultery midli...
                              ...                        
4755                                                  NaN
4756                                                  NaN
4757    Thriller Drama christian film sex trafficking ...
4758                                                  NaN
4759                                                  NaN
Length: 4760, dtype: object

In [19]:
#handle NaN values
X.fillna('Unknown', inplace=True)

In [20]:
X

0       Crime Comedy hotel new year's eve witch bet ho...
1       Adventure Action Science Fiction android galax...
2       Animation Family father son relationship harbo...
3       Comedy Drama Romance vietnam veteran hippie me...
4       Drama male nudity female nudity adultery midli...
                              ...                        
4755                                              Unknown
4756                                              Unknown
4757    Thriller Drama christian film sex trafficking ...
4758                                              Unknown
4759                                              Unknown
Length: 4760, dtype: object

In [21]:
#get feature text conversion to token
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X)


In [23]:
X.shape

(4760, 14759)

In [24]:
print(X)

  (0, 533)	0.16400730347966605
  (0, 428)	0.14480710249054038
  (0, 13193)	0.1460072234045736
  (0, 8247)	0.14258556698453415
  (0, 8077)	0.1669712770888808
  (0, 1178)	0.1669712770888808
  (0, 6720)	0.09939253039334757
  (0, 1034)	0.14056628189744372
  (0, 633)	0.13700718336702736
  (0, 11181)	0.14366855395969502
  (0, 13141)	0.10816721996836318
  (0, 7722)	0.0874285167465866
  (0, 9512)	0.06394158650289854
  (0, 14351)	0.1744398940932959
  (0, 4823)	0.08589709371345547
  (0, 14317)	0.10686114739776267
  (0, 6783)	0.13255293007270985
  (0, 13017)	0.09959538243705136
  (0, 9578)	0.0736878087323703
  (0, 3325)	0.11965590860182923
  (0, 4727)	0.11432976440217778
  (0, 6061)	0.19730018458027812
  (0, 6401)	0.14921553081605893
  (0, 1273)	0.19477864361326694
  (0, 7859)	0.15005418398894924
  :	:
  (4757, 6612)	0.282967046407734
  (4757, 1886)	0.282967046407734
  (4757, 702)	0.282967046407734
  (4757, 12126)	0.282967046407734
  (4757, 4987)	0.24256955606784908
  (4757, 13280)	0.242569556067

In [27]:
#get cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_score = cosine_similarity(X)


In [28]:
similarity_score

array([[1.        , 0.01383423, 0.03601212, ..., 0.        , 0.        ,
        0.        ],
       [0.01383423, 1.        , 0.00822535, ..., 0.        , 0.        ,
        0.        ],
       [0.03601212, 0.00822535, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        1.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        1.        ]])

In [31]:
#get movie title as input from the user and validate for closest spelling
movie_title = input("Enter the movie title: ")



In [32]:
All_movie_titles = df['Movie_Title'].tolist()

In [33]:
import difflib
# The difflib module contains tools for computing and working with differences between sequences.

In [38]:
movie_recommendation = difflib.get_close_matches(movie_title, All_movie_titles, n=1, cutoff=0.0)
movie_recommendation

['Æon Flux']

In [39]:
closest_movie_title = movie_recommendation[0]
closest_movie_title

'Æon Flux'

In [41]:
index_of_closest_movie_title = All_movie_titles.index(closest_movie_title)
index_of_closest_movie_title

1066

In [42]:
#get the similarity score of the movie title entered by the user with all the movie titles in the dataset
similarity_score_of_closest_movie_title = similarity_score[index_of_closest_movie_title]
similarity_score_of_closest_movie_title

array([0.01957126, 0.02238028, 0.00751991, ..., 0.        , 0.        ,
       0.        ])

In [43]:
len(similarity_score_of_closest_movie_title)

4760

In [47]:
#get all movies sorted in descending order of their similarity score with the movie title entered by the user
sorted_similar_movies = sorted(list(enumerate(similarity_score_of_closest_movie_title)), key=lambda x:x[1], reverse=True)

In [49]:
#print the top 10 similar movies
i=0
print("Top 30 similar movies to  are:\n")
for movie in sorted_similar_movies:
    print(All_movie_titles[movie[0]])
    i=i+1
    if i>30:
        break


Top 30 similar movies to  are:

Æon Flux
Scoop
Killers
Mad Max: Fury Road
After Earth
The Bourne Identity
Resident Evil: Apocalypse
All Good Things
North Country
The Equalizer
The Great Raid
Superman II
Elektra
In the Valley of Elah
Trainspotting
The Road
Kingdom of Heaven
The Man with the Golden Gun
Jennifer's Body
Serenity
A Guy Thing
The Host
Wonder Boys
Bloodsport
The Mechanic
Minority Report
Transformers: Dark of the Moon
The Green Hornet
Excessive Force
Dream House
Reindeer Games


In [51]:
#top 10 recommended movies
movie_name = input("Enter the movie title: ")
list_of_movies = df['Movie_Title'].tolist()
find_closest_movie_title = difflib.get_close_matches(movie_name, list_of_movies, n=1, cutoff=0.0)
closest_movie_title = find_closest_movie_title[0]
index_of_closest_movie_title = list_of_movies.index(closest_movie_title)
similarity_score_of_closest_movie_title = similarity_score[index_of_closest_movie_title]
sorted_similar_movies = sorted(list(enumerate(similarity_score_of_closest_movie_title)), key=lambda x:x[1], reverse=True)
i=0
print("Top 10 similar movies to "+movie_name+" are:\n")
for movie in sorted_similar_movies:
    print(list_of_movies[movie[0]])
    i=i+1
    if i>10:
        break


Top 10 similar movies to avtaar are:

Avatar
Aliens
Guardians of the Galaxy
Alien
Star Trek Into Darkness
Galaxy Quest
Alien³
Gravity
Moonraker
Jason X
Pocahontas
