Movie Recommendation System:

The dataset for the Movie Recommendation System is obtained from the following link:
https://drive.google.com/file/d/1cCkwiVv4mgfl20ntgY3n4yApcWqqZQe6/view

Basically , the movie recommendation is provided to the user based on the content they've previously watched, popularity across the region, etc. We have 3 different types of Recommendation System and they are as follows:

1. Content Based Recommendation System - Recommends movie to the users based on the content of the film that they have previously watched
2. Popularity Based Recommendation System - Recommends movie based on the popularity
3. Collaborative Recommendation System - Groups people based on their watching pattern. This system will try to recommend movies based on the watching pattern of an individual based on some previous data it has collected from the other user


Workflow:
1. Data obtained from the link given above
2. Data preprocessing
3. Feature Extraction ( to convert the textual data into a feature vector)
4. Obtain the input for the movie name from the user 
5. Cosine Similarity Algorithm - to identify the similarity between the movie name provided by the user and the movie present in the dataset
6. Extract the list of movies based on their similarity

Note: (difflib , cosine similarity and the TfidfVectorizer)
difflib library is used to identify the close match between the movie name provided by the user and the movie name present in the dataset. It is quite common for a human to make some silly errors while inputting the movie name , to get rid of this difflib library is used 

TfidfVectorizer is used to convert the text into a feature vector

Cosine Similarity Algorithm is used to identify the similarity between the movies.

# Import the dependencies:

In [1]:
import pandas as pd
import difflib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Data collection and preprocessing:

In [2]:
df = pd.read_csv(r"C:\Users\17347\Sapna's Projects\Movie Recommendation System\movies.csv")

In [3]:
print(df.head())

   index     budget                                    genres  \
0      0  237000000  Action Adventure Fantasy Science Fiction   
1      1  300000000                  Adventure Fantasy Action   
2      2  245000000                    Action Adventure Crime   
3      3  250000000               Action Crime Drama Thriller   
4      4  260000000          Action Adventure Science Fiction   

                                       homepage      id  \
0                   http://www.avatarmovie.com/   19995   
1  http://disney.go.com/disneypictures/pirates/     285   
2   http://www.sonypictures.com/movies/spectre/  206647   
3            http://www.thedarkknightrises.com/   49026   
4          http://movies.disney.com/john-carter   49529   

                                            keywords original_language  \
0  culture clash future space war space colony so...                en   
1  ocean drug abuse exotic island east india trad...                en   
2         spy based on novel sec

In [4]:
df.shape

(4803, 24)

# Selecting the relevant features for recommendation:

In [5]:
#To build the content based recommendation s/m, Use the following columns ( you 
# can also choose on your own) : genre , keywords, tagline,cast and director 
selected_features = ['genres','keywords','tagline','cast','director']
print(selected_features)

['genres', 'keywords', 'tagline', 'cast', 'director']


In [6]:
#replacing the null values with null string 
for feature in selected_features:
    df[feature] = df[feature].fillna('')

In [7]:
#combining all the 5 selected features 
combined_features =  df['genres'] + ' ' + df['keywords'] + ' ' + df['tagline'] + ' ' + df['cast'] + ' ' + df['director']
print(combined_features)

0       Action Adventure Fantasy Science Fiction cultu...
1       Adventure Fantasy Action ocean drug abuse exot...
2       Action Adventure Crime spy based on novel secr...
3       Action Crime Drama Thriller dc comics crime fi...
4       Action Adventure Science Fiction based on nove...
                              ...                        
4798    Action Crime Thriller united states\u2013mexic...
4799    Comedy Romance  A newlywed couple's honeymoon ...
4800    Comedy Drama Romance TV Movie date love at fir...
4801      A New Yorker in Shanghai Daniel Henney Eliza...
4802    Documentary obsession camcorder crush dream gi...
Length: 4803, dtype: object


In [8]:
#converting the text data into a feature vector ( numeric values )
vectorizer = TfidfVectorizer()
feature_vectors = vectorizer.fit_transform(combined_features)
print(feature_vectors)

  (0, 2432)	0.17272411194153
  (0, 7755)	0.1128035714854756
  (0, 13024)	0.1942362060108871
  (0, 10229)	0.16058685400095302
  (0, 8756)	0.22709015857011816
  (0, 14608)	0.15150672398763912
  (0, 16668)	0.19843263965100372
  (0, 14064)	0.20596090415084142
  (0, 13319)	0.2177470539412484
  (0, 17290)	0.20197912553916567
  (0, 17007)	0.23643326319898797
  (0, 13349)	0.15021264094167086
  (0, 11503)	0.27211310056983656
  (0, 11192)	0.09049319826481456
  (0, 16998)	0.1282126322850579
  (0, 15261)	0.07095833561276566
  (0, 4945)	0.24025852494110758
  (0, 14271)	0.21392179219912877
  (0, 3225)	0.24960162956997736
  (0, 16587)	0.12549432354918996
  (0, 14378)	0.33962752210959823
  (0, 5836)	0.1646750903586285
  (0, 3065)	0.22208377802661425
  (0, 3678)	0.21392179219912877
  (0, 5437)	0.1036413987316636
  :	:
  (4801, 17266)	0.2886098184932947
  (4801, 4835)	0.24713765026963996
  (4801, 403)	0.17727585190343226
  (4801, 6935)	0.2886098184932947
  (4801, 11663)	0.21557500762727902
  (4801, 1672

# Cosine Similarity Algorithm:


The similarity confidence score for the movies are obtained using the Cosine Similarity Algorithm ; cosine similarity algorithm 
will go through all the values in the feature vectors as obtained in the earlier step and will try to find the values that are similar to each other such that we can find which movies are related to each other.

In [9]:
similarity = cosine_similarity(feature_vectors)

In [10]:
print(similarity)

[[1.         0.07219487 0.037733   ... 0.         0.         0.        ]
 [0.07219487 1.         0.03281499 ... 0.03575545 0.         0.        ]
 [0.037733   0.03281499 1.         ... 0.         0.05389661 0.        ]
 ...
 [0.         0.03575545 0.         ... 1.         0.         0.02651502]
 [0.         0.         0.05389661 ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.02651502 0.         1.        ]]


In [11]:
print(similarity.shape)

(4803, 4803)


# Obtain the movie name from the user:

In [12]:
# getting the movie name from the user:
movie_name = input('Enter your favourite movie name:')

Enter your favourite movie name:alone


In [13]:
# creating a list with all the movie name (title) in the dataset:
 
title_list = df['title'].tolist()
print(title_list)

['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre', 'The Dark Knight Rises', 'John Carter', 'Spider-Man 3', 'Tangled', 'Avengers: Age of Ultron', 'Harry Potter and the Half-Blood Prince', 'Batman v Superman: Dawn of Justice', 'Superman Returns', 'Quantum of Solace', "Pirates of the Caribbean: Dead Man's Chest", 'The Lone Ranger', 'Man of Steel', 'The Chronicles of Narnia: Prince Caspian', 'The Avengers', 'Pirates of the Caribbean: On Stranger Tides', 'Men in Black 3', 'The Hobbit: The Battle of the Five Armies', 'The Amazing Spider-Man', 'Robin Hood', 'The Hobbit: The Desolation of Smaug', 'The Golden Compass', 'King Kong', 'Titanic', 'Captain America: Civil War', 'Battleship', 'Jurassic World', 'Skyfall', 'Spider-Man 2', 'Iron Man 3', 'Alice in Wonderland', 'X-Men: The Last Stand', 'Monsters University', 'Transformers: Revenge of the Fallen', 'Transformers: Age of Extinction', 'Oz: The Great and Powerful', 'The Amazing Spider-Man 2', 'TRON: Legacy', 'Cars 2', 'Green Lant

# Find the close match for the movie name given by the user using difflib:

In [14]:
find_close_match = difflib.get_close_matches(movie_name, title_list)
print(find_close_match)

['Malone', 'Salton Sea', 'Coraline']


In [15]:
close_match = find_close_match[0]
print(close_match)

Malone


In [16]:
#finding the index of the movie based on the title
index_of_the_movie = df[df.title == close_match]['index'].values[0]
print(index_of_the_movie)

3077


In [17]:
#getting a list of similar movies ; identifies the similarity between the movie provided by the user and the movies in the 
#dataset
similarity_score = list(enumerate(similarity[index_of_the_movie]))
#first value indicates the index of the movie and the similarity score between the movie provided by the user and the movies in 
#the dataset
print(similarity_score)


[(0, 0.00608642604026032), (1, 0.005312245200578919), (2, 0.006227192698954829), (3, 0.013135759084254022), (4, 0.005926642213598285), (5, 0.0063050608631907154), (6, 0.0), (7, 0.006143457642902958), (8, 0.0), (9, 0.006002160476303311), (10, 0.006845828085335102), (11, 0.010450846816113878), (12, 0.006214551770679437), (13, 0.005610876926577836), (14, 0.006046766033891668), (15, 0.0), (16, 0.00582530375993212), (17, 0.00594494022315706), (18, 0.005815099040511541), (19, 0.005278142353664474), (20, 0.023394698279387102), (21, 0.019715717296076997), (22, 0.0), (23, 0.0), (24, 0.005741959745335327), (25, 0.005716767957722395), (26, 0.0063405825613225475), (27, 0.010716756271241027), (28, 0.01176573370462956), (29, 0.01230870938569693), (30, 0.006134201618449729), (31, 0.005981421343218528), (32, 0.0), (33, 0.011800825612193755), (34, 0.0), (35, 0.0307349883323347), (36, 0.0057750414792079865), (37, 0.0), (38, 0.023975523604396566), (39, 0.005731971391446183), (40, 0.024477060135299718), (

In [18]:
len(similarity_score)
#4803 is the actual number of rows in the dataset

4803

# Sorting the movies based on their similarity score:

In [19]:
# x[1] in the following line indicates the similarity score
sorted_similar_movies = sorted(similarity_score, key = lambda x:x[1], reverse = True)
print(sorted_similar_movies)

[(3077, 1.0), (1797, 0.18433980357205407), (1626, 0.17327353918869892), (737, 0.15388477707627635), (1104, 0.14306956897491618), (2043, 0.13737535219564528), (4194, 0.13221826614875076), (4378, 0.1236017830886415), (1429, 0.12315032341168729), (893, 0.12238078710763897), (895, 0.12231876296997705), (1528, 0.12107390368848016), (2001, 0.118786271472509), (1164, 0.11624552243845156), (976, 0.11393675158928494), (1848, 0.11385962898537333), (594, 0.11160370522890074), (455, 0.10867021859034458), (392, 0.1080814299109485), (1193, 0.10666830043932315), (213, 0.10480469252963724), (1249, 0.10362817577988048), (261, 0.10341402364219485), (3764, 0.10200792625165081), (974, 0.10156572810326463), (2870, 0.0991020296673514), (553, 0.09903052186734543), (3065, 0.09894811482670646), (1417, 0.09829277526902895), (888, 0.09749417151836967), (2083, 0.09591087170915036), (159, 0.09545910274922952), (1245, 0.09380663140124088), (1701, 0.09354387723416992), (724, 0.09320072632669284), (2398, 0.0906501044

# Print the title of the movie based on their Index value:

In [20]:
print("Movies Recommended for you:")
i = 1
for movie in sorted_similar_movies:
    index = movie[0]
    title_from_index = df[df.index == index ]['title'].values[0]
    if i < 41:
        print(i, '.',title_from_index)
        i+=1
        

Movies Recommended for you:
1 . Malone
2 . Agent Cody Banks 2: Destination London
3 . My Super Ex-Girlfriend
4 . Jack Ryan: Shadow Recruit
5 . The Bounty Hunter
6 . Homefront
7 . In the Bedroom
8 . Wristcutters: A Love Story
9 . Broken City
10 . From Paris with Love
11 . Me, Myself & Irene
12 . Criminal
13 . The Crew
14 . Scream 3
15 . Escape from L.A.
16 . Agent Cody Banks
17 . Bad Company
18 . Conspiracy Theory
19 . Safe House
20 . The Count of Monte Cristo
21 . Mission: Impossible II
22 . Torque
23 . Live Free or Die Hard
24 . Topaz
25 . Blood Work
26 . Celebrity
27 . The Kingdom
28 . Heartbeeps
29 . Alex Cross
30 . The Dukes of Hazzard
31 . The Best Little Whorehouse in Texas
32 . Spider-Man
33 . Colombiana
34 . Once Upon a Time in Mexico
35 . Man on Fire
36 . Hitman
37 . Spy Game
38 . In the Name of the King: A Dungeon Siege Tale
39 . Meet the Parents
40 . Charly


# Movie Recommendation System:


In [21]:
movie_name = input(' Enter your favourite movie name : ')
title_list = df['title'].tolist()
find_close_match = difflib.get_close_matches(movie_name, title_list)
close_match = find_close_match[0]
index_of_the_movie = df[df.title == close_match]['index'].values[0]
similarity_score = list(enumerate(similarity[index_of_the_movie]))
sorted_similar_movies = sorted(similarity_score, key = lambda x:x[1], reverse = True)
print('Movies Recommended for you: \n')
i = 1
for movie in sorted_similar_movies:
    index = movie[0]
    title_from_index = df[df.index == index ]['title'].values[0]
    if i < 41:
        print(i, '.',title_from_index)
        i+=1

 Enter your favourite movie name : lonely
Movies Recommended for you: 

1 . Honey
2 . 8 Mile
3 . London
4 . B-Girl
5 . Footloose
6 . The Skeleton Key
7 . Once Upon a Time in the West
8 . Glitter
9 . Beauty Shop
10 . Hustle & Flow
11 . Rize
12 . Tupac: Resurrection
13 . You Got Served
14 . Dance Flick
15 . Becoming Jane
16 . Soul Food
17 . The Perfect Match
18 . The Good Girl
19 . O
20 . Flashdance
21 . Four Christmases
22 . Slow Burn
23 . The Nutcracker
24 . Straight Outta Compton
25 . Antwone Fisher
26 . In Her Line of Fire
27 . Hairspray
28 . That Thing You Do!
29 . Addicted
30 . Standing Ovation
31 . Mad Hot Ballroom
32 . Impostor
33 . The Work and The Story
34 . In Dreams
35 . Step Up
36 . Raise Your Voice
37 . Top Hat
38 . I Still Know What You Did Last Summer
39 . Get Rich or Die Tryin'
40 . Cat on a Hot Tin Roof
