# Movie Recommender System

The system generates movie predictions for its users, while items are the movies themselves. The primary goal of movie recommendation systems is to filter and predict only those movies that a corresponding user is most likely to want to watch.The ML algorithms for these recommendation systems use the data about this user from the system’s database. This data is used to predict the future behavior of the user concerned based on the information from the past.

Filtration Strategies for Movie Recommendation Systems

- Popularity-Based Filtering

- Collaborative Filtering

The machine learning algorithm aims to discover user preference patterns used to make recommendations. One common approach is to use matrix factorization method. It involves a large spreadsheet where users are listed on one side and movies on the other. Each cell in the spreadsheet shows if a user likes a particular movie.

In [46]:
# Import all necessary libraries

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import pickle

In [47]:
# Import dataset

data_set = pd.read_csv("Dataset.csv")
movie_title = pd.read_csv("Movie_Id_Titles.csv")

In [48]:
# Display the first five rows of the data_set.csv

data_set.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


In [49]:
# Display the first five rows of the Movie_Id_Titles.csv

movie_title.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


## Data Pre-processing

In [50]:
print(data_set.shape)
print(movie_title.shape)

(100003, 4)
(1682, 2)


In [51]:
# Check for missing values in data_set
data_set.isnull().sum()

user_id      0
item_id      0
rating       0
timestamp    0
dtype: int64

In [52]:
# duplicate values in data_set
data_set.duplicated().sum()

0

In [53]:
# Check for missing balues in movie_title
movie_title.isnull().sum()

item_id    0
title      0
dtype: int64

In [54]:
# duplicate values in movie_title
movie_title.duplicated().sum()

0

This system attempts to solve the problem of unique recommendations which results from ignoring the data specific to the user.
The psychological profile of the user, their watching history and the data involving movie scores from other websites is collected.
They are based on aggregate similarity calculation.
The item_id represents the specific movie id while user_id represents the specific user with movie recommendation.
Rating is given from 1-5 for a particular movie within the dataset. A Timestamp is represented with the components of date, time and either the number of hours offset (plus or minus) from Greenwich Mean Time, or the letter Z to signify that it is the same as Greenwich Mean Time.

In [55]:
# dropna() drops all the rows having NaN(Not a Number) values or missing values
data_set.dropna()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742
...,...,...,...,...
99998,880,476,3,880175444
99999,716,204,5,879795543
100000,276,1090,1,874795795
100001,13,225,2,882399156


We have 100003 rows and 4 columns to preprocess in this dataset further.

In [56]:
# Count of values in datset after dropping NA values
print(data_set.count)

<bound method DataFrame.count of         user_id  item_id  rating  timestamp
0             0       50       5  881250949
1             0      172       5  881250949
2             0      133       1  881250949
3           196      242       3  881250949
4           186      302       3  891717742
...         ...      ...     ...        ...
99998       880      476       3  880175444
99999       716      204       5  879795543
100000      276     1090       1  874795795
100001       13      225       2  882399156
100002       12      203       3  879959583

[100003 rows x 4 columns]>


In [57]:
# we don't need 'timestamp' column for our system so drop it
data_set = data_set.drop('timestamp', axis=1)

## Popularity Based Recommender System

In [58]:
# Merging Movie_Id_Titles.csv to Dataset.csv on the basis of item_id
data_with_title = data_set.merge(movie_title, on = 'item_id')
data_with_title

Unnamed: 0,user_id,item_id,rating,title
0,0,50,5,Star Wars (1977)
1,290,50,5,Star Wars (1977)
2,79,50,4,Star Wars (1977)
3,2,50,5,Star Wars (1977)
4,8,50,5,Star Wars (1977)
...,...,...,...,...
99998,840,1674,4,Mamma Roma (1962)
99999,655,1640,3,"Eighth Day, The (1996)"
100000,655,1637,3,Girls Town (1996)
100001,655,1630,3,"Silence of the Palace, The (Saimt el Qusur) (1..."


In [59]:
# To check how many number of ratings we got for each movie title {use groupby() and count() function}
movie_title_rating = data_with_title.groupby('title').count()['rating'].reset_index()
movie_title_rating.rename(columns={'rating':'num_ratings'},inplace=True)
movie_title_rating

Unnamed: 0,title,num_ratings
0,'Til There Was You (1997),9
1,1-900 (1994),5
2,101 Dalmatians (1996),109
3,12 Angry Men (1957),125
4,187 (1997),41
...,...,...
1659,Young Guns II (1990),44
1660,"Young Poisoner's Handbook, The (1995)",41
1661,Zeus and Roxanne (1997),6
1662,unknown,9


In [60]:
# Average movie rating for each movie title
avg_movie_rating = data_with_title.groupby('title').mean()['rating'].reset_index()
avg_movie_rating.rename(columns={'rating':'avg_rating'},inplace=True)
avg_movie_rating

Unnamed: 0,title,avg_rating
0,'Til There Was You (1997),2.333333
1,1-900 (1994),2.600000
2,101 Dalmatians (1996),2.908257
3,12 Angry Men (1957),4.344000
4,187 (1997),3.024390
...,...,...
1659,Young Guns II (1990),2.772727
1660,"Young Poisoner's Handbook, The (1995)",3.341463
1661,Zeus and Roxanne (1997),2.166667
1662,unknown,3.444444


In [61]:
# merging number of ratings and average rating
popularity_dataset = movie_title_rating.merge(avg_movie_rating,on='title')
popularity_dataset

Unnamed: 0,title,num_ratings,avg_rating
0,'Til There Was You (1997),9,2.333333
1,1-900 (1994),5,2.600000
2,101 Dalmatians (1996),109,2.908257
3,12 Angry Men (1957),125,4.344000
4,187 (1997),41,3.024390
...,...,...,...
1659,Young Guns II (1990),44,2.772727
1660,"Young Poisoner's Handbook, The (1995)",41,3.341463
1661,Zeus and Roxanne (1997),6,2.166667
1662,unknown,9,3.444444


In [62]:
# sorting the popularity_data in descending order to get 20 such movies titles with highest rating on top
popular = popularity_dataset[popularity_dataset['num_ratings']>=250].sort_values('avg_rating',ascending=False).head(20)
popular

Unnamed: 0,title,num_ratings,avg_rating
1281,Schindler's List (1993),298,4.466443
1317,"Shawshank Redemption, The (1994)",283,4.44523
1572,"Usual Suspects, The (1995)",267,4.385768
1398,Star Wars (1977),584,4.359589
1102,One Flew Over the Cuckoo's Nest (1975),264,4.291667
1329,"Silence of the Lambs, The (1991)",390,4.289744
612,"Godfather, The (1972)",413,4.283293
1205,Raiders of the Lost Ark (1981),420,4.252381
1500,Titanic (1997),350,4.245714
456,"Empire Strikes Back, The (1980)",368,4.206522


## Collaborative Filtering Based Recommender System


In [63]:
# get the user_ids that gave minimum 200 ratings for movies
max_users = data_with_title.groupby('user_id').count()['rating']>200
expert_users = max_users[max_users].index

In [64]:
# we get the filtered data of ratings according to user_id column showing experienced users
filtered_rating = data_with_title[data_with_title['user_id'].isin(expert_users)]
filtered_rating

Unnamed: 0,user_id,item_id,rating,title
8,305,50,5,Star Wars (1977)
11,234,50,4,Star Wars (1977)
16,145,50,5,Star Wars (1977)
19,271,50,5,Star Wars (1977)
27,130,50,5,Star Wars (1977)
...,...,...,...,...
99997,916,1682,3,Scream of Stone (Schrei aus Stein) (1991)
99999,655,1640,3,"Eighth Day, The (1996)"
100000,655,1637,3,Girls Town (1996)
100001,655,1630,3,"Silence of the Palace, The (Saimt el Qusur) (1..."


In [65]:
# get the movies titles having minimum 50 ratings given for them
max_rates = filtered_rating.groupby('title').count()['rating']>=50
popular_movies = max_rates[max_rates].index

In [66]:
# we get the final filtered rating according to title column showing popular and most-frequently rated movies
final_rating = filtered_rating[filtered_rating['title'].isin(popular_movies)]
final_rating.drop_duplicates()

Unnamed: 0,user_id,item_id,rating,title
8,305,50,5,Star Wars (1977)
11,234,50,4,Star Wars (1977)
16,145,50,5,Star Wars (1977)
19,271,50,5,Star Wars (1977)
27,130,50,5,Star Wars (1977)
...,...,...,...,...
94261,758,1074,1,Reality Bites (1994)
94263,429,1074,3,Reality Bites (1994)
94264,870,1074,2,Reality Bites (1994)
94265,916,1074,3,Reality Bites (1994)


In [67]:
# display a pivot table for this final dataset giving appropriate index labels
pivot = final_rating.pivot_table(index = 'title', columns = 'user_id', values = 'rating')

In [68]:
# replace the NaN values in dataset with 0
pivot.fillna(0, inplace = True)
pivot

user_id,1,6,7,13,18,43,49,59,60,62,...,881,883,886,889,892,894,896,916,919,932
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12 Angry Men (1957),5.0,4.0,4.0,4.0,3.0,0.0,0.0,0.0,5.0,0.0,...,3.0,0.0,5.0,5.0,5.0,0.0,0.0,0.0,0.0,5.0
2001: A Space Odyssey (1968),4.0,5.0,5.0,5.0,3.0,0.0,0.0,5.0,5.0,4.0,...,4.0,4.0,0.0,2.0,5.0,0.0,3.0,4.0,0.0,5.0
Absolute Power (1997),0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,3.0,0.0,0.0,3.0,4.0,0.0,3.0,0.0,0.0,0.0
"Abyss, The (1989)",3.0,0.0,5.0,3.0,0.0,0.0,0.0,0.0,0.0,5.0,...,0.0,0.0,4.0,4.0,0.0,0.0,4.0,4.0,0.0,0.0
Ace Ventura: Pet Detective (1994),3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.0,2.0,4.0,0.0,2.0,0.0,0.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Willy Wonka and the Chocolate Factory (1971),4.0,3.0,4.0,0.0,3.0,4.0,5.0,5.0,5.0,5.0,...,2.0,5.0,0.0,3.0,4.0,0.0,0.0,3.0,4.0,3.0
"Wizard of Oz, The (1939)",4.0,5.0,5.0,4.0,5.0,0.0,0.0,5.0,4.0,5.0,...,3.0,0.0,3.0,4.0,5.0,0.0,3.0,3.0,0.0,0.0
"Wrong Trousers, The (1993)",5.0,4.0,0.0,0.0,5.0,5.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,5.0
Young Frankenstein (1974),5.0,4.0,5.0,5.0,4.0,5.0,4.0,5.0,5.0,0.0,...,3.0,4.0,3.0,4.0,4.0,0.0,0.0,0.0,0.0,5.0


In [69]:
# cosine_similarity function gives the vector and combinations of one movie index with itself and all the other movies
similarity_scores = cosine_similarity(pivot)
similarity_scores

array([[1.        , 0.57201464, 0.30440939, ..., 0.44045842, 0.5805019 ,
        0.44589749],
       [0.57201464, 1.        , 0.40511415, ..., 0.51583958, 0.74116004,
        0.49172168],
       [0.30440939, 0.40511415, 1.        , ..., 0.31155058, 0.41627534,
        0.38022979],
       ...,
       [0.44045842, 0.51583958, 0.31155058, ..., 1.        , 0.58668426,
        0.32139804],
       [0.5805019 , 0.74116004, 0.41627534, ..., 0.58668426, 1.        ,
        0.50865449],
       [0.44589749, 0.49172168, 0.38022979, ..., 0.32139804, 0.50865449,
        1.        ]])

In [70]:
# get the dimension of the above 2d numpy array
similarity_scores.shape

(322, 322)

In [71]:
# define a recommend function for getting suggestions using  trained recommender system
def recommend(movie_name):
    distances = similarity_scores[index]
    similar_items = sorted(list(enumerate(similarity_scores[0])), key = lambda x:x[1], reverse = True)[1:6]
    return suggestions

In [72]:
# fetch the index of the movie name
np.where(pivot.index == '2001: A Space Odyssey (1968)')[0][0]

1

In [73]:
# we get the index for last movie name
np.where(pivot.index == 'Young Guns (1988)')[0][0]

321

In [74]:
# this list and enumerate function gives us the similarity score of 0th index movie with itself ad all the other movies
list(enumerate(similarity_scores[0]))

[(0, 0.999999999999999),
 (1, 0.5720146376705479),
 (2, 0.30440939091174773),
 (3, 0.39587227587755486),
 (4, 0.3779518465809805),
 (5, 0.322770880344456),
 (6, 0.44024027773598184),
 (7, 0.595275893008731),
 (8, 0.4054281882784444),
 (9, 0.5286636144221072),
 (10, 0.3076104419711055),
 (11, 0.6032327415712364),
 (12, 0.3067275484099103),
 (13, 0.5680463236583655),
 (14, 0.6300416126912085),
 (15, 0.4544020868992733),
 (16, 0.39393288270588794),
 (17, 0.5444842588370484),
 (18, 0.6392262886886917),
 (19, 0.6017731182712491),
 (20, 0.4721330898277884),
 (21, 0.5743031206561253),
 (22, 0.2872859147168937),
 (23, 0.5001631000743151),
 (24, 0.6398808242968682),
 (25, 0.33190954957283636),
 (26, 0.4823760999753519),
 (27, 0.33090958708624896),
 (28, 0.4266328699252418),
 (29, 0.46974537115112797),
 (30, 0.22219232503985556),
 (31, 0.5624198427371025),
 (32, 0.5656859782146835),
 (33, 0.3921055944009202),
 (34, 0.3521552030427022),
 (35, 0.4979937342047609),
 (36, 0.48190048528165996),
 (37,

In [75]:
# sort this list according to the similarity scores in descending order of similarity
similar_items = sorted(list(enumerate(similarity_scores[0])), key = lambda x:x[1], reverse = True)[1:6]

In [76]:
def recommend(movie_name):
    index = np.where(pivot.index == '2001: A Space Odyssey (1968)')[0][0]
    similar_items = sorted(list(enumerate(similarity_scores[0])), key = lambda x:x[1], reverse = True)[1:6]

    for i in similar_items:
      print(pivot.index[i[0]])

In [77]:
recommend('Wizard of Oz, The (1939)')

Sting, The (1973)
North by Northwest (1959)
Bridge on the River Kwai, The (1957)
Raiders of the Lost Ark (1981)
Mr. Smith Goes to Washington (1939)


In [78]:
pivot.index[315]

'While You Were Sleeping (1995)'