# Movie Recommender System

The system generates movie predictions for its users, while items are the movies themselves. The primary goal of movie recommendation systems is to filter and predict only those movies that a corresponding user is most likely to want to watch.The ML algorithms for these recommendation systems use the data about this user from the system’s database. This data is used to predict the future behavior of the user concerned based on the information from the past.

Filtration Strategies for Movie Recommendation Systems

- Content-Based Filtering

- Collaborative Filtering

The machine learning algorithm aims to discover user preference patterns used to make recommendations. One common approach is to use matrix factorization method. It involves a large spreadsheet where users are listed on one side and movies on the other. Each cell in the spreadsheet shows if a user likes a particular movie.

In [107]:
# Import all necessary libraries

import numpy as np
import pandas as pd

In [108]:
# Import dataset

data_set = pd.read_csv("Dataset.csv")
movie_title = pd.read_csv("Movie_Id_Titles.csv")

In [109]:
# Display the first five rows of the data_set.csv

data_set.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


In [110]:
# Display the first five rows of the Movie_Id_Titles.csv

movie_title.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


## Data Pre-processing

In [111]:
print(data_set.shape)
print(movie_title.shape)

(100003, 4)
(1682, 2)


In [112]:
# Check for missing balues in data_set
data_set.isnull().sum()

user_id      0
item_id      0
rating       0
timestamp    0
dtype: int64

In [113]:
# duplicate values in data_set
data_set.duplicated().sum()

0

In [114]:
# Check for missing balues in movie_title
movie_title.isnull().sum()

item_id    0
title      0
dtype: int64

In [115]:
# duplicate values in movie_title
movie_title.duplicated().sum()

0

This system attempts to solve the problem of unique recommendations which results from ignoring the data specific to the user.
The psychological profile of the user, their watching history and the data involving movie scores from other websites is collected.
They are based on aggregate similarity calculation.
The item_id represents the specific movie id while user_id represents the specific user with movie recommendation.
Rating is given from 1-5 for a particular movie within the dataset. A Timestamp is represented with the components of date, time and either the number of hours offset (plus or minus) from Greenwich Mean Time, or the letter Z to signify that it is the same as Greenwich Mean Time.

In [116]:
# dropna() drops all the rows having NaN(Not a Number) values or missing values
data_set.dropna()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742
...,...,...,...,...
99998,880,476,3,880175444
99999,716,204,5,879795543
100000,276,1090,1,874795795
100001,13,225,2,882399156


We have 100003 rows and 4 columns to preprocess in this dataset further.

In [117]:
# Count of values in datset after dropping NA values
print(data_set.count)

<bound method DataFrame.count of         user_id  item_id  rating  timestamp
0             0       50       5  881250949
1             0      172       5  881250949
2             0      133       1  881250949
3           196      242       3  881250949
4           186      302       3  891717742
...         ...      ...     ...        ...
99998       880      476       3  880175444
99999       716      204       5  879795543
100000      276     1090       1  874795795
100001       13      225       2  882399156
100002       12      203       3  879959583

[100003 rows x 4 columns]>


## Popularity Based Recommender System

In [118]:
# Merging Movie_Id_Titles.csv to Dataset.csv on the basis of item_id
data_with_title = data_set.merge(movie_title, on = 'item_id')
data_with_title

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,290,50,5,880473582,Star Wars (1977)
2,79,50,4,891271545,Star Wars (1977)
3,2,50,5,888552084,Star Wars (1977)
4,8,50,5,879362124,Star Wars (1977)
...,...,...,...,...,...
99998,840,1674,4,891211682,Mamma Roma (1962)
99999,655,1640,3,888474646,"Eighth Day, The (1996)"
100000,655,1637,3,888984255,Girls Town (1996)
100001,655,1630,3,887428735,"Silence of the Palace, The (Saimt el Qusur) (1..."


In [119]:
# To check how many number of ratings we got for each movie title {use groupby() and count() function}
movie_title_rating = data_with_title.groupby('title').count()['rating'].reset_index()
movie_title_rating.rename(columns={'rating':'num_ratings'},inplace=True)
movie_title_rating

Unnamed: 0,title,num_ratings
0,'Til There Was You (1997),9
1,1-900 (1994),5
2,101 Dalmatians (1996),109
3,12 Angry Men (1957),125
4,187 (1997),41
...,...,...
1659,Young Guns II (1990),44
1660,"Young Poisoner's Handbook, The (1995)",41
1661,Zeus and Roxanne (1997),6
1662,unknown,9


In [120]:
# Average movie rating for each movie title
avg_movie_rating = data_with_title.groupby('title').mean()['rating'].reset_index()
avg_movie_rating.rename(columns={'rating':'avg_rating'},inplace=True)
avg_movie_rating

Unnamed: 0,title,avg_rating
0,'Til There Was You (1997),2.333333
1,1-900 (1994),2.600000
2,101 Dalmatians (1996),2.908257
3,12 Angry Men (1957),4.344000
4,187 (1997),3.024390
...,...,...
1659,Young Guns II (1990),2.772727
1660,"Young Poisoner's Handbook, The (1995)",3.341463
1661,Zeus and Roxanne (1997),2.166667
1662,unknown,3.444444


In [121]:
# merging number of ratings and average rating
popularity_dataset = movie_title_rating.merge(avg_movie_rating,on='title')
popularity_dataset

Unnamed: 0,title,num_ratings,avg_rating
0,'Til There Was You (1997),9,2.333333
1,1-900 (1994),5,2.600000
2,101 Dalmatians (1996),109,2.908257
3,12 Angry Men (1957),125,4.344000
4,187 (1997),41,3.024390
...,...,...,...
1659,Young Guns II (1990),44,2.772727
1660,"Young Poisoner's Handbook, The (1995)",41,3.341463
1661,Zeus and Roxanne (1997),6,2.166667
1662,unknown,9,3.444444


In [122]:
# sorting the popularity_data is descending order to get 20 such movies titles with highest rating on top
popular = popularity_dataset[popularity_dataset['num_ratings']>=250].sort_values('avg_rating',ascending=False).head(20)

In [123]:
# merging the data on to movie_title
popular.merge(movie_title,on='title').drop_duplicates('title')

Unnamed: 0,title,num_ratings,avg_rating,item_id
0,Schindler's List (1993),298,4.466443,318
1,"Shawshank Redemption, The (1994)",283,4.44523,64
2,"Usual Suspects, The (1995)",267,4.385768,12
3,Star Wars (1977),584,4.359589,50
4,One Flew Over the Cuckoo's Nest (1975),264,4.291667,357
5,"Silence of the Lambs, The (1991)",390,4.289744,98
6,"Godfather, The (1972)",413,4.283293,127
7,Raiders of the Lost Ark (1981),420,4.252381,174
8,Titanic (1997),350,4.245714,313
9,"Empire Strikes Back, The (1980)",368,4.206522,172


## Collaborative Filtering Based Recommender System