# EDSA-Movie-Recommendation 2022
© Explore Data Science Academyy

Team 3

<h2><center> Movie Recommender System</h2></center>
<figure>
<center>
    <img src ="./assets/movie.gif" width = "350" height = '150'/>


### Table Of Contents

```
1. Import our dependencies
2. Load dataset
3. Exploratory Data Anaylsis (EDA)Understand dataset
4. Recommender Systems<br>- Collaborative Filtering
 - Memory based collaborative filtering
User-Item Filtering
Item-Item Filtering
Model based collaborative filtering
Single Value Decomposition(SVD)
SVD++
Evaluating Collaborative Filtering using SVD
Hybrid Model

## 1. Introduction
In today's online world, recommender systems have become a natural part of the user experience. There is an enormous amount of research focused on recommender systems based on historical ratings and review text. Numerous recommender systems have been introduced over the last years. These methods work with two types of data:
- user-item interactions - such as ratings or buying behavior and 
- attribute information about the items and users, such as keywords or textual profiles. 

Recommender systems based on the former data type are called Collaborative Filtering (CF) methods, whereas recommender methods based on the latter type of data are called Content-Based (CB) methods. 

We would be exploring both in this project.

## 1. Import libraries

Import all required libraries

In [1]:
from matplotlib import pyplot as plt
from math import sqrt
import seaborn as sns
import pandas as pd
import numpy as np
import ast 
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
#Panda Profiling for EDA
from pandas_profiling import ProfileReport

import warnings; warnings.simplefilter('ignore')

## 2. Load Data Set

About the data set goes here ...

In [2]:
# load the data
df_train = pd.read_csv('./data/train.csv')
df_test = pd.read_csv('./data/test.csv')
df_movies = pd.read_csv('./data/movies.csv')

In [3]:
# Preview train dataset
print('Shape of the data: ', df_train.shape)
df_train.head()


Shape of the data:  (10000038, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


## 3. Understanding the Data Set

In [4]:
#Uncomment to install pandas profiling and ipywidgets for the one-liner EDA

#pip install pandas-profiling
#conda install ipywidgets

In [5]:
# EDA using Pandas Profiling Report

#ppr_profile = ProfileReport(df_train, title="Pandas Profiling Report")
#ppr_profile

In [6]:
df_movies.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
# Most popular genres of movie released

plt.figure(figsize=(20,7))
genrelist = df_movies['genres'].apply(lambda genrelist_movie : str(genrelist_movie).split("|"))
genres_count = {}

for genrelist_movie in genrelist:
    for genre in genrelist_movie:
        if(genres_count.get(genre,False)):
            genres_count[genre]=genres_count[genre]+1
        else:
            genres_count[genre] = 1       
genres_count.pop("(no genres listed)")
plt.bar(genres_count.keys(),genres_count.values(),color='0.2')

<BarContainer object of 19 artists>

In [8]:
df_train.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [9]:
df_movies.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [10]:
# Users ratings distribution

sns.distplot(df_train["rating"])

<AxesSubplot:xlabel='rating', ylabel='Density'>

In [11]:
# View the shape of the train and movie data sets
print("Shape of frames: \n"+ " Train DataFrame"+ str(df_train.shape)+"\n Movies DataFrame"+ str(df_movies.shape))

Shape of frames: 
 Train DataFrame(10000038, 4)
 Movies DataFrame(62423, 3)


In [12]:
# Join the movie and train data sets
merge_train_movies = pd.merge(df_movies, df_train, on='movieId', how='inner')

In [13]:
merge_train_movies.head(5)

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,158849,5.0,994716786
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,97203,5.0,942683155
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,161871,3.0,833104576
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,45117,4.0,1442256969
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,27431,5.0,849667827


In [14]:
# Let's drop the time stamp column, we won't be needing it
merge_train_movies = merge_train_movies.drop('timestamp', axis=1)

In [15]:
merge_train_movies.head(5); 
merge_train_movies.shape

(10000038, 5)

In [16]:
# Let's group the ratings by users

ratings_grouped_by_users = merge_train_movies.groupby('userId').agg([np.size, np.mean])

In [17]:
ratings_grouped_by_users.head(5)

Unnamed: 0_level_0,movieId,movieId,rating,rating
Unnamed: 0_level_1,size,mean,size,mean
userId,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,28,6031.892857,28,3.910714
2,72,4658.722222,72,3.416667
3,251,49949.884462,251,3.691235
4,89,68441.966292,89,3.308989
5,35,606.942857,35,3.885714


In [18]:
# Drop movie id column
ratings_grouped_by_users = ratings_grouped_by_users.drop('movieId', axis = 1)

In [19]:
ratings_grouped_by_users.head(5)

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
userId,Unnamed: 1_level_2,Unnamed: 2_level_2
1,28,3.910714
2,72,3.416667
3,251,3.691235
4,89,3.308989
5,35,3.885714


In [20]:
#Top 15 users who have rated movies the most 
ratings_grouped_by_users['rating']['size'].sort_values(ascending=False).head(15).plot(kind = 'bar', color = 'orange', figsize = (10,5))

<AxesSubplot:xlabel='userId', ylabel='Density'>

In [21]:
# Let's group the ratings by movies
ratings_grouped_by_movies = merge_train_movies.groupby('movieId').agg([np.mean], np.size)

In [22]:
ratings_grouped_by_movies.head(5)

Unnamed: 0_level_0,userId,rating
Unnamed: 0_level_1,mean,mean
movieId,Unnamed: 1_level_2,Unnamed: 2_level_2
1,81254.820137,3.889971
2,81263.808266,3.263414
3,80285.365983,3.132325
4,80994.333678,2.878099
5,81837.870912,3.059165


In [23]:
# Let's drop the userId column
ratings_grouped_by_movies = ratings_grouped_by_movies.drop('userId', axis=1)

In [24]:
ratings_grouped_by_movies.head(5)

Unnamed: 0_level_0,rating
Unnamed: 0_level_1,mean
movieId,Unnamed: 1_level_2
1,3.889971
2,3.263414
3,3.132325
4,2.878099
5,3.059165


#### Movies with low average rating (Last 10)

In [25]:
movies_low_rated_filter = ratings_grouped_by_movies['rating']['mean']< 1.5

In [26]:
movies_low_rated = ratings_grouped_by_movies[movies_low_rated_filter]

In [27]:
movies_low_rated.head(10).plot(kind='barh', figsize=(6,4), color='darkblue')

<AxesSubplot:ylabel='movieId'>

In [28]:
movies_low_rated.head(10)

Unnamed: 0_level_0,rating
Unnamed: 0_level_1,mean
movieId,Unnamed: 1_level_2
109,1.2
1495,1.398917
1789,1.125
1826,1.163522
1990,1.384615
3558,1.0
3561,1.0
3574,1.323529
3962,1.444444
4051,1.375


#### Movies with high average rating (Top 10)

In [29]:
ratings_grouped_by_movies['rating']['mean'].sort_values(ascending=False).head(10).plot(kind='barh', figsize=(6,4), color='orange');

## 4. Recommender Systems

About recommender systems

### 4.a Content Based Filtering
Content-based Filtering is a Machine Learning technique that uses similarities in features to make decisions. This technique is often used in recommender systems, which are algorithms designed to advertise or recommend things to users based on knowledge accumulated about the user. 

The concepts of Term Frequency (TF) and Inverse Document Frequency (IDF) are used in information retrieval systems and also content based filtering mechanisms (such as a content based recommender). They are used to determine the relative importance of a document / article / news item / movie etc.


**Term Frequency (TF) and Inverse Document Frequency (IDF)**

TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc)  in a document amongst a collection of documents (also known as a corpus).

The reason we need IDF is to help correct for words like “of”, “as”, “the”, etc. since they appear frequently in an English corpus. Thus by taking inverse document frequency, we can minimize the weighting of frequent terms while making infrequent terms have a higher impact.

In this section, genres are considered as an important parameter to recommend user the movie he watches based on generes of movie user has already watched.

 <img src ="./assets/tf.png" width = "450" height = '250'/>
 
For calculating distances, there are several similarity coefficients to be used - Euclidean, Cosine, Pearson Correlation and more.

**Cosine similarity** is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Given two vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as Inline-style: alt text

We will use cosine distance here. Here we are insterested in similarity. That means higher the value more similar they are. But as the function gives us the distance, we will deduct it from 1.

In [31]:
merge_train_movies.head(5)

Unnamed: 0,movieId,title,genres,userId,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,158849,5.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,97203,5.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,161871,3.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,45117,4.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,27431,5.0


Unnamed: 0,movieId,title,genres,userId,rating


In [35]:
# Define a TF-IDF Vectorizer Object.
tfidf_movies_genres = TfidfVectorizer(token_pattern = '[a-zA-Z0-9\-]+')

#Replace NaN with an empty string
merge_train_movies['genres'] = merge_train_movies['genres'].replace(to_replace="(no genres listed)", value="")

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_movies_genres_matrix = tfidf_movies_genres.fit_transform(merge_train_movies['genres'])
# print(tfidf_movies_genres.get_feature_names())
# Compute the cosine similarity matrix
print(tfidf_movies_genres_matrix.shape)
print(tfidf_movies_genres_matrix.dtype)


(10000038, 19)
float64


In [None]:
cosine_sim_movies = linear_kernel(tfidf_movies_genres_matrix, tfidf_movies_genres_matrix)
print(cosine_sim_movies)