<!-- ![IMDB.jpg](attachment:bb74f207-9245-457e-8ac7-fb36ab2057ef.jpg) -->

Dataset Link 
* https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset  

Reference: 
* https://medium.com/@sr7037/implementing-a-recommendation-system-on-imdb-dataset-through-machine-learning-techniques-47d0a86da9df 
* https://www.analyticsvidhya.com/blog/2020/11/create-your-own-movie-movie-recommendation-system/ 
* https://www.geeksforgeeks.org/python-implementation-of-movie-recommender-system/
* https://www.relataly.com/content-based-movie-recommender-using-python/4294/ 
* https://www.aravi.me/blog/how-actually-movies-are-recommended-to-you-and-build-one-yourself

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


<!-- ![f3f5aca7-162b-49f0-bce7-8e59266888a8.png](attachment:d1a39177-328a-45af-b45e-39416c29d097.png) -->

<!-- <a id='content_based'></a>
![d14a6a7d-2801-4bd5-ae7b-a04b6d0e7130.png](attachment:57c4d09d-d401-4e33-80b3-0c727b80e725.png) -->

In content based filtering we **recommend items** to a **user** which are **similar to items the user likes** based on the **properties/attributes of that item.**

<!-- ![image.png](attachment:a590f78f-56e1-42fc-9524-fa53785df2ce.png) -->

#### **Import Libraries**

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt

from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore')

#### **Read the data**

In [None]:
# Read the movies meta-data(we will be using the feature genre, overview & title from this)

movie_md = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/movie recommendation system/the-movies-dataset/movies_metadata.csv")

# Read the keywords
movie_keywords = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/movie recommendation system/the-movies-dataset/keywords.csv")

# Read the credits
movie_credits = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/movie recommendation system/the-movies-dataset/credits.csv")

#### **Check the first 5 rows**

In [None]:
movie_md.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


**We are gonna select movies which have more than 55 vote count**

In [None]:
movie_md = movie_md[movie_md['vote_count']>=55]

In [None]:
movie_md = movie_md[['id','original_title','overview','genres']]

In [None]:
# Creating a duplicate column for title so that once can be used to search later and one for creating features
movie_md['title'] = movie_md['original_title'].copy()

In [None]:
movie_md.reset_index(inplace=True, drop=True)
movie_md.head()

Unnamed: 0,id,original_title,overview,genres,title
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",Toy Story
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",Jumanji
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",Grumpier Old Men
3,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,"[{'id': 35, 'name': 'Comedy'}]",Father of the Bride Part II
4,949,Heat,"Obsessive master thief, Neil McCauley leads a ...","[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",Heat


* From movies metadata column we are going to work with the following features - 

1. `Genres`

2. `Original Title`

3. `Overview`

4. `id`

In [None]:
movie_keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


* From movies keywords column we are going to work with the following features - 

1. `keywords` (to fetch the keywords)

2. `id` (to merge dataframe)

In [None]:
movie_credits.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


* From movies credits column we are going to work with the following features - 

1. `cast` - To get the name of the actors

2. `id` - To merge dataframe

In [None]:
movie_credits = movie_credits[['id','cast']]

### **Data Cleaning & Preprocessing**

In [None]:
# Removing the records for which the id is not available
movie_md = movie_md[movie_md['id'].str.isnumeric()]

#### Merge dataframes into one single entity

In [None]:
# Merge all dataframe as a single entity
# To merge the ids must be of same datatype
movie_md['id'] = movie_md['id'].astype(int)

# Merge
df = pd.merge(movie_md, movie_keywords, on='id', how='left')

# Reset the index
df.reset_index(inplace=True, drop=True)

In [None]:
# Merge with movie credits
df = pd.merge(df, movie_credits, on='id', how='left')

# Reset the index
df.reset_index(inplace=True, drop=True)

In [None]:
#final dataframe
df.head()

Unnamed: 0,id,original_title,overview,genres,title,keywords,cast
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",Toy Story,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","[{'cast_id': 14, 'character': 'Woody (voice)',..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",Jumanji,"[{'id': 10090, 'name': 'board game'}, {'id': 1...","[{'cast_id': 1, 'character': 'Alan Parrish', '..."
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",Grumpier Old Men,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","[{'cast_id': 2, 'character': 'Max Goldman', 'c..."
3,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,"[{'id': 35, 'name': 'Comedy'}]",Father of the Bride Part II,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","[{'cast_id': 1, 'character': 'George Banks', '..."
4,949,Heat,"Obsessive master thief, Neil McCauley leads a ...","[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",Heat,"[{'id': 642, 'name': 'robbery'}, {'id': 703, '...","[{'cast_id': 25, 'character': 'Lt. Vincent Han..."


### Let's fetch the genres, keywords, cast to vectorize them later

In [None]:
# Lets first start with cleaning the movies metadata
# Fetchin the genre list from the column
df['genres'] = df['genres'].apply(lambda x: [i['name'] for i in eval(x)])

# Replaces spaces in between genre(ex - sci fi to scifi) and make it a string
df['genres'] = df['genres'].apply(lambda x: ' '.join([i.replace(" ","") for i in x]))

In [None]:
# Filling the numm values as []
df['keywords'].fillna('[]', inplace=True)

In [None]:
# Let's clean the keywords dataframe to extract the keywords
# Fetchin the keyword list from the column     
df['keywords'] = df['keywords'].apply(lambda x: [i['name'] for i in eval(x)])

# Remove the expty spaces and join all the keyword wwwith spaces
df['keywords'] = df['keywords'].apply(lambda x: ' '.join([i.replace(" ",'') for i in x]))

In [None]:
# Filling the numm values as []
df['cast'].fillna('[]', inplace=True)

In [None]:
# Let's clean the cast dataframe to extract the name of aactors from cast column
# Fetchin the cast list from the column
df['cast'] = df['cast'].apply(lambda x: [i['name'] for i in eval(x)])

# Remove the expty spaces and join all the cast with spaces
df['cast'] = df['cast'].apply(lambda x: ' '.join([i.replace(" ",'') for i in x]))

In [None]:
df.head()

Unnamed: 0,id,original_title,overview,genres,title,keywords,cast
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",Animation Comedy Family,Toy Story,jealousy toy boy friendship friends rivalry bo...,TomHanks TimAllen DonRickles JimVarney Wallace...
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,Adventure Fantasy Family,Jumanji,boardgame disappearance basedonchildren'sbook ...,RobinWilliams JonathanHyde KirstenDunst Bradle...
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,Romance Comedy,Grumpier Old Men,fishing bestfriend duringcreditsstinger oldmen,WalterMatthau JackLemmon Ann-Margret SophiaLor...
3,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,Comedy,Father of the Bride Part II,baby midlifecrisis confidence aging daughter m...,SteveMartin DianeKeaton MartinShort KimberlyWi...
4,949,Heat,"Obsessive master thief, Neil McCauley leads a ...",Action Crime Drama Thriller,Heat,robbery detective bank obsession chase shootin...,AlPacino RobertDeNiro ValKilmer JonVoight TomS...


### **Let's merge all content/description of movies as a single feature**

In [None]:
df['tags'] = df['overview'] + ' ' + df['genres'] +  ' ' + df['original_title'] + ' ' + df['keywords'] + ' ' + df['cast']

In [None]:
# Delete useless columns
df.drop(columns=['genres','overview','original_title','keywords','cast'], inplace=True)

In [None]:
df.head()

Unnamed: 0,id,title,tags
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...
4,949,Heat,"Obsessive master thief, Neil McCauley leads a ..."


In [None]:
df.isnull().sum()

id        0
title     0
tags     35
dtype: int64

* **These null values are the values for which the data was not available, hence, we would have to remove these records inorder to proceed further**

In [None]:
df.drop(df[df['tags'].isnull()].index, inplace=True)

In [None]:
df.head()

Unnamed: 0,id,title,tags
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...
4,949,Heat,"Obsessive master thief, Neil McCauley leads a ..."


In [None]:
df.shape

(8735, 3)

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.shape

(8595, 3)

## **Convert the contents to vectors**

As our model will not be able to understand text inputs we would have to vectorize them and make it in form of machine readable format

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Initialize a tfidf object
tfidf = TfidfVectorizer(max_features=5000)

# Transform the data
vectorized_data = tfidf.fit_transform(df['tags'].values)

In [None]:
vectorized_data

<8595x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 375635 stored elements in Compressed Sparse Row format>

In [None]:
vectorized_dataframe = pd.DataFrame(vectorized_data.toarray(), index=df['tags'].index.tolist())
print("vectorized_dataframe.shape :",vectorized_dataframe.shape )

vectorized_dataframe.shape : (8595, 5000)


In [None]:
vectorized_dataframe

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8766,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8767,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8768,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## **Perform Dimension Reduction**

We are gonna perform dimensional reduction as computing similarities with such huge dimensions would be exremely computationally expensive

In [None]:
from sklearn.decomposition import TruncatedSVD

In [None]:
# Initialize a PCA object
svd = TruncatedSVD(n_components=3000)

# Fit transform the data
reduced_data = svd.fit_transform(vectorized_dataframe)

# Print the shape
reduced_data.shape

(8595, 3000)

In [None]:
type(reduced_data)
pd.DataFrame(reduced_data)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
0,0.139003,-0.019699,0.011627,0.064827,-0.054754,0.004788,0.042422,-0.061905,-0.006149,-0.023124,...,0.004485,-0.002769,0.006878,-0.000445,-0.006775,-0.009387,0.008006,-0.011329,0.002563,0.010269
1,0.212309,-0.035660,-0.045956,0.097111,0.033862,-0.069007,0.032509,-0.024120,0.043341,0.057915,...,0.004870,-0.007362,-0.000772,0.003006,0.000336,0.007195,0.001831,-0.002990,-0.015898,-0.010755
2,0.195794,0.094501,-0.039028,0.011741,-0.059289,-0.017042,-0.022946,0.028915,-0.049783,0.042210,...,0.000193,-0.009350,-0.004503,0.000242,-0.002474,0.007971,0.000255,-0.005531,0.013274,-0.007506
3,0.256830,0.039630,0.090064,0.043781,-0.033856,-0.058046,-0.070625,-0.014066,0.026313,-0.056321,...,-0.004486,-0.000609,0.001676,-0.022946,0.016937,0.002894,0.001619,-0.004462,0.003329,-0.001846
4,0.153441,-0.050863,-0.003514,0.017283,0.122651,0.110975,0.026069,0.024661,-0.088494,0.008259,...,0.003003,0.008125,-0.006692,0.006824,0.003251,0.005095,-0.000968,0.006032,0.005569,0.002953
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8590,0.339936,-0.064954,0.132725,-0.029665,-0.086280,0.010889,-0.046212,-0.014159,0.022919,-0.023377,...,0.004403,0.000696,-0.006007,-0.015039,0.003338,-0.002537,-0.005660,-0.006246,-0.009026,0.002036
8591,0.259158,-0.085939,-0.079302,0.040288,-0.087677,0.069818,0.006047,0.012006,0.017070,-0.032988,...,-0.001895,-0.007181,0.010350,-0.009055,-0.004717,0.008832,0.000604,-0.001581,-0.002419,-0.004678
8592,0.216518,-0.099889,-0.080659,-0.059704,0.030262,-0.055619,0.048568,-0.002919,-0.036502,0.031121,...,0.009133,-0.007847,-0.001032,0.016705,-0.010539,0.000455,-0.006548,0.015010,-0.003623,-0.005334
8593,0.278910,-0.021182,0.112876,-0.015884,-0.145934,0.009911,0.060736,-0.135193,-0.002653,0.001545,...,0.003989,-0.008305,0.004426,0.000217,0.001080,-0.000213,-0.003039,-0.009739,-0.002406,0.002459


In [None]:
svd.explained_variance_ratio_.cumsum()

array([0.00470896, 0.01167374, 0.01736822, ..., 0.92189056, 0.92196334,
       0.92203598])

## **Compute a similarity metric on vectors for recommendation**
Now in order to make recommendations we would have to compute any similarity index ex- cosine similarity, eucledian distance, Jaccard distance, etc. here we are going to use cosine similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
similarity = cosine_similarity(reduced_data)

## **Making recommendations for a given movie**

In [None]:
def recommendation(movie_title):
    id_of_movie = df[df['title']==movie_title].index[0]
    distances = similarity[id_of_movie]
    movie_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x:x[1])[1:10]
    
    for i in movie_list:
        print(df.iloc[i[0]].title)

In [None]:
recommendation('The Matrix')

The Matrix Revisited
The Matrix Revolutions
The Matrix Reloaded
The Animatrix
Commando
Terminator 3: Rise of the Machines
GHOST IN THE SHELL
Hackers
Who Am I - Kein System ist sicher


In [None]:
recommendation('Jumanji')

Brainscan
Wreck-It Ralph
Stay Alive
Geri's Game
Alan Partridge: Alpha Papa
Dungeons & Dragons
Nirvana
Indie Game: The Movie
Jack the Giant Slayer


In [None]:
recommendation('Casino')

Lucky You
Last Vegas
Vegas Vacation
Fear and Loathing in Las Vegas
The Godfather: Part II
La mafia uccide solo d'estate
Mississippi Grind
The Cincinnati Kid
Wild Card


In [None]:
recommendation('Heat')

Kiss Kiss Bang Bang
No Good Deed
The Grifters
The Long Goodbye
Le Cercle Rouge
Inside Man
Insomnia
신세계
Arsène Lupin


<a id='visualize'></a>
## **Let's try to visualize the vectors in 2-D space using T-SNE**

In [None]:
from sklearn.manifold import TSNE

In [None]:
# Initialize TSNE object
tsne = TSNE(n_components=2,init="random")

# Fir transform the data
tsne_data = tsne.fit_transform(vectorized_data)

# Convert to dataframe
tsne_data = pd.DataFrame(tsne_data, columns=['x','y'])

In [None]:
tsne_data['title'] = df['title'].copy()

In [None]:
data = go.Scatter(x=tsne_data['x'],y=tsne_data['y'],text=tsne_data['title'],mode='markers+text',)

fig = go.Figure(data=data)

fig.show()

<a id='model_based'></a>
## **Model Based Recommender Systems**

Model-based recommendation systems involve building a model based on the dataset of ratings. In other words, we extract some information from the dataset, and use that as a "model" to make recommendations without having to use the complete dataset every time.

For model based recommender systems we are going to use a library called Surprise and we are going to use SVD as a matrix factorization method.

### **Singular Value Decomposition (SVD)**
Singular Value Decomposition (SVD) is one of the matrix factorization method in machine learning. Singular value decomposition decomposes a matrix into three other matrices and extracts the features from the factorization of a high-level (user-item-rating) matrix.

<!-- ![svd%20example.png](attachment:5815b044-a296-42d2-afcd-28ca93ea9607.png) -->

The formula of SVD can be given as - 

<!-- ![0_O4gMln9rCdtUVovH.png](attachment:e57188dd-ab21-4bb6-98c2-41c79005d8d1.png) -->

Where,

`Matrix U:` Latent features of Users

`Matrix S:` Diagonal matrix representing the strength of each feature

`Matrix U:` Latent features of Items

#### **Import Libraries**

In [None]:
!pip install scikit-surprise

from surprise import Dataset, Reader

from surprise.prediction_algorithms.matrix_factorization import SVD

from surprise import accuracy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#### **Read the data**

In [None]:
ratings = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/movie recommendation system/the-movies-dataset/ratings_small.csv")

movie_md = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/movie recommendation system/the-movies-dataset/movies_metadata.csv")

ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
...,...,...,...,...
99999,671,6268,2.5,1065579370
100000,671,6269,4.0,1065149201
100001,671,6365,4.0,1070940363
100002,671,6385,2.5,1070979663


In [None]:
movie_md

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,...,,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,...,2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


* We will consider ratings for movies which have more than 55 counts

In [None]:
# movie dataframe with votes more than 55
movie_md = movie_md[movie_md['vote_count']>55][['id','title']]

# IDs of movies with count more than 55
movie_ids = [int(x) for x in movie_md['id'].values]

# Select ratings of movies with more than 55 counts
ratings = ratings[ratings['movieId'].isin(movie_ids)]

# Reset Index
ratings.reset_index(inplace=True, drop=True)

# Print first 5 rows
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1371,2.5,1260759135
1,1,2105,4.0,1260759139
2,1,2294,2.0,1260759108
3,2,17,5.0,835355681
4,2,62,3.0,835355749


In [None]:
ratings.shape

(29965, 4)

In [None]:
# Initialize a surprise reader object
reader = Reader(line_format='user item rating', sep=',', rating_scale=(0,5), skip_lines=1)

# Load the data
data = Dataset.load_from_df(ratings[['userId','movieId','rating']], reader=reader)

# Build trainset object(perform this only when you are using whole dataset to train)
trainset = data.build_full_trainset()

In [None]:
data

<surprise.dataset.DatasetAutoFolds at 0x7ff497f0bbe0>

In [None]:
trainset

<surprise.trainset.Trainset at 0x7ff497f0b580>

In [None]:
# Initialize model
svd = SVD()

# cross-validate
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7ff4bd891c90>

**We have fit the data successfully, now let's check some predictions**

In [None]:
svd.predict(uid=3,iid=2959,r_ui=5.0)

Prediction(uid=3, iid=2959, r_ui=5.0, est=4.216121667626948, details={'was_impossible': False})

In [None]:
svd.predict(uid=15,iid=2678,r_ui=1.0)

Prediction(uid=15, iid=2678, r_ui=1.0, est=2.8301711824586704, details={'was_impossible': False})

When are using `.predict()` method where we are passing 3 arguments,i.e. `userID(uid)`, `itemID(iid)` and `r_ui(true rating)`

 The output of each prediction is a tuple where `est` is our estimated ratings.
 
 We can see that our model is able to perform good and is able to make good predictions. However, this model can be further improved by using hyperparameter optimization techinique.
 
 Now that our models is ready we will be filling the user-item interaction matrix and will be making recommendations.

In [None]:
def get_recommendations(data, movie_md, user_id, top_n, algo):
    
    # creating an empty list to store the recommended product ids
    recommendations = []
    
    # creating an user item interactions matrix 
    user_movie_interactions_matrix = data.pivot(index='userId', columns='movieId', values='rating')
    
    # extracting those product ids which the user_id has not interacted yet
    non_interacted_movies = user_movie_interactions_matrix.loc[user_id][user_movie_interactions_matrix.loc[user_id].isnull()].index.tolist()
    
    # looping through each of the product ids which user_id has not interacted yet
    for item_id in non_interacted_movies:
        
        # predicting the ratings for those non interacted product ids by this user
        est = algo.predict(user_id, item_id).est
        
        # appending the predicted ratings
        movie_name = movie_md[movie_md['id']==str(item_id)]['title'].values[0]
        recommendations.append((movie_name, est))

    # sorting the predicted ratings in descending order
    recommendations.sort(key=lambda x: x[1], reverse=True)

    return recommendations[:top_n] # returing top n highest predicted rating products for this user

In [None]:
get_recommendations(data=ratings,movie_md=movie_md, user_id=654, top_n=10, algo=svd)

[('The Sixth Sense', 4.9662174966408505),
 ('Nell', 4.95561126067868),
 ('Galaxy Quest', 4.93651718612747),
 ('Dead Man', 4.8382409921593466),
 ('Hard Target', 4.8347157948267006),
 ("We're No Angels", 4.813398701756862),
 ('While You Were Sleeping', 4.8116974925306835),
 ('Terminator Salvation', 4.800847142389158),
 ('The Thomas Crown Affair', 4.7952075011132),
 ('Crank', 4.793300113924036)]

<a id='memory_based'></a>
## **Memory Based Recommender System**

Memory-based methods use user rating historical data to compute the similarity between users or items. The idea behind these methods is to define a similarity measure between users or items, and find the most similar to recommend unseen items.
Memory based recommender systems are of 2 types - 

1. User-Based

2. Item-Bassed

<a id='user_based'></a>
### **User-Based**

In user based recommendation method we will be computing similarities between users and will fetch the most similar users using an algorithm(KNN) and will recommend movies which one user likes to another user and vice versa

<!-- ![1_x8gTiprhLs7zflmEn1UjAQ.png](attachment:49758429-7335-40e4-9621-a31f0e5c4613.png) -->

#### **Import Libraries**

In [None]:
from surprise.prediction_algorithms.knns import KNNBasic

In [None]:
#Declaring the similarity options.
sim_options = {'name': 'cosine',
               'user_based': True}

# KNN algorithm is used to find similar items
sim_user = KNNBasic(sim_options=sim_options, verbose=False, random_state=33)

# Train the algorithm on the trainset, and predict ratings for the testset
sim_user.fit(trainset)

<surprise.prediction_algorithms.knns.KNNBasic at 0x7ff4c2266380>

In [None]:
#predicting rating for a sample user with an interacted product.
sim_user.predict(uid=2,iid=17,r_ui=5.0)

Prediction(uid=2, iid=17, r_ui=5.0, est=4.166335018545322, details={'actual_k': 40, 'was_impossible': False})

In [None]:
#predicting rating for a sample user with an interacted product.
sim_user.predict(uid=671,iid=4011,r_ui=4.0)

Prediction(uid=671, iid=4011, r_ui=4.0, est=4.262454431125302, details={'actual_k': 40, 'was_impossible': False})

In [None]:
get_recommendations(ratings, movie_md, 671,10,sim_user)

[('The Wizard', 5),
 ('Rio Bravo', 5),
 ('The Celebration', 5),
 ('Spider-Man 3', 5),
 ('A Streetcar Named Desire', 5),
 ('Gentlemen Prefer Blondes', 5),
 ('The Evil Dead', 5),
 ('JFK', 5),
 ('Strangers on a Train', 5),
 ("Singin' in the Rain", 5)]

<a id='item_based'></a>
### **Item-Based**

In item based recommendation method we will be computing similarities between items(movies) and will fetch the most similar items(movies) using an algorithm(KNN) and will recommend items(movies) which one user likes to another user who likes similar kind of item(movie) and vice versa.


<!-- ![1_BME1JjIlBEAI9BV5pOO5Mg.png](attachment:2b66cfef-9db1-4bc4-a09c-4b46206114bf.png) -->

In modelling part there is gonna be just one simple difference, we would have to set the user_based as False in sim_options parameter while initializing model.

In [None]:
#Declaring the similarity options.
sim_options = {'name': 'cosine',
               'user_based': False}

# KNN algorithm is used to find similar items
sim_item = KNNBasic(sim_options=sim_options, verbose=False, random_state=33)

# Train the algorithm on the trainset, and predict ratings for the testset
sim_item.fit(trainset)

<surprise.prediction_algorithms.knns.KNNBasic at 0x7ff4bd8ae5c0>

In [None]:
#predicting rating for a sample user with an interacted product.
sim_item.predict(uid=2,iid=17,r_ui=5.0)

Prediction(uid=2, iid=17, r_ui=5.0, est=3.650476877827318, details={'actual_k': 40, 'was_impossible': False})

In [None]:
#predicting rating for a sample user with an interacted product.
sim_item.predict(uid=671,iid=4011,r_ui=4.0)

Prediction(uid=671, iid=4011, r_ui=4.0, est=4.164142698155605, details={'actual_k': 31, 'was_impossible': False})

In [None]:
get_recommendations(ratings, movie_md, 671,10,sim_item)

[('Hard Candy', 5),
 ('Visitor Q', 5),
 ('The Protector', 4.666666666666667),
 ('Shaun of the Dead', 4.571428571428571),
 ('The Silence of the Lambs', 4.503228000162119),
 ("Singin' in the Rain", 4.5),
 ("Hearts of Darkness: A Filmmaker's Apocalypse", 4.5),
 ('Sense and Sensibility', 4.5),
 ("The Hitchhiker's Guide to the Galaxy", 4.5),
 ('Fantasia', 4.428571428571429)]