<h1 class="alert alert-info" style="text-align: center;">Recommending Movies</h1>

In [106]:
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tqdm
import requests
import os
import zipfile
%matplotlib inline

#suppress warnings
import warnings
warnings.filterwarnings('ignore')

## Movielens - Dataset

This dataset (ml-latest) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 33832162 ratings and 2328315 tag applications across 86537 movies. These data were created by 330975 users between January 09, 1995 and July 20, 2023. This dataset was generated on July 20, 2023.

Users were selected at random for inclusion. All selected users had rated at least 1 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files genome-scores.csv, genome-tags.csv, links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.

This is a development dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available benchmark datasets if that is your intent.

This and other GroupLens data sets are publicly available for download at http://grouplens.org/datasets/.


I am using this specific data in this notebook: https://files.grouplens.org/datasets/movielens/ml-latest-small.zip

## Data Description

We will be using below two files:


**Ratings (ratings.csv)**    -- Information about the user given rating, contains columns - userId | movieId | rating | timestamp


**Movie Information (movies.csv)**   -- Information about the items (movies), contains columns - movieId | title | genres

## Table of Content

[1. Download and Reading Dataset](#1-download-and-reading-dataset)

[2. Explore Data](#2-explore-data)

[3. Merging Movie information with ratings](#3-merging-movie-information-with-ratings)

[4. Preparing Data](#4-preparing-data)

[5. Setting evaluation metric](#5-setting-evaluation-metric)

[6. Simple Baseline Model](#6-simple-baseline-model)

[7. User based collaborative filtering with simple user mean](#7-user-based-collaborative-filtering-with-simple-user-mean)

[8. User based collaborative filtering with similarity weighted mean](#8-user-based-collaborative-filtering-with-similarity-weighted-mean)

[9. Item based collaborative filtering with simple item mean](#9-item-based-collaborative-filtering-with-simple-item-mean)

[10. Item based collaborative filtering with similarity weighted mean](#10-item-based-collaborative-filtering-with-similarity-weighted-mean)

[11. Matrix Factorization](#10-matrix-factorization)

[12. Content Based Filtering](#12-content-based-filtering)

## 1. Download and Reading Dataset

In [3]:
# Download the data and extract it into the input/raw folder

url = 'https://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
file_name = 'ml-latest-small.zip'

# Stream the download
with requests.get(url, stream=True) as r:
    with open(file_name, 'wb') as f:
        for chunk in r.iter_content(chunk_size=8192):
            f.write(chunk)

# Unzip the data and place it in the input raw folder
with zipfile.ZipFile(file_name, 'r') as zip_ref:
    zip_ref.extractall('./')
os.remove(file_name)

In [4]:
ratings = pd.read_csv('ml-latest-small/ratings.csv')
movies = pd.read_csv('ml-latest-small/movies.csv')

## 2. Explore Data

In [5]:
ratings.shape, movies.shape

((100836, 4), (9742, 3))

- We have total 9742 movies and 100836 ratings in these datasets.

In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [7]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## 3. Merging Movie Information with Ratings

In [5]:
ratings = ratings.merge(movies[['movieId','title']], how='left', on='movieId', validate='many_to_one')

In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title
0,1,1,4.0,964982703,Toy Story (1995)
1,1,3,4.0,964981247,Grumpier Old Men (1995)
2,1,6,4.0,964982224,Heat (1995)
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995)
4,1,50,5.0,964982931,"Usual Suspects, The (1995)"


Lets also combine movie id and movie title separated by ': ' and store it in a new column named movie

In [7]:
ratings['movie'] = ratings['movieId'].map(str) + str(': ') + ratings['title'].map(str)

In [8]:
ratings.columns

Index(['userId', 'movieId', 'rating', 'timestamp', 'title', 'movie'], dtype='object')

Keeping the columns movie, userId and rating in the ratings dataframe and drop all others

In [9]:
ratings = ratings[['userId', 'movie', 'rating']]

In [10]:
ratings.head()

Unnamed: 0,userId,movie,rating
0,1,1: Toy Story (1995),4.0
1,1,3: Grumpier Old Men (1995),4.0
2,1,6: Heat (1995),4.0
3,1,47: Seven (a.k.a. Se7en) (1995),5.0
4,1,"50: Usual Suspects, The (1995)",5.0


## 4. Preparing Data

In [107]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(ratings, test_size=0.2, random_state=42)

In [12]:
train.shape, test.shape

((80668, 3), (20168, 3))

## 5. Setting Evaluation Metric

In [13]:
from sklearn.metrics import mean_squared_error

def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

# Evaluate the model using RMSE
def rmse_score(model, data):
    id_pairs = zip(data['userId'], data['movie'])
    y_pred = np.array([model(user, movie) for (user, movie) in id_pairs])
    y_true = np.array(data['rating'])
    if np.isnan(y_pred).any():
        print("NaN values found in predictions")
    if np.isnan(y_true).any():
        print("NaN values found in test")
    return rmse(y_true, y_pred)

## 6. Simple Baseline Model

In [14]:
avg_rating = train['rating'].mean()

def baseline(user, movie):
    return avg_rating

In [15]:
# Score the baseline model
rmse_score(baseline, test)

1.0488405992661316

## 7. User based collaborative filtering with Simple User Mean

In User based CF we use weighted mean of similar users' ratings, let's first try just a simple average of all ratings given to a particular movie by all other users and make predictions.

To do that first we will create the ratings matrix using pandas pivot_table function.

In [17]:
rmatrix = train.pivot_table(values='rating', index='userId', columns='movie')
rmatrix.head()

movie,100044: Human Planet (2011),100068: Comme un chef (2012),100083: Movie 43 (2013),"100106: Pervert's Guide to Ideology, The (2012)",100159: Sightseers (2012),100163: Hansel & Gretel: Witch Hunters (2013),100194: Jim Jefferies: Fully Functional (EPIX) (2012),100226: Why Stop Now (2012),100277: Tabu (2012),100302: Upside Down (2012),...,99764: It's Such a Beautiful Day (2012),"99813: Batman: The Dark Knight Returns, Part 2 (2013)",99846: Everything or Nothing: The Untold Story of 007 (2012),99853: Codependent Lesbian Space Alien Seeks Same (2011),998: Set It Off (1996),"99910: Last Stand, The (2013)",99917: Upstream Color (2013),999: 2 Days in the Valley (1996),99: Heidi Fleiss: Hollywood Madam (1995),9: Sudden Death (1995)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


In [34]:
all_user_mean = rmatrix.mean(axis=0)

def cf_user_mean(user, movie):
    if movie in rmatrix:
        mean_rating = all_user_mean[movie]
    else:
        mean_rating = avg_rating
        
    return mean_rating

In [35]:
rmse_score(cf_user_mean, test)

0.9827389937822489

- We have improved already from the baseline model, but we can do better by considering only the top k similar users to the target user.

## 8. User based collaborative filtering with Similarity Weighted Mean

For this we will use pearson correlation coefficient as the similarity metric.

In [37]:
pearson_corr = rmatrix.T.corr()

In [38]:
rmatrix.head()

movie,100044: Human Planet (2011),100068: Comme un chef (2012),100083: Movie 43 (2013),"100106: Pervert's Guide to Ideology, The (2012)",100159: Sightseers (2012),100163: Hansel & Gretel: Witch Hunters (2013),100194: Jim Jefferies: Fully Functional (EPIX) (2012),100226: Why Stop Now (2012),100277: Tabu (2012),100302: Upside Down (2012),...,99764: It's Such a Beautiful Day (2012),"99813: Batman: The Dark Knight Returns, Part 2 (2013)",99846: Everything or Nothing: The Untold Story of 007 (2012),99853: Codependent Lesbian Space Alien Seeks Same (2011),998: Set It Off (1996),"99910: Last Stand, The (2013)",99917: Upstream Color (2013),999: 2 Days in the Valley (1996),99: Heidi Fleiss: Hollywood Madam (1995),9: Sudden Death (1995)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


In [42]:
pearson_corr.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,,0.085749,0.207172,0.18503,-0.24132,-0.064002,0.4502873,0.904534,-0.356348,...,-0.166667,0.15097,-0.093537,-0.2886751,-0.221146,-0.059354,0.012277,0.420441,-0.175412,-0.084961
2,,1.0,,,,,-1.0,,,0.695701,...,-0.52915,,-1.0,,,0.755929,,-0.125,,0.767185
3,0.085749,,1.0,,,,,,,,...,,,0.927173,,,-0.471405,,-0.29277,,0.92008
4,0.207172,,,1.0,-0.273483,0.274795,0.647345,-1.306396e-16,,0.593666,...,-0.604552,0.298922,0.020806,-1.110223e-16,0.403198,0.076878,0.03735,-0.213223,,-0.135612
5,0.18503,,,-0.273483,1.0,0.123476,0.437237,-0.05387166,,,...,,0.050937,0.195628,0.4055536,-0.209165,0.333123,0.472377,-0.016052,0.433013,-0.048877


- Here we see that there are a lot of missing values, this could be due to no common ratings between 2 users. We can replace all these missing values by 0 as this essentially means no correlation from the provided data between the 2 users

In [44]:
pearson_corr = pearson_corr.fillna(0)

Now, we have the user user similarities stored in the matrix **pearson_corr**. We will define a function to predict the unknown ratings in the test set using user based collarborative filtering with simiarity as pearson correlation and using all neighbours with positive correlation. For each user movie pair:

1. Check if a movie is there in train set, if its not in that case we will just predict the mean rating as the predicted rating

2. Calculate the mean rating for the active user

3. Extract correlation values from matrix pearson_corr and sort it in decreasing order of correlation values

4. Keep only similarity scores for users with positive correlation with the active user

5. Drop all the users similar to active user but haven't rated the target movie

6. Do a check and predict mean rating if there are no similar users who have rated the target movie

7. Use the prediction formula to make rating predictions

$$pred(a,p) = \bar{r_a} + \frac{\sum_{b} sim(a,b) * (r_{b,p} - \bar{r_b})}{\sum_{b} sim(a,b)}$$

where:

- $pred_{a,p}$ is the predicted rating for user a and movie p
- $\bar{r_a}$ is the mean rating of user a
- $sim(a,b)$ is the pearson correlation between user a and user b
- $r_{b,p}$ is the rating of user b for movie p
- $\bar{r_b}$ is the mean rating of user b

In [45]:
r_avg_user_rating = rmatrix.mean(axis = 1)
r_avg_movie_rating = rmatrix.mean(axis = 0)

def cf_user_wmean(user, movie):
    if movie in rmatrix:
        ra = r_avg_user_rating[user]

        #Get the similarity scores for the user in question with every other user
        sim_scores = pearson_corr[user].sort_values(ascending = False)
        
        # Keep similarity scores for users with positive correlation with active user
        sim_scores_pos = sim_scores[sim_scores > 0]
        
        #Get the user ratings for the movie in question
        m_ratings = rmatrix[movie][sim_scores_pos.index]
        
        #Extract the indices containing NaN in the m_ratings series (Users who have not rated the target movie)
        idx = m_ratings[m_ratings.isnull()].index
        
        #Drop the NaN values from the m_ratings Series
        m_ratings = m_ratings.dropna()
        
        # If there are no ratings from similar users we cannot use this method so we predict just 
        # the average rating of the movie else we use the prediction formula
        if len(m_ratings) == 0:
            wmean_rating = r_avg_movie_rating[movie]
        else:   
            #Drop the corresponding correlation scores from the sim_scores series
            sim_scores_pos = sim_scores_pos.drop(idx)
            
            #Subtract average rating of each user from the rating (rbp - mean(rb))
            m_ratings = m_ratings - r_avg_user_rating[m_ratings.index]
            
            #Compute the final weighted mean using np.dot which is nothing but the product divided by sum of weights
            wmean_rating = ra + (np.dot(sim_scores_pos, m_ratings)/ sim_scores_pos.sum())
   
    else:
        wmean_rating = avg_rating
    
    return wmean_rating

In [46]:
rmse_score(cf_user_wmean, test)

0.9102944927663451

- !!BAM!! We have improved our model by considering only the similar users to the target user from ~0.98 to ~0.91

> Now, to avoid all the manual calculation and to perform parameter tuning to select top k similar users, we will use **surprise library** which provides a lot of inbuilt functions to perform these tasks.

### Surprise Library

Surprise is an easy-to-use Python scikit learn like tool for recommender systems. Surprise library provides essential tools to build and experiment with various collaborative filtering methods. It provides support for:

1. Cross Validation

2. Grid Search

3. Built-in Datasets

4. Various Collaborative filtering methods


### Installation

`$ pip install numpy`

`$ pip install scikit-surprise`

In [16]:
from surprise import Dataset, Reader
from surprise.model_selection import GridSearchCV
from surprise.prediction_algorithms import KNNWithMeans

To load a dataset from a pandas dataframe within Surprise, you will need the load_from_df() method. 
1. You will also need a `Reader` object and the `rating_scale` parameter must be specified. 
2. The dataframe here must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings in this order. 
3. Each row thus corresponds to a given rating

In [17]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(train[['userId','movie','rating']], reader)

#### Grid Search for neighbourhood size and similarity measure

The `cross_validate()` function reports accuracy metric over a cross-validation procedure for a given set of parameters. If you want to know which parameter combination yields the best results, the `GridSearchCV` class comes to the rescue. 

Given a dict of parameters, this class exhaustively tries all the combinations of parameters and reports the best parameters for any accuracy measure (averaged over the different splits). It is heavily inspired from scikit-learn’s GridSearchCV.

In [21]:
param_grid = {"k":list(range(1,50,5)),
              "sim_options":{"name":["cosine","pearson"]}}

#KNNWithMeans by default does user based collaborative filtering
gs = GridSearchCV(KNNWithMeans, 
                  param_grid, 
                  measures=['rmse'], 
                  cv=5, 
                  n_jobs = -1)

gs.fit(data)

Computing the cosine similarity matrix...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Computing the cosine similarity matrix...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Computing the pearson similarity matrix...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Computing the cosine similarity matrix...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Computi

In [22]:
print(f'Best RMSE score: {gs.best_score["rmse"]}')
print(f'Best parameters: {gs.best_params["rmse"]}')

Best RMSE score: 0.9067101350619954
Best parameters: {'k': 46, 'sim_options': {'name': 'cosine', 'user_based': True}}


#### Fitting model on complete train data and checking the performance on test data

In [51]:
sim_options = {'name': 'cosine', 'user_based': True}

model = KNNWithMeans(k = 46, sim_options = sim_options)

model.fit(data.build_full_trainset())

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x782f6541dee0>

In [53]:
id_pairs = zip(test['userId'], test['movie'])

y_pred = [model.predict(uid = user, iid = movie)[3] for (user, movie) in id_pairs]

y_true = test['rating']

In [54]:
rmse(y_true, y_pred)

0.9087259019005215

- !!SMALL BAM!! We have improved our model by considering only the top 46 similar users to the target user from ~0.91 to ~0.90

## 9. Item based collaborative filtering with Simlpe Item Mean

In Item based CF we use weighted mean of similar items' ratings, let's first try just a simple average of all ratings given by a particular user to all other movies and make predictions.

In [57]:
avg_item_rating = rmatrix.mean(axis=1)

def cf_item_mean(user, movie):
    mean_rating = avg_item_rating[user]

    return mean_rating

In [58]:
rmse_score(cf_item_mean, test)

0.9497467465720074

## 10. Item based collaborative filtering with Similarity Weighted Mean

For this we will use cosine similarity as the similarity metric.

In [61]:
rmatrix_dummy = rmatrix.copy().fillna(0)
rmatrix_dummy.head()

movie,100044: Human Planet (2011),100068: Comme un chef (2012),100083: Movie 43 (2013),"100106: Pervert's Guide to Ideology, The (2012)",100159: Sightseers (2012),100163: Hansel & Gretel: Witch Hunters (2013),100194: Jim Jefferies: Fully Functional (EPIX) (2012),100226: Why Stop Now (2012),100277: Tabu (2012),100302: Upside Down (2012),...,99764: It's Such a Beautiful Day (2012),"99813: Batman: The Dark Knight Returns, Part 2 (2013)",99846: Everything or Nothing: The Untold Story of 007 (2012),99853: Codependent Lesbian Space Alien Seeks Same (2011),998: Set It Off (1996),"99910: Last Stand, The (2013)",99917: Upstream Color (2013),999: 2 Days in the Valley (1996),99: Heidi Fleiss: Hollywood Madam (1995),9: Sudden Death (1995)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [65]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(rmatrix_dummy.T, rmatrix_dummy.T)

cosine_sim = pd.DataFrame(cosine_sim, index = rmatrix.columns, columns = rmatrix.columns)

In [66]:
cosine_sim.head()

movie,100044: Human Planet (2011),100068: Comme un chef (2012),100083: Movie 43 (2013),"100106: Pervert's Guide to Ideology, The (2012)",100159: Sightseers (2012),100163: Hansel & Gretel: Witch Hunters (2013),100194: Jim Jefferies: Fully Functional (EPIX) (2012),100226: Why Stop Now (2012),100277: Tabu (2012),100302: Upside Down (2012),...,99764: It's Such a Beautiful Day (2012),"99813: Batman: The Dark Knight Returns, Part 2 (2013)",99846: Everything or Nothing: The Untold Story of 007 (2012),99853: Codependent Lesbian Space Alien Seeks Same (2011),998: Set It Off (1996),"99910: Last Stand, The (2013)",99917: Upstream Color (2013),999: 2 Days in the Valley (1996),99: Heidi Fleiss: Hollywood Madam (1995),9: Sudden Death (1995)
movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100044: Human Planet (2011),1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100068: Comme un chef (2012),0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100083: Movie 43 (2013),0.0,0.0,1.0,0.0,0.544949,0.229488,0.3114,0.0,0.0,0.0,...,0.0,0.325971,0.0,0.0,0.0,0.0,0.297245,0.0,0.0,0.0
"100106: Pervert's Guide to Ideology, The (2012)",1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100159: Sightseers (2012),0.0,0.0,0.544949,0.0,1.0,0.421117,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.545455,0.0,0.0,0.0


In [67]:
# Check the similarity of some movies
cosine_sim['100044: Human Planet (2011)'].sort_values(ascending=False).head(4)

movie
100044: Human Planet (2011)                      1.0
188675: Dogman (2018)                            1.0
127180: Story of Film: An Odyssey, The (2011)    1.0
127114: The End of the Tour (2015)               1.0
Name: 100044: Human Planet (2011), dtype: float64

Now, we have the item-item similarities stored in the matrix **cosine_sim**. We will define a function to predict the unknown ratings in the test set using item based collarborative filtering with simiarity as cosine and using all the ratings of other items. For each user movie pair:

1. Check if a movie is there in train set, if its not in that case we will just predict the mean rating as the predicted rating
2. Extract cosine similarity values from matrix cosine_sim
3. Drop all the unrated items as they cannot contribute to the prediction from both similarity scores and ratings
4. Use the prediction formula to make rating predictions: 

    $$pred(u,p) = \frac{\sum_{p} sim(i,p) * r_{ui}}{\sum_{p} sim(i,p)}$$

where:

- $pred(u,p)$ is the predicted rating for user u and movie p
- $sim(i,p)$ is the similarity between items i and p
- $r_{ui}$ is the rating given by user u to item i

In [96]:
def cf_item_wmean(user, movie):
    if movie in rmatrix:
        #The similarity scores for the item in question with every other item
        sim_scores = cosine_sim[movie]
        
        # sim_scores = sim_scores[sim_scores > 0]
        
        #The movie ratings for the user in question
        m_ratings = rmatrix.loc[user]
        
        #Extract the indices containing NaN in the m_ratings series
        idx = m_ratings[m_ratings.isnull()].index
        
        #Drop the NaN values from the m_ratings Series (removing unrated items)
        m_ratings = m_ratings.dropna()
        
        #Drop the corresponding cosine scores from the sim_scores series
        sim_scores = sim_scores.drop(idx)
        
        #Compute the final weighted mean
        wmean_rating = np.dot(sim_scores, m_ratings)/ sim_scores.sum()
    
    else:
        wmean_rating = avg_rating
    
    return wmean_rating if not np.isnan(wmean_rating) else avg_rating

In [97]:
rmse_score(cf_item_wmean, test)

0.9299153704678671

- !!BAM!! We have improved our model by considering only the similar items to the target item from ~0.95 to ~0.92

#### Using Surprise library for item based collaborative filtering

In [18]:
param_grid = {"k":list(range(1,50,5)),
              "sim_options":{"name":["cosine","pearson"],'user_based': [False]}}

gs = GridSearchCV(KNNWithMeans, 
                  param_grid, 
                  measures=['rmse'], 
                  cv=5, 
                  n_jobs = 2)

gs.fit(data)

Computing the cosine similarity matrix...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Computing the cosine similarity matrix...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Computing the cosine similarity matrix...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Computing the pearson similarity matrix...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Computing the pearson similarity matrix...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Computing the cosine similarity matrix...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Computing the cosine similarity matrix...
Computing the cosine similarity matrix...

In [19]:
print(f'Best RMSE score: {gs.best_score["rmse"]}')
print(f'Best parameters: {gs.best_params["rmse"]}')

Best RMSE score: 0.9093759977738891
Best parameters: {'k': 46, 'sim_options': {'name': 'cosine', 'user_based': False}}


In [20]:
# Let's fit the model with the best parameters
sim_options = {'name': 'cosine', 'user_based': False}

model = KNNWithMeans(k = 46, sim_options = sim_options)

model.fit(data.build_full_trainset())

id_pairs = zip(test['userId'], test['movie'])

y_pred = [model.predict(uid = user, iid = movie)[3] for (user, movie) in id_pairs]

y_true = test['rating']

rmse(y_true, y_pred)

Computing the cosine similarity matrix...
Done computing similarity matrix.


0.9103073035144981

- !!SMALL BAM!! We have improved our model by considering only the similar items to the target item from ~0.92 to ~0.91

## 11. Matrix Factorization

Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices.

- We will use Singular Value Decomposition (SVD) for matrix factorization

- We will use the Surprise library to perform matrix factorization



In [21]:
from surprise import SVD

In [22]:
model = SVD(n_factors=100)

model.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fed83113be0>

In [23]:
id_pairs = zip(test['userId'], test['movie'])

y_pred = [model.predict(uid = user, iid = movie)[3] for (user, movie) in id_pairs]

y_true = test['rating']

rmse(y_true, y_pred)

0.8860800569214757

- !!BAM!! We have improved our model by considering only the similar items to the target item from ~0.91 to ~0.88

#### Tuning the SVD

In [25]:
param_grid = {'n_factors':list(range(1,50,5)), 'n_epochs': [5, 10, 20], 'random_state': [42]}

gs = GridSearchCV(SVD, 
                  param_grid, 
                  measures=['rmse'], 
                  cv=5, 
                  n_jobs = 2)

gs.fit(data)

In [27]:
print(f'Best RMSE score: {gs.best_score["rmse"]}')
print(f'Best parameters: {gs.best_params["rmse"]}')

Best RMSE score: 0.8720242043564241
Best parameters: {'n_factors': 1, 'n_epochs': 20, 'random_state': 42}


In [28]:
# Fitting the model with the best parameters
model = SVD(n_factors=46, n_epochs=20, random_state=42)

model.fit(data.build_full_trainset())

id_pairs = zip(test['userId'], test['movie'])

y_pred = [model.predict(uid = user, iid = movie)[3] for (user, movie) in id_pairs]

y_true = test['rating']

rmse(y_true, y_pred)

0.8819810245452766

- !!TINY BAM!! We have improved our model by considering only the similar items to the target item from ~0.886 to ~0.881

## 12. Content Based Filtering

Content-based filtering methods are based on a description of the item and a profile of the user’s preferences. These methods are best suited to situations where there is known data on an item (name, location, description, etc.), but not on the user.

- We will use movies dataset with detailed information about movies for content based filtering, you can find the dataset upder Datasets/recommender/ folder.

In [2]:
movies = pd.read_csv('../Datasets/recommender/movie_ratings_with_info.zip')

In [3]:
movies.shape

(41865, 7)

In [4]:
movies.head().T

Unnamed: 0,0,1,2,3,4
user_id,1,1,1,1,1
movie_id,5: Four Rooms,11: Star Wars,12: Finding Nemo,13: Forrest Gump,14: American Beauty
rating,3,2,5,5,5
keywords,hotel new year's eve witch bet hotel room,android galaxy hermit death star lightsaber,father son relationship harbor underwater fish...,vietnam veteran hippie mentally disabled runni...,male nudity female nudity adultery midlife cri...
cast,Tim Roth Antonio Banderas Jennifer Beals Madon...,Mark Hamill Harrison Ford Carrie Fisher Peter ...,Albert Brooks Ellen DeGeneres Alexander Gould ...,Tom Hanks Robin Wright Gary Sinise Mykelti Wil...,Kevin Spacey Annette Bening Thora Birch Wes Be...
genres,Crime Comedy,Adventure Action Science Fiction,Animation Family,Comedy Drama Romance,Drama
director,Allison Anders,George Lucas,Andrew Stanton,Robert Zemeckis,Sam Mendes


In [75]:
movies.movie_id.nunique()

555

- We have 555 movies in the dataset

### Creating item profile using various features

In [12]:
features = ['keywords','cast','genres','director']

In [13]:
for feature in features:
    movies[feature] = movies[feature].fillna('')

In [14]:
#Concatenating strings from each of the content features to get the entire information in 1 column
movies['combined'] = movies['keywords'] + ' ' + movies['cast'] + ' ' + movies['genres'] + ' ' + movies['director']

In [15]:
movies['combined'].sample(1).values

array(['based on novel psychopath horror suspense serial killer Jodie Foster Anthony Hopkins Scott Glenn Ted Levine Anthony Heald Crime Drama Thriller Jonathan Demme'],
      dtype=object)

In [88]:
#Creating frequency of top 500 terms across all movies using count vectorizor
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(1, 2),max_features=500)

# Let's use only unique movie ids
df = movies.drop_duplicates(subset='movie_id').reset_index(drop=True)

count_matrix = cv.fit_transform(df["combined"])

In [89]:
count_matrix.todense().shape

(555, 500)

In [90]:
#Finding cosine similarity between frequency or count vectors for each movie
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(count_matrix)

In [91]:
cosine_sim.shape

(555, 555)

Finding most similar movies based on content

In [92]:
#Picking a random movie title
movie_user_likes = "The Dark Knight"

In [93]:
def get_title_from_index(index):
    return df[df.index == index]["movie_id"].values[0]

def get_index_from_title(title):
    return df[df.movie_id.str.contains(title)].index.values[0]

In [94]:
# Get the movie index of the "The Dark Knights"
movie_index = get_index_from_title(movie_user_likes)

In [95]:
similar_movies =  list(enumerate(cosine_sim[movie_index]))

sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=True)

In [96]:
#Top 10 similar movies to the Dark Knight
i=0
for element in sorted_similar_movies:
    print(get_title_from_index(element[0]))
    i=i+1
    if i>10:
        break

155: The Dark Knight
272: Batman Begins
415: Batman & Robin
1620: Hitman
364: Batman Returns
268: Batman
314: Catwoman
1619: The Way of the Gun
629: The Usual Suspects
414: Batman Forever
393: Kill Bill: Vol. 2


- !!!BAM!!! We can see the movies recommended by content based filtering are very similar to the target movie

In [98]:
def get_recommendations(movie_name):
    movie_user_likes = movie_name

    movie_index = get_index_from_title(movie_user_likes)

    similar_movies =  list(enumerate(cosine_sim[movie_index]))

    sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=True)

    i=0
    movies = []
    for element in sorted_similar_movies:
        movies.append(get_title_from_index(element[0]))
        i=i+1
        if i>10:
            break
            
    return movies

In [101]:
get_recommendations('Finding Nemo')

['12: Finding Nemo',
 '35: The Simpsons Movie',
 '321: Mambo Italiano',
 '1439: Anna and the King',
 "1268: Mr. Bean's Holiday",
 '319: True Romance',
 '587: Big Fish',
 '1656: The Legend of Zorro',
 '309: The Celebration',
 '468: My Own Private Idaho',
 '118: Charlie and the Chocolate Factory']

### Using TF-IDF for content based filtering

In [118]:
from scipy.sparse import vstack
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import MinMaxScaler, normalize

In [119]:
train_data, test_data = train_test_split(df, test_size = 0.2, random_state = 42)

In [120]:
train_data.shape, test_data.shape

((444, 8), (111, 8))

#### Steps for building the content based recommender system

1. Calculate Item Profile from tfidf values of each movie

2. Calculate User Profile by using item profiles of movies rated by the user, for each user I look in train set what are the movies/
items that this user has rated (1-5), for each item I have the item profile, so on the basis of rating as the weight for that item profile I calculate the sum of all tfidf/countvectorisor

3. Next step is to find out the most similar movies to each user profile

4. Finally top n recommendations are created from these most similar movies

In [283]:
# Extracting TF-IDF vectors for combined features
vectorizer = TfidfVectorizer(analyzer='word',
                     ngram_range=(1, 2),
                     min_df=0.003,
                     max_df=0.5,
                     max_features=500)

item_ids = train_data['movie_id'].unique().tolist()
tfidf_matrix = vectorizer.fit_transform(train_data['combined'].unique())
tfidf_feature_names = vectorizer.get_feature_names_out()

In [284]:
tfidf_feature_names[:5]

array(['1970s', 'abuse', 'action', 'action adventure', 'action comedy'],
      dtype=object)

In [285]:
train_data.head()

Unnamed: 0,user_id,movie_id,rating,keywords,cast,genres,director,combined
140,3,326: Snakes on a Plane,2,snake suspense fbi agent death animal attack,Samuel L. Jackson Julianna Margulies Kenan Tho...,Action Crime Horror Thriller,David R. Ellis,snake suspense fbi agent death animal attack S...
89,1,199: Star Trek: First Contact,4,federation starfleet borg enterprise-e cyborg,Patrick Stewart Jonathan Frakes Brent Spiner L...,Science Fiction Action Adventure Thriller,Jonathan Frakes,federation starfleet borg enterprise-e cyborg ...
310,10,707: A View to a Kill,5,paris london england france england san francisco,Roger Moore Christopher Walken Tanya Roberts G...,Adventure Action Thriller,John Glen,paris london england france england san franci...
499,279,1495: Kingdom of Heaven,4,crusade epic knight swordsman order of the tem...,Orlando Bloom Eva Green Jeremy Irons Marton Cs...,Drama Action Adventure History War,Ridley Scott,crusade epic knight swordsman order of the tem...
155,5,388: Inside Man,2,bank manager kidnapping nazi background docume...,Denzel Washington Clive Owen Jodie Foster Chri...,Crime Drama Thriller,Spike Lee,bank manager kidnapping nazi background docume...


In [286]:
# Function to extract movie profile or tfidf for a given movie
def get_item_profile(item_id):
    idx = item_ids.index(item_id)
    item_profile = tfidf_matrix[idx:idx+1]
    return item_profile


# Function to extract movie profile or item profile for items with given ids
def get_item_profiles(ids):
    if isinstance(ids, str):
        ids = [ids]
    item_profiles_list = [get_item_profile(x) for x in ids]
    item_profiles = vstack(item_profiles_list)
    return item_profiles


# Function to create user profile using interaction strength or rating
def build_users_profile(person_id, interactions_indexed_df):
    
    interactions_person_df = interactions_indexed_df.loc[person_id]

    # Get item profiles for all items for the given user
    user_item_profiles = get_item_profiles(interactions_person_df['movie_id'])
    
    # Storing ratings or interaction strength for each user or person
    user_item_strengths = np.array(interactions_person_df['rating']).reshape(-1,1)
    
    #Weighted average of item profiles by the interactions strength to get the complete profile for user
    user_item_strengths_weighted_avg = np.sum(user_item_profiles.multiply(user_item_strengths), axis=0) / np.sum(user_item_strengths)
    # print(type(user_item_strengths_weighted_avg))
    # user_profile_norm = normalize(user_item_strengths_weighted_avg)
     # Convert np.matrix to numpy array
    user_item_strengths_weighted_avg_array = np.asarray(user_item_strengths_weighted_avg)

    # Normalize the user profile
    user_profile_norm = normalize(user_item_strengths_weighted_avg_array.reshape(1, -1))
    return user_profile_norm

# Function to build user profiles for all users using the train data
def build_users_profiles(): 
    interactions_indexed_df = train_data.set_index('user_id')
    user_profiles = {}

    #Loop over each user to build user profiles for each
    for person_id in interactions_indexed_df.index.unique():
        user_profiles[person_id] = build_users_profile(person_id, interactions_indexed_df)
        # break
    return user_profiles

In [288]:
user_profiles = build_users_profiles()

In [289]:
user_profiles.keys()

dict_keys([3, 1, 10, 279, 5, 405, 13, 7, 206, 6, 11, 883, 15, 201, 2, 54, 207, 110, 550, 254, 8, 12, 409, 119, 181, 18, 21, 489, 655, 56, 325, 42, 58, 433, 113, 92, 83, 130, 34, 551, 60, 141, 14, 59, 195, 43, 109, 125, 222, 854, 116, 234, 16, 587, 782, 82, 543, 399, 387, 121, 49])

In [290]:
# User profile for  a random user
myprofile = user_profiles[201]
pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        user_profiles[201].flatten().tolist()), key=lambda x: -x[1])[:20],
             columns=['token', 'relevance'])

Unnamed: 0,token,relevance
0,antonio,0.269646
1,cia,0.261813
2,corruption,0.234423
3,depp,0.234423
4,johnny depp,0.234423
5,steve,0.229549
6,johnny,0.22659
7,dafoe,0.218352
8,emma,0.218352
9,willem,0.218352


In [291]:
#Function to extract all movies that the user has already seen
def get_items_interacted(person_id, train_data):
    
    # Get the user's data and merge in the movie information.
    interacted_items = train_data[train_data['user_id'] == person_id]['movie_id']
    return set(interacted_items if type(interacted_items) == pd.Series else [interacted_items])

#Extract 100 most similar items to the user profile
def get_similar_items_to_user_profile(person_id, topn=100):
    #Computes the cosine similarity between the user profile and all item profiles
    cosine_similarities = cosine_similarity(user_profiles[person_id], tfidf_matrix)
    
    #Gets the top similar items
    similar_indices = cosine_similarities.argsort().flatten()[-topn:]
    
    #Sort the similar items by similarity
    similar_items = sorted([(item_ids[i], cosine_similarities[0,i]) for i in similar_indices], key=lambda x: -x[1])
    return similar_items
    
#Generate top 10 recommendations from the movies that the user has not watched yet    
def cb_recommend_items(user_id, items_to_ignore=[], topn=10):
    similar_items = get_similar_items_to_user_profile(user_id)
        
    #Ignores items the user has already interacted with
    similar_items_filtered = list(filter(lambda x: x[0] not in items_to_ignore, similar_items))
        
    recommendations_df = pd.DataFrame(similar_items_filtered, columns=['movie_id', 'rating']) \
                                    .head(topn)
    return list(recommendations_df['movie_id'])

In [292]:
#Extracting recommended items for user 201
userr = 201
iti = list(train_data[train_data['user_id'] == userr]['movie_id'])
cb_recommend_items(userr, items_to_ignore=iti, topn=10)

['1639: Speed 2: Cruise Control',
 '388: Inside Man',
 '522: Ed Wood',
 '792: Platoon',
 '1586: Secret Window',
 '622: The Ninth Gate',
 '162: Edward Scissorhands',
 '768: From Hell',
 '421: The Life Aquatic with Steve Zissou',
 '557: Spider-Man']

In [293]:
#Implementation for average precision@k
def apk(actual, predicted, k=3):
    actual = list(actual)
    predicted = list(predicted)
    
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)
            
    if not actual:
        return 0.0
    if num_hits == 0:
        return 0
    else:
        return score / num_hits

#### Evaluation using Mean Average Precision at 10

In [294]:
#List of unique users from test set
user_ids = list(test_data['user_id'].unique())
sum_ap = 0
ap_user = []
missing = 0
for user in user_ids:
    if user not in user_profiles.keys():
        missing += 1
        continue
    #Ignoring movies that the user has already rated
    iti = list(train_data[train_data['user_id'] == user]['movie_id'])
    
    #Creating recommendations for each user
    rec = cb_recommend_items(user, items_to_ignore=iti, topn=10)
    
    #Actual movies rated by the user
    act = list(test_data[test_data['user_id'] == user]['movie_id'])
    
    #Calculating Average Precision@K for each user
    ap = apk(act, rec, k=10)
    ap_user.append(ap)
    
    #Sum of precisions
    sum_ap += ap

#Mean average precision@10
map_at_10 = sum_ap/len(user_ids)
print(map_at_10)


0.0
