# Movie Recommendation System

## 0. Setup

To run this notebook you just need to have [pipenv](https://github.com/pypa/pipenv) installed. Then run these 3 commands:

1. Install the dependencies with: `pipenv install`
2. Launch the virtual env: `pipenv shell`
3. Start jupyter and open the notebook: `jupyter-lab`

In [1]:
%load_ext autotime

import os
import numpy as np
import pandas as pd
import random
import requests

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel 

from surprise import NormalPredictor, SVD, KNNBasic, NMF
from surprise import Dataset, Reader
from surprise import accuracy
from surprise.model_selection import cross_validate, KFold

from tempfile import NamedTemporaryFile
from tqdm import tqdm
from zipfile import ZipFile

time: 689 ms (started: 2022-11-12 01:25:23 +05:30)


## 1. Introduction

Recommender systems goal is to push relevant items to a given user. Understanding and modelling the user's preferences is required to reach this goal. In this project we will learn how to model the user's preferences with the [Surprise library](http://surpriselib.com/) to build different recommender systems. The first one will be a pure collaborative filtering approach, and the second one will rely on item attributes in a content-based way.

## 2. Downloading and Loading Data

We use here the [MovieLens dataset](https://grouplens.org/datasets/movielens/). It contains 25 millions of users ratings. the data are in the `./data/raw` folder. We could load directly the .csv file with a built-in Surprise function, but it's more convenient to load it through a Pandas dataframe for later flexibility purpose.

In [2]:
# Download and extract the dataset

zip_file_name = "ml-25m.zip"
zip_url = "https://files.grouplens.org/datasets/movielens/ml-25m.zip"


response = requests.get(zip_url, stream=True)
total_size_in_bytes= int(response.headers.get('content-length', 0))
block_size = 1024 #1 Kibibyte
progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True)

with open(zip_file_name, 'wb') as file:
    for data in response.iter_content(block_size):
        progress_bar.update(len(data))
        file.write(data)
    progress_bar.close()
if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
    print("ERROR, something went wrong")
    
with ZipFile(zip_file_name, "r") as zip_ref:
    for file in tqdm(iterable=zip_ref.namelist(), total=len(zip_ref.namelist())):
        zip_ref.extract(member=file)
        
os.remove(zip_file_name)

100%|████████████████████████████████████████████████████████████████████████████████████████| 262M/262M [00:28<00:00, 9.05MiB/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  2.97it/s]

time: 32.8 s (started: 2022-11-12 01:25:24 +05:30)





In [3]:
RATINGS_DATA_FILE = f"{zip_file_name[:-4]}/ratings.csv"
MOVIES_DATA_FILE = f"{zip_file_name[:-4]}/movies.csv"

time: 223 µs (started: 2022-11-12 01:25:57 +05:30)


In [4]:
# Load the raw csv into a data_frame
df_ratings = pd.read_csv(RATINGS_DATA_FILE)

# Drop the timestamp column since we dont need it now
df_ratings = df_ratings.drop(columns="timestamp")

# Movies dataframe
df_movies = pd.read_csv(MOVIES_DATA_FILE)

time: 3.78 s (started: 2022-11-12 01:25:57 +05:30)


In [5]:
# check we have 25M users' ratings
df_ratings.userId.count()

25000095

time: 31.8 ms (started: 2022-11-12 01:26:00 +05:30)


In [6]:
def get_subset(df, number):
    """
    Get a subset of a large dataset for debug purpose.
    """
    rids = np.arange(df.shape[0])
    np.random.shuffle(rids)
    df_subset = df.iloc[rids[:number], :].copy()
    return df_subset

df_ratings_1k = get_subset(df_ratings, 1000)
df_movies_100 = get_subset(df_movies, 100)

time: 617 ms (started: 2022-11-12 01:26:00 +05:30)


In [7]:
# Surprise reader
reader = Reader(rating_scale=(0, 5))

# Finally load all ratings
ratings = Dataset.load_from_df(df_ratings_1k, reader)

time: 841 µs (started: 2022-11-12 01:26:01 +05:30)


In [8]:
df_ratings_1k.head()

Unnamed: 0,userId,movieId,rating
19890974,129330,8798,3.5
1513320,10115,3552,3.0
9285614,60497,1291,2.0
7382990,47868,1265,3.5
7136723,46264,1320,3.0


time: 4.31 ms (started: 2022-11-12 01:26:01 +05:30)


## 3. Collaborative Filtering

We can test first any of the [Surprise algorithms](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

In [9]:
# Define a cross-validation iterator

kf = KFold(n_splits=3)

algos = [SVD(), NMF(), KNNBasic()]  

time: 491 µs (started: 2022-11-12 01:26:01 +05:30)


In [10]:
def get_rmse(algo, testset):
    predictions = algo.test(testset)
    accuracy.rmse(predictions, verbose=True)
        
for trainset, testset in tqdm(kf.split(ratings)): 
    """
    Get an evaluation with cross-validation for different algorithms.
    """  
    for algo in algos:
        algo.fit(trainset)
        get_rmse(algo, testset)

3it [00:00, 25.10it/s]

RMSE: 1.1176
RMSE: 1.1298
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.1201
RMSE: 1.1436
RMSE: 1.1449
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.1435
RMSE: 1.1038
RMSE: 1.1083
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.1081
time: 121 ms (started: 2022-11-12 01:26:01 +05:30)





## 4. Content-based Filtering

Here we will rely directly on items attributes. First we have to describe a user profile with an attributes vector. Then we will use these vectors to generate recommendations.

In [11]:
# Computing similarities requires too much ressources on the whole dataset, so we take the subset with 100 items

df_movies_100 = df_movies_100.reset_index(drop=True)

time: 276 µs (started: 2022-11-12 01:26:01 +05:30)


In [12]:
# Compute a TFIDF on the titles of the movies

tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df_movies_100['title'])

time: 5.14 ms (started: 2022-11-12 01:26:01 +05:30)


In [13]:
# Calculate cosine similarities: this takes a lot of time on the real dataset

cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

time: 1.08 ms (started: 2022-11-12 01:26:01 +05:30)


In [14]:
# We generate in 'results' the most similar movies for each movie: we put a pair (score, movie_id)

results = {}
for idx, row in df_movies_100.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1] 
    similar_items = [(cosine_similarities[idx][i], df_movies_100['movieId'].loc[[i]].tolist()[0]) for i in similar_indices] 
    results[idx] = similar_items[1:]

time: 951 ms (started: 2022-11-12 01:26:01 +05:30)


In [15]:
# Transform a 'movieId' into its corresponding movie title

def item(id):  
    return df_movies_100.loc[df_movies_100['movieId'] == id]['title'].tolist()[0].split(' - ')[0] 

time: 230 µs (started: 2022-11-12 01:26:02 +05:30)


In [16]:
# Transform a 'movieId' into the index id

def get_idx(id):
    return df_movies_100[df_movies_100['movieId'] == id].index.tolist()[0]

time: 198 µs (started: 2022-11-12 01:26:02 +05:30)


In [17]:
# Finally we put everything together here:

def recommend(item_id, num):
    print("Recommending " + str(num) + " products similar to " + item(item_id) + "...")   
    print("-------")    
    recs = results[get_idx(item_id)][:num]   
    for rec in recs: 
        print("\tRecommended: " + item(rec[1]) + " (score:" +      str(rec[0]) + ")")

time: 273 µs (started: 2022-11-12 01:26:02 +05:30)


In [18]:
df_movies_100.head()

Unnamed: 0,movieId,title,genres
0,77332,Don's Plum (2001),Drama
1,164111,From the Land of the Moon (2016),Drama|Romance
2,99222,Silent Night (2012),Horror
3,190817,The Miles Davis Story (2001),Documentary
4,164685,Sometimes Aunt Martha Does Dreadful Things (1971),Horror


time: 2.15 ms (started: 2022-11-12 01:26:02 +05:30)


In [19]:
random_item_id = random.choice(df_movies_100["movieId"].tolist())
recommend(item_id=random_item_id, num=10)

Recommending 10 products similar to The Miles Davis Story (2001)...
-------
	Recommended: 61* (2001) (score:0.12749864647542272)
	Recommended: Don's Plum (2001) (score:0.08690617461050223)
	Recommended: Legally Blonde (2001) (score:0.08690617461050223)
	Recommended: Human Nature (2001) (score:0.08690617461050223)
	Recommended: The Misfit Brigade (1987) (score:0.0)
	Recommended: Goods: Live Hard, Sell Hard, The (2009) (score:0.0)
	Recommended: Scott Walker: 30 Century Man (2006) (score:0.0)
	Recommended: Battle in Seattle (2007) (score:0.0)
	Recommended: Ukonvaaja (2016) (score:0.0)
	Recommended: Rurouni Kenshin (Rurôni Kenshin: Meiji kenkaku roman tan) (2012) (score:0.0)
time: 2.4 ms (started: 2022-11-12 01:26:02 +05:30)
