## Recommender System
A basic recommender system built in python.

### Data Understanding
Lets load our datasets and get familiar with them.

We'll start by importing the necessary libraries then loading the files.

In [90]:
import surprise
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


%matplotlib inline

In [91]:
movies = pd.read_csv('./data/movies.csv')
ratings = pd.read_csv('./data/ratings.csv')
tags = pd.read_csv('./data/tags.csv')
links = pd.read_csv('./data/links.csv')

In [92]:
#load the first five rows of the movies dataset
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [93]:
#load the first five rows of the ratings dataset
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [94]:
#load the first five rows of the links dataset
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [95]:
#first five rows of the tags dataset
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [96]:
#first five rows of links dataset
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [97]:
#get a summary of all the datasets
print(movies.info(), '', ratings.info(), '', links.info(), '',  tags.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   i

Here is a summary of the relationships of different columns among the dataframes:
* User Ids - User ids are consistent between `ratings.csv` and `tags.csv` 
* Movie Ids - Movie ids are consistent between `ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv`

### Analysis for each of the dataframe(Univariate Analysis)
Perform a preliminary analysis on each of the dataframes individually, to get a better understanding of each column.


In [98]:
movies.shape

(9742, 3)

In [99]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [100]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [101]:
movies['genres'].value_counts()

Drama                                                  1053
Comedy                                                  946
Comedy|Drama                                            435
Comedy|Romance                                          363
Drama|Romance                                           349
                                                       ... 
Action|Crime|Horror|Mystery|Thriller                      1
Adventure|Animation|Children|Comedy|Musical|Romance       1
Action|Adventure|Animation|Comedy|Crime|Mystery           1
Children|Comedy|Fantasy|Sci-Fi                            1
Action|Animation|Comedy|Fantasy                           1
Name: genres, Length: 951, dtype: int64

In [102]:
print(ratings.info(), movies.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
None None


### Initial Dataframes merging
Now I will combined different dataframes based on their shared columns and explore these new dataframes.
* Combine the ratings and the movies dataframes on the movieID


In [103]:
rated_movies = pd.merge(ratings, movies, on='movieId', how='outer')
rated_movies.tail()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
100849,,30892,,,In the Realms of the Unreal (2004),Animation|Documentary
100850,,32160,,,Twentieth Century (1934),Comedy
100851,,32371,,,Call Northside 777 (1948),Crime|Drama|Film-Noir
100852,,34482,,,"Browning Version, The (1951)",Drama
100853,,85565,,,Chalet Girl (2011),Comedy|Romance


In [104]:
rated_movies['movieId'].value_counts().sum()

100854

In [105]:
rated_movies['rating'].unique()

array([4. , 4.5, 2.5, 3.5, 3. , 5. , 0.5, 2. , 1.5, 1. , nan])

Time to clean the combined dataframe.

In [106]:
rated_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100854 entries, 0 to 100853
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  float64
 1   movieId    100854 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  float64
 4   title      100854 non-null  object 
 5   genres     100854 non-null  object 
dtypes: float64(3), int64(1), object(2)
memory usage: 5.4+ MB


In [107]:
rated_movies = rated_movies.dropna()

In [108]:
rated_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100836 entries, 0 to 100835
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  float64
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  float64
 4   title      100836 non-null  object 
 5   genres     100836 non-null  object 
dtypes: float64(3), int64(1), object(2)
memory usage: 5.4+ MB


Lets finish this data cleaning by dropping the last three columns.

In [109]:
rated_movies.drop(columns=['timestamp', 'title', 'genres'], axis=1, inplace=True)
rated_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100836 entries, 0 to 100835
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   userId   100836 non-null  float64
 1   movieId  100836 non-null  int64  
 2   rating   100836 non-null  float64
dtypes: float64(2), int64(1)
memory usage: 3.1 MB


Now, import and use python's `surprise` library to start modelling our recommendation system.

In [110]:
from surprise import Dataset, Reader, BaselineOnly, SVD
from surprise.model_selection import cross_validate, train_test_split


In [111]:
#specify the rating scale used in the dataset
reader = Reader(rating_scale=(1, 5))

#load the dataset using surprises built-in methods
s_df = Dataset.load_from_df(rated_movies[["userId", "movieId", "rating"]], reader)

In [112]:
cross_validate(BaselineOnly(), s_df, verbose=True)


Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE, MAE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8791  0.8697  0.8710  0.8783  0.8632  0.8723  0.0059  
MAE (testset)     0.6782  0.6688  0.6731  0.6763  0.6655  0.6724  0.0047  
Fit time          0.29    0.31    0.32    0.31    0.32    0.31    0.01    
Test time         0.07    0.06    0.06    0.06    0.06    0.06    0.00    


{'test_rmse': array([0.87907066, 0.86968328, 0.87104975, 0.87829989, 0.86323973]),
 'test_mae': array([0.67815057, 0.66878406, 0.67310131, 0.67629454, 0.66552504]),
 'fit_time': (0.28652143478393555,
  0.313631534576416,
  0.32064390182495117,
  0.3122837543487549,
  0.3208780288696289),
 'test_time': (0.06652307510375977,
  0.06046128273010254,
  0.05817604064941406,
  0.06224822998046875,
  0.06212925910949707)}

With a `test_rmse` of `0.8748` and a `test_mae` of `0.6753`, while not satisfactory it's a good start.

Now, we'll try an `SVD` algorithm alongside baseline, create a pipeline and implement GridsearchCV to try and improve this.

In [116]:
from sklearn.pipeline import Pipeline
from surprise.model_selection import GridSearchCV


#initialising algorithms
baseline = BaselineOnly()
svd = SVD()

#baseline parameter grid
param_grid_baseline = {
    'bsl_options': {'method': ['als'], 'reg_u': [10, 15, 20], 'reg_i': [5, 10, 15]},
    'verbose': [False]  
}

#baseline pipeline
baseline_pipeline = Pipeline([
    ('baseline', baseline)
])


#hyperparameter tuning for baseline algorithm
baseline_gs = GridSearchCV(BaselineOnly, param_grid_baseline, measures=['rmse', 'mae'], cv=5)
baseline_gs.fit(s_df)



#svd's parameter grids for hyper parameter tuning
param_grid_svd = {
    'n_factors': [50, 100, 150],
    'n_epochs': [20, 30, 40],
    'lr_all': [0.005, 0.01, 0.015],
    'reg_all': [0.02, 0.04, 0.06]
}


#svd pipeline
svd_pipeline = Pipeline([
    ('svd', svd)
])

#svd hyperparameter tuning 
svd_gs = GridSearchCV(SVD, param_grid_svd,measures=['rmse', 'mae'], cv=5)
svd_gs.fit(s_df)

Let's get a look at the best parameters from GridSearchCV.

In [120]:
best_baseline_params = baseline_gs.best_params['rmse']
best_baseline_model = baseline_gs.best_estimator['rmse']

best_svd_params = svd_gs.best_params['rmse']
best_svd_model = svd_gs.best_estimator['rmse']

print(f'Best baseline parameters are {best_baseline_params}, best baseline model rmse score is {best_baseline_model}.')
print(f'Best svd parameters are {best_svd_params} and the best svd score was {best_svd_model}')

Best baseline parameters are {'bsl_options': {'method': 'als', 'reg_u': 10, 'reg_i': 5}, 'verbose': False}, best baseline model rmse score is <surprise.prediction_algorithms.baseline_only.BaselineOnly object at 0x7f32b57c0590>.
Best svd parameters are {'n_factors': 150, 'n_epochs': 40, 'lr_all': 0.01, 'reg_all': 0.06} and the best svd score was <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7f32b65ae210>


Its time to evaluate these tuned models and see how they perform.

In [121]:
from surprise import accuracy

#split the dataset to better access the performance of the models
trainset, testset = train_test_split(s_df, test_size=0.25, random_state=21)


# Train the models on the training set
best_baseline_model.fit(trainset)
best_svd_model.fit(trainset)

# Test the models on the holdout test set
baseline_predictions = best_baseline_model.test(testset)
svd_predictions = best_svd_model.test(testset)

# Calculate evaluation metrics
baseline_rmse = accuracy.rmse(baseline_predictions)
baseline_mae = accuracy.mae(baseline_predictions)

svd_rmse = accuracy.rmse(svd_predictions)
svd_mae = accuracy.mae(svd_predictions)


RMSE: 0.8726
MAE:  0.6727
RMSE: 0.8605
MAE:  0.6610


 The SVD With a lower RMSE of 0.8605 and MAE of 0.6610, outperformed BaselineOnly, which yielded RMSE of 0.8726 and MAE of 0.6727. 
 
 Although this difference is minor, these metrics suggest that the SVD model provides more accurate predictions of user ratings, meaning a better potential for generating precise movie recommendations in a recommendation system context. 