![title_pic](./img/title_page.png)
#### TV & Movie recommendation system using a collaborative and content based filtering approach

# Collaborative Filtering Model

# Project Overview
The main goal for this project is to Develop a hybrid movie/TV recommendation system that combines collaborative filtering and content-based filtering to suggest new content to users. Currently, these techniques are applied independently. Our project aims to harness their combined potential.

**Collaborative Filtering**: Analyzes existing user profiles to discover shared preferences and recommend new content based on similarities.

**Content-Based Filtering**: Suggests new content with similar fearures to the movie/TV show that you input.


# Business Understanding
As streaming platforms pile-up content, users struggle to pinpoint films or shows that align with their tastes. The dubious presence of bias in platform algorithms exacerbates this challenge, making it harder for users to rely on platform recommendations. Biases emerge from factors like skewed user preferences, popularity bias, or even the platform's promotional agenda. As a result, recommended content may not cater to users' unique tastes, negatively affecting the overall user experience.

Streaming platforms stand to gain from implementing an unbiased hybrid recommendation system that blends content-based and collaborative filtering techniques. This approach leverages the best of both methods, increasing reliability and personalization while mitigating biases. The content-based technique analyzes features like genre and content description, while collaborative filtering harnesses the collective trends of user ratings. Together, they forge a powerful recommendation engine, enhancing user satisfaction and overall experience.

# Collaborative Filtering

In the collaborative filtering approach, we focused on user-to-user filtering, comparing users' profiles to identify similarities in movie preferences by looking at the ratings they gave to movies that they have both watched. We started by creating a base model as a benchmark, which predicted the mean rating for each movie. To improve the recommendations, we iterated over two models: an SVD model and an SVDpp model. In the iteration process, we used cross-validation in tandem with GridSearchCV to find the best parameters for our models, optimizing their performance.

This cross-validation and GridSearchCV process helped us fine-tune our models to achieve a Root Mean Square Error (RMSE) score of 0.908. Given that our dataset contains user ratings on a scale of 1 to 5, a 0.908 RMSE score signifies good performance in predicting users' movie ratings. It indicates that, on average, our model's predictions deviate by 0.908 from the actual user ratings. This performance enables more accurate and personalized recommendations for users based on their shared preferences, enhancing the overall user experience and satisfaction.

In [1]:
import pandas as pd
import numpy as np
import random

from surprise import Dataset, Reader, accuracy
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
from surprise.prediction_algorithms import BaselineOnly, SVDpp, SVD


# increasing display to view large descriptions and reviewText
pd.set_option('display.max_colwidth', None)



import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('./data/collab_model_revs.csv')
df.head()

Unnamed: 0,rating,user_id,movie_id,reviews
0,5,A3JVF9Y53BEOGC,000503860X,"I have seen X live many times, both in the ear..."
1,5,A12VPEOEZS1KTC,000503860X,"I was so excited for this! Finally, a live co..."
2,5,ATLZNVLYKP9AZ,000503860X,X is one of the best punk bands ever. I don't ...
3,5,A3TNYNA2360NPA,000503860X,I've loved X since I first saw them here in Sa...
4,5,A2PANT8U0OJNT4,0005419263,The DVD came in great condition and provided l...


In [3]:
df.drop(columns='reviews', axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2133511 entries, 0 to 2133510
Data columns (total 3 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   rating    int64 
 1   user_id   object
 2   movie_id  object
dtypes: int64(1), object(2)
memory usage: 48.8+ MB


In [4]:
reader = Reader()
data = Dataset.load_from_df(df[['user_id', 'movie_id', 'rating']], reader)

In [5]:
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

In [83]:
baselinee = BaselineOnly()
baselinee.fit(trainset)
predictions = baselinee.test(testset)
base_pred = accuracy.rmse(predictions)

Estimating biases using als...
RMSE: 0.9294


In [8]:
svd_cv = SVD()
cv_svd = cross_validate(svd_cv, data, measures=['RMSE'], n_jobs=-1, verbose=True)

for i in cv_svd.items():
    print(i)
print('-----------------------')
print(np.mean(cv_svd['test_rmse']))

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9187  0.9177  0.9143  0.9161  0.9192  0.9172  0.0018  
Fit time          13.15   13.63   13.52   13.32   12.66   13.25   0.34    
Test time         4.10    3.74    3.40    3.17    2.95    3.47    0.41    
('test_rmse', array([0.91874919, 0.91768868, 0.91429583, 0.91609454, 0.91921596]))
('fit_time', (13.146021127700806, 13.625556468963623, 13.51961612701416, 13.316447734832764, 12.663116216659546))
('test_time', (4.099060535430908, 3.7391767501831055, 3.396686315536499, 3.1740715503692627, 2.954120635986328))
-----------------------
0.9172088420568423


In [10]:
params = {'n_factors': [20, 50, 100, 150],
          'n_epochs':[10, 20, 40]}
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1, joblib_verbose=10)
g_s_svd.fit(data)

print(g_s_svd.best_score)
print(g_s_svd.best_params)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   33.6s
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:   55.6s
[Parallel(n_jobs=-1)]: Done  29 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  36 out of  60 | elapsed:  2.0min remaining:  1.3min
[Parallel(n_jobs=-1)]: Done  43 out of  60 | elapsed:  2.7min remaining:  1.1min
[Parallel(n_jobs=-1)]: Done  50 out of  60 | elapsed:  3.0min remaining:   36.1s
[Parallel(n_jobs=-1)]: Done  57 out of  60 | elapsed:  3.7min remaining:   11.7s


{'rmse': 0.9102883902547211, 'mae': 0.6189911924279894}
{'rmse': {'n_factors': 20, 'n_epochs': 20}, 'mae': {'n_factors': 20, 'n_epochs': 40}}


[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  3.8min finished


In [12]:
SVD_base = SVD()
SVD_base.fit(trainset)
predictions = SVD_base.test(testset)
SVD_first = accuracy.rmse(predictions)

RMSE: 0.9202


In [9]:
svd_pp_cv = SVDpp()
cv_svdpp = cross_validate(svd_pp_cv, data, measures=['RMSE'], n_jobs=-1, verbose=True)

for i in cv_svdpp.items():
    print(i)
print('-----------------------')
print(np.mean(cv_svdpp['test_rmse']))

Evaluating RMSE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9096  0.9085  0.9060  0.9089  0.9096  0.9085  0.0013  
Fit time          34.45   34.78   34.53   34.40   34.60   34.55   0.13    
Test time         9.63    10.14   9.98    9.92    9.86    9.91    0.17    
('test_rmse', array([0.90960273, 0.90850002, 0.90603958, 0.9089005 , 0.90955774]))
('fit_time', (34.44675898551941, 34.77713227272034, 34.52870988845825, 34.39724254608154, 34.604445934295654))
('test_time', (9.634008169174194, 10.143620729446411, 9.9827721118927, 9.919447898864746, 9.857593297958374))
-----------------------
0.908520115436667


In [11]:
params = {'n_factors': [10, 20, 50, 100],
          'n_epochs': [10, 20, 40],
          'cache_ratings': [True, False],
         }
g_s_svdpp = GridSearchCV(SVDpp,param_grid=params,n_jobs=-1, measures=['RMSE'], joblib_verbose=10)
g_s_svdpp.fit(data)

print(g_s_svdpp.best_score)
print(g_s_svdpp.best_params)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   53.2s
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done  29 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done  66 tasks      | elapsed:  6.0min
[Parallel(n_jobs=-1)]: Done  81 tasks      | elapsed: 10.1min
[Parallel(n_jobs=-1)]: Done 102 out of 120 | elapsed: 14.8min remaining:  2.6min
[Parallel(n_jobs=-1)]: Done 115 out of 120 | elapsed: 19.4min remaining:   50.4s


{'rmse': 0.9054603863502747}
{'rmse': {'n_factors': 10, 'n_epochs': 20, 'cache_ratings': False}}


[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed: 20.8min finished


In [86]:
SVDpp_one = SVDpp(n_factors=10, n_epochs=20)
SVDpp_one.fit(trainset)
predictions = SVDpp_final.test(testset)
SVDpp_first = accuracy.rmse(predictions)

RMSE: 0.9087


In [6]:
SVDpp_final = SVDpp(n_factors=5, n_epochs=20)
SVDpp_final.fit(trainset)
predictions = SVDpp_final.test(testset)
SVDpp_last = accuracy.rmse(predictions)

RMSE: 0.9074


## Function to return content based on similar users predicted by the final model
To use the final model I created the recommend_movies function to give content recommendations to existing users, based on the trained SVDpp collaborative filtering model. It takes an existing user ID, the model, a  DataFrame like my movie and tv meta dataframe, and an optional parameter N (defaulting to 10) to recommend N movies. By predicting user ratings for unseen movies and sorting them in descending order, the function returns the top N movie details, such as titles and ratings, based on user preferences and similarities with other users.

In [24]:
def recommend_movies(trained_model, movie_df, N=5, user_id=None):
    
    if user_id is None:
        all_user_ids = movie_df['user_id'].unique().tolist()
        user_id = random.choice(all_user_ids)

    user_movies = movie_df[movie_df['user_id'] == user_id]['movie_id'].tolist()
    all_movies = movie_df['movie_id'].tolist()
    unseen_movies = set(all_movies) - set(user_movies)

    predictions = []
    for movie_id in unseen_movies:
        predicted_rating = trained_model.predict(user_id, movie_id).est
        predictions.append({'movie_id': movie_id, 'predicted_rating': predicted_rating})
    predictions_df = pd.DataFrame(predictions)

    top_N = predictions_df.sort_values('predicted_rating', ascending=False).head(N)
    top_N_movie_ids = top_N['movie_id'].tolist()

    top_N_movies = movie_df[movie_df['movie_id'].isin(top_N_movie_ids)]
    top_N_movies.drop_duplicates(subset=['movie_id'], inplace=True)

    top_N_ratings = pd.merge(top_N, top_N_movies, on='movie_id')
    
    num_movies_reviewed = len(user_movies)
    print(f"The user_id # for these recommendations is {user_id} and they have reviewed {num_movies_reviewed} different Movies &/ TV Shows")

    return top_N_ratings.drop(columns=['user_id', 'rating'], axis=1)

In [88]:
# df_movies = pd.read_csv('./data/collab_meta.csv')
# df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94324 entries, 0 to 94323
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   genre        94324 non-null  object
 1   description  70549 non-null  object
 2   title        94324 non-null  object
 3   starring     94324 non-null  object
 4   movie_id     94324 non-null  object
dtypes: object(5)
memory usage: 3.6+ MB


In [89]:
# merged_df = pd.merge(df, df_movies, on="movie_id", how="left")
# merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2133511 entries, 0 to 2133510
Data columns (total 7 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   rating       int64 
 1   user_id      object
 2   movie_id     object
 3   genre        object
 4   description  object
 5   title        object
 6   starring     object
dtypes: int64(1), object(6)
memory usage: 113.9+ MB


In [104]:
# merged_df.to_csv('./data/collab_merged_df.csv', encoding='utf-8', index=False)

In [7]:
merged_df = pd.read_csv('./data/collab_merged_df.csv')

In [26]:
recommendations = recommend_movies(SVDpp_final, merged_df)
recommendations

The user_id # for these recommendations is A1WHHAHIOC21I5 and they have reviewed 5 different Movies &/ TV Shows


Unnamed: 0,movie_id,predicted_rating,genre,description,title,starring
0,B000ARXF7S,5.0,HBO,<![CDATA[ Dear America: Letters Home from Viet...,Dear America - Letters Home from Vietnam,Tom Berenger
1,B004MYOYFC,5.0,TV,"""Mesmerizing"" --The Sun (U.K.) As seen on publ...",Murdoch Mysteries: Season 3,Peter Outerbridge
2,B004CWLRHC,5.0,Foreign Films,Krister Henriksson. Henning Mankell's Swedish ...,Wallander: Episodes 07 - 09,Krister Henriksson
3,B00404ME0G,5.0,Broadway Musicals,Join us for a rousing celebration of the life ...,Sondheim: The Birthday Concert,Stephen Sondheim
4,B000WGTD82,5.0,TV,Martin Clunes returns as the socially-challeng...,Doc Martin - Series 3 - Complete,Martin Clunes
