![title_pic](./img/title_page.png)
#### TV & Movie recommendation system using a collaborative and content based filtering approach

# Collaborative Filtering Model

# Project Overview
The main goal for this project is to Develop a hybrid movie/TV recommendation system that combines collaborative filtering and content-based filtering to suggest new content to users. Currently, these techniques are applied independently. Our project aims to harness their combined potential.

**Collaborative Filtering**: Analyzes existing user profiles to discover shared preferences and recommend new content based on similarities.

**Content-Based Filtering**: Suggests new content with similar fearures to the movie/TV show that you input.


# Business Understanding
As streaming platforms pile-up content, users struggle to pinpoint films or shows that align with their tastes. The dubious presence of bias in platform algorithms exacerbates this challenge, making it harder for users to rely on platform recommendations. Biases emerge from factors like skewed user preferences, popularity bias, or even the platform's promotional agenda. As a result, recommended content may not cater to users' unique tastes, negatively affecting the overall user experience.

Streaming platforms stand to gain from implementing an unbiased hybrid recommendation system that blends content-based and collaborative filtering techniques. This approach leverages the best of both methods, increasing reliability and personalization while mitigating biases. The content-based technique analyzes features like genre and content description, while collaborative filtering harnesses the collective trends of user ratings. Together, they forge a powerful recommendation engine, enhancing user satisfaction and overall experience.

# Collaborative Filtering

In the collaborative filtering approach, we focused on user-to-user filtering, comparing users' profiles to identify similarities in movie preferences by looking at the ratings they gave to movies that they have both watched. We started by creating a base model as a benchmark, which predicted the mean rating for each movie. To improve the recommendations, we iterated over two models: an SVD model and an SVDpp model. In the iteration process, we used cross-validation in tandem with GridSearchCV to find the best parameters for our models, optimizing their performance.

This cross-validation and GridSearchCV process helped us fine-tune our models to achieve a Root Mean Square Error (RMSE) score of 0.908. Given that our dataset contains user ratings on a scale of 1 to 5, a 0.908 RMSE score signifies good performance in predicting users' movie ratings. It indicates that, on average, our model's predictions deviate by 0.908 from the actual user ratings. This performance enables more accurate and personalized recommendations for users based on their shared preferences, enhancing the overall user experience and satisfaction.

### Importing All Packages Used

In [73]:
import pandas as pd
import numpy as np
import random


from surprise import Dataset, Reader, accuracy
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
from surprise.prediction_algorithms import BaselineOnly,NormalPredictor, SVDpp, SVD


import warnings
warnings.filterwarnings('ignore')

In [74]:
df_merged = pd.read_csv('./data/collab_merged.csv')
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2025476 entries, 0 to 2025475
Data columns (total 8 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   rating       int64 
 1   user_id      object
 2   movie_id     object
 3   reviews      object
 4   genre        object
 5   description  object
 6   title        object
 7   starring     object
dtypes: int64(1), object(7)
memory usage: 123.6+ MB


### Creating a dataframe for evaluating model later

In [31]:
df_meta = df_merged.drop(columns=['reviews'], axis=1)
df_user_check = df_merged.drop(columns=['movie_id', 'reviews'], axis=1)

In [44]:
def print_user_check_df(df, user_id):
    df_user = df[df['user_id'] == user_id].drop('user_id', axis=1).reset_index(drop=True)
    return df_user

In [45]:
print_user_check_df(df_user_check, 'A2Z9MK8HNYT17Y')

Unnamed: 0,rating,genre,description,title,starring
0,5,comedy,"Get ready to laugh with Mork & Mindy, the insa...",Mork & Mindy: The Complete Series,Robin Williams
1,5,comedy,When widower Mike Brady (Robert Reed) marries ...,Brady Bunch: The Complete Series,Florence Henderson
2,5,unknown,Will Smith stars as a teenager from inner city...,The Fresh Prince of Bel-Air: The Complete Seas...,Will Smith
3,5,comedy,"<![CDATA[ Fresh Prince of Bel Air, The: The Co...",The Fresh Prince of Bel-Air: Season 3,Will Smith
4,5,comedy,"<![CDATA[ Fresh Prince of Bel-Air, The: The Co...",The Fresh Prince of Bel-Air: Season 4,William Smith


### Creating dataframe for modeling
- removing all columns that will not be used for collaborative modeling

In [4]:
df = df_merged.drop(columns=['reviews', 'genre', 'description',
                            'title', 'starring'], axis=1)
df.head()

Unnamed: 0,rating,user_id,movie_id
0,5,A2M1CU2IRZG0K9,5089549
1,5,A2PANT8U0OJNT4,5419263
2,5,A3TS9EQCNLU0SM,5092663
3,3,A2AZ7CE08CWK4F,5092663
4,4,A2QWTVQ90KYZZP,5092663


## Instantiating Reader() and splitting the data

In [5]:
reader = Reader()
data = Dataset.load_from_df(df[['user_id', 'movie_id', 'rating']], reader)

In [6]:
#splitting data into 80% training and 20% testing data and setting a random state
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

## Benchmark RMSE scores with NormalPredictor and BaselineOnly
Evaluation: An RMSE value of 0.9352 for the predicted 'rating' (scaled 1 to 5) implies that, on average, the predicted ratings have an error of approximately 0.94. Since the error is less than 1, the models have a relatively decent performance in predicting user ratings. However, further improvements may still be made through hyperparameter tuning and testing more complex models.

In [7]:
# Using NormalPredictor() as a dummy model to evaluate later model iteration
normal_dummy = NormalPredictor()
normal_dummy.fit(trainset)

# Setting the predicted RMSE score to dum_score
predictions = normal_dummy.test(testset)
dum_score = accuracy.rmse(predictions)

RMSE: 1.3849


In [8]:
# Using BasineLineOnly() as a benchmark to evaluate later model iteration
baselinee = BaselineOnly()
baselinee.fit(trainset)

# Setting the predicted RMSE score to base_score
predictions = baselinee.test(testset)
base_score = accuracy.rmse(predictions)

Estimating biases using als...
RMSE: 0.9317


# SVD() modeling

In [10]:
# cross-validating across the entire dataset to evaluate model performance
# across 5 different folds of the dataset vs baselineOnly()
svd_cv = SVD(random_state=42)
cv_svd = cross_validate(svd_cv, data, measures=['RMSE'], n_jobs=-1, verbose=True)

#printing the average RMSE score across 5 the different folds    
np.mean(cv_svd['test_rmse'])

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9269  0.9273  0.9268  0.9239  0.9273  0.9264  0.0013  
Fit time          13.15   13.28   13.26   12.74   11.99   12.89   0.49    
Test time         3.93    3.69    3.40    3.13    2.91    3.41    0.37    


0.9264214899329287

### cross_val Evaluation
- The results from the cross validation show that the SVD model performs performs better than the dummy model and slightly better than the baseline model across all datasets.
- Therefore, After gridsearching for the best hyperparameters an SVD() model should perform even better than the baselineOnly() model when fit on the training data.

#### Gridsearching SVD()
- using GridSearchCV to find the best paramaters for an SVD model.

In [11]:
# setting paramaters to gridsearch
params = {'n_factors': [5, 10, 20, 50, 100, 110],
          'n_epochs': [10, 20, 40, 60],
          'biased':[True, False]
         }

# applying gridsearch using defined parameters
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1, joblib_verbose=10)
g_s_svd.fit(data)

# printing best score and best paramaters from for the model
print(g_s_svd.best_score)
print(g_s_svd.best_params)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   29.1s
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:   52.9s
[Parallel(n_jobs=-1)]: Done  29 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done  66 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done  81 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done  96 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 113 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 130 tasks      | elapsed:  5.7min
[Parallel(n_jobs=-1)]: Done 149 tasks      | elapsed:  7.1min
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:  8.1min
[Parallel(n_jobs=-1)]: Done 189 tasks      | elapsed: 10.1min
[Parallel(n_jobs=-1)]: Done 234 out of 240 | elapsed: 14.1min remaining:   21.7s


{'rmse': 0.9143161864548596, 'mae': 0.6229083008467438}
{'rmse': {'n_factors': 5, 'n_epochs': 40, 'biased': True}, 'mae': {'n_factors': 5, 'n_epochs': 60, 'biased': True}}


[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 14.3min finished


### Instantiantiaging parameters on actual SVD() model

In [13]:
SVD_first = SVD(n_factors=5, n_epochs=40, random_state=42)
SVD_first.fit(trainset)
predictions = SVD_first.test(testset)
SVD_one = accuracy.rmse(predictions)

RMSE: 0.9128


In [None]:
params = {'n_factors': [1, 3, 5, 10, 20, 40, 80, 100],
          'n_epochs': [10, 20, 40, 60, 80],
          'biased':[True, False],
          'init_mean': [0, 0.1, 0.5],
          'reg_all': [0.02, 0.03, 0.05, 0.07],
          'lr_all': [0.005, 0.01, 0.02]
         }

g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1, joblib_verbose=10)
g_s_svd.fit(data)

print(g_s_svd.best_params_)

### Instantiating second SVD() model
- using best params found from second gridsearching to save a final SVD() model
- Evaluating performance on testset

In [None]:
SVD_final = SVD(n_factors=1, n_epochs=40, random_state=42)
SVD_final.fit(trainset)
predictions = SVD_final.test(testset)
SVD_last = accuracy.rmse(predictions)

# SVD++() modeling

In [15]:
# cross-validating across the entire dataset to evaluate model performance
# across 5 different folds of the dataset vs baselineOnly()
svd_pp_cv = SVDpp(random_state=42)
cv_svdpp = cross_validate(svd_pp_cv, data, measures=['RMSE'], n_jobs=-1, verbose=True)

#printing the average RMSE score across 5 the different folds    
print(np.mean(cv_svdpp['test_rmse']))

Evaluating RMSE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9183  0.9173  0.9169  0.9162  0.9152  0.9168  0.0011  
Fit time          28.34   28.53   28.45   28.21   28.29   28.36   0.11    
Test time         8.60    8.51    8.49    8.56    8.36    8.50    0.08    
0.9168005594770212


### cross_val Evaluation
- Again, the results from the cross validation show that the SVD++ model performs better than the dummy model and slightly better than the baseline model across all datasets.
- Therefore, After gridsearching for the best hyperparameters an SVD() model should perform even better than the baselineOnly() model when fit on the training data.

In [14]:
# setting paramaters to gridsearch
params = {'n_factors': [5, 10, 20, 50, 100],
          'n_epochs': [10, 20, 40, 60],
          'cache_ratings':[True, False]
         }
# applying gridsearch using defined parameters
g_s_svdpp = GridSearchCV(SVDpp,param_grid=params,n_jobs=-1, measures=['RMSE'], joblib_verbose=10)
g_s_svdpp.fit(data)

# printing best score and best paramaters from for the model
print(g_s_svdpp.best_score)
print(g_s_svdpp.best_params)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   43.7s
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done  29 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done  66 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done  81 tasks      | elapsed:  5.9min
[Parallel(n_jobs=-1)]: Done  96 tasks      | elapsed:  6.9min
[Parallel(n_jobs=-1)]: Done 113 tasks      | elapsed:  9.4min
[Parallel(n_jobs=-1)]: Done 130 tasks      | elapsed: 11.0min
[Parallel(n_jobs=-1)]: Done 149 tasks      | elapsed: 15.6min
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed: 20.1min
[Parallel(n_jobs=-1)]: Done 190 out of 200 | elapsed: 27.9min remaining:  1.5min


{'rmse': 0.9122061140158619}
{'rmse': {'n_factors': 5, 'n_epochs': 20, 'cache_ratings': True}}


[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed: 33.2min finished


### Instantiating best SVD() model
- using best params found from gridsearching to save a final SVD() model
- Evaluating performance on testset

In [16]:
SVDpp_one = SVDpp(n_factors=5,n_epochs=20, cache_ratings=True, random_state=42)
SVDpp_one.fit(trainset)
predictions = SVDpp_one.test(testset)
SVDpp_first = accuracy.rmse(predictions)

RMSE: 0.9104


In [None]:
# setting paramaters to gridsearch
params = {'n_factors': [1, 3, 5, 10, 20, 40, 80, 100],
          'n_epochs': [10, 20, 40, 60, 80],
          'cache_ratings':[True, False]
          'init_mean': [0, 0.1, 0.5],
          'reg_all': [0.02, 0.03, 0.05, 0.07],
          'lr_all': [0.005, 0.01, 0.02]
         }
# applying gridsearch using defined parameters
g_s_svdpp = GridSearchCV(SVDpp,param_grid=params,n_jobs=-1, measures=['RMSE'], joblib_verbose=10)
g_s_svdpp.fit(data)

# printing best score and best paramaters from for the model
print(g_s_svdpp.best_score)
print(g_s_svdpp.best_params)

## Final Model: SVD()++ model ('SVDpp_one' variable)
Based on the model results It would seem that the gridsearched SVD()++ model 'SVDpp_one' is the best performer with an RMSE: 0.9104. Because of time constraints I did not gridsearch all possible hyperparamaters and with more time to gridsearch it could lead to a better performing model. 

Although with an RMSE of 0.91 for a final model compared to the benchmark dummyscore score of 1.34, is not much better than the baselineOnly() model and indicates that I may want to do some further feature engineering on the dataset before attempting more hyperparameter tuning as there seems to be diminishing returns.

## Function to return content based on similar users predicted by the final model
To use the final model I created the recommend_movies function to give content recommendations to existing users, based on the trained SVDpp collaborative filtering model. It takes an existing user ID, the model, DataFrame like my movie and tv meta dataframe, and an optional parameter N (defaulting to 10) to recommend N movies. By predicting user ratings for unseen movies and sorting them in descending order, the function returns the top high predicted unseen movies and their details.

In [69]:
def recommend_movies(trained_model, movie_df, N=5, user_id=None):
    
    # If no user_id provided, default to a random user_id for evaluations
    if user_id is None:
        all_user_ids = movie_df['user_id'].unique().tolist()
        user_id = random.choice(all_user_ids)

    # Get user's watched movies and all available movies
    user_movies = movie_df[movie_df['user_id'] == user_id]['movie_id'].tolist()
    all_movies = movie_df['movie_id'].tolist()
    
    # Determine the set of unseen movies
    unseen_movies = set(all_movies) - set(user_movies)

    # Predict ratings for unseen movies
    predictions = []
    for movie_id in unseen_movies:
        predicted_rating = trained_model.predict(user_id, movie_id).est
        predictions.append({'movie_id': movie_id, 'predicted_rating': predicted_rating})
    
    # Create a DataFrame with predicted ratings
    predictions_df = pd.DataFrame(predictions)

    # Get the top N movies with the highest predicted ratings
    top_N = predictions_df.sort_values('predicted_rating', ascending=False).head(N)
    top_N_movie_ids = top_N['movie_id'].tolist()

    # Get movie details of the top N movies
    top_N_movies = movie_df[movie_df['movie_id'].isin(top_N_movie_ids)]
    top_N_movies.drop_duplicates(subset=['movie_id'], inplace=True)

    # Merge movie details with predicted ratings
    top_N_ratings = pd.merge(top_N, top_N_movies, on='movie_id')
    
    # Print user information
    num_movies_reviewed = len(user_movies)
    print(f"The user_id for this recommendations is {user_id} and they have reviewed {num_movies_reviewed} different movies and/or TV shows")

    # Return top N movie recommendations without unnecessary columns
    return top_N_ratings.drop(columns=['user_id', 'rating', 'movie_id'], axis=1)

In [72]:
recommendations = recommend_movies(SVDpp_one, df_meta, N=10, user_id='A1GF8W7J9526AB')
recommendations

The user_id for this recommendations is A1GF8W7J9526AB and they have reviewed 12 different movies and/or TV shows


Unnamed: 0,predicted_rating,genre,description,title,starring
0,4.710481,unknown,Based upon the acclaimed comic book and direct...,Kingsman: The Secret Service,Various Artists
1,4.596481,unknown,They are the world's most celebrated bodyguard...,Secret Service VHS,Various Artists
2,4.360443,unknown,HBO presents the new one-hour drama series fro...,The Newsroom: Season 1,Various Artists
3,4.291105,classics silent films,"The Mack Sennett Collection, Vol. Onefeatures ...",The Mack Sennett Collection: Volume 1,Charlie Chaplin
4,4.272097,military war,"The complete third series of this terrific, at...",Foyle's War - Series 3 - Complete 2004,Michael Kitchen
5,4.266229,art house international italian,Italian cinema dream team Sophia Loren (Marria...,A Special Day,Sophia Loren
6,4.254209,unknown,The beloved nurses and nuns of Nonnatus House ...,Call the Midwife: Season Five,Jenny Agutter
7,4.249585,comedy,Birdemic is one of our favorite bad movies of ...,RiffTrax Live!: Birdemic,Michael J Nelson
8,4.248201,unknown,"12 Angry Men, by Sidney Lumet, may be the most...",12 Angry Men,Ed Begley
9,4.247771,action adventure,Based on the series of graphic novels by accla...,The Middleman,Matt Keeslar


In [48]:
print_user_check_df(df_user_check, 'A1GF8W7J9526AB')

Unnamed: 0,rating,genre,description,title,starring
0,3,unknown,This 90 minute feature film DVD stars Daniel B...,Bloodsport 4,Daniel Bernhardt
1,4,under 15,Black Mask Take the art direction and set desi...,Black Mask,Jet Li
2,1,comedy,A wild college party in a remote cabin will go...,Big Bad Wolf,Clint Howard
3,5,unknown,"A high-octane, globe-spanning thriller with st...",Strike Back: Season 2,Various Artists
4,3,horror,An aging caretaker for a powerful vampire must...,The Caretakers,Nick Faust
5,1,science fiction fantasy fantasy,"Trapped under a dark spell, Gretel assembles a...",Hansel Vs Gretel,Brent Lydic
6,1,science fiction fantasy fantasy,Six innocent women in Northern Italy were foun...,Darkside Witches,Gerard Diefenthal
7,1,unknown,"Reeve, an elite member of a covert unit of Nin...",The Ninja Immovable Heart,Danny Glover
8,1,drama,"The big-budget, epic film on young King David ...",David and Goliath,Jerry Sokolosky
9,3,unknown,When the U.S. Government loses all contact wit...,Navy Seals Vs Zombies,Ed Quinn


### Evaluation of returned recommendations:
- The recommendations for this movie seem to be pretty on target. The user profile seems to have reviewed a variety of different Action and Military type action & adventure films with a variety of ratings. Without much more domain knowledge of the recommendations it's hard to tell how relative the recommendations are but they are definitely new and seemingly similar to the movies they have higher ratings for. 

In [70]:
recommendations2 = recommend_movies(SVDpp_one, df_meta, N=10, user_id='A2Z9MK8HNYT17Y')
recommendations2

The user_id for this recommendations is A2Z9MK8HNYT17Y and they have reviewed 5 different movies and/or TV shows


Unnamed: 0,predicted_rating,genre,description,title,starring
0,5.0,mystery suspense,"""Now in original U.K. broadcast orderSuperbly ...",Midsomer Murders Series 8,John Nettles
1,5.0,unknown,Wazzup? Superstar Martin Lawrence (Open Season...,Martin: Season 2,Martin Lawrence
2,5.0,art house international italian,Italian cinema dream team Sophia Loren (Marria...,A Special Day,Sophia Loren
3,5.0,documentary,The true story behind the cataclysmic World Wa...,Hitlers Lost Sub VHS,Various Artists
4,5.0,comedy,<![CDATA[ Pushing Daisies: Season 1 & Season 2...,Pushing Daisies: The Complete First and Second...,Lee Pace
5,5.0,documentary,"A 6 volume set of VHS tapes, from PBS",Eyes on the Prize America's Civil Rights Years...,Henry Hampton
6,5.0,unknown,EXCLUSIVE Edition which was ONLY available at ...,Wizard of Oz,Various Artists
7,5.0,documentary,The original Emmy Award-winning nine-part seri...,Ken Burns: The Civil War,"Morgan Freeman, Garrison Keillor Sam Waterson"
8,5.0,drama,*** SEASON SIX *** VOLUME SIX *** CONTAINS THR...,Dr. Quinn Medicine Woman: Season Six - Volume ...,Various Artists
9,5.0,comedy,Travel to Miami for a visit with the original ...,The Golden Girls: Season 5,Beatrice Arthur


In [71]:
print_user_check_df(df_user_check, 'A2Z9MK8HNYT17Y')

Unnamed: 0,rating,genre,description,title,starring
0,5,comedy,"Get ready to laugh with Mork & Mindy, the insa...",Mork & Mindy: The Complete Series,Robin Williams
1,5,comedy,When widower Mike Brady (Robert Reed) marries ...,Brady Bunch: The Complete Series,Florence Henderson
2,5,unknown,Will Smith stars as a teenager from inner city...,The Fresh Prince of Bel-Air: The Complete Seas...,Will Smith
3,5,comedy,"<![CDATA[ Fresh Prince of Bel Air, The: The Co...",The Fresh Prince of Bel-Air: Season 3,Will Smith
4,5,comedy,"<![CDATA[ Fresh Prince of Bel-Air, The: The Co...",The Fresh Prince of Bel-Air: Season 4,William Smith


### Evaluation of returned recommendations (last project run-through):
- The video content recommended to this user seems to be a bit off based on the movies and ratings they have in their profile. Our example, this user has 5 reviews with all of them rated the highest score of 5/5, three of which are the same show and the other two are very similar in that they are also mainstream Comedy TV Shows of their time. 
- I think what is happening is that because this dataset is so vast and there are soo many different users to compare to, and since this user has rated all of their very similar videos as a perfect 5.0, which can kind of be thought of as one review in this context, the user_id and movie_id matrix being compared by the recommendation system is considering this user as similar to a large amount of user-movie movie profiles with a larger variety of reviewed movies who rate this type of show highly.
- I was not lucky enough to catch this case early, and currently do not have time to reprocess and remodel but this makes me believe that If I want to use just a collaborative based approach I'll have to further filter this prepared dataset so that the minimum reviews per user is higher. (currently the minimum amount of reviews per user_id is 4)

# Conclusions

During the analysis of the collaborative filtering recommendation system, we found that recommendations for users with limited high-rated Comedy TV Shows didn't align well with their preferences.

### Possible Causes
- **Vast dataset**: The large and diverse dataset may cause users with a small set of highly-rated videos to correlate strongly with users who reviewed varied movies but rated the same shows highly.
- **Similarity metric**: The current system treats varied-preference users as similar, resulting in inaccurate recommendations.

### Recommended Actions
- **Increase minimum reviews**: Filter users with more than the current minimum number of reviews (4) for better comparisons.
- **Customize similarity metric**: Implement a weighted similarity metric considering overlapping movies and users' genre importance.
- **Hybrid approach**: Use a combination of collaborative filtering and content-based filtering methods to account for user and movie attributes, generating better recommendations.

These steps could potentially enhance the recommendation system's effectiveness, providing more relevant recommendations for each user profile.


##### future steps sidenote: Since the dataset is so large, it may make more sense to use the BaselineOnly model if not able to increase the difference in the hyperparameter tuned final model in order to save computation.