# Project 4 Project
## Recommendations of Movies
![First Picture](pictures/Movie.jpg)

## Overview 
There are more than millions of movies made since first introduced. The normal person however has limited time to watch movies in their free time more than ever. **To help consumers save time and money and companies give consumers the best experience**, recommendations are made to make selection easier. To make these recommendations, we will be using a dataset of roughly **10,000 movie entries** to predict that a user would rate a given movie higher than those users with similar reviews on said movie. By comparing similar users and movie ratings, we should be able to recommend movies accurately. 

We attempt to use **memory-based modeling** and **model-based modeling** to fit the training set. **Peterson's similarity** appeared to perform the best for the Memory based models and was used to compare the others. Grid search was also used for both types to find the best combinations for each of the models. The final model used was the KNN Baseline algorithm

Lastly, there were **two ways** to recommend to users the top 5 movies. <br>
**First** was an artificial ranking given by the position and weighing them the user-based model and the item-based model. **The sum of the two ranks is their combined rank and is compared to the others to recommend.** <br>
The **second** way is to take the corresponding rank of the predicted values from the models and take **the average of the ratings from their sum**. The second way appear to be a better prediction than the previous models with a small difference in RSME. 

## Buiness Understanding
There has been a boom in streaming services and thousands of movies for consumers to watch. Netflix has over **4,000 movies** and Prime Video has roughly **7,000 movies** not considering that these big streaming services have their original movies as well. To compete with other streaming platforms, the user experience should be the focus of these companies. <br>

**One aspect to look into is the recommendation system they have on their website that would recommend movies to the users based on their movies, trending movies, and popular movies**. The user experience needs to entice old users to stay and welcome new users to join and begin watching movies. 

Recommendation works well in most cases. On average, better-rated movies perform well and people will actively look for them before making their choice to watch said movie. For example, **70% of videos** watched are made from their recommendations. 

## Data Understanding
The dataset was compiled by the **Grouplens** research group and the source of that data comes from [MovieLens](https://movielens.org/). MovieLens is a movie recommendation service that has **9,742 movies** and **100,836 ratings** from **610 users**. The dataset was updated on September 26, 2018. There are three datasets but will only be needing two of them, the ratings and the movies. 

The movie dataset has the **movie IDs, titles, and genres** for the movie. This was used mainly for conversions but could potentially be used for the genres as it was explored for a bit. 
Ratings dataset have the most pieces of information as it has ratings and timestamps. The ratings were scaled from **0 to 5 with a .5 step**. 

Exploring the datasets have some interesting facts concerning the dataset. For one, the distribution is slightly skewed left with the **mean** rating happening to be roughly **3.5**. There are a handful of users that contributes to the reviews given which may have bias depending on who the user was. **Most movies were not rated below 3** so it might be difficult to determine a good movie to recommend.  

In [1]:
### Read all the dataset and load them in with proper names ###
### Links were not used for this project ###

import pandas as pd
import helper as hp
movies = pd.read_csv('Data/movies.csv')
ratings = pd.read_csv('Data/ratings.csv')
tags = pd.read_csv('Data/tags.csv')

# links = pd.read_csv('ml-latest-small/links.csv') 

### Exploration Information
* There are **9,742** movies in the dataset
* There are **100,836** ratings and 610 users 
* There are **58** users that make up the **3,683** tags added
* Genres are separated by | if there are more than one


In [2]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [3]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


## Data Preperations
When exploring the dataset, none of the data was missing and other than the timestamps, all columns were usable. 

In [4]:
### Remove the timestamp column ### 
ratings.drop('timestamp', axis = 1, inplace = True)
movies['genres'] = movies['genres'].apply(lambda x: x.lower())

In [5]:
from surprise import Reader, Dataset
from surprise.model_selection import train_test_split

### Reader to use surprise libraries and create training set and testset ### 
reader = Reader()
data = Dataset.load_from_df(ratings,reader)
trainset, testset = train_test_split(data, test_size=0.3, random_state= 69)

In [6]:
### Necessary libraries for modeling and validating and testing accuracy of the models ### 
from surprise.prediction_algorithms import knns
from surprise.model_selection import cross_validate
from surprise import accuracy

## Method
We used **surprise library** to create a recommendation system using different algorithms. The surprise library has bulting models and testing function that can be easily used in this project. The Memory-Based models are **KnnBasic**, **KnnBaseline**, and **KnnWithMeans**. The Model-based modeling algorithm is **SVD or Singular Value decomposition**. After finding the best base model, a grid search is used to find the best parameters. We also tried to use different similarity conditions to see if that also improves the model. 

The metric used for evaluation is **RSME or the Root Square Mean Error**. This metric gives the average amount that each predicted rating was off by. Ideally, we would want a score close to 0 to predict similarly to other rated movies in the test set. 

The **default paramenters** are used to compare the model. We perform a grid search once the best performing model is choosing. We will be using **cross validation** function to get an average on the performance of the model. 

### Memory Based Methods 
**Three different** variation of KNN with all the cosine similarity to compare and all user based for these models. <br>
Cosine similarity perform better for KnnBasic and KnnWithMeans, but the best performing model, **KnnBaseline** perform best with **Pearson similarity** instead. 

#### KnnBasic

In [10]:
sim_cosine = {"name": "cosine", "user_based": True}
basic = knns.KNNBasic(sim_options=sim_cosine, random_state = 69)
cv_basic = cross_validate(basic, data, measures=['RMSE'], cv=3, verbose=False)['test_rmse'].mean()
basic.fit(trainset)
basic_pred = basic.test(testset)

print('Average Cross Validate RMSE Score: ', cv_basic)
print('Testset RSME Score: ',accuracy.rmse(basic_pred)) 

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Average Cross Validate RMSE Score:  0.9780677232980256
RMSE: 0.9819
Testset RSME Score:  0.9818952607403942


#### KnnWithMeans

In [11]:
knn_means = knns.KNNWithMeans(sim_options=sim_cosine, random_state = 69)
cv_means = cross_validate(knn_means, data, measures=['RMSE'], cv=3, verbose=False)['test_rmse'].mean()
knn_means.fit(trainset)
predictions = knn_means.test(testset)

print('Average Cross Validate RMSE Score: ', cv_means)
print('Testset RSME Score: ',accuracy.rmse(predictions)) 

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Average Cross Validate RMSE Score:  0.9081005845177618
RMSE: 0.9103
Testset RSME Score:  0.910278017316939


#### Knn Baseline 

In [13]:
knn_baseline = knns.KNNBaseline(sim_options=sim_cosine, random_state = 69)
cv_baseline = cross_validate(knn_baseline, data, measures=['RMSE'], cv=3, verbose=False)['test_rmse'].mean()
knn_baseline.fit(trainset)
predictions = knn_baseline.test(testset)

print('Average Cross Validate RMSE Score: ', cv_baseline)
print('Testset RSME Score: ',accuracy.rmse(predictions)) 

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Average Cross Validate RMSE Score:  0.8860220078078452
RMSE: 0.8883
Testset RSME Score:  0.8882986936389531


### Model Based Method

#### SVD algoritm

In [None]:
from surprise.prediction_algorithms import SVD

svd = SVD(random_state = 69)
svd_cv = cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)['test_rmse'].mean()
svd.fit(trainset)
svd_pred = svd.test(testset)

print('Average Cross Validate RMSE Score: ', svd_cv)
print('Testset RSME Score: ',accuracy.rmse(svd_pred)) 

Average Cross Validate RMSE Score:  0.8740699058628424
RMSE: 0.8833
Testset RSME Score:  0.883294923683403


### KnnBaseline vs SVD
Both KnnBaseline and SVD performed the best when modeling so we need to determine which will be used for the final model. A gridsearch is used to see if there is a better combination of parameters to use. 

In [14]:
# clf = knns.KNNBaseline(sim_options=sim_pearson)
# params = {'k':[10, 20, 30, 40, 50],
#           'min_k': [1, 2, 3, 4,5,6,7,8,9,10],
#           'random_state':[69]
#          }
# g_s_baseline = GridSearchCV(knns.KNNBaseline,param_grid=params,n_jobs=-1)
# g_s_baseline.fit(data)
# g_s_baseline.best_params

In [14]:
sim_pearson = {"name": "pearson", "user_based": True}
knn_baseline = knns.KNNBaseline(k = 30, min_k = 6, sim_options=sim_pearson, random_state = 69)
cv_baseline_best = cross_validate(knn_baseline, data, measures=['RMSE'], cv=3, verbose=False)['test_rmse'].mean()
knn_baseline.fit(trainset)
predictions = knn_baseline.test(testset)
print('Average Cross Validate RMSE Score: ', cv_baseline_best)
print('Testset RSME Score: ',accuracy.rmse(predictions)) 

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Average Cross Validate RMSE Score:  0.8728894348485715
RMSE: 0.8753
Testset RSME Score:  0.8753167080242304


In [None]:
from surprise.model_selection import GridSearchCV
# params = {'n_factors': [20, 50, 100],
#          'reg_all': [0.02, 0.05, 0.1],
#           'lr_all': [.001, .002, .003, .004, .005],
#          'random_state':[69]
#          }
# g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1,cv = 5)
# g_s_svd.fit(data)
# g_s_svd.best_params


In [15]:
best_svd = SVD(n_factors= 50, reg_all = 0.05, lr_all= 0.005, random_state = 69)
cv_svd = cross_validate(best_svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)['test_rmse'].mean()
best_svd.fit(trainset)
svd_pred = best_svd.test(testset)

print('Average Cross Validate RMSE Score: ', cv_svd)
print('Testset RSME Score: ',accuracy.rmse(svd_pred)) 

NameError: name 'SVD' is not defined

### Model Evaluation

KnnBaseline perform slightly better than the SVD algorithm so we will using KnnBaseline as our main model.