## Building user-based recommendation model for Amazon.
### Project 3 || Niladri Sekhar Sardar

#### DESCRIPTION

##### The dataset provided contains movie reviews given by Amazon customers. Reviews were given between May 1996 and July 2014.

#### Data Dictionary
##### UserID – 4848 customers who provided a rating for each movie
##### Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users

#### Data Considerations
##### - All the users have not watched all the movies and therefore, all movies are not rated. These missing values are represented by NA.
##### - Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best.

#### Analysis Task
##### - Exploratory Data Analysis:

###### Which movies have maximum views/ratings?
###### What is the average rating for each movie? Define the top 5 movies with the maximum ratings.
###### Define the top 5 movies with the least audience.
##### - Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.

###### Divide the data into training and test data
###### Build a recommendation model on training data
###### Make predictions on the test data

In [1]:
#import LIB
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import surprise
from sklearn.metrics import mean_squared_error, pairwise
from surprise import Reader
from surprise import accuracy
from surprise import Dataset
from surprise.model_selection import train_test_split
from surprise import SVD
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV


In [2]:
df = pd.read_csv("Amazon - Movies and TV Ratings.csv")

In [3]:
df.shape

(4848, 207)

In [4]:
df.head(10)

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,...,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,...,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,...,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,...,,,,,,,,,,
5,AP57WZ2X4G0AA,,,,,2.0,,,,,...,,,,,,,,,,
6,A3NMBJ2LCRCATT,,,,,5.0,,,,,...,,,,,,,,,,
7,A5Y15SAOMX6XA,,,,,2.0,,,,,...,,,,,,,,,,
8,A3P671HJ32TCSF,,,,,5.0,,,,,...,,,,,,,,,,
9,A3VCKTRD24BG7K,,,,,5.0,,,,,...,,,,,,,,,,


In [5]:
df.describe()

Unnamed: 0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
count,1.0,1.0,1.0,2.0,29.0,1.0,1.0,1.0,1.0,1.0,...,5.0,2.0,1.0,8.0,3.0,6.0,1.0,8.0,35.0,13.0
mean,5.0,5.0,2.0,5.0,4.103448,4.0,5.0,5.0,5.0,5.0,...,3.8,5.0,5.0,4.625,4.333333,4.333333,3.0,4.375,4.628571,4.923077
std,,,,0.0,1.496301,,,,,,...,1.643168,0.0,,0.517549,1.154701,1.632993,,1.407886,0.910259,0.27735
min,5.0,5.0,2.0,5.0,1.0,4.0,5.0,5.0,5.0,5.0,...,1.0,5.0,5.0,4.0,3.0,1.0,3.0,1.0,1.0,4.0
25%,5.0,5.0,2.0,5.0,4.0,4.0,5.0,5.0,5.0,5.0,...,4.0,5.0,5.0,4.0,4.0,5.0,3.0,4.75,5.0,5.0
50%,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,4.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0
75%,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0
max,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0


In [6]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Movie1,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie2,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie3,1.0,2.000000,,2.0,2.00,2.0,2.0,2.0
Movie4,2.0,5.000000,0.000000,5.0,5.00,5.0,5.0,5.0
Movie5,29.0,4.103448,1.496301,1.0,4.00,5.0,5.0,5.0
...,...,...,...,...,...,...,...,...
Movie202,6.0,4.333333,1.632993,1.0,5.00,5.0,5.0,5.0
Movie203,1.0,3.000000,,3.0,3.00,3.0,3.0,3.0
Movie204,8.0,4.375000,1.407886,1.0,4.75,5.0,5.0,5.0
Movie205,35.0,4.628571,0.910259,1.0,5.00,5.0,5.0,5.0


#### 1 - Exploratory Data Analysis: Which movies have maximum views/ratings?

In [26]:
#Movie with highest views
max_view=df.describe().T['count'].sort_values(ascending=False)[:5].to_frame()
max_view

Unnamed: 0,count
Movie127,2313.0
Movie140,578.0
Movie16,320.0
Movie103,272.0
Movie29,243.0


In [27]:
#Movie with highest Ratings
max_ratings=df.drop('user_id',axis=1).sum().sort_values(ascending=False)[:5].to_frame()
max_ratings

Unnamed: 0,0
Movie127,9511.0
Movie140,2794.0
Movie16,1446.0
Movie103,1241.0
Movie29,1168.0


#### 2 - Exploratory Data Analysis: What is the average rating for each movie? Define the top 5 movies with the maximum ratings.

In [9]:
# What is the average rating for each movie?


In [10]:
# Define the top 5 movies with the maximum ratings.
df.drop('user_id',axis=1).mean().sort_values(ascending=False)[:5].to_frame()

Unnamed: 0,0
Movie1,5.0
Movie55,5.0
Movie131,5.0
Movie132,5.0
Movie133,5.0


#### 3 - Exploratory Data Analysis: Define the top 5 movies with the least audience.

In [11]:
df.drop('user_id',axis=1).mean().sort_values(ascending=True)[:5].to_frame()

Unnamed: 0,0
Movie144,1.0
Movie67,1.0
Movie45,1.0
Movie58,1.0
Movie60,1.0


### - Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.

##### Divide the data into training and test data
##### Build a recommendation model on training data
##### Make predictions on the test data

In [12]:
df_melt = df.melt(id_vars = df.columns[0],value_vars=df.columns[1:],var_name="Movies",value_name="Rating")
df_melt

Unnamed: 0,user_id,Movies,Rating
0,A3R5OBKS7OM2IR,Movie1,5.0
1,AH3QC2PC1VTGP,Movie1,
2,A3LKP6WPMP9UKX,Movie1,
3,AVIY68KEPQ5ZD,Movie1,
4,A1CV1WROP5KTTW,Movie1,
...,...,...,...
998683,A1IMQ9WMFYKWH5,Movie206,5.0
998684,A1KLIKPUF5E88I,Movie206,5.0
998685,A5HG6WFZLO10D,Movie206,5.0
998686,A3UU690TWXCG1X,Movie206,5.0


In [13]:
rd = Reader()
data = Dataset.load_from_df(df_melt.fillna(0),reader=rd)
data

<surprise.dataset.DatasetAutoFolds at 0x7f452dc0c490>

In [14]:
trainset, testset = train_test_split(data,test_size=0.25)

#Using SVD (Singular Value Descomposition)
svd = SVD()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f452dc6abd0>

In [15]:
pred = svd.test(testset)
accuracy.rmse(pred)
accuracy.mae(pred)

RMSE: 1.0253
MAE:  1.0118


1.0117511976081337

In [16]:
cross_validate(svd, data, measures = ['RMSE', 'MAE'], cv = 3, verbose = True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    1.0261  1.0250  1.0276  1.0262  0.0011  
MAE (testset)     1.0120  1.0116  1.0126  1.0121  0.0004  
Fit time          36.53   36.91   37.70   37.05   0.49    
Test time         3.49    3.09    3.09    3.22    0.19    


{'test_rmse': array([1.02609926, 1.02498722, 1.02758553]),
 'test_mae': array([1.01200543, 1.01158986, 1.01264477]),
 'fit_time': (36.5263307094574, 36.909157276153564, 37.70207953453064),
 'test_time': (3.488724946975708, 3.092465877532959, 3.089536666870117)}

In [17]:
def repeat(ml_type,dframe):
    rd = Reader()
    data = Dataset.load_from_df(dframe,reader=rd)
    print(cross_validate(ml_type, data, measures = ['RMSE', 'MAE'], cv = 3, verbose = True))
    print("--"*15)
    usr_id = 'A3R5OBKS7OM2IR'
    mv = 'Movie1'
    r_u = 5.0
    print(ml_type.predict(usr_id,mv,r_ui = r_u,verbose=True))
    print("--"*15)

In [18]:
repeat(SVD(),df_melt.fillna(df_melt['Rating'].mean()))

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.0865  0.0835  0.0884  0.0861  0.0020  
MAE (testset)     0.0097  0.0098  0.0097  0.0097  0.0000  
Fit time          36.32   36.50   36.39   36.40   0.08    
Test time         3.22    3.68    3.38    3.43    0.19    
{'test_rmse': array([0.08647955, 0.08354766, 0.0883849 ]), 'test_mae': array([0.00967566, 0.00978719, 0.00970534]), 'fit_time': (36.31685996055603, 36.49971675872803, 36.38637113571167), 'test_time': (3.220324754714966, 3.680865526199341, 3.383195400238037)}
------------------------------
user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 4.40   {'was_impossible': False}
user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 4.40   {'was_impossible': False}
------------------------------


In [19]:
repeat(SVD(),df_melt.fillna(df_melt['Rating'].median()))

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.0929  0.0914  0.0936  0.0926  0.0009  
MAE (testset)     0.0071  0.0072  0.0071  0.0071  0.0000  
Fit time          36.48   36.88   36.94   36.77   0.21    
Test time         3.70    3.21    3.92    3.61    0.30    
{'test_rmse': array([0.09291764, 0.09138289, 0.09361348]), 'test_mae': array([0.00711676, 0.00715395, 0.00707688]), 'fit_time': (36.4776656627655, 36.880109548568726, 36.94311809539795), 'test_time': (3.7009661197662354, 3.205291509628296, 3.9201879501342773)}
------------------------------
user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 5.00   {'was_impossible': False}
user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 5.00   {'was_impossible': False}
------------------------------


In [20]:
#trying grid search and find optimum hyperparameter value for n_factors

param_grid = {'n_epochs':[20,30],'lr_all':[0.005,0.001],'n_factors':[50,100]}
gs = GridSearchCV(SVD,param_grid,measures=['rmse','mae'],cv=3)
data1 = Dataset.load_from_df(df_melt.fillna(df_melt['Rating'].mean()),reader=rd)
gs.fit(data1)

In [21]:
gs.best_score

{'rmse': 0.08472511174307518, 'mae': 0.009012022373479157}

In [22]:
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

0.08472511174307518
{'n_epochs': 30, 'lr_all': 0.001, 'n_factors': 50}
