DESCRIPTION

The dataset provided contains movie reviews given by Amazon customers. Reviews were given between May 1996 and July 2014.

Data Dictionary
UserID – 4848 customers who provided a rating for each movie
Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users

Data Considerations
- All the users have not watched all the movies and therefore, all movies are not rated. These missing values are represented by NA.
- Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best.

Analysis Task
- Exploratory Data Analysis:

Which movies have maximum views/ratings?
What is the average rating for each movie? Define the top 5 movies with the maximum ratings.
Define the top 5 movies with the least audience.
- Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.

Divide the data into training and test data
Build a recommendation model on training data
Make predictions on the test data

In [26]:
pip install surprise

Collecting surprise
  Downloading https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/97/37/5d334adaf5ddd65da99fc65f6507e0e4599d092ba048f4302fe8775619e8/scikit-surprise-1.1.1.tar.gz (11.8MB)
[K     |████████████████████████████████| 11.8MB 2.7MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp36-cp36m-linux_x86_64.whl size=1670967 sha256=5bdbc539089d29ca0fe10bac56dfd83056fa9a18a3a8b60c162bd993ec2a78c1
  Stored in directory: /root/.cache/pip/wheels/78/9c/3d/41b419c9d2aff5b6e2b4c0fc8d25c538202834058f9ed110d0
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


In [38]:
# import libraries
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from surprise import Reader
from surprise import accuracy
from surprise import Dataset
from surprise.model_selection import train_test_split
from surprise import SVD
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV


In [3]:
# read the data
df = pd.read_csv('/content/drive/My Drive/DeepLearning_Simili/Projects/Machine Learning/Amazon - Movies and TV Ratings.csv')

In [4]:
df.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,Movie11,Movie12,Movie13,Movie14,Movie15,Movie16,Movie17,Movie18,Movie19,Movie20,Movie21,Movie22,Movie23,Movie24,Movie25,Movie26,Movie27,Movie28,Movie29,Movie30,Movie31,Movie32,Movie33,Movie34,Movie35,Movie36,Movie37,Movie38,Movie39,...,Movie167,Movie168,Movie169,Movie170,Movie171,Movie172,Movie173,Movie174,Movie175,Movie176,Movie177,Movie178,Movie179,Movie180,Movie181,Movie182,Movie183,Movie184,Movie185,Movie186,Movie187,Movie188,Movie189,Movie190,Movie191,Movie192,Movie193,Movie194,Movie195,Movie196,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [5]:
df.columns

Index(['user_id', 'Movie1', 'Movie2', 'Movie3', 'Movie4', 'Movie5', 'Movie6',
       'Movie7', 'Movie8', 'Movie9',
       ...
       'Movie197', 'Movie198', 'Movie199', 'Movie200', 'Movie201', 'Movie202',
       'Movie203', 'Movie204', 'Movie205', 'Movie206'],
      dtype='object', length=207)

In [6]:
df.shape, df.size

((4848, 207), 1003536)

In [7]:
# data description 
df.describe

<bound method NDFrame.describe of              user_id  Movie1  Movie2  ...  Movie204  Movie205  Movie206
0     A3R5OBKS7OM2IR     5.0     5.0  ...       NaN       NaN       NaN
1      AH3QC2PC1VTGP     NaN     NaN  ...       NaN       NaN       NaN
2     A3LKP6WPMP9UKX     NaN     NaN  ...       NaN       NaN       NaN
3      AVIY68KEPQ5ZD     NaN     NaN  ...       NaN       NaN       NaN
4     A1CV1WROP5KTTW     NaN     NaN  ...       NaN       NaN       NaN
...              ...     ...     ...  ...       ...       ...       ...
4843  A1IMQ9WMFYKWH5     NaN     NaN  ...       NaN       NaN       5.0
4844  A1KLIKPUF5E88I     NaN     NaN  ...       NaN       NaN       5.0
4845   A5HG6WFZLO10D     NaN     NaN  ...       NaN       NaN       5.0
4846  A3UU690TWXCG1X     NaN     NaN  ...       NaN       NaN       5.0
4847   AI4J762YI6S06     NaN     NaN  ...       NaN       NaN       5.0

[4848 rows x 207 columns]>

In [9]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Movie1,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie2,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie3,1.0,2.000000,,2.0,2.00,2.0,2.0,2.0
Movie4,2.0,5.000000,0.000000,5.0,5.00,5.0,5.0,5.0
Movie5,29.0,4.103448,1.496301,1.0,4.00,5.0,5.0,5.0
...,...,...,...,...,...,...,...,...
Movie202,6.0,4.333333,1.632993,1.0,5.00,5.0,5.0,5.0
Movie203,1.0,3.000000,,3.0,3.00,3.0,3.0,3.0
Movie204,8.0,4.375000,1.407886,1.0,4.75,5.0,5.0,5.0
Movie205,35.0,4.628571,0.910259,1.0,5.00,5.0,5.0,5.0


# Which movies have maximum views/ratings?



In [12]:
# the highest views 
df.describe().T['count'].sort_values(ascending=False)[:1].to_frame()

Unnamed: 0,count
Movie127,2313.0


In [13]:
# the highest views
df.drop('user_id',axis=1).sum().sort_values(ascending=False)[:1].to_frame()   

Unnamed: 0,0
Movie127,9511.0


# What is the average rating for each movie? Define the top 5 movies with the maximum ratings.


In [14]:
df.drop('user_id',axis=1).mean().sort_values(ascending=False)[:5].to_frame()

Unnamed: 0,0
Movie1,5.0
Movie55,5.0
Movie131,5.0
Movie132,5.0
Movie133,5.0


# Define the top 5 movies with the least audience.


In [15]:
# the lowest views
df.describe().T['count'].sort_values(ascending=True)[:5].to_frame()

Unnamed: 0,count
Movie1,1.0
Movie71,1.0
Movie145,1.0
Movie69,1.0
Movie68,1.0


# - **Recommendation Model:** Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.



In [24]:
# extract 3 features
df_melt = df.melt(id_vars = df.columns[0],value_vars=df.columns[1:],var_name="Movies",value_name="Rating")
df_melt

Unnamed: 0,user_id,Movies,Rating
0,A3R5OBKS7OM2IR,Movie1,5.0
1,AH3QC2PC1VTGP,Movie1,
2,A3LKP6WPMP9UKX,Movie1,
3,AVIY68KEPQ5ZD,Movie1,
4,A1CV1WROP5KTTW,Movie1,
...,...,...,...
998683,A1IMQ9WMFYKWH5,Movie206,5.0
998684,A1KLIKPUF5E88I,Movie206,5.0
998685,A5HG6WFZLO10D,Movie206,5.0
998686,A3UU690TWXCG1X,Movie206,5.0


In [29]:
# Read the abov data
rd = Reader()
data = Dataset.load_from_df(df_melt.fillna(0),reader=rd)
data

<surprise.dataset.DatasetAutoFolds at 0x7f2935b37400>

# Divide the data into training and test data

In [30]:
# split the data
trainset, testset = train_test_split(data,test_size=0.25)

# Build a recommendation model on training data


In [31]:
#Using SVD (Singular Value Descomposition)
svd = SVD()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f2935b373c8>

# Make predictions on the test data


In [32]:
pred = svd.test(testset)

In [33]:
# Print Root Mean Sequare Error
accuracy.rmse(pred)

RMSE: 1.0268


1.026838965029349

In [34]:
# Print Mean Average Error
accuracy.mae(pred)

MAE:  1.0124


1.012377472601624

In [35]:
# cross validation using cv = 3
cross_validate(svd, data, measures = ['RMSE', 'MAE'], cv = 3, verbose = True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    1.0252  1.0265  1.0269  1.0262  0.0007  
MAE (testset)     1.0117  1.0123  1.0125  1.0121  0.0003  
Fit time          42.40   42.90   42.75   42.68   0.21    
Test time         3.62    3.22    3.66    3.50    0.20    


{'fit_time': (42.400745153427124, 42.8960120677948, 42.75018286705017),
 'test_mae': array([1.01168832, 1.01225215, 1.01245157]),
 'test_rmse': array([1.02517495, 1.02648831, 1.02685257]),
 'test_time': (3.617727518081665, 3.21506404876709, 3.663652181625366)}

In [36]:
def repeat(ml_type,dframe):
    rd = Reader()
    data = Dataset.load_from_df(dframe,reader=rd)
    print(cross_validate(ml_type, data, measures = ['RMSE', 'MAE'], cv = 3, verbose = True))
    print("--"*15)
    usr_id = 'A3R5OBKS7OM2IR'
    mv = 'Movie1'
    r_u = 5.0
    print(ml_type.predict(usr_id,mv,r_ui = r_u,verbose=True))
    print("--"*15)

In [37]:
# repeat the training with filling the non rating values with mean
repeat(SVD(),df_melt.fillna(df_melt['Rating'].mean()))

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.0856  0.0875  0.0854  0.0862  0.0009  
MAE (testset)     0.0096  0.0098  0.0099  0.0098  0.0001  
Fit time          41.92   42.35   42.71   42.33   0.32    
Test time         3.40    3.98    3.44    3.61    0.26    
{'test_rmse': array([0.08560341, 0.08750733, 0.0854297 ]), 'test_mae': array([0.00964415, 0.00980438, 0.00992204]), 'fit_time': (41.923986196517944, 42.353782653808594, 42.710392475128174), 'test_time': (3.4026715755462646, 3.9757375717163086, 3.4378769397735596)}
------------------------------
user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 4.40   {'was_impossible': False}
user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 4.40   {'was_impossible': False}
------------------------------


In [39]:
# repeat the training with fillin the non rating values with median 
repeat(SVD(),df_melt.fillna(df_melt['Rating'].median()))

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.0929  0.0950  0.0893  0.0924  0.0023  
MAE (testset)     0.0071  0.0071  0.0070  0.0070  0.0000  
Fit time          42.15   42.55   42.49   42.40   0.18    
Test time         3.85    3.38    3.38    3.54    0.22    
{'test_rmse': array([0.09285554, 0.0949759 , 0.08934389]), 'test_mae': array([0.00706132, 0.00708006, 0.00699587]), 'fit_time': (42.147024154663086, 42.55401682853699, 42.48625826835632), 'test_time': (3.8545894622802734, 3.3822731971740723, 3.3793675899505615)}
------------------------------
user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 5.00   {'was_impossible': False}
user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 5.00   {'was_impossible': False}
------------------------------


In [40]:
# applying Grid search to find optimum hyperparameter value for n_factors
param_grid = {'n_epochs':[20,30],
             'lr_all':[0.005,0.001],
             'n_factors':[50,100]}

In [41]:
gs = GridSearchCV(SVD,param_grid,measures=['rmse','mae'],cv=3)
data1 = Dataset.load_from_df(df_melt.fillna(df_melt['Rating'].mean()),reader=rd)
gs.fit(data1)

In [42]:
gs.best_score

{'mae': 0.009015680432007667, 'rmse': 0.08468139302540162}

In [43]:
gs.best_score['rmse']

0.08468139302540162

In [44]:
# the optimum # of epoch 30 and learning rate 0.001 and number factors 50
gs.best_params['rmse']

{'lr_all': 0.001, 'n_epochs': 30, 'n_factors': 50}