<img src="http://cfs22.simplicdn.net/ice9/new_logo.svgz "/>

## Building user-based recommendation model for Amazon
### Created By- Pradip Bera

### Analysis Task
##### - Exploratory Data Analysis:

###### Which movies have maximum views/ratings?
###### What is the average rating for each movie? Define the top 5 movies with the maximum ratings.
###### Define the top 5 movies with the least audience.

##### - Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.

###### Divide the data into training and test data
###### Build a recommendation model on training data
###### Make predictions on the test data

In [2]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
pwd()

'C:\\Users\\pradi\\Downloads'

In [4]:
df = pd.read_csv('Amazon - Movies and TV Ratings.csv')

In [5]:
df.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,...,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,...,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,...,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,...,,,,,,,,,,


In [6]:
df.shape

(4848, 207)

In [7]:
df_org = df.copy()

In [8]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Movie1,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie2,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie3,1.0,2.000000,,2.0,2.00,2.0,2.0,2.0
Movie4,2.0,5.000000,0.000000,5.0,5.00,5.0,5.0,5.0
Movie5,29.0,4.103448,1.496301,1.0,4.00,5.0,5.0,5.0
...,...,...,...,...,...,...,...,...
Movie202,6.0,4.333333,1.632993,1.0,5.00,5.0,5.0,5.0
Movie203,1.0,3.000000,,3.0,3.00,3.0,3.0,3.0
Movie204,8.0,4.375000,1.407886,1.0,4.75,5.0,5.0,5.0
Movie205,35.0,4.628571,0.910259,1.0,5.00,5.0,5.0,5.0


### Task 1  - Which movies have maximum views/ratings?

In [9]:
#Movie with highest views
df.describe().T['count'].sort_values(ascending=False)[:1].to_frame() #---Movie127

Unnamed: 0,count
Movie127,2313.0


In [10]:
#Movie with highest Ratings
df.drop('user_id',axis=1).sum().sort_values(ascending=False)[:1].to_frame()  #---Movie127

Unnamed: 0,0
Movie127,9511.0


### Task 2 - What is the average rating for each movie? Define the top 5 movies with the maximum ratings

In [11]:
df.drop('user_id',axis=1).mean().sort_values(ascending=False)[:5].to_frame()

Unnamed: 0,0
Movie1,5.0
Movie66,5.0
Movie76,5.0
Movie75,5.0
Movie74,5.0


### Task 3 - Define the top 5 movies with the least audience

In [12]:
df.describe().T['count'].sort_values(ascending=True)[:5].to_frame()

Unnamed: 0,count
Movie1,1.0
Movie71,1.0
Movie145,1.0
Movie69,1.0
Movie68,1.0


### Task 4 - Recommendation Model

In [17]:
conda install -c conda-forge scikit-surprise

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\pradi\anaconda3

  added / updated specs:
    - scikit-surprise


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.12.0               |   py38haa244fe_0         1.0 MB  conda-forge
    python_abi-3.8             |           2_cp38           4 KB  conda-forge
    scikit-surprise-1.1.1      |   py38h6f4d8f0_2         597 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         1.6 MB

The following NEW packages will be INSTALLED:

  python_abi         conda-forge/win-64::python_abi-3.8-2_cp38
  scikit-surprise    conda-forge/win-64::scikit-surprise-1.1.1-py38h6f4d8f0_2

The following packages will be UPDATED:

  conda              pkgs/mai



  current version: 4.10.1
  latest version: 4.12.0

Please update conda by running

    $ conda update -n base -c defaults conda




In [18]:
# !pip install scikit-surprise

In [20]:
import surprise
from surprise import Reader
from surprise import accuracy
from surprise import Dataset
from surprise.model_selection import train_test_split
from surprise import SVD
from surprise.model_selection import cross_validate

In [21]:
df_melt = df.melt(id_vars = df.columns[0],value_vars=df.columns[1:],var_name="Movies",value_name="Rating")

In [22]:
df_melt

Unnamed: 0,user_id,Movies,Rating
0,A3R5OBKS7OM2IR,Movie1,5.0
1,AH3QC2PC1VTGP,Movie1,
2,A3LKP6WPMP9UKX,Movie1,
3,AVIY68KEPQ5ZD,Movie1,
4,A1CV1WROP5KTTW,Movie1,
...,...,...,...
998683,A1IMQ9WMFYKWH5,Movie206,5.0
998684,A1KLIKPUF5E88I,Movie206,5.0
998685,A5HG6WFZLO10D,Movie206,5.0
998686,A3UU690TWXCG1X,Movie206,5.0


In [23]:
rd = Reader()
data = Dataset.load_from_df(df_melt.fillna(0),reader=rd)
data

<surprise.dataset.DatasetAutoFolds at 0x1daad0e7790>

In [24]:
trainset, testset = train_test_split(data,test_size=0.25)

In [25]:
#Using SVD (Singular Value Descomposition)
svd = SVD()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1daad0e7e50>

In [26]:
pred = svd.test(testset)

In [27]:
accuracy.rmse(pred)

RMSE: 1.0260


1.0260027701236416

In [28]:
accuracy.mae(pred)

MAE:  1.0120


1.0120218261705

In [29]:
cross_validate(svd, data, measures = ['RMSE', 'MAE'], cv = 3, verbose = True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    1.0266  1.0254  1.0269  1.0263  0.0007  
MAE (testset)     1.0122  1.0117  1.0124  1.0121  0.0003  
Fit time          43.38   45.10   43.50   43.99   0.78    
Test time         3.61    4.86    4.54    4.34    0.53    


{'test_rmse': array([1.02655857, 1.02535932, 1.026882  ]),
 'test_mae': array([1.01224612, 1.01174657, 1.01236731]),
 'fit_time': (43.37951183319092, 45.097490549087524, 43.50173354148865),
 'test_time': (3.6101443767547607, 4.855774879455566, 4.541630983352661)}

In [30]:
def repeat(ml_type,dframe):
    rd = Reader()
    data = Dataset.load_from_df(dframe,reader=rd)
    print(cross_validate(ml_type, data, measures = ['RMSE', 'MAE'], cv = 3, verbose = True))
    print("--"*15)
    usr_id = 'A3R5OBKS7OM2IR'
    mv = 'Movie1'
    r_u = 5.0
    print(ml_type.predict(usr_id,mv,r_ui = r_u,verbose=True))
    print("--"*15)


In [31]:
repeat(SVD(),df_melt.fillna(df_melt['Rating'].mean()))
#repeat(SVD(),df_melt.fillna(df_melt['Rating'].median()))

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.0834  0.0851  0.0890  0.0858  0.0023  
MAE (testset)     0.0098  0.0098  0.0098  0.0098  0.0000  
Fit time          40.47   40.69   41.69   40.95   0.53    
Test time         3.18    3.06    3.26    3.17    0.08    
{'test_rmse': array([0.08337141, 0.0851464 , 0.08896734]), 'test_mae': array([0.00983467, 0.00983486, 0.00980184]), 'fit_time': (40.466646671295166, 40.686384439468384, 41.69422769546509), 'test_time': (3.1813571453094482, 3.0629024505615234, 3.26265549659729)}
------------------------------
user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 4.40   {'was_impossible': False}
user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 4.40   {'was_impossible': False}
------------------------------


In [32]:
#trying grid search and find optimum hyperparameter value for n_factors
from surprise.model_selection import GridSearchCV

In [33]:
param_grid = {'n_epochs':[20,30],
             'lr_all':[0.005,0.001],
             'n_factors':[50,100]}

In [34]:
gs = GridSearchCV(SVD,param_grid,measures=['rmse','mae'],cv=3)
data1 = Dataset.load_from_df(df_melt.fillna(df_melt['Rating'].mean()),reader=rd)
gs.fit(data1)

In [35]:
gs.best_score

{'rmse': 0.08476270562048378, 'mae': 0.009029807812800326}

In [36]:
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

0.08476270562048378
{'n_epochs': 30, 'lr_all': 0.001, 'n_factors': 50}
