#### Building user-based recommendation model for Amazon.


#### DESCRIPTION

The dataset provided contains movie reviews given by Amazon customers. Reviews were given between May 1996 and July 2014.


#### Data Dictionary
UserID – 4848 customers who provided a rating for each movie
Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users


#### Data Considerations
- All the users have not watched all the movies and therefore, all movies are not rated. These missing values are represented by NA.
- Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best.


#### Analysis Task
- Exploratory Data Analysis:

- Which movies have maximum views/ratings?
- What is the average rating for each movie? Define the top 5 movies with the maximum ratings.
- Define the top 5 movies with the least audience.
- Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.
- Divide the data into training and test data
- Build a recommendation model on training data
- Make predictions on the test data

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

In [31]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv("Amazon - Movies and TV Ratings.csv")
df.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,...,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,...,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,...,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,...,,,,,,,,,,


In [3]:
df.describe()

Unnamed: 0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
count,1.0,1.0,1.0,2.0,29.0,1.0,1.0,1.0,1.0,1.0,...,5.0,2.0,1.0,8.0,3.0,6.0,1.0,8.0,35.0,13.0
mean,5.0,5.0,2.0,5.0,4.103448,4.0,5.0,5.0,5.0,5.0,...,3.8,5.0,5.0,4.625,4.333333,4.333333,3.0,4.375,4.628571,4.923077
std,,,,0.0,1.496301,,,,,,...,1.643168,0.0,,0.517549,1.154701,1.632993,,1.407886,0.910259,0.27735
min,5.0,5.0,2.0,5.0,1.0,4.0,5.0,5.0,5.0,5.0,...,1.0,5.0,5.0,4.0,3.0,1.0,3.0,1.0,1.0,4.0
25%,5.0,5.0,2.0,5.0,4.0,4.0,5.0,5.0,5.0,5.0,...,4.0,5.0,5.0,4.0,4.0,5.0,3.0,4.75,5.0,5.0
50%,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,4.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0
75%,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0
max,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0


#### Exploratory Data Analysis:

#### Which movies have maximum views/ratings?

In [4]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Movie1,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie2,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie3,1.0,2.000000,,2.0,2.00,2.0,2.0,2.0
Movie4,2.0,5.000000,0.000000,5.0,5.00,5.0,5.0,5.0
Movie5,29.0,4.103448,1.496301,1.0,4.00,5.0,5.0,5.0
...,...,...,...,...,...,...,...,...
Movie202,6.0,4.333333,1.632993,1.0,5.00,5.0,5.0,5.0
Movie203,1.0,3.000000,,3.0,3.00,3.0,3.0,3.0
Movie204,8.0,4.375000,1.407886,1.0,4.75,5.0,5.0,5.0
Movie205,35.0,4.628571,0.910259,1.0,5.00,5.0,5.0,5.0


In [5]:
df.describe().T["count"].sort_values(ascending = False)[:5]

Movie127    2313.0
Movie140     578.0
Movie16      320.0
Movie103     272.0
Movie29      243.0
Name: count, dtype: float64

#### Movie127 has maximum views.

#### What is the average rating for each movie? 

#### Define the top 5 movies with the maximum ratings.

In [6]:
#avg rating for each movie
df.describe().T["mean"].to_frame()

Unnamed: 0,mean
Movie1,5.000000
Movie2,5.000000
Movie3,2.000000
Movie4,5.000000
Movie5,4.103448
...,...
Movie202,4.333333
Movie203,3.000000
Movie204,4.375000
Movie205,4.628571


In [7]:
new_data = df.drop('user_id', axis=1)
new_data.head()

Unnamed: 0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,5.0,5.0,,,,,,,,,...,,,,,,,,,,
1,,,2.0,,,,,,,,...,,,,,,,,,,
2,,,,5.0,,,,,,,...,,,,,,,,,,
3,,,,5.0,,,,,,,...,,,,,,,,,,
4,,,,,5.0,,,,,,...,,,,,,,,,,


In [8]:
#top 5 movies with the maximum ratings
new_data.sum().sort_values(ascending=False).to_frame()[:5]

Unnamed: 0,0
Movie127,9511.0
Movie140,2794.0
Movie16,1446.0
Movie103,1241.0
Movie29,1168.0


Top 5 movies with the maximum ratings

-Movie127	9511.0

-Movie140	2794.0

-Movie16	    1446.0

-Movie103	1241.0

#### Define the top 5 movies with the least audience.

In [9]:
df.describe().T['count'].sort_values(ascending=True)[:5].to_frame()

Unnamed: 0,count
Movie1,1.0
Movie71,1.0
Movie145,1.0
Movie69,1.0
Movie68,1.0


#### Recommendation Model:

In [11]:
#Surprise is an easy to easy library that allows us to quickly
#build rating-based recommender system without reinventing th wheel.
#the reader class is used to parse a file containing ratings

from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split

In [12]:
df.columns

Index(['user_id', 'Movie1', 'Movie2', 'Movie3', 'Movie4', 'Movie5', 'Movie6',
       'Movie7', 'Movie8', 'Movie9',
       ...
       'Movie197', 'Movie198', 'Movie199', 'Movie200', 'Movie201', 'Movie202',
       'Movie203', 'Movie204', 'Movie205', 'Movie206'],
      dtype='object', length=207)

In [13]:
melt_data = df.melt(id_vars=df.columns[0], value_vars= df.columns[1:],
                   var_name="movie name", value_name="ratings")

In [14]:
melt_data

Unnamed: 0,user_id,movie name,ratings
0,A3R5OBKS7OM2IR,Movie1,5.0
1,AH3QC2PC1VTGP,Movie1,
2,A3LKP6WPMP9UKX,Movie1,
3,AVIY68KEPQ5ZD,Movie1,
4,A1CV1WROP5KTTW,Movie1,
...,...,...,...
998683,A1IMQ9WMFYKWH5,Movie206,5.0
998684,A1KLIKPUF5E88I,Movie206,5.0
998685,A5HG6WFZLO10D,Movie206,5.0
998686,A3UU690TWXCG1X,Movie206,5.0


In [15]:
from surprise import Dataset
reader = Reader(rating_scale=(-1,10))

In [16]:
data = Dataset.load_from_df(melt_data.fillna(0), reader=reader)

In [17]:
trainset, testset = train_test_split(data, test_size=0.25)

In [18]:
#The singular value decomposition (SVD) provides another way
#to factorize a matrix, into singular vectors and singular values.

from surprise import SVD
algo = SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f7e736bf430>

In [19]:
prediction = algo.test(testset)
accuracy.rmse(prediction)

RMSE: 0.2751


0.2750541902990439

In [20]:
#Predicting a data for a single user id
user_id = 'A3R50BKS7OM2IR'
movie_id = 'Movie1'
rating = 5.0

In [21]:
algo.predict(user_id, movie_id, r_ui=rating, verbose=True)

user: A3R50BKS7OM2IR item: Movie1     r_ui = 5.00   est = 0.00   {'was_impossible': False}


Prediction(uid='A3R50BKS7OM2IR', iid='Movie1', r_ui=5.0, est=0.00047508964652544575, details={'was_impossible': False})

In [22]:
from surprise.model_selection import cross_validate

In [23]:
cross_validate(algo, data, measures=['RMSE','MAE'], cv = 3, verbose=False)

{'test_rmse': array([0.28305094, 0.28092762, 0.2828386 ]),
 'test_mae': array([0.0431761 , 0.04265138, 0.04269878]),
 'fit_time': (36.44602012634277, 35.75047969818115, 35.565176010131836),
 'test_time': (2.865178108215332, 2.196913957595825, 2.1280221939086914)}

In [26]:
def repeat(algo_type, frame, min_, max_):
    reader = Reader(rating_scale=(min_,max_))
    data = Dataset.load_from_df(frame, reader=reader)
    algo = algo_type
    print(cross_validate(algo, data, measures=['RMSE','MAE'], cv = 3, verbose = True))
    user_id = 'A3R50BKS70M2IR'
    movie_id = 'Movie1'
    rating = 5.0
    algo.predict(user_id, movie_id, r_ui = rating, verbose=True)

In [27]:
df = df.iloc[:1212,:50]

In [28]:
melt_data = df.melt(id_vars=df.columns[0], value_vars=df.columns[1:],
                   var_name="movie name", value_name="ratings")

In [29]:
repeat(SVD(), melt_data.fillna(0),-1,10)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.4360  0.4487  0.4642  0.4496  0.0115  
MAE (testset)     0.1027  0.1021  0.1039  0.1029  0.0008  
Fit time          2.06    2.12    2.00    2.06    0.05    
Test time         0.10    0.09    0.10    0.10    0.00    
{'test_rmse': array([0.43602263, 0.44869669, 0.46415674]), 'test_mae': array([0.10267312, 0.10207043, 0.10394785]), 'fit_time': (2.058443069458008, 2.1231300830841064, 1.9964628219604492), 'test_time': (0.10438203811645508, 0.09285902976989746, 0.10057187080383301)}
user: A3R50BKS70M2IR item: Movie1     r_ui = 5.00   est = 0.00   {'was_impossible': False}


In [32]:
repeat(SVD(),melt_data.fillna(melt_data.mean()),-1,10)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.0945  0.0841  0.0934  0.0907  0.0046  
MAE (testset)     0.0199  0.0205  0.0204  0.0202  0.0003  
Fit time          2.02    2.10    1.93    2.02    0.07    
Test time         0.09    0.09    0.09    0.09    0.00    
{'test_rmse': array([0.09445197, 0.08412615, 0.09340356]), 'test_mae': array([0.01986889, 0.02047909, 0.02037124]), 'fit_time': (2.0243237018585205, 2.0983238220214844, 1.9322123527526855), 'test_time': (0.0878610610961914, 0.09418487548828125, 0.08534407615661621)}
user: A3R50BKS70M2IR item: Movie1     r_ui = 5.00   est = 4.61   {'was_impossible': False}


In [33]:
repeat(SVD(),melt_data.fillna(melt_data.median()),-1,10)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.0944  0.0950  0.1042  0.0979  0.0045  
MAE (testset)     0.0203  0.0195  0.0193  0.0197  0.0004  
Fit time          2.08    2.16    2.19    2.14    0.04    
Test time         0.10    0.09    0.09    0.09    0.00    
{'test_rmse': array([0.09444991, 0.0950339 , 0.10419936]), 'test_mae': array([0.02027261, 0.01949627, 0.01934775]), 'fit_time': (2.0837509632110596, 2.160637378692627, 2.1881678104400635), 'test_time': (0.09638285636901855, 0.09352874755859375, 0.09183526039123535)}
user: A3R50BKS70M2IR item: Movie1     r_ui = 5.00   est = 5.00   {'was_impossible': False}
