## Description
   The dataset provided contains movie reviews given by Amazon customers. Reviews were given between May 1996 and July 2014.
## Data Dictionary
    UserID – 4848 customers who provided a rating for each movie
    Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users
## Data Considerations
    - All the users have not watched all the movies and therefore, all movies are not rated. These missing values are represented by NA.
    - Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best.

## Analysis Task
    - Exploratory Data Analysis:

        1) Which movies have maximum views/ratings?
        2) What is the average rating for each movie? Define the top 5 movies with the maximum ratings.
        3) Define the top 5 movies with the least audience.
        
    - Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.

        4) Divide the data into training and test data
        5) Build a recommendation model on training data
        6) Make predictions on the test data

In [7]:
# Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [8]:
df = pd.read_csv("Amazon_Movies and TV Ratings.csv")
df.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,...,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,...,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,...,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,...,,,,,,,,,,


In [9]:
df.describe()        

Unnamed: 0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
count,1.0,1.0,1.0,2.0,29.0,1.0,1.0,1.0,1.0,1.0,...,5.0,2.0,1.0,8.0,3.0,6.0,1.0,8.0,35.0,13.0
mean,5.0,5.0,2.0,5.0,4.103448,4.0,5.0,5.0,5.0,5.0,...,3.8,5.0,5.0,4.625,4.333333,4.333333,3.0,4.375,4.628571,4.923077
std,,,,0.0,1.496301,,,,,,...,1.643168,0.0,,0.517549,1.154701,1.632993,,1.407886,0.910259,0.27735
min,5.0,5.0,2.0,5.0,1.0,4.0,5.0,5.0,5.0,5.0,...,1.0,5.0,5.0,4.0,3.0,1.0,3.0,1.0,1.0,4.0
25%,5.0,5.0,2.0,5.0,4.0,4.0,5.0,5.0,5.0,5.0,...,4.0,5.0,5.0,4.0,4.0,5.0,3.0,4.75,5.0,5.0
50%,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,4.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0
75%,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0
max,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0


# Exploratory Data Analysis:

### Question 1 : Which movies have maximum views/ratings?

In [10]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Movie1,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie2,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie3,1.0,2.000000,,2.0,2.00,2.0,2.0,2.0
Movie4,2.0,5.000000,0.000000,5.0,5.00,5.0,5.0,5.0
Movie5,29.0,4.103448,1.496301,1.0,4.00,5.0,5.0,5.0
...,...,...,...,...,...,...,...,...
Movie202,6.0,4.333333,1.632993,1.0,5.00,5.0,5.0,5.0
Movie203,1.0,3.000000,,3.0,3.00,3.0,3.0,3.0
Movie204,8.0,4.375000,1.407886,1.0,4.75,5.0,5.0,5.0
Movie205,35.0,4.628571,0.910259,1.0,5.00,5.0,5.0,5.0


In [11]:
df.describe().T["count"].sort_values(ascending= False)[:5]

Movie127    2313.0
Movie140     578.0
Movie16      320.0
Movie103     272.0
Movie29      243.0
Name: count, dtype: float64

#### Here, Movie127 has maximum views.

### Question 2 : What is the average rating for each movie? Define the top 5 movies with the maximum ratings ?

###### a) Average Rating for each movie ?

In [12]:
df.describe().T["mean"].to_frame()

Unnamed: 0,mean
Movie1,5.000000
Movie2,5.000000
Movie3,2.000000
Movie4,5.000000
Movie5,4.103448
...,...
Movie202,4.333333
Movie203,3.000000
Movie204,4.375000
Movie205,4.628571


###### b) Define the top 5 movies with the maximum ratings ?

In [13]:
new_data = df.drop('user_id',axis=1)
new_data.head()

Unnamed: 0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,5.0,5.0,,,,,,,,,...,,,,,,,,,,
1,,,2.0,,,,,,,,...,,,,,,,,,,
2,,,,5.0,,,,,,,...,,,,,,,,,,
3,,,,5.0,,,,,,,...,,,,,,,,,,
4,,,,,5.0,,,,,,...,,,,,,,,,,


In [14]:
new_data.sum().sort_values(ascending = False).to_frame()[:5]

Unnamed: 0,0
Movie127,9511.0
Movie140,2794.0
Movie16,1446.0
Movie103,1241.0
Movie29,1168.0


#### The above 5 movies have maximum ratings.

### Question3 : Define the top 5 movies with the least audience ?

In [15]:
df.describe().T['count'].sort_values(ascending=True)[:5].to_frame()

Unnamed: 0,count
Movie1,1.0
Movie71,1.0
Movie145,1.0
Movie69,1.0
Movie68,1.0


### Recommendation Model

In [16]:
# Surprise is an easy-to-use Python library that allows us to quickly build rating-based recommender systems without reinventing
# the wheel.

from surprise import Reader         # The Reader class is used to parse a file containing ratings
from surprise import accuracy
from surprise.model_selection import train_test_split

In [17]:
df.columns

Index(['user_id', 'Movie1', 'Movie2', 'Movie3', 'Movie4', 'Movie5', 'Movie6',
       'Movie7', 'Movie8', 'Movie9',
       ...
       'Movie197', 'Movie198', 'Movie199', 'Movie200', 'Movie201', 'Movie202',
       'Movie203', 'Movie204', 'Movie205', 'Movie206'],
      dtype='object', length=207)

In [18]:
melt_data = df.melt(id_vars = df.columns[0], value_vars = df.columns[1:], 
                    var_name = "movie name", value_name="ratings")

In [19]:
melt_data

Unnamed: 0,user_id,movie name,ratings
0,A3R5OBKS7OM2IR,Movie1,5.0
1,AH3QC2PC1VTGP,Movie1,
2,A3LKP6WPMP9UKX,Movie1,
3,AVIY68KEPQ5ZD,Movie1,
4,A1CV1WROP5KTTW,Movie1,
...,...,...,...
998683,A1IMQ9WMFYKWH5,Movie206,5.0
998684,A1KLIKPUF5E88I,Movie206,5.0
998685,A5HG6WFZLO10D,Movie206,5.0
998686,A3UU690TWXCG1X,Movie206,5.0


In [20]:
from surprise import Dataset
reader = Reader(rating_scale=(-1,10))

In [21]:
data = Dataset.load_from_df(melt_data.fillna(0), reader = reader)

In [22]:
trainset, testset = train_test_split(data, test_size = 0.25)

In [23]:
# The singular value decomposition (SVD) provides another way to factorize a matrix, into singular vectors and singular values.

from surprise import SVD       
algo = SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1e2ba14dc40>

In [24]:
prediction = algo.test(testset)
accuracy.rmse(prediction)

RMSE: 0.2806


0.28056794254992845

In [26]:
# predicting a data for a single user id

In [27]:
user_id = 'A3R5OBKS7OM2IR'
movie_id = 'Movie1'
rating = 5.0

In [28]:
algo.predict(user_id, movie_id, r_ui=rating, verbose = True)

user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 0.06   {'was_impossible': False}


Prediction(uid='A3R5OBKS7OM2IR', iid='Movie1', r_ui=5.0, est=0.06149107982039217, details={'was_impossible': False})

In [29]:
from surprise.model_selection import cross_validate

In [30]:
cross_validate(algo, data, measures=['RMSE','MAE'], cv = 3, verbose = False)

{'test_rmse': array([0.28466763, 0.27853711, 0.28308831]),
 'test_mae': array([0.04290158, 0.04239286, 0.04208872]),
 'fit_time': (39.8145637512207, 39.77973747253418, 38.242653131484985),
 'test_time': (3.3344571590423584, 3.292114019393921, 2.866452932357788)}

In [31]:
def repeat(algo_type, frame, min_, max_):
    reader = Reader(rating_scale=(min_,max_))
    data = Dataset.load_from_df(frame, reader=reader)
    algo = algo_type
    print(cross_validate(algo, data, measures=['RMSE','MAE'], cv = 3, verbose = True))
    user_id = 'A3R5OBKS7OM2IR'
    movie_id = 'Movie1'
    rating = 5.0
    algo.predict(user_id, movie_id, r_ui=rating, verbose = True)

In [32]:
df = df.iloc[:1212, :50]

In [33]:
melt_data = df.melt(id_vars = df.columns[0], value_vars = df.columns[1:], 
                    var_name = "movie name", value_name="ratings")

In [34]:
repeat(SVD(), melt_data.fillna(0), -1, 10)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.4461  0.4456  0.4658  0.4525  0.0094  
MAE (testset)     0.0991  0.1039  0.1078  0.1036  0.0036  
Fit time          2.33    2.32    2.28    2.31    0.02    
Test time         0.11    0.11    0.17    0.13    0.03    
{'test_rmse': array([0.44612658, 0.44561018, 0.46581414]), 'test_mae': array([0.0990521 , 0.10386887, 0.10776172]), 'fit_time': (2.3316242694854736, 2.3152010440826416, 2.284266233444214), 'test_time': (0.11327600479125977, 0.11348891258239746, 0.16948890686035156)}
user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 0.42   {'was_impossible': False}


In [35]:
repeat(SVD(), melt_data.fillna(melt_data.mean()), -1, 10)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.0891  0.0833  0.0975  0.0900  0.0058  
MAE (testset)     0.0203  0.0196  0.0204  0.0201  0.0003  
Fit time          2.23    2.19    2.25    2.23    0.03    
Test time         0.11    0.45    0.12    0.23    0.16    
{'test_rmse': array([0.08913923, 0.08327312, 0.09749963]), 'test_mae': array([0.02029403, 0.01958467, 0.02035107]), 'fit_time': (2.2336337566375732, 2.1906981468200684, 2.2531378269195557), 'test_time': (0.10772061347961426, 0.45308494567871094, 0.12307000160217285)}
user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 4.61   {'was_impossible': False}


In [36]:
repeat(SVD(), melt_data.fillna(melt_data.median()), -1, 10)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.1003  0.0962  0.0979  0.0981  0.0017  
MAE (testset)     0.0196  0.0195  0.0198  0.0196  0.0001  
Fit time          2.41    2.30    2.32    2.34    0.05    
Test time         0.12    0.11    0.12    0.12    0.01    
{'test_rmse': array([0.10031897, 0.09620899, 0.09791962]), 'test_mae': array([0.01955701, 0.01951054, 0.01983827]), 'fit_time': (2.40529727935791, 2.29986572265625, 2.320157527923584), 'test_time': (0.1216588020324707, 0.10907196998596191, 0.11919426918029785)}
user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 5.00   {'was_impossible': False}
