Building user-based recommendation model for Amazon. Project 3

DESCRIPTION

The dataset provided contains movie reviews given by Amazon customers. Reviews were given between May 1996 and July 2014.

Data Dictionary UserID – 4848 customers who provided a rating for each movie Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users

Data Considerations

All the users have not watched all the movies and therefore, all movies are not rated. These missing values are represented by NA.
Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best.

Analysis Task

Exploratory Data Analysis:

*Which movies have maximum views/ratings?  

*What is the average rating for each movie? 

*Define the top 5 movies with the maximum ratings.

*Define the top 5 movies with the least audience.

Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.
Divide the data into training and test data Build a recommendation model on training data Make predictions on the test data

In [14]:
import pandas as pd
import scipy as sparse
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split
import numpy as np

In [15]:
#Reading the Data
dfRaw=pd.read_csv(r'E:\data_sciene_course\simplilearn\Machine Learing\Project_3\Amazon - Movies and TV Ratings.csv')
dfRaw

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,...,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,...,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,...,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4843,A1IMQ9WMFYKWH5,,,,,,,,,,...,,,,,,,,,,5.0
4844,A1KLIKPUF5E88I,,,,,,,,,,...,,,,,,,,,,5.0
4845,A5HG6WFZLO10D,,,,,,,,,,...,,,,,,,,,,5.0
4846,A3UU690TWXCG1X,,,,,,,,,,...,,,,,,,,,,5.0


In [16]:
#Checking the null values
dfRaw.isnull().sum()

user_id        0
Movie1      4847
Movie2      4847
Movie3      4847
Movie4      4846
            ... 
Movie202    4842
Movie203    4847
Movie204    4840
Movie205    4813
Movie206    4835
Length: 207, dtype: int64

In [17]:
#Filling the null values with 0
dfRaw.fillna(0,inplace=True)

In [18]:
#Checking again the null values
dfRaw.isnull().sum()

user_id     0
Movie1      0
Movie2      0
Movie3      0
Movie4      0
           ..
Movie202    0
Movie203    0
Movie204    0
Movie205    0
Movie206    0
Length: 207, dtype: int64

In [19]:
#Creating new userID column marking it from 1 to len(DataFrame) so that the other processing can be easy
df_userID=dfRaw
df_userID.insert(0,"userID",range(1,1+len(df_userID)))
df_userID.head()

Unnamed: 0,userID,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,1,A3R5OBKS7OM2IR,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,AH3QC2PC1VTGP,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,A3LKP6WPMP9UKX,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,AVIY68KEPQ5ZD,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,A1CV1WROP5KTTW,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
#Removing the user_ID column from original Data
df_user_Id_dropped=df_userID.drop("user_id", axis=1)
df_userID_index=df_user_Id_dropped.set_index('userID')
df_userID_index.head(5)

Unnamed: 0_level_0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
#Calculating the total rating for each movie
dfSum=df_userID_index.sum().to_frame().rename(columns={0 : 'Total_Rating'})
dfSum.head(5)

Unnamed: 0,Total_Rating
Movie1,5.0
Movie2,5.0
Movie3,2.0
Movie4,10.0
Movie5,119.0


### Which movies have maximum views/ratings?

In [26]:
#Calulating the movie with highest rating
df1=dfSum
dfMovie=df1[df1["Total_Rating"]==dfSum.Total_Rating.max()]
dfMovie

Unnamed: 0,Total_Rating
Movie127,9511.0


In [27]:
print(f"The movie with highest rating is { dfSum.idxmax()[0]} and rating is { dfMovie.iloc[0,0]}")

The movie with highest rating is Movie127 and rating is 9511.0


### *What is the average rating for each movie?

In [43]:
dfavgRating=dfSum
dfavgRating["Average"]=dfSum.Total_Rating/len(df_userID_index)
dfavgRating.head(5)

Unnamed: 0,Total_Rating,Average
Movie1,5.0,0.001031
Movie2,5.0,0.001031
Movie3,2.0,0.000413
Movie4,10.0,0.002063
Movie5,119.0,0.024546


### Define the top 5 movies with the maximum ratings

In [57]:
sort=dfavgRating.sort_values("Total_Rating", axis=0, ascending=False)
print("Top 5 Movies with Highest rating are")
sort.head(5)

Top 5 Movies with Highest rating are


Unnamed: 0,Total_Rating,Average
Movie127,9511.0,1.96184
Movie140,2794.0,0.57632
Movie16,1446.0,0.298267
Movie103,1241.0,0.255982
Movie29,1168.0,0.240924


### *Define the top 5 movies with the least audience.

In [58]:
print("Top 5 Movies with Least rating are")
sort.tail(5)

Top 5 Movies with Least rating are


Unnamed: 0,Total_Rating,Average
Movie154,1.0,0.000206
Movie144,1.0,0.000206
Movie69,1.0,0.000206
Movie60,1.0,0.000206
Movie67,1.0,0.000206


### Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users. Divide the data into training and test data Build a recommendation model on training data Make predictions on the test data

In [48]:
import random 
#Train test split
random.seed(10)
trainset, testset=train_test_split(df_userID_index,test_size=.70 )
#Converting train data to Sparse data Frame
df_sparse_train=csr_matrix(trainset)


#### Functionising the whole process

In [53]:
#Fitting the model
knn_model=NearestNeighbors(metric="cosine", algorithm="brute")
knn_model.fit(df_sparse_train)
def movie_recommender(df_userID_index,user):        
    # Get location of the actual movie in the User-Items matrix
    user_ix =df_userID_index.index.get_loc(user)
    #Getting the distace and 
    distances1, indices1=knn_model.kneighbors(df_userID_index.iloc[user_ix,:].values.reshape(1, -1))
    indices2=np.array(indices1, dtype=np.int64).reshape(5,)
    # Obtain the mean ratings of those users for all movies
    rec_movies = df_userID_index.loc[indices2].mean(0).sort_values(ascending=False)
    # Discard already seen movies
    m_seen_movies = df_userID_index.loc[user].gt(0)
    seen_movies = m_seen_movies.index[m_seen_movies].tolist()# the movie seen by user will be stored
    rec_movies = rec_movies.drop(seen_movies).head(10)#now droping the seen movie and updating the mean rating
    print(f"User {user} since have rated {seen_movies} we recommend watching and rating \n {rec_movies.index.to_frame().reset_index(drop=True).rename(columns={0: ''})}")
    

In [56]:
#Applying the recommendtion on test set
user=np.random.choice(testset.index)#Since we don't know the users splitted to testset, we take random
movie_recommender(testset,user)

User 1222 since have rated ['Movie103'] we recommend watching and rating 
            
0   Movie16
1   Movie29
2   Movie91
3    Movie1
4  Movie132
5  Movie133
6  Movie134
7  Movie135
8  Movie136
9  Movie137
