# Recommendation System Notebook
- User based recommendation
- User based prediction & evaluation
- Item based recommendation
- Item based prediction & evaluation

Different Approaches to develop Recommendation System -

1. Demographich based Recommendation System

2. Content Based Recommendation System

3. Collaborative filtering Recommendation System

In [1]:
# import libraties
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
import seaborn as sns

In [2]:
# Reading ratings file from GitHub. # MovieLens
# ratings = pd.read_csv('https://raw.githubusercontent.com/antrikshsaxena/NLPCapstone/main/ratings_final.csv' , encoding='latin-1')
df = pd.read_csv('../Details/dataset/SentimentbasedRecoEngine/sample30.csv', index_col=None)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id                    30000 non-null  object
 1   brand                 30000 non-null  object
 2   categories            30000 non-null  object
 3   manufacturer          29859 non-null  object
 4   name                  30000 non-null  object
 5   reviews_date          29954 non-null  object
 6   reviews_didPurchase   15932 non-null  object
 7   reviews_doRecommend   27430 non-null  object
 8   reviews_rating        30000 non-null  int64 
 9   reviews_text          30000 non-null  object
 10  reviews_title         29810 non-null  object
 11  reviews_userCity      1929 non-null   object
 12  reviews_userProvince  170 non-null    object
 13  reviews_username      29937 non-null  object
 14  user_sentiment        29999 non-null  object
dtypes: int64(1), object(14)
memory usage

In [3]:
ratings = df[['id', 'name', 'reviews_rating', 'reviews_date']]
le = LabelEncoder()
ratings['id'] = le.fit_transform(df['id'])
ratings['name'] = le.fit_transform(df['name'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [4]:
ratings = ratings.rename(columns={'id':'userId', 'name':'movieId', 'reviews_rating':'rating'})
ratings.head()

Unnamed: 0,userId,movieId,rating,reviews_date
0,0,182,5,2012-11-30T06:21:45.000Z
1,1,140,5,2017-07-09T00:00:00.000Z
2,1,140,5,2017-07-09T00:00:00.000Z
3,2,120,1,2016-01-06T00:00:00.000Z
4,2,120,1,2016-12-21T00:00:00.000Z


In [5]:
ratings['movieId'].nunique()

271

## Dividing the dataset into train and test

In [6]:
# Test and Train split of the dataset.
from sklearn.model_selection import train_test_split
train, test = train_test_split(ratings, test_size=0.30, random_state=42)

In [7]:
print(train.shape)
print(test.shape)

(21000, 4)
(9000, 4)


In [8]:
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

In [9]:
train.head()

Unnamed: 0,userId,movieId,rating,reviews_date
0,267,232,2,2011-03-31T00:00:00.000Z
1,196,183,5,2014-11-17T00:00:00.000Z
2,260,255,5,2013-05-24T00:00:00.000Z
3,93,65,5,2015-02-20T00:00:00.000Z
4,196,183,3,2014-12-07T00:00:00.000Z


In [10]:
# Pivot the train ratings' dataset into matrix format in which columns are movies and the rows are user IDs.
df_pivot = train.pivot_table(index='userId', columns='movieId', values='rating', fill_value=0) #.fillna(0)
df_pivot.head(3)

movieId,0,1,2,3,4,5,6,7,9,10,...,259,260,262,263,264,266,267,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,...,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0
1,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,...,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0
2,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,...,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0


### Creating dummy train & dummy test dataset
These dataset will be used for prediction 
- Dummy train will be used later for prediction of the movies which has not been rated by the user. To ignore the movies rated by the user, we will mark it as 0 during prediction. The movies not rated by user is marked as 1 for prediction in dummy train dataset. 

- Dummy test will be used for evaluation. To evaluate, we will only make prediction on the movies rated by the user. So, this is marked as 1. This is just opposite of dummy_train.

In [11]:
# Copy the train dataset into dummy_train
dummy_train = train.copy()

In [12]:
dummy_train.head()

Unnamed: 0,userId,movieId,rating,reviews_date
0,267,232,2,2011-03-31T00:00:00.000Z
1,196,183,5,2014-11-17T00:00:00.000Z
2,260,255,5,2013-05-24T00:00:00.000Z
3,93,65,5,2015-02-20T00:00:00.000Z
4,196,183,3,2014-12-07T00:00:00.000Z


In [13]:
# The movies not rated by user is marked as 1 for prediction. 
dummy_train['rating'] = dummy_train['rating'].apply(lambda x: 0 if x>=1 else 1)

In [14]:
# Convert the dummy train dataset into matrix format.
dummy_train = dummy_train.pivot_table(
    index='userId',
    columns='movieId',
    values='rating'
).fillna(1)

In [15]:
dummy_train.head()

movieId,0,1,2,3,4,5,6,7,9,10,...,259,260,262,263,264,266,267,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**Cosine Similarity**

Cosine Similarity is a measurement that quantifies the similarity between two vectors [Which is Rating Vector in this case] 

**Adjusted Cosine**

Adjusted cosine similarity is a modified version of vector-based similarity where we incorporate the fact that different users have different ratings schemes. In other words, some users might rate items highly in general, and others might give items lower ratings as a preference. To handle this nature from rating given by user , we subtract average ratings for each user from each user's rating for different movies.



# User Similarity Matrix

## Using Cosine Similarity

In [16]:
df_pivot.index.nunique()

253

In [17]:
from sklearn.metrics.pairwise import pairwise_distances

# Creating the User Similarity Matrix using pairwise_distance function.
user_correlation = 1 - pairwise_distances(df_pivot, metric='cosine')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation)

[[1. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 0. 1.]]


In [18]:
user_correlation.shape

(253, 253)

## Using adjusted Cosine 

### Here, we are not removing the NaN values and calculating the mean only for the movies rated by the user

In [19]:
# Create a user-movie matrix.
df_pivot = train.pivot_table(
    index='userId',
    columns='movieId',
    values='rating',
    fill_value=0
)

In [20]:
df_pivot.head()

movieId,0,1,2,3,4,5,6,7,9,10,...,259,260,262,263,264,266,267,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,...,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0
1,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,...,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0
2,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,...,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0
3,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,...,0,0,0,0.0,0.0,0.0,0,4.198413,0.0,0.0
4,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,...,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0


### Normalising the rating of the movie for each user around 0 mean

In [21]:
mean = np.nanmean(df_pivot, axis=1)
df_subtracted = (df_pivot.T-mean).T

In [22]:
df_subtracted.head()

movieId,0,1,2,3,4,5,6,7,9,10,...,259,260,262,263,264,266,267,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,...,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763
1,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,...,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763
2,-0.010164,-0.010164,-0.010164,-0.010164,-0.010164,-0.010164,-0.010164,-0.010164,-0.010164,-0.010164,...,-0.010164,-0.010164,-0.010164,-0.010164,-0.010164,-0.010164,-0.010164,-0.010164,-0.010164,-0.010164
3,-0.016595,-0.016595,-0.016595,-0.016595,-0.016595,-0.016595,-0.016595,-0.016595,-0.016595,-0.016595,...,-0.016595,-0.016595,-0.016595,-0.016595,-0.016595,-0.016595,-0.016595,4.181818,-0.016595,-0.016595
4,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,...,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763


### Finding cosine similarity

In [23]:
from sklearn.metrics.pairwise import pairwise_distances

In [24]:
# Creating the User Similarity Matrix using pairwise_distance function.
user_correlation = 1 - pairwise_distances(df_subtracted.fillna(0), metric='cosine')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation)

[[ 1.         -0.00396825 -0.00396825 ... -0.00396825 -0.00396825
  -0.00396825]
 [-0.00396825  1.         -0.00396825 ... -0.00396825 -0.00396825
  -0.00396825]
 [-0.00396825 -0.00396825  1.         ... -0.00396825 -0.00396825
  -0.00396825]
 ...
 [-0.00396825 -0.00396825 -0.00396825 ...  1.         -0.00396825
  -0.00396825]
 [-0.00396825 -0.00396825 -0.00396825 ... -0.00396825  1.
  -0.00396825]
 [-0.00396825 -0.00396825 -0.00396825 ... -0.00396825 -0.00396825
   1.        ]]


In [25]:
user_correlation.shape

(253, 253)

## Prediction - User User

Doing the prediction for the users which are positively related with other users, and not the users which are negatively related as we are interested in the users which are more similar to the current users. So, ignoring the correlation for values less than 0. 

In [26]:
user_correlation[user_correlation<0]=0
user_correlation

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

Rating predicted by the user (for movies rated as well as not rated) is the weighted sum of correlation with the movie rating (as present in the rating dataset). 

In [27]:
user_predicted_ratings = np.dot(user_correlation, df_pivot.fillna(0))
user_predicted_ratings

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [28]:
user_predicted_ratings.shape

(253, 253)

In [29]:
user_predicted_ratings

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Since we are interested only in the movies not rated by the user, we will ignore the movies rated by the user by making it zero. 

In [30]:
user_final_rating = np.multiply(user_predicted_ratings,dummy_train)
user_final_rating.head()

movieId,0,1,2,3,4,5,6,7,9,10,...,259,260,262,263,264,266,267,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Finding the top 5 recommendation for the *user*

In [52]:
# Take the user ID as input.
user_input = int(input("Enter your user name"))
print(user_input)

Enter your user name2
2


In [53]:
user_final_rating.head(7)

movieId,0,1,2,3,4,5,6,7,9,10,...,259,260,262,263,264,266,267,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [54]:
d = user_final_rating.loc[user_input].sort_values(ascending=False)[0:5]
d

movieId
0      0.0
186    0.0
172    0.0
173    0.0
174    0.0
Name: 2, dtype: float64

In [55]:
#Mapping with Movie Title / Genres 
movie_mapping = pd.read_csv('https://raw.githubusercontent.com/antrikshsaxena/NLPCapstone/main/movies.csv')
movie_mapping.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [56]:
d = pd.merge(d,movie_mapping,left_on='movieId',right_on='movieId', how = 'left')
d.head()

Unnamed: 0,movieId,2,title,genres
0,0,0.0,,
1,186,0.0,Nine Months (1995),Comedy|Romance
2,172,0.0,Johnny Mnemonic (1995),Action|Sci-Fi|Thriller
3,173,0.0,Judge Dredd (1995),Action|Crime|Sci-Fi
4,174,0.0,Jury Duty (1995),Comedy


# Evaluation - User User 

Evaluation will we same as you have seen above for the prediction. The only difference being, you will evaluate for the movie already rated by the user insead of predicting it for the movie not rated by the user. 

In [57]:
# Find out the common users of test and train dataset.
common = test[test.userId.isin(train.userId)]
common.shape

(8980, 4)

In [58]:
common.head()

Unnamed: 0,userId,movieId,rating,reviews_date
0,37,151,5,2016-12-14T00:00:00.000Z
1,182,63,1,2016-08-30T00:00:00.000Z
2,187,93,4,2015-06-24T00:00:00.000Z
3,187,93,5,2014-09-20T00:00:00.000Z
4,42,157,4,2016-07-22T00:00:00.000Z


In [59]:
# convert into the user-movie matrix.
common_user_based_matrix = common.pivot_table(index='userId', columns='movieId', values='rating', fill_value=0)

In [60]:
common_user_based_matrix.head()

movieId,0,1,3,5,7,9,10,12,13,14,...,256,257,258,259,263,264,266,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0,0.0,0,0,0.0,0.0,0,0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0.0
3,0,0.0,0,0,0.0,0.0,0,0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0,3.947917,0.0,0.0
5,0,0.0,0,0,0.0,0.0,0,0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0.0
6,0,0.0,0,0,0.0,0.0,0,0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0.0
7,0,0.0,0,0,0.0,0.0,0,0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0.0


In [61]:
# Convert the user_correlation matrix into dataframe.
user_correlation_df = pd.DataFrame(user_correlation)

In [62]:
user_correlation_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,243,244,245,246,247,248,249,250,251,252
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [63]:
df_subtracted.head(1)

movieId,0,1,2,3,4,5,6,7,9,10,...,259,260,262,263,264,266,267,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,...,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763


In [64]:
# df_subtracted.head()

In [65]:
user_correlation_df['userId'] = df_subtracted.index

user_correlation_df.set_index('userId',inplace=True)
user_correlation_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,243,244,245,246,247,248,249,250,251,252
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [66]:
common.head(1)

Unnamed: 0,userId,movieId,rating,reviews_date
0,37,151,5,2016-12-14T00:00:00.000Z


In [67]:
list_name = common.userId.tolist()

user_correlation_df.columns = df_subtracted.index.tolist()


user_correlation_df_1 =  user_correlation_df[user_correlation_df.index.isin(list_name)]

In [68]:
user_correlation_df_1.shape

(196, 253)

In [69]:
user_correlation_df_2 = user_correlation_df_1.T[user_correlation_df_1.T.index.isin(list_name)]

In [70]:
user_correlation_df_3 = user_correlation_df_2.T

In [71]:
user_correlation_df_3.head()

Unnamed: 0_level_0,2,3,5,6,7,8,9,11,12,15,...,258,259,260,261,262,263,266,267,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [72]:
user_correlation_df_3.shape

(196, 196)

In [73]:
user_correlation_df_3[user_correlation_df_3<0]=0

common_user_predicted_ratings = np.dot(user_correlation_df_3, common_user_based_matrix.fillna(0))
common_user_predicted_ratings

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 3.94791667, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [74]:
dummy_test = common.copy()

dummy_test['rating'] = dummy_test['rating'].apply(lambda x: 1 if x>=1 else 0)

dummy_test = dummy_test.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)

In [75]:
dummy_test.shape

(196, 196)

In [76]:
common_user_based_matrix.head()

movieId,0,1,3,5,7,9,10,12,13,14,...,256,257,258,259,263,264,266,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0,0.0,0,0,0.0,0.0,0,0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0.0
3,0,0.0,0,0,0.0,0.0,0,0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0,3.947917,0.0,0.0
5,0,0.0,0,0,0.0,0.0,0,0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0.0
6,0,0.0,0,0,0.0,0.0,0,0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0.0
7,0,0.0,0,0,0.0,0.0,0,0,0,0.0,...,0.0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0.0


In [77]:
dummy_test.head()

movieId,0,1,3,5,7,9,10,12,13,14,...,256,257,258,259,263,264,266,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [78]:
common_user_predicted_ratings = np.multiply(common_user_predicted_ratings,dummy_test)

In [79]:
common_user_predicted_ratings.head()

movieId,0,1,3,5,7,9,10,12,13,14,...,256,257,258,259,263,264,266,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.947917,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Calculating the RMSE for only the movies rated by user. For RMSE, normalising the rating to (1,5) range.

In [80]:
from sklearn.preprocessing import MinMaxScaler
from numpy import *

X  = common_user_predicted_ratings.copy() 
X = X[X>0]

scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X))
y = (scaler.transform(X))

print(y)

MinMaxScaler(feature_range=(1, 5))
[[nan nan nan ... nan nan nan]
 [nan nan nan ...  1. nan nan]
 [nan nan nan ... nan nan nan]
 ...
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]]


In [81]:
common_ = common.pivot_table(index='userId', columns='movieId', values='rating')

In [82]:
# Finding total non-NaN value
total_non_nan = np.count_nonzero(~np.isnan(y))

In [83]:
rmse = (sum(sum((common_ - y )**2))/total_non_nan)**0.5
print(rmse)

3.478557242196848


## Using Item similarity

# Item Based Similarity

Taking the transpose of the rating matrix to normalize the rating around the mean for different movie ID. In the user based similarity, we had taken mean for each user instead of each movie. 

In [84]:
df_pivot = train.pivot_table(
    index='userId',
    columns='movieId',
    values='rating',
    fill_value=0
).T

df_pivot.head()

userId,0,1,2,3,4,5,6,7,8,9,...,259,260,261,262,263,265,266,267,269,270
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Normalising the movie rating for each movie for using the Adujsted Cosine

In [85]:
mean = np.nanmean(df_pivot, axis=1)
df_subtracted = (df_pivot.T-mean).T

In [86]:
df_subtracted.head()

userId,0,1,2,3,4,5,6,7,8,9,...,259,260,261,262,263,265,266,267,269,270
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,-0.016798,-0.016798,-0.016798,-0.016798,-0.016798,-0.016798,-0.016798,-0.016798,-0.016798,-0.016798,...,-0.016798,-0.016798,-0.016798,-0.016798,-0.016798,-0.016798,-0.016798,-0.016798,-0.016798,-0.016798
1,-0.018474,-0.018474,-0.018474,-0.018474,-0.018474,-0.018474,-0.018474,-0.018474,-0.018474,-0.018474,...,-0.018474,-0.018474,-0.018474,-0.018474,-0.018474,-0.018474,-0.018474,-0.018474,-0.018474,-0.018474
2,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,...,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763
3,-0.018445,-0.018445,-0.018445,-0.018445,-0.018445,-0.018445,-0.018445,-0.018445,-0.018445,-0.018445,...,-0.018445,-0.018445,-0.018445,-0.018445,-0.018445,-0.018445,-0.018445,-0.018445,-0.018445,-0.018445
4,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,...,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763,-0.019763


Finding the cosine similarity using pairwise distances approach

In [87]:
from sklearn.metrics.pairwise import pairwise_distances

In [88]:
# Item Similarity Matrix
item_correlation = 1 - pairwise_distances(df_subtracted.fillna(0), metric='cosine')
item_correlation[np.isnan(item_correlation)] = 0
print(item_correlation)

[[ 1.         -0.00396825 -0.00396825 ... -0.00396825 -0.00396825
  -0.00396825]
 [-0.00396825  1.         -0.00396825 ... -0.00396825 -0.00396825
  -0.00396825]
 [-0.00396825 -0.00396825  1.         ... -0.00396825 -0.00396825
  -0.00396825]
 ...
 [-0.00396825 -0.00396825 -0.00396825 ...  1.         -0.00396825
  -0.00396825]
 [-0.00396825 -0.00396825 -0.00396825 ... -0.00396825  1.
  -0.00396825]
 [-0.00396825 -0.00396825 -0.00396825 ... -0.00396825 -0.00396825
   1.        ]]


In [89]:
item_correlation.shape

(253, 253)

Filtering the correlation only for which the value is greater than 0. (Positively correlated)

In [90]:
item_correlation[item_correlation<0]=0
item_correlation

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

# Prediction - Item Item

In [91]:
item_predicted_ratings = np.dot((df_pivot.fillna(0).T),item_correlation)
item_predicted_ratings

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [92]:
item_predicted_ratings.shape

(253, 253)

In [93]:
dummy_train.shape

(253, 253)

### Filtering the rating only for the movies not rated by the user for recommendation

In [94]:
item_final_rating = np.multiply(item_predicted_ratings,dummy_train)
item_final_rating.head()

movieId,0,1,2,3,4,5,6,7,9,10,...,259,260,262,263,264,266,267,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Finding the top 5 recommendation for the *user*



In [95]:
# Take the user ID as input
user_input = int(input("Enter your user name"))
print(user_input)

Enter your user name5
5


In [96]:
# Recommending the Top 5 products to the user.
d = item_final_rating.loc[user_input].sort_values(ascending=False)[0:5]
d

movieId
0      0.0
186    0.0
172    0.0
173    0.0
174    0.0
Name: 5, dtype: float64

In [97]:
#Mapping with Movie Title / Genres 
movie_mapping = pd.read_csv('https://raw.githubusercontent.com/antrikshsaxena/NLPCapstone/main/movies.csv', encoding='latin-1')

In [98]:
d = pd.merge(d,movie_mapping,left_on='movieId', right_on='movieId', how='left')
d.head()

Unnamed: 0,movieId,5,title,genres
0,0,0.0,,
1,186,0.0,Nine Months (1995),Comedy|Romance
2,172,0.0,Johnny Mnemonic (1995),Action|Sci-Fi|Thriller
3,173,0.0,Judge Dredd (1995),Action|Crime|Sci-Fi
4,174,0.0,Jury Duty (1995),Comedy


In [99]:
train_new = pd.merge(train, movie_mapping,left_on='movieId', right_on='movieId', how='left')
train_new[train_new.userId == 1] .head()

Unnamed: 0,userId,movieId,rating,reviews_date,title,genres
10592,1,140,5,2017-07-09T00:00:00.000Z,Up Close and Personal (1996),Drama|Romance
13308,1,140,5,2017-07-09T00:00:00.000Z,Up Close and Personal (1996),Drama|Romance


# Evaluation - Item Item

Evaluation will we same as you have seen above for the prediction. The only difference being, you will evaluate for the movie already rated by the user insead of predicting it for the movie not rated by the user. 

In [100]:
test.columns

Index(['userId', 'movieId', 'rating', 'reviews_date'], dtype='object')

In [101]:
common =  test[test.movieId.isin(train.movieId)]
common.shape

(8980, 4)

In [102]:
common.head(4)

Unnamed: 0,userId,movieId,rating,reviews_date
0,37,151,5,2016-12-14T00:00:00.000Z
1,182,63,1,2016-08-30T00:00:00.000Z
2,187,93,4,2015-06-24T00:00:00.000Z
3,187,93,5,2014-09-20T00:00:00.000Z


In [103]:
common_item_based_matrix = common.pivot_table(index='userId', columns='movieId', values='rating').T

In [104]:
common_item_based_matrix.shape

(196, 196)

In [105]:
item_correlation_df = pd.DataFrame(item_correlation)

In [106]:
item_correlation_df.head(1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,243,244,245,246,247,248,249,250,251,252
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [107]:
item_correlation_df['movieId'] = df_subtracted.index
item_correlation_df.set_index('movieId',inplace=True)
item_correlation_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,243,244,245,246,247,248,249,250,251,252
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [108]:
list_name = common.movieId.tolist()

In [109]:
item_correlation_df.columns = df_subtracted.index.tolist()

item_correlation_df_1 =  item_correlation_df[item_correlation_df.index.isin(list_name)]

In [110]:
item_correlation_df_2 = item_correlation_df_1.T[item_correlation_df_1.T.index.isin(list_name)]

item_correlation_df_3 = item_correlation_df_2.T

In [111]:
item_correlation_df_3.head()

Unnamed: 0_level_0,0,1,3,5,7,9,10,12,13,14,...,256,257,258,259,263,264,266,268,269,270
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [112]:
item_correlation_df_3[item_correlation_df_3<0]=0

common_item_predicted_ratings = np.dot(item_correlation_df_3, common_item_based_matrix.fillna(0))
common_item_predicted_ratings


array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 3.94791667, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [113]:
common_item_predicted_ratings.shape

(196, 196)

Dummy test will be used for evaluation. To evaluate, we will only make prediction on the movies rated by the user. So, this is marked as 1. This is just opposite of dummy_train



In [114]:
dummy_test = common.copy()

dummy_test['rating'] = dummy_test['rating'].apply(lambda x: 1 if x>=1 else 0)

dummy_test = dummy_test.pivot_table(index='userId', columns='movieId', values='rating').T.fillna(0)

common_item_predicted_ratings = np.multiply(common_item_predicted_ratings,dummy_test)

The products not rated is marked as 0 for evaluation. And make the item- item matrix representaion.


In [115]:
common_ = common.pivot_table(index='userId', columns='movieId', values='rating').T

In [116]:
from sklearn.preprocessing import MinMaxScaler
from numpy import *

X  = common_item_predicted_ratings.copy() 
X = X[X>0]

scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X))
y = (scaler.transform(X))

print(y)

MinMaxScaler(feature_range=(1, 5))
[[nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 ...
 [nan  1. nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]]


In [117]:
# Finding total non-NaN value
total_non_nan = np.count_nonzero(~np.isnan(y))

In [118]:
rmse = (sum(sum((common_ - y )**2))/total_non_nan)**0.5
print(rmse)

3.4785572421968474


##**Summary - Recommendation Engine**