# Recommendation System Notebook
- User based recommendation
- User based prediction & evaluation
- Item based recommendation
- Item based prediction & evaluation

Different Approaches to develop Recommendation System -

1. Demographich based Recommendation System

2. Content Based Recommendation System

3. Collaborative filtering Recommendation System

In [1]:
# import libraties
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
import seaborn as sns

In [2]:
# Reading ratings file from GitHub. # MovieLens
# ratings = pd.read_csv('https://raw.githubusercontent.com/antrikshsaxena/NLPCapstone/main/ratings_final.csv' , encoding='latin-1')
# df = pd.read_csv('../Details/dataset/SentimentbasedRecoEngine/sample30.csv', index_col=None)
df = pd.read_csv('final_data.csv', index_col=None)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29936 entries, 0 to 29935
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   id                   29936 non-null  object
 1   brand                29936 non-null  object
 2   categories           29936 non-null  object
 3   manufacturer         29936 non-null  object
 4   name                 29936 non-null  object
 5   reviews_date         29896 non-null  object
 6   reviews_text         29936 non-null  object
 7   reviews_title        29936 non-null  object
 8   reviews_doRecommend  27395 non-null  object
 9   reviews_rating       29936 non-null  int64 
 10  reviews_username     29936 non-null  object
 11  user_sentiment       29936 non-null  object
 12  proc_reviews_text    29936 non-null  object
dtypes: int64(1), object(12)
memory usage: 3.0+ MB


In [3]:
ratings = df[['reviews_username', 'name', 'reviews_rating', 'reviews_date']]
ratings = ratings.rename(columns={'reviews_username':'userId', 'name':'movieId', 'reviews_rating':'rating'})

In [4]:
le = LabelEncoder()
ratings['userId'] = le.fit_transform(ratings['userId'])
ratings['movieId'] = le.fit_transform(ratings['movieId'])
ratings.head()

Unnamed: 0,userId,movieId,rating,reviews_date
0,11440,182,5,2012-11-30T06:21:45.000Z
1,6974,140,5,2017-07-09T00:00:00.000Z
2,6974,140,5,2017-07-09T00:00:00.000Z
3,19327,120,1,2016-01-06T00:00:00.000Z
4,24205,120,1,2016-12-21T00:00:00.000Z


In [5]:
ratings['movieId'].nunique()

271

## Dividing the dataset into train and test

In [6]:
# Test and Train split of the dataset.
from sklearn.model_selection import train_test_split
train, test = train_test_split(ratings, test_size=0.30, random_state=42)

In [7]:
print(train.shape)
print(test.shape)

(20955, 4)
(8981, 4)


In [8]:
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

In [9]:
train.head()

Unnamed: 0,userId,movieId,rating,reviews_date
0,21687,244,5,2015-11-16T00:00:00.000Z
1,3088,268,5,2017-06-16T00:00:00.000Z
2,4979,65,5,2014-12-05T00:00:00.000Z
3,24874,93,5,2014-10-18T00:00:00.000Z
4,16068,147,5,2014-06-30T11:47:00Z


In [10]:
# Pivot the train ratings' dataset into matrix format in which columns are movies and the rows are user IDs.
df_pivot = train.pivot_table(index='userId', columns='movieId', values='rating', fill_value=0) #.fillna(0)
df_pivot.head(3)

movieId,0,1,2,3,4,5,6,7,8,9,...,260,261,262,263,264,265,266,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Creating dummy train & dummy test dataset
These dataset will be used for prediction 
- Dummy train will be used later for prediction of the movies which has not been rated by the user. To ignore the movies rated by the user, we will mark it as 0 during prediction. The movies not rated by user is marked as 1 for prediction in dummy train dataset. 

- Dummy test will be used for evaluation. To evaluate, we will only make prediction on the movies rated by the user. So, this is marked as 1. This is just opposite of dummy_train.

In [11]:
# Copy the train dataset into dummy_train
dummy_train = train.copy()

In [12]:
dummy_train.head()

Unnamed: 0,userId,movieId,rating,reviews_date
0,21687,244,5,2015-11-16T00:00:00.000Z
1,3088,268,5,2017-06-16T00:00:00.000Z
2,4979,65,5,2014-12-05T00:00:00.000Z
3,24874,93,5,2014-10-18T00:00:00.000Z
4,16068,147,5,2014-06-30T11:47:00Z


In [13]:
# The movies not rated by user is marked as 1 for prediction. 
dummy_train['rating'] = dummy_train['rating'].apply(lambda x: 0 if x>=1 else 1)

In [14]:
# Convert the dummy train dataset into matrix format.
dummy_train = dummy_train.pivot_table(
    index='userId',
    columns='movieId',
    values='rating'
).fillna(1)

In [15]:
dummy_train.head()

movieId,0,1,2,3,4,5,6,7,8,9,...,260,261,262,263,264,265,266,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**Cosine Similarity**

Cosine Similarity is a measurement that quantifies the similarity between two vectors [Which is Rating Vector in this case] 

**Adjusted Cosine**

Adjusted cosine similarity is a modified version of vector-based similarity where we incorporate the fact that different users have different ratings schemes. In other words, some users might rate items highly in general, and others might give items lower ratings as a preference. To handle this nature from rating given by user , we subtract average ratings for each user from each user's rating for different movies.



# User Similarity Matrix

## Using Cosine Similarity

In [16]:
df_pivot.index.nunique()

18273

In [17]:
from sklearn.metrics.pairwise import pairwise_distances

# Creating the User Similarity Matrix using pairwise_distance function.
user_correlation = 1 - pairwise_distances(df_pivot, metric='cosine')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation)

[[1.        0.        0.        ... 0.        0.9486833 0.       ]
 [0.        1.        1.        ... 0.        0.        0.       ]
 [0.        1.        1.        ... 0.        0.        0.       ]
 ...
 [0.        0.        0.        ... 1.        0.        1.       ]
 [0.9486833 0.        0.        ... 0.        1.        0.       ]
 [0.        0.        0.        ... 1.        0.        1.       ]]


In [18]:
user_correlation.shape

(18273, 18273)

## Using adjusted Cosine 

### Here, we are not removing the NaN values and calculating the mean only for the movies rated by the user

In [19]:
# Create a user-movie matrix.
df_pivot = train.pivot_table(
    index='userId',
    columns='movieId',
    values='rating',
    fill_value=0
)

In [20]:
df_pivot.head()

movieId,0,1,2,3,4,5,6,7,8,9,...,260,261,262,263,264,265,266,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Normalising the rating of the movie for each user around 0 mean

In [21]:
mean = np.nanmean(df_pivot, axis=1)
df_subtracted = (df_pivot.T-mean).T

In [22]:
df_subtracted.head()

movieId,0,1,2,3,4,5,6,7,8,9,...,260,261,262,263,264,265,266,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,...,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625
2,-0.011719,-0.011719,-0.011719,-0.011719,-0.011719,-0.011719,-0.011719,-0.011719,-0.011719,-0.011719,...,-0.011719,-0.011719,-0.011719,-0.011719,-0.011719,-0.011719,-0.011719,-0.011719,-0.011719,-0.011719
3,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,...,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531
4,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,...,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625
5,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,...,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531,-0.019531


### Finding cosine similarity

In [23]:
from sklearn.metrics.pairwise import pairwise_distances

In [24]:
# Creating the User Similarity Matrix using pairwise_distance function.
user_correlation = 1 - pairwise_distances(df_subtracted.fillna(0), metric='cosine')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation)

[[ 1.         -0.00496628 -0.00496628 ... -0.00496628  0.9485598
  -0.00496628]
 [-0.00496628  1.          1.         ... -0.00392157 -0.00392157
  -0.00392157]
 [-0.00496628  1.          1.         ... -0.00392157 -0.00392157
  -0.00392157]
 ...
 [-0.00496628 -0.00392157 -0.00392157 ...  1.         -0.00392157
   1.        ]
 [ 0.9485598  -0.00392157 -0.00392157 ... -0.00392157  1.
  -0.00392157]
 [-0.00496628 -0.00392157 -0.00392157 ...  1.         -0.00392157
   1.        ]]


In [25]:
user_correlation.shape

(18273, 18273)

## Prediction - User User

Doing the prediction for the users which are positively related with other users, and not the users which are negatively related as we are interested in the users which are more similar to the current users. So, ignoring the correlation for values less than 0. 

In [26]:
user_correlation[user_correlation<0]=0
user_correlation

array([[1.       , 0.       , 0.       , ..., 0.       , 0.9485598,
        0.       ],
       [0.       , 1.       , 1.       , ..., 0.       , 0.       ,
        0.       ],
       [0.       , 1.       , 1.       , ..., 0.       , 0.       ,
        0.       ],
       ...,
       [0.       , 0.       , 0.       , ..., 1.       , 0.       ,
        1.       ],
       [0.9485598, 0.       , 0.       , ..., 0.       , 1.       ,
        0.       ],
       [0.       , 0.       , 0.       , ..., 1.       , 0.       ,
        1.       ]])

Rating predicted by the user (for movies rated as well as not rated) is the weighted sum of correlation with the movie rating (as present in the rating dataset). 

In [27]:
user_predicted_ratings = np.dot(user_correlation, df_pivot.fillna(0))
user_predicted_ratings

array([[ 0.        ,  2.74768866,  0.        , ..., 10.10013917,
         2.82816461,  1.44650707],
       [ 0.        , 24.86721747,  0.        , ..., 16.04306553,
         0.        ,  0.        ],
       [ 0.        , 24.86721747,  0.        , ..., 16.04306553,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  8.81492841,  0.        , ..., 10.69510988,
         3.52859468,  0.        ],
       [ 0.        ,  0.79212093,  0.        , ...,  8.12356709,
         2.98503419,  1.31204049],
       [ 0.        ,  8.81492841,  0.        , ..., 10.69510988,
         3.52859468,  0.        ]])

In [28]:
user_predicted_ratings.shape

(18273, 256)

In [29]:
user_predicted_ratings

array([[ 0.        ,  2.74768866,  0.        , ..., 10.10013917,
         2.82816461,  1.44650707],
       [ 0.        , 24.86721747,  0.        , ..., 16.04306553,
         0.        ,  0.        ],
       [ 0.        , 24.86721747,  0.        , ..., 16.04306553,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  8.81492841,  0.        , ..., 10.69510988,
         3.52859468,  0.        ],
       [ 0.        ,  0.79212093,  0.        , ...,  8.12356709,
         2.98503419,  1.31204049],
       [ 0.        ,  8.81492841,  0.        , ..., 10.69510988,
         3.52859468,  0.        ]])

Since we are interested only in the movies not rated by the user, we will ignore the movies rated by the user by making it zero. 

In [30]:
user_final_rating = np.multiply(user_predicted_ratings, dummy_train)
user_final_rating.head()

movieId,0,1,2,3,4,5,6,7,8,9,...,260,261,262,263,264,265,266,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,2.747689,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.654512,...,2.322172,0.0,0.0,0.0,1.939874,0.0,0.0,10.100139,2.828165,1.446507
2,0.0,24.867217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.498594,...,0.0,0.0,0.0,0.0,1.48137,0.0,0.0,16.043066,0.0,0.0
3,0.0,24.867217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.498594,...,0.0,0.0,0.0,0.0,1.48137,0.0,0.0,16.043066,0.0,0.0
4,0.0,24.867217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.498594,...,0.0,0.0,0.0,0.0,1.48137,0.0,0.0,16.043066,0.0,0.0
5,0.0,8.814928,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.887445,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.69511,3.528595,0.0


### Finding the top 5 recommendation for the *user*

In [31]:
# Take the user ID as input.
user_input = int(input("Enter your user name"))
print(user_input)

Enter your user name5
5


In [32]:
user_final_rating.head(7)

movieId,0,1,2,3,4,5,6,7,8,9,...,260,261,262,263,264,265,266,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,2.747689,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.654512,...,2.322172,0.0,0.0,0.0,1.939874,0.0,0.0,10.100139,2.828165,1.446507
2,0.0,24.867217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.498594,...,0.0,0.0,0.0,0.0,1.48137,0.0,0.0,16.043066,0.0,0.0
3,0.0,24.867217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.498594,...,0.0,0.0,0.0,0.0,1.48137,0.0,0.0,16.043066,0.0,0.0
4,0.0,24.867217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.498594,...,0.0,0.0,0.0,0.0,1.48137,0.0,0.0,16.043066,0.0,0.0
5,0.0,8.814928,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.887445,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.69511,3.528595,0.0
6,0.0,8.814928,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.887445,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.69511,3.528595,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.87326,0.0,0.0


In [33]:
d = user_final_rating.loc[user_input].sort_values(ascending=False)[0:5]
d

movieId
64     1457.064023
93       40.668291
151      36.009613
157      33.486121
183      29.925775
Name: 5, dtype: float64

In [34]:
#Mapping with Movie Title / Genres 
movie_mapping = pd.read_csv('https://raw.githubusercontent.com/antrikshsaxena/NLPCapstone/main/movies.csv')
movie_mapping.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [35]:
d = pd.merge(d,movie_mapping,left_on='movieId',right_on='movieId', how = 'left')
d.head()

Unnamed: 0,movieId,5,title,genres
0,64,1457.064023,Two if by Sea (1996),Comedy|Romance
1,93,40.668291,Vampire in Brooklyn (1995),Comedy|Horror|Romance
2,151,36.009613,Rob Roy (1995),Action|Drama|Romance|War
3,157,33.486121,Canadian Bacon (1995),Comedy|War
4,183,29.925775,Mute Witness (1994),Comedy|Horror|Thriller


# Evaluation - User User 

Evaluation will we same as you have seen above for the prediction. The only difference being, you will evaluate for the movie already rated by the user insead of predicting it for the movie not rated by the user. 

In [38]:
# Find out the common users of test and train dataset.
common = test[test.userId.isin(train.userId)]
common.shape

(2049, 4)

In [39]:
common.head()

Unnamed: 0,userId,movieId,rating,reviews_date
4,10780,65,5,2012-01-25T00:00:00.000Z
8,19813,166,1,2016-01-22T03:40:47.000Z
10,5464,93,3,2014-09-18T00:00:00.000Z
12,7832,166,1,2015-07-28T00:00:00.000Z
13,15891,183,5,2014-11-07T00:00:00.000Z


In [40]:
# convert into the user-movie matrix.
common_user_based_matrix = common.pivot_table(index='userId', columns='movieId', values='rating', fill_value=0)

In [41]:
common_user_based_matrix.head()

movieId,0,1,9,10,14,15,16,17,19,20,...,253,254,255,256,257,258,260,263,264,268
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
15,0,0,0,0,0,0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
17,0,0,0,0,0,0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
20,0,0,0,0,0,0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
44,0,0,0,0,0,0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
92,0,0,0,0,0,0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [42]:
# Convert the user_correlation matrix into dataframe.
user_correlation_df = pd.DataFrame(user_correlation)

In [43]:
user_correlation_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18263,18264,18265,18266,18267,18268,18269,18270,18271,18272
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.94856,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.94856,0.0
1,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0


In [44]:
df_subtracted.head(1)

movieId,0,1,2,3,4,5,6,7,8,9,...,260,261,262,263,264,265,266,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,...,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625,-0.015625


In [45]:
# df_subtracted.head()

In [46]:
user_correlation_df['userId'] = df_subtracted.index

user_correlation_df.set_index('userId',inplace=True)
user_correlation_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,18263,18264,18265,18266,18267,18268,18269,18270,18271,18272
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.94856,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.94856,0.0
2,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0


In [47]:
common.head(1)

Unnamed: 0,userId,movieId,rating,reviews_date
4,10780,65,5,2012-01-25T00:00:00.000Z


In [48]:
list_name = common.userId.tolist()

user_correlation_df.columns = df_subtracted.index.tolist()

user_correlation_df_1 =  user_correlation_df[user_correlation_df.index.isin(list_name)]

In [49]:
user_correlation_df_1.shape

(1694, 18273)

In [50]:
user_correlation_df_2 = user_correlation_df_1.T[user_correlation_df_1.T.index.isin(list_name)]

In [51]:
user_correlation_df_3 = user_correlation_df_2.T

In [52]:
user_correlation_df_3.head()

Unnamed: 0_level_0,15,17,20,44,92,166,246,277,281,283,...,24576,24609,24703,24776,24801,24804,24811,24836,24870,24909
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
15,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.814888,...,0.0,0.437357,0.0,0.0,0.0,0.705719,0.0,0.0,0.496063,0.0
17,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
20,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
44,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
92,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [53]:
user_correlation_df_3.shape

(1694, 1694)

In [54]:
user_correlation_df_3[user_correlation_df_3<0]=0

common_user_predicted_ratings = np.dot(user_correlation_df_3, common_user_based_matrix.fillna(0))
common_user_predicted_ratings

array([[ 0.        ,  7.69838309,  2.48031496, ...,  1.52456472,
         0.        , 10.45862108],
       [ 0.        ,  0.        ,  2.87540847, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  7.52859468, ...,  0.        ,
         0.        ,  1.27965182],
       ...,
       [ 0.        ,  0.        ,  2.87540847, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  1.80439363],
       [ 0.        ,  0.        ,  2.87540847, ...,  0.        ,
         0.        ,  0.        ]])

In [55]:
dummy_test = common.copy()

dummy_test['rating'] = dummy_test['rating'].apply(lambda x: 1 if x>=1 else 0)

dummy_test = dummy_test.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)

In [56]:
dummy_test.shape

(1694, 139)

In [57]:
common_user_based_matrix.head()

movieId,0,1,9,10,14,15,16,17,19,20,...,253,254,255,256,257,258,260,263,264,268
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
15,0,0,0,0,0,0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
17,0,0,0,0,0,0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
20,0,0,0,0,0,0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
44,0,0,0,0,0,0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
92,0,0,0,0,0,0,0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [58]:
dummy_test.head()

movieId,0,1,9,10,14,15,16,17,19,20,...,253,254,255,256,257,258,260,263,264,268
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
44,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
92,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [59]:
common_user_predicted_ratings = np.multiply(common_user_predicted_ratings,dummy_test)

In [60]:
common_user_predicted_ratings.head()

movieId,0,1,9,10,14,15,16,17,19,20,...,253,254,255,256,257,258,260,263,264,268
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
44,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
92,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Calculating the RMSE for only the movies rated by user. For RMSE, normalising the rating to (1,5) range.

In [61]:
from sklearn.preprocessing import MinMaxScaler
from numpy import *

X  = common_user_predicted_ratings.copy() 
X = X[X>0]

scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X))
y = (scaler.transform(X))

print(y)

MinMaxScaler(feature_range=(1, 5))
[[nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 ...
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]]


In [62]:
common_ = common.pivot_table(index='userId', columns='movieId', values='rating')

In [63]:
# Finding total non-NaN value
total_non_nan = np.count_nonzero(~np.isnan(y))

In [64]:
rmse = (sum(sum((common_ - y )**2))/total_non_nan)**0.5
print(rmse)

2.100598286179026


## Using Item similarity

# Item Based Similarity

Taking the transpose of the rating matrix to normalize the rating around the mean for different movie ID. In the user based similarity, we had taken mean for each user instead of each movie. 

In [65]:
df_pivot = train.pivot_table(
    index='userId',
    columns='movieId',
    values='rating',
    fill_value=0
).T

df_pivot.head()

userId,1,2,3,4,5,6,7,8,9,10,...,24902,24903,24905,24907,24908,24909,24910,24911,24912,24913
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Normalising the movie rating for each movie for using the Adujsted Cosine

In [66]:
mean = np.nanmean(df_pivot, axis=1)
df_subtracted = (df_pivot.T-mean).T

In [67]:
df_subtracted.head()

userId,1,2,3,4,5,6,7,8,9,10,...,24902,24903,24905,24907,24908,24909,24910,24911,24912,24913
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,-0.00093,-0.00093,-0.00093,-0.00093,-0.00093,-0.00093,-0.00093,-0.00093,-0.00093,-0.00093,...,-0.00093,-0.00093,-0.00093,-0.00093,-0.00093,-0.00093,-0.00093,-0.00093,-0.00093,-0.00093
1,-0.024025,-0.024025,-0.024025,-0.024025,-0.024025,-0.024025,4.975975,-0.024025,-0.024025,-0.024025,...,-0.024025,-0.024025,-0.024025,-0.024025,-0.024025,-0.024025,-0.024025,-0.024025,-0.024025,-0.024025
2,-0.001095,-0.001095,-0.001095,-0.001095,-0.001095,-0.001095,-0.001095,-0.001095,-0.001095,-0.001095,...,-0.001095,-0.001095,-0.001095,-0.001095,-0.001095,-0.001095,-0.001095,-0.001095,-0.001095,-0.001095
3,-0.00104,-0.00104,-0.00104,-0.00104,-0.00104,-0.00104,-0.00104,-0.00104,-0.00104,-0.00104,...,-0.00104,-0.00104,-0.00104,-0.00104,-0.00104,-0.00104,-0.00104,-0.00104,-0.00104,-0.00104
4,-0.000274,-0.000274,-0.000274,-0.000274,-0.000274,-0.000274,-0.000274,-0.000274,-0.000274,-0.000274,...,-0.000274,-0.000274,-0.000274,-0.000274,-0.000274,-0.000274,-0.000274,-0.000274,-0.000274,-0.000274


Finding the cosine similarity using pairwise distances approach

In [68]:
from sklearn.metrics.pairwise import pairwise_distances

In [69]:
# Item Similarity Matrix
item_correlation = 1 - pairwise_distances(df_subtracted.fillna(0), metric='cosine')
item_correlation[np.isnan(item_correlation)] = 0
print(item_correlation)

[[ 1.00000000e+00 -1.03255792e-03 -2.14897699e-04 ... -1.58949411e-03
  -6.84484407e-04 -5.39036196e-04]
 [-1.03255792e-03  1.00000000e+00 -1.05202939e-03 ... -4.57132066e-03
  -3.35088610e-03 -2.63884594e-03]
 [-2.14897699e-04 -1.05202939e-03  1.00000000e+00 ... -1.61946801e-03
  -6.97392076e-04 -5.49201074e-04]
 ...
 [-1.58949411e-03 -4.57132066e-03 -1.61946801e-03 ...  1.00000000e+00
  -5.15827113e-03 -4.06217413e-03]
 [-6.84484407e-04 -3.35088610e-03 -6.97392076e-04 ... -5.15827113e-03
   1.00000000e+00 -1.74929547e-03]
 [-5.39036196e-04 -2.63884594e-03 -5.49201074e-04 ... -4.06217413e-03
  -1.74929547e-03  1.00000000e+00]]


In [70]:
item_correlation.shape

(256, 256)

Filtering the correlation only for which the value is greater than 0. (Positively correlated)

In [71]:
item_correlation[item_correlation<0]=0
item_correlation

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

# Prediction - Item Item

In [72]:
item_predicted_ratings = np.dot((df_pivot.fillna(0).T),item_correlation)
item_predicted_ratings

array([[0.        , 0.0034399 , 0.        , ..., 0.        , 0.        ,
        0.0047605 ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.00793417],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [73]:
item_predicted_ratings.shape

(18273, 256)

In [74]:
dummy_train.shape

(18273, 256)

### Filtering the rating only for the movies not rated by the user for recommendation

In [75]:
item_final_rating = np.multiply(item_predicted_ratings,dummy_train)
item_final_rating.head()

movieId,0,1,2,3,4,5,6,7,8,9,...,260,261,262,263,264,265,266,268,269,270
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.00344,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016517,...,0.090657,0.0,0.0,0.0,0.055235,0.0,0.0,0.0,0.0,0.00476
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000783,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.001304,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.001043,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Finding the top 5 recommendation for the *user*



In [76]:
# Take the user ID as input
user_input = int(input("Enter your user name"))
print(user_input)

Enter your user name5
5


In [77]:
# Recommending the Top 5 products to the user.
d = item_final_rating.loc[user_input].sort_values(ascending=False)[0:5]
d

movieId
64     0.052455
158    0.037725
67     0.030557
125    0.016700
75     0.013517
Name: 5, dtype: float64

In [78]:
#Mapping with Movie Title / Genres 
movie_mapping = pd.read_csv('https://raw.githubusercontent.com/antrikshsaxena/NLPCapstone/main/movies.csv', encoding='latin-1')

In [79]:
d = pd.merge(d,movie_mapping,left_on='movieId', right_on='movieId', how='left')
d.head()

Unnamed: 0,movieId,5,title,genres
0,64,0.052455,Two if by Sea (1996),Comedy|Romance
1,158,0.037725,Casper (1995),Adventure|Children
2,67,0.030557,Two Bits (1995),Drama
3,125,0.0167,Flirting With Disaster (1996),Comedy
4,75,0.013517,Big Bully (1996),Comedy|Drama


In [80]:
train_new = pd.merge(train, movie_mapping,left_on='movieId', right_on='movieId', how='left')
train_new[train_new.userId == 1] .head()

Unnamed: 0,userId,movieId,rating,reviews_date,title,genres
17200,1,157,3,2016-07-09T00:00:00.000Z,Canadian Bacon (1995),Comedy|War
18225,1,151,1,2017-03-14T00:00:00.000Z,Rob Roy (1995),Action|Drama|Romance|War


# Evaluation - Item Item

Evaluation will we same as you have seen above for the prediction. The only difference being, you will evaluate for the movie already rated by the user insead of predicting it for the movie not rated by the user. 

In [81]:
test.columns

Index(['userId', 'movieId', 'rating', 'reviews_date'], dtype='object')

In [82]:
common =  test[test.movieId.isin(train.movieId)]
common.shape

(8965, 4)

In [83]:
common.head(4)

Unnamed: 0,userId,movieId,rating,reviews_date
0,24783,93,5,2014-11-14T00:00:00.000Z
1,1490,65,5,2015-01-03T00:00:00.000Z
2,1150,41,3,2013-03-14T00:00:00.000Z
3,7799,15,5,2016-10-13T00:00:00.000Z


In [84]:
common_item_based_matrix = common.pivot_table(index='userId', columns='movieId', values='rating').T

In [85]:
common_item_based_matrix.shape

(195, 8321)

In [86]:
item_correlation_df = pd.DataFrame(item_correlation)

In [87]:
item_correlation_df.head(1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,246,247,248,249,250,251,252,253,254,255
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [88]:
item_correlation_df['movieId'] = df_subtracted.index
item_correlation_df.set_index('movieId',inplace=True)
item_correlation_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,246,247,248,249,250,251,252,253,254,255
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.16054,0.0


In [89]:
list_name = common.movieId.tolist()

In [90]:
item_correlation_df.columns = df_subtracted.index.tolist()

item_correlation_df_1 =  item_correlation_df[item_correlation_df.index.isin(list_name)]

In [91]:
item_correlation_df_2 = item_correlation_df_1.T[item_correlation_df_1.T.index.isin(list_name)]

item_correlation_df_3 = item_correlation_df_2.T

In [92]:
item_correlation_df_3.head()

Unnamed: 0_level_0,0,1,3,5,9,10,12,13,14,15,...,256,257,258,260,263,264,266,268,269,270
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [93]:
item_correlation_df_3[item_correlation_df_3<0]=0

common_item_predicted_ratings = np.dot(item_correlation_df_3, common_item_based_matrix.fillna(0))
common_item_predicted_ratings


array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.07875875, 3.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [94]:
common_item_predicted_ratings.shape

(195, 8321)

Dummy test will be used for evaluation. To evaluate, we will only make prediction on the movies rated by the user. So, this is marked as 1. This is just opposite of dummy_train



In [95]:
dummy_test = common.copy()

dummy_test['rating'] = dummy_test['rating'].apply(lambda x: 1 if x>=1 else 0)

dummy_test = dummy_test.pivot_table(index='userId', columns='movieId', values='rating').T.fillna(0)

common_item_predicted_ratings = np.multiply(common_item_predicted_ratings,dummy_test)

The products not rated is marked as 0 for evaluation. And make the item- item matrix representaion.


In [96]:
common_ = common.pivot_table(index='userId', columns='movieId', values='rating').T

In [97]:
from sklearn.preprocessing import MinMaxScaler
from numpy import *

X  = common_item_predicted_ratings.copy() 
X = X[X>0]

scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X))
y = (scaler.transform(X))

print(y)

MinMaxScaler(feature_range=(1, 5))
[[nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 ...
 [nan  1. nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]]


In [98]:
# Finding total non-NaN value
total_non_nan = np.count_nonzero(~np.isnan(y))

In [99]:
rmse = (sum(sum((common_ - y )**2))/total_non_nan)**0.5
print(rmse)

3.5435475671613226


## **Summary - Recommendation Engine**