### Recommendation System

Problem Statement: Build a recommendation engine for the following Dataset

Dataset--> Movie.csv

In [1]:
#Importing the Required Libraries
import pandas as pd
import numpy as np
#Calculating Cosine Similarity between Users
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine, correlation

In [2]:
#Loading the Dataset
movies_df = pd.read_csv('C:/Users/Akaash/Downloads/Movie.csv')
movies_df.head()

Unnamed: 0,userId,movie,rating
0,3,Toy Story (1995),4.0
1,6,Toy Story (1995),5.0
2,8,Toy Story (1995),4.0
3,10,Toy Story (1995),4.0
4,11,Toy Story (1995),4.5


In [3]:
#To get Shape of Original Dataset
movies_df.shape

(8992, 3)

In [4]:
#number of unique users in the dataset
len(movies_df.userId.unique())

4081

In [5]:
#number of unique movies in the dataset
len(movies_df.movie.unique())

10

Inference: 

Total/Unqiue Users = 4081, Total/Unqiue Movies = 10

Here Product is movies and users are customer, so all the unique user should come in row side, and unique movie on the column side. So now we can use crosstab / Pivot table 

After Using a Crosstab / Pivot Table we will get (4081,10)--(rows,columns)

In [6]:
#Creating a Pivot Table DataFrame
user_movies_df = movies_df.pivot(index='userId', columns='movie', values='rating').reset_index(drop=True)
user_movies_df

movie,Father of the Bride Part II (1995),GoldenEye (1995),Grumpier Old Men (1995),Heat (1995),Jumanji (1995),Sabrina (1995),Sudden Death (1995),Tom and Huck (1995),Toy Story (1995),Waiting to Exhale (1995)
0,,,,,3.5,,,,,
1,,,4.0,,,,,,,
2,,,,,,,,,4.0,
3,,4.0,,3.0,,,,,,
4,,,,,3.0,,,,,
...,...,...,...,...,...,...,...,...,...,...
4076,4.0,,,,,,,,,
4077,3.5,,,,,,,,4.0,
4078,,3.0,4.0,5.0,,3.0,1.0,,4.0,
4079,,,,,,,,,5.0,


Inference: Now you can see we have (4081,10) shape with proper index,as the previous index was reset using -->reset_index(drop=True)

Where drop = True drops original index

In [7]:
#Replaceing index with userid
user_movies_df.index = movies_df.userId.unique()
user_movies_df

movie,Father of the Bride Part II (1995),GoldenEye (1995),Grumpier Old Men (1995),Heat (1995),Jumanji (1995),Sabrina (1995),Sudden Death (1995),Tom and Huck (1995),Toy Story (1995),Waiting to Exhale (1995)
3,,,,,3.5,,,,,
6,,,4.0,,,,,,,
8,,,,,,,,,4.0,
10,,4.0,,3.0,,,,,,
11,,,,,3.0,,,,,
...,...,...,...,...,...,...,...,...,...,...
7044,4.0,,,,,,,,,
7070,3.5,,,,,,,,4.0,
7080,,3.0,4.0,5.0,,3.0,1.0,,4.0,
7087,,,,,,,,,5.0,


Inference: Replacing the index or movie column with user's user id

In [8]:
#Impute those NaNs with 0 values
user_movies_df.fillna(0, inplace=True)
user_movies_df

movie,Father of the Bride Part II (1995),GoldenEye (1995),Grumpier Old Men (1995),Heat (1995),Jumanji (1995),Sabrina (1995),Sudden Death (1995),Tom and Huck (1995),Toy Story (1995),Waiting to Exhale (1995)
3,0.0,0.0,0.0,0.0,3.5,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
10,0.0,4.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
11,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
7044,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7070,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
7080,0.0,3.0,4.0,5.0,0.0,3.0,1.0,0.0,4.0,0.0
7087,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0


Inference: Filling NA Values  with 0 as considering that the users haven't given rating to those movies which they didn't Watch.

#### Checking Similiarity --Correlation Based Similarity

In [9]:
#Checking Similiarity by using Correlation based Similarity -- passing an array of values
user_sim = 1 - pairwise_distances( user_movies_df.values,metric='correlation')
user_sim

array([[ 1.        , -0.11111111, -0.11111111, ..., -0.35136418,
        -0.11111111,  0.47898578],
       [-0.11111111,  1.        , -0.11111111, ...,  0.35136418,
        -0.11111111, -0.21772081],
       [-0.11111111, -0.11111111,  1.        , ...,  0.35136418,
         1.        ,  0.56607411],
       ...,
       [-0.35136418,  0.35136418,  0.35136418, ...,  1.        ,
         0.35136418,  0.13769873],
       [-0.11111111, -0.11111111,  1.        , ...,  0.35136418,
         1.        ,  0.56607411],
       [ 0.47898578, -0.21772081,  0.56607411, ...,  0.13769873,
         0.56607411,  1.        ]])

In [10]:
#Store the results of Correlation Based Similarity in a dataframe
user_sim_df = pd.DataFrame(user_sim)

#Set the index and column names to user ids 
user_sim_df.index = movies_df.userId.unique()
user_sim_df.columns = movies_df.userId.unique()
#Getting First 5 Row and columns
user_sim_df.iloc[0:5,0:5]

Unnamed: 0,3,6,8,10,11
3,1.0,-0.111111,-0.111111,-0.164581,1.0
6,-0.111111,1.0,-0.111111,-0.164581,-0.111111
8,-0.111111,-0.111111,1.0,-0.164581,-0.111111
10,-0.164581,-0.164581,-0.164581,1.0,-0.164581
11,1.0,-0.111111,-0.111111,-0.164581,1.0


Inference: Created the matrix of Similiarity, 1 = Similiarity between Two Users, -0  = No Similiarity / Negative Similiarity
    
Now as you can see similiarity between ourselves will be always 1 so droping them i.e diagonal element

In [12]:
#Filling Diagonal to 0
np.fill_diagonal(user_sim, 0)
#Getting First 5 Row and columns
user_sim_df.iloc[0:5, 0:5]

Unnamed: 0,3,6,8,10,11
3,0.0,-0.111111,-0.111111,-0.164581,1.0
6,-0.111111,0.0,-0.111111,-0.164581,-0.111111
8,-0.111111,-0.111111,0.0,-0.164581,-0.111111
10,-0.164581,-0.164581,-0.164581,0.0,-0.164581
11,1.0,-0.111111,-0.111111,-0.164581,0.0


In [13]:
#Most Similar Users
user_sim_df.idxmax(axis=1)[0:]

3         11
6        667
8         16
10      4047
11         3
        ... 
7044     298
7070    1808
7080    4032
7087       8
7105    4110
Length: 4081, dtype: int64

Inference: All these users have simliarity between them. like 3,11 have similarity and same goes on 

In [14]:
#Checking Details of 3 & 11 Users to Recommend
movies_df[(movies_df['userId']==3) | (movies_df['userId']==11)]

Unnamed: 0,userId,movie,rating
0,3,Toy Story (1995),4.0
4,11,Toy Story (1995),4.5
7446,11,GoldenEye (1995),2.5


Inference: 

As you can User3 has seen only Toy story whereas User11 has seen two movies i.e toy story and GoldenEye,

So Now We can recommend User3 GoldenEye based on Users11 As both User has simliarity of One Movie

In [16]:
#Creating DataFrame of Two User_id for Join Statement
user_1=movies_df[movies_df['userId']==6]
user_2=movies_df[movies_df['userId']==667]
print(user_1.movie)
print(user_2.movie)

1              Toy Story (1995)
3725    Grumpier Old Men (1995)
6464             Sabrina (1995)
Name: movie, dtype: object
225    Toy Story (1995)
Name: movie, dtype: object


In [17]:
#Join movies Watched by both and then we can recommend the ones which have not been watched -- Joins like SQL
pd.merge(user_1,user_2,on='movie',how='outer')

Unnamed: 0,userId_x,movie,rating_x,userId_y,rating_y
0,6,Toy Story (1995),5.0,667.0,4.5
1,6,Grumpier Old Men (1995),3.0,,
2,6,Sabrina (1995),5.0,,


Inference: So now we can recommend Grumpier and Sabrina to user_id:667 based on above matrix