**Exploratory Data Analysis**

We first import the relevant libraries.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf

Reading the dataset with appropriate columns and printing its first few columns

In [2]:
columns = ['productID', 'userID', 'ratings','timestamp']
recomm_df = pd.read_csv('http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Gift_Cards.csv',names=columns)
recomm_df.head()

Unnamed: 0,productID,userID,ratings,timestamp
0,B001GXRQW0,APV13CM0919JD,1.0,1229644800
1,B001GXRQW0,A3G8U1G1V082SN,5.0,1229472000
2,B001GXRQW0,A11T2Q0EVTUWP,5.0,1229472000
3,B001GXRQW0,A9YKGBH3SV22C,5.0,1229472000
4,B001GXRQW0,A34WZIHVF3OKOL,1.0,1229472000


Removing 'timestamp' column as it is not relevant to us

In [3]:
recomm_df = recomm_df.drop('timestamp', axis=1)

The following command gives details about the dataset.

In [4]:
recomm_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147194 entries, 0 to 147193
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   productID  147194 non-null  object 
 1   userID     147194 non-null  object 
 2   ratings    147194 non-null  float64
dtypes: float64(1), object(2)
memory usage: 3.4+ MB


Checking the various mathemaical functions about 'ratings' column, we find that most of the reviews are 5.0.

In [5]:
recomm_df.describe()

Unnamed: 0,ratings
count,147194.0
mean,4.67197
std,0.955134
min,1.0
25%,5.0
50%,5.0
75%,5.0
max,5.0


Gives the number of empty cells in each column. Since all are 0, it implies we dont have unknown values in our dataset.

In [6]:
recomm_df.isna().sum()

productID    0
userID       0
ratings      0
dtype: int64

We find that there are 20994353 rows and 3 columns in our dataset.

In [7]:
recomm_df.shape

(147194, 3)

We look for different users and the number of ratings that they have given.

In [8]:
recomm_df.userID.value_counts()

A13H0YP0J8PM6V    39
A1U1G73EI5IRZF    32
A3OHGWD8LIDZ8K    26
A1F2NKB1ZKMO2V    23
A2RTTRR421J9KG    22
                  ..
A11DQ1W4EXZ3VJ     1
ABGQHBIJ3I01P      1
ABGSDBKH34JM6      1
A3LNAEWZXHDSUO     1
ANABUB0FRZXRM      1
Name: userID, Length: 128877, dtype: int64

The following commands give their respective values.

In [9]:
print('Number of unique users', len(recomm_df['userID'].unique()))
print('Number of unique products', len(recomm_df['productID'].unique()))
print('Unique Ratings', recomm_df['ratings'].unique())

Number of unique users 128877
Number of unique products 1548
Unique Ratings [1. 5. 3. 4. 2.]


**Data Preprocessing**

We take data having only those users that have given more than 50 ratings and products that have more than 50 ratings.

First we find users who have made more than 50 reviews and create a new table having these values. 

In [24]:
userID = recomm_df.groupby('userID').count()
top_user = userID[userID['ratings'] >= 5].index
topuser_ratings_df = recomm_df[recomm_df['userID'].isin(top_user)]
topuser_ratings_df.shape

(4832, 3)

In [11]:
topuser_ratings_df.head()

Unnamed: 0,productID,userID,ratings


The data has been reduced to 394059 rows. We now arrange the ratings in decreasing order.

In [12]:
topuser_ratings_df.sort_values(by='ratings', ascending=False).head()

Unnamed: 0,productID,userID,ratings


We now do the same for ratings.

In [13]:
prodID = recomm_df.groupby('productID').count()

In [25]:
top_prod = prodID[prodID['ratings'] >= 5].index
top_ratings_df = topuser_ratings_df[topuser_ratings_df['productID'].isin(top_prod)]
top_ratings_df.sort_values(by='ratings', ascending=False).head()

Unnamed: 0,productID,userID,ratings
43,B001GXRQW0,A21G4URAAP0EN,5.0
107823,B00MV9GRNW,A35S2WL2F27WC7,5.0
111290,B00MV9H2B8,A2M0BETYD0WVU8,5.0
111248,B00MV9P8MS,A2YMYWMC2FDTZ1,5.0
111161,B00MV9GCYQ,AVCV7UBUUX7YP,5.0


In [26]:
top_ratings_df.shape

(4807, 3)

The number of entries have now been further reduced to 291192.

**Building the Collaborative Filtering Model**

We convert the data that we have now to a matrix.

In [27]:
user_ratings = top_ratings_df.pivot_table(index=['userID'], columns=['productID'], values='ratings')
user_ratings.head()

productID,B001GXRQW0,B002MS7BPA,B002XNLC04,B002YEWXZ0,B004KNWWMW,B004KNWWO0,B004KNWWPE,B004KNWWT0,B004KNWWTA,B004KNWWTK,...,B01GF6X20W,B01GF7GNCA,B01GKWEH64,B01GKWEJTO,B01GKWEPBG,B01GKWETLC,B01GKZ37SA,B01GKZ3SQG,B01GP1W4LA,B01H5PPJT4
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A102300ZYSDHRR,,,,,,,,,,,...,,,,,,,,,,
A10CJ0DWV2M12X,,,,,,,,,,,...,,,,,,,,,,
A10PEXB6XAQ5XF,,,,,,,,,,,...,,,,,,,,,,
A11F143J72N3QZ,,,,,,,,,,,...,,,,,,,,,,
A11FTRONEVLMKY,,,,,,,,,,,...,,,,,,,,,,


The NaN are values for which a rating does not exist. We fill them with 0.

In [28]:
user_ratings=user_ratings.dropna(thresh=50, axis=1).fillna(0)

In [29]:
user_ratings.head()

productID,B004Q7CK9M,B006PJHP62,B0078EPBHI,B0078EPRPE,B0091JKJ0M,B0091JKVU0,B0091JKY0M,B00AR51Y5I,B00BXLVAD6,B00BXLW5QC,B00CHQ7I2S,B00CXZPG0O,B00F2RZMEA,B00GOLGWVK,B00JDQJZWG,B00PG8502O,B014S24DAI,B015WY0DOQ,B01E4QPDV6,B01E4QUN0W
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
A102300ZYSDHRR,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0
A10CJ0DWV2M12X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A10PEXB6XAQ5XF,0.0,0.0,0.0,0.0,5.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,5.0
A11F143J72N3QZ,0.0,5.0,0.0,0.0,0.0,5.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,5.0
A11FTRONEVLMKY,5.0,0.0,0.0,0.0,0.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We now find the Pearson correlation of each product with respect to each other to know the realtions that exist between them - item to item filtering. 

In [30]:
item_similarity_df=user_ratings.corr(method='pearson')
item_similarity_df.head(10)

productID,B004Q7CK9M,B006PJHP62,B0078EPBHI,B0078EPRPE,B0091JKJ0M,B0091JKVU0,B0091JKY0M,B00AR51Y5I,B00BXLVAD6,B00BXLW5QC,B00CHQ7I2S,B00CXZPG0O,B00F2RZMEA,B00GOLGWVK,B00JDQJZWG,B00PG8502O,B014S24DAI,B015WY0DOQ,B01E4QPDV6,B01E4QUN0W
productID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
B004Q7CK9M,1.0,-0.020661,-0.032571,-0.030149,0.115784,0.063326,0.086251,0.073412,-0.035637,-0.040187,0.071439,-0.000865,-0.068437,-0.025476,0.095055,-0.0279,0.073866,-0.049553,0.135061,0.083831
B006PJHP62,-0.020661,1.0,0.131454,0.155744,-0.066352,-0.065861,-0.077417,0.032216,0.234358,0.194745,-0.094369,0.208254,0.187105,0.1162,-0.090086,0.179259,-0.040559,0.084657,-0.01547,-0.022827
B0078EPBHI,-0.032571,0.131454,1.0,0.147741,-0.047754,-0.087505,-0.058051,0.077446,0.175501,0.201359,-0.079351,0.213274,0.127616,0.03449,-0.075495,0.170531,-0.043155,0.176143,-0.058574,-0.083756
B0078EPRPE,-0.030149,0.155744,0.147741,1.0,-0.045908,-0.100657,-0.072338,0.059906,0.171252,0.122745,-0.060043,0.117081,0.122447,0.167131,-0.073935,0.10645,-0.081477,0.084017,-0.036373,-0.082092
B0091JKJ0M,0.115784,-0.066352,-0.047754,-0.045908,1.0,0.226957,0.324652,0.009934,-0.068422,-0.069929,0.312465,-0.046113,-0.100998,-0.055554,0.089726,-0.062723,0.125569,-0.084666,0.122018,0.198674
B0091JKVU0,0.063326,-0.065861,-0.087505,-0.100657,0.226957,1.0,0.394512,-0.014765,-0.072666,-0.11275,0.292037,-0.08564,-0.105299,-0.04067,0.081028,-0.078782,0.145547,-0.082154,0.073029,0.290542
B0091JKY0M,0.086251,-0.077417,-0.058051,-0.072338,0.324652,0.394512,1.0,0.001816,-0.058235,-0.058663,0.330594,-0.072484,-0.076913,-0.064444,0.132695,-0.06227,0.057529,-0.085851,0.052444,0.278328
B00AR51Y5I,0.073412,0.032216,0.077446,0.059906,0.009934,-0.014765,0.001816,1.0,0.037458,0.216544,-0.046054,0.062128,0.091101,0.199136,-0.038459,0.115279,0.019554,0.042757,-0.043952,0.028578
B00BXLVAD6,-0.035637,0.234358,0.175501,0.171252,-0.068422,-0.072666,-0.058235,0.037458,1.0,0.114219,-0.030194,0.110988,0.143595,0.09534,-0.075734,0.146128,-0.083021,0.041213,-0.042133,-0.019386
B00BXLW5QC,-0.040187,0.194745,0.201359,0.122745,-0.069929,-0.11275,-0.058663,0.216544,0.114219,1.0,-0.094615,0.138208,0.284236,0.22076,-0.104375,0.192768,-0.023984,0.272578,-0.070378,-0.078131


Now all we have to do is to create a function that finds the best products to recommend and arranges it in decreasing order. Then we pass an example of a user to it with products and the ratings the user has given to the function for each product and rating.  

In [31]:
def get_similar_products(product_name, user_rating):
  similar_score=item_similarity_df[product_name]*(user_rating-2.5)
  similar_score=similar_score.sort_values(ascending=False)
  return similar_score

In [33]:
example=[("B004Q7CK9M", 4),("B006PJHP62", 5),("B0078EPBHI", 3),("B0078EPRPE", 1),("B0091JKJ0M", 2)]
similar_products=pd.DataFrame()
for product, rating in example:
  similar_products=similar_products.append(get_similar_products(product, rating),ignore_index=True)
similar_products.head()
similar_products.sum().sort_values(ascending=False)

B006PJHP62    2.334295
B004Q7CK9M    1.419394
B0078EPBHI    0.582045
B00CXZPG0O    0.473410
B00BXLVAD6    0.397522
B00BXLW5QC    0.378108
B00PG8502O    0.363249
B00F2RZMEA    0.295743
B015WY0DOQ    0.141690
B00AR51Y5I    0.134554
B01E4QPDV6    0.128180
B01E4QUN0W    0.050601
B014S24DAI    0.047255
B00GOLGWVK    0.046611
B00JDQJZWG   -0.054340
B0091JKVU0   -0.075908
B0091JKY0M   -0.147010
B00CHQ7I2S   -0.234608
B0091JKJ0M   -0.447220
B0078EPRPE   -1.059038
dtype: float64

We can see that the model is successful because the user's best recommendation is the one that he had rated 5.

In [None]:
import pickle

In [None]:
filename='similarity_df.sav'
pickle.dump(item_similarity_df, open(filename, 'wb'))

In [None]:
loaded_model=pickle.load(open('similarity_df.sav', 'rb'))