Memory based Collaborative filtering using Cosine Distance and kNN

Recommender systems are an integral part of many online systems. From e-commerce to online streaming platforms.
Recommender systems employ the past purchase patters on it's user to predict which other products they may in interested in and likey to purchase. Recommending the right products gives a significat advantage to the business. A mojor portion of the revenue is generated through recommendations.


The Collaborative Filtering algorithm is very popular in online streaming platforms and e-commerse sites where the customer interacts with each product (which can be a movie/ song or consumer products) by either liking/ disliking or giving a rating of sorts.
One of the requirements to be able to apply collaborative filtering is that sufficient number of products need ratings associated with not them. User interaction is required.




This notebook walks through the implementation of collaborative filtering using a memory-based technique of distance proximity, using cosine distances and nearest neighbours.

## Importing libraries and initial data checks

In [1]:
# import required libraries
import pandas as pd
import numpy as np

### About the data

This is a dataset related to over 2 Million customer reviews and ratings of Beauty related products sold on Amazon's website.

It contains:
- the unique UserId (Customer Identification),
- the product ASIN (Amazon's unique product identification code for each product),
- Ratings (ranging from 1-5 based on customer satisfaction) and
- the Timestamp of the rating (in UNIX time)

In [2]:
# raed the dataset
#df = pd.read_csv('ratings_Beauty.csv')
df = pd.read_csv('https://s3.amazonaws.com/hackerday.datascience/725/ratings_Beauty.csv')
df.shape

(2023070, 4)

In [3]:
# check the first 5 rows
df.head()

Unnamed: 0,UserId,ProductId,Rating,Timestamp
0,A39HTATAQ9V7YF,205616461,5.0,1369699200
1,A3JM6GV9MNOF9X,558925278,3.0,1355443200
2,A1Z513UWSAAO0F,558925278,5.0,1404691200
3,A1WMRR494NWEWV,733001998,4.0,1382572800
4,A3IAAVS479H7M7,737104473,1.0,1274227200


Check if there are any duplicate values present

In [4]:
duplicates = df.duplicated(["UserId","ProductId", "Rating", "Timestamp"]).sum()
print(' Duplicate records: ',duplicates)


 Duplicate records:  0


See the number of unique values present

In [5]:
print('unique users:',len(df.UserId.unique()))
print('unique products:',len(df.ProductId.unique()))
print("total ratings: ",df.shape[0])


unique users: 1210271
unique products: 249274
total ratings:  2023070


Check for null values

In [6]:
df.isnull().any()

Unnamed: 0,0
UserId,False
ProductId,False
Rating,False
Timestamp,False


Number of rated products per user

In [7]:
products_user= df.groupby(by = "UserId")["Rating"].count().sort_values(ascending =False)
products_user.head()

Unnamed: 0_level_0,Rating
UserId,Unnamed: 1_level_1
A3KEZLJ59C1JVH,389
A281NPSIMI1C2R,336
A3M174IC0VXOS2,326
A2V5R832QCSOMX,278
A3LJLRIZL38GG3,276


Number of ratings per product

In [8]:
product_rated = df.groupby(by = "ProductId")["Rating"].count().sort_values(ascending = False)
product_rated.head()


Unnamed: 0_level_0,Rating
ProductId,Unnamed: 1_level_1
B001MA0QY2,7533
B0009V1YR8,2869
B0043OYFKU,2477
B0000YUXI0,2143
B003V265QW,2088


Number of products rated by each user

In [9]:
rated_users=df.groupby("UserId")["ProductId"].count().sort_values(ascending=False)
print(rated_users)

UserId
A3KEZLJ59C1JVH           389
A281NPSIMI1C2R           336
A3M174IC0VXOS2           326
A2V5R832QCSOMX           278
A3LJLRIZL38GG3           276
                        ... 
A0014588315DWN77627MY      1
A00144702V3Q8N2EJ3S2G      1
A00126503SUWI86KZBMIN      1
A001235832OWO8HZGS1KC      1
A00120381FL204MYH7G3B      1
Name: ProductId, Length: 1210271, dtype: int64


In [10]:
rated_products=df.groupby("ProductId")["UserId"].count().sort_values(ascending=False)
print(rated_products)

ProductId
B001MA0QY2    7533
B0009V1YR8    2869
B0043OYFKU    2477
B0000YUXI0    2143
B003V265QW    2088
              ... 
B00LIF47O6       1
B00LH81A0I       1
B00LH50A0C       1
B00LH4LD1I       1
B00LH28Q88       1
Name: UserId, Length: 249274, dtype: int64


Number of products with some minimum ratings

In [11]:
print('Number of products with minimum of 5 reviews/ratings:',rated_products[rated_products>5].count())
print('Number of products with minimum of 4 reviews/ratings:',rated_products[rated_products>4].count())
print('Number of products with minimum of 3 reviews/ratings:',rated_products[rated_products>3].count())
print('Number of products with minimum of 2 reviews/ratings:',rated_products[rated_products>2].count())
print('Number of products with minimum of 1 reviews/ratings:',rated_products[rated_products>1].count())

Number of products with minimum of 5 reviews/ratings: 57722
Number of products with minimum of 4 reviews/ratings: 67345
Number of products with minimum of 3 reviews/ratings: 81247
Number of products with minimum of 2 reviews/ratings: 103581
Number of products with minimum of 1 reviews/ratings: 145790


## Visualizing the data

In [12]:
# plot the data
import plotly.graph_objects as go
index = ['Total size of records', "Number of unique users","Number of unique products"]
values =[len(df),len(df['UserId'].unique()),len(df['ProductId'].unique())]

plot = go.Figure([go.Bar(x=index, y=values,textposition='auto')])
plot.update_layout(title_text='Number of Users and Products w.r.to Total size of Data',
                    xaxis_title="Records",
                    yaxis_title="Total number of Records")

plot.show()


### The ratings given by users

In [13]:
print("Range of Ratings: ", df['Rating'].value_counts())
print(list(df['Rating'].value_counts()))

values = list(df['Rating'].value_counts())

plot = go.Figure([go.Bar(x = df['Rating'].value_counts().index, y = values,textposition='auto')])

plot.update_layout(title_text='Ratings given by user',
                    xaxis_title="Rating",
                    yaxis_title="Total number of Ratings")

plot.show()


Range of Ratings:  Rating
5.0    1248721
4.0     307740
1.0     183784
3.0     169791
2.0     113034
Name: count, dtype: int64
[1248721, 307740, 183784, 169791, 113034]


### Products which are most popular

In [14]:
print("Products with occurred the most: \n",df['ProductId'].value_counts().nlargest(5))

values = list(df['ProductId'].value_counts())


plot = go.Figure([go.Bar(x = df['ProductId'].value_counts().nlargest(5).index, y = values,textposition='auto')])

plot.update_layout(title_text='Most rated products',
                    xaxis_title="ProductID",
                    yaxis_title="Number of times occurred in the data")

plot.show()


Products with occurred the most: 
 ProductId
B001MA0QY2    7533
B0009V1YR8    2869
B0043OYFKU    2477
B0000YUXI0    2143
B003V265QW    2088
Name: count, dtype: int64


### Average rating given by each user


In [15]:
ratings_per_user = df.groupby('UserId')['Rating'].count().sort_values(ascending=False)
print("Average rating given by each user: ",ratings_per_user.head())

plot = go.Figure(data=[go.Histogram(x=ratings_per_user)])
plot.show()


Average rating given by each user:  UserId
A3KEZLJ59C1JVH    389
A281NPSIMI1C2R    336
A3M174IC0VXOS2    326
A2V5R832QCSOMX    278
A3LJLRIZL38GG3    276
Name: Rating, dtype: int64


In [16]:
ratings_per_product = df.groupby('ProductId')['Rating'].count().sort_values(ascending=False)
# print("Average rating given by each user: ",ratings_per_user.head())

plot = go.Figure(data=[go.Histogram(x=ratings_per_product)])
plot.show(title_text='Number of ratings per product',
                    xaxis_title="Product",
                    yaxis_title="Number of ratings")

In [17]:
ratings_per_product = df.groupby('ProductId')['Rating'].count().sort_values(ascending=False)
# print("Average rating given by each user: ",ratings_per_user.head())

plot = go.Figure(data=[go.Histogram(x=ratings_per_product.nlargest(2000))])
plot.show(title_text='Number of ratings per product',
                    xaxis_title="Product",
                    yaxis_title="Number of ratings")

### Products with very less ratings


In [18]:

rating_of_products = df.groupby('ProductId')['Rating'].count()
# convert to make dataframe to analyse data
number_of_ratings_given = pd.DataFrame(rating_of_products)
print("Products with ratings given by users: \n",number_of_ratings_given.head())

less_than_ten = []
less_than_fifty_greater_than_ten = []
greater_than_fifty_less_than_hundred = []
greater_than_hundred = []
average_rating = []

for rating in number_of_ratings_given['Rating']:
    if rating <=10:
        less_than_ten.append(rating)
    if rating > 10 and rating <= 50:
        less_than_fifty_greater_than_ten.append(rating)
    if rating > 50 and rating <= 100:
        greater_than_fifty_less_than_hundred.append(rating)
    if rating > 100:
        greater_than_hundred.append(rating)

    average_rating.append(rating)

print("Ratings_count_less_than_ten: ", len(less_than_ten))
print("Ratings_count_greater_than_ten_less_than_fifty: ", len(less_than_fifty_greater_than_ten))
print("Ratings_count_greater_than_fifty_less_than_hundred: ", len(greater_than_fifty_less_than_hundred))
print("Ratings_count_greater_than_hundred: ", len(greater_than_hundred))
print("Average number of products rated by users: ", np.mean(average_rating))



Products with ratings given by users: 
             Rating
ProductId         
0205616461       1
0558925278       2
0733001998       1
0737104473       1
0762451459       1
Ratings_count_less_than_ten:  215395
Ratings_count_greater_than_ten_less_than_fifty:  27082
Ratings_count_greater_than_fifty_less_than_hundred:  4110
Ratings_count_greater_than_hundred:  2687
Average number of products rated by users:  8.115848423822781


In [19]:
x_values = ["Ratings_count_less_than_ten","Ratings_count_greater_than_ten_less_than_fifty",
           "Ratings_count_greater_than_fifty_less_than_hundred","Ratings_count_greater_than_hundred"]
y_values = [len(less_than_ten),len(less_than_fifty_greater_than_ten),len(greater_than_fifty_less_than_hundred),
            len(greater_than_hundred)]


plot = go.Figure([go.Bar(x = x_values, y = y_values, textposition='auto')])

plot.add_annotation(
        x=1,
        y=100000,
        xref="x",
        yref="y")

plot.update_layout(title_text='Ratings Count on Products',
                    xaxis_title="Ratings Range",
                    yaxis_title="Count of Rating")
plot.show()


In [20]:
from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()


### To convert alphanumeric data to numeric

In [21]:
dataset = df
dataset['user'] = label_encoder.fit_transform(df['UserId'])
dataset['product'] = label_encoder.fit_transform(df['ProductId'])
dataset.head()


Unnamed: 0,UserId,ProductId,Rating,Timestamp,user,product
0,A39HTATAQ9V7YF,205616461,5.0,1369699200,725046,0
1,A3JM6GV9MNOF9X,558925278,3.0,1355443200,814606,1
2,A1Z513UWSAAO0F,558925278,5.0,1404691200,313101,1
3,A1WMRR494NWEWV,733001998,4.0,1382572800,291075,2
4,A3IAAVS479H7M7,737104473,1.0,1274227200,802842,3


In [22]:

# average rating given by each user
average_rating = dataset.groupby(by="user", as_index=False)['Rating'].mean()
print("Average rating given by users: \n",average_rating.head())
print("----------------------------------------------------------\n")


# let's merge it with the dataset as we will be using that later
dataset = pd.merge(dataset, average_rating, on="user")
print("Modified dataset: \n", dataset.head())
print("----------------------------------------------------------\n")

# renaming columns
dataset = dataset.rename(columns={"Rating_x": "real_rating", "Rating_y": "average_rating"})
print("Dataset: \n", dataset.head())
print("----------------------------------------------------------\n")


Average rating given by users: 
    user  Rating
0     0     5.0
1     1     5.0
2     2     3.0
3     3     5.0
4     4     5.0
----------------------------------------------------------

Modified dataset: 
            UserId   ProductId  Rating_x   Timestamp    user  product  Rating_y
0  A39HTATAQ9V7YF  0205616461       5.0  1369699200  725046        0      4.25
1  A3JM6GV9MNOF9X  0558925278       3.0  1355443200  814606        1      3.50
2  A1Z513UWSAAO0F  0558925278       5.0  1404691200  313101        1      5.00
3  A1WMRR494NWEWV  0733001998       4.0  1382572800  291075        2      4.00
4  A3IAAVS479H7M7  0737104473       1.0  1274227200  802842        3      1.00
----------------------------------------------------------

Dataset: 
            UserId   ProductId  real_rating   Timestamp    user  product  \
0  A39HTATAQ9V7YF  0205616461          5.0  1369699200  725046        0   
1  A3JM6GV9MNOF9X  0558925278          3.0  1355443200  814606        1   
2  A1Z513UWSAAO0F  05

Certain users tend to give higher ratings while others tend to gibve lower ratings. To negate this bias, we normalise the ratings given by the users.

In [23]:
dataset['normalized_rating'] = dataset['real_rating'] - dataset['average_rating']
print("Data with adjusted rating: \n", dataset.head())


Data with adjusted rating: 
            UserId   ProductId  real_rating   Timestamp    user  product  \
0  A39HTATAQ9V7YF  0205616461          5.0  1369699200  725046        0   
1  A3JM6GV9MNOF9X  0558925278          3.0  1355443200  814606        1   
2  A1Z513UWSAAO0F  0558925278          5.0  1404691200  313101        1   
3  A1WMRR494NWEWV  0733001998          4.0  1382572800  291075        2   
4  A3IAAVS479H7M7  0737104473          1.0  1274227200  802842        3   

   average_rating  normalized_rating  
0            4.25               0.75  
1            3.50              -0.50  
2            5.00               0.00  
3            4.00               0.00  
4            1.00               0.00  


# Cosine Similarity

We use a distance based metric - cosine similarity to identify similar users. It is important first, to remove products that have very low number of ratings.

## Filter based on number of ratings available

In [24]:
rating_of_product = dataset.groupby('product')['real_rating'].count() # apply groupby
ratings_of_products_df = pd.DataFrame(rating_of_product)
print("Real ratings:\n",ratings_of_products_df.head()) # check for real rating for products


Real ratings:
          real_rating
product             
0                  1
1                  2
2                  1
3                  1
4                  1


In [25]:
filtered_ratings_per_product = ratings_of_products_df[ratings_of_products_df.real_rating >= 200]
print(filtered_ratings_per_product.head())
print(filtered_ratings_per_product.shape)

         real_rating
product             
704              558
719              377
754              288
834              412
843              313
(934, 1)


In [26]:
# build a list of products to keep
popular_products = filtered_ratings_per_product.index.tolist()
print("Popular product count which have ratings over average rating count: ",len(popular_products))
print("--------------------------------------------------------------------------------")

filtered_ratings_data = dataset[dataset["product"].isin(popular_products)]
print("Filtered rated product in the dataset: \n",filtered_ratings_data.head())
print("---------------------------------------------------------------------------------")

print("The size of dataset has changed from ", len(dataset), " to ", len(filtered_ratings_data))
print("---------------------------------------------------------------------------------")

Popular product count which have ratings over average rating count:  934
--------------------------------------------------------------------------------
Filtered rated product in the dataset: 
               UserId   ProductId  real_rating   Timestamp     user  product  \
2589   AIQX2510USU1W  B00004TUBL          5.0  1163289600  1057161      704   
2590  A2AOLV77AF11M2  B00004TUBL          5.0  1334793600   415068      704   
2591  A3B1V3AUUV13R6  B00004TUBL          5.0  1375228800   738916      704   
2592  A3DDQR5XWQFLRE  B00004TUBL          5.0  1363132800   759568      704   
2593   AT8JPRINM298P  B00004TUBL          5.0  1366243200  1150521      704   

      average_rating  normalized_rating  
2589        5.000000           0.000000  
2590        5.000000           0.000000  
2591        3.500000           1.500000  
2592        4.333333           0.666667  
2593        5.000000           0.000000  
------------------------------------------------------------------------------

## Creating the User-item matrix

In [27]:
similarity = pd.pivot_table(filtered_ratings_data,values='normalized_rating',index='UserId',columns='product')
similarity = similarity.fillna(0)
print("Updated Dataset: \n",similarity.head())

Updated Dataset: 
 product                704     719     754     834     843     858     861     \
UserId                                                                          
A0010876CNE3ILIM9HV0      0.0     0.0     0.0     0.0     0.0     0.0     0.0   
A0011102257KBXODKL24I     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
A00120381FL204MYH7G3B     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
A00126503SUWI86KZBMIN     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
A001573229XK5T8PI0OKA     0.0     0.0     0.0     0.0     0.0     0.0     0.0   

product                873     944     981     ...  241604  242018  242048  \
UserId                                         ...                           
A0010876CNE3ILIM9HV0      0.0     0.0     0.0  ...     0.0     0.0     0.0   
A0011102257KBXODKL24I     0.0     0.0     0.0  ...     0.0     0.0     0.0   
A00120381FL204MYH7G3B     0.0     0.0     0.0  ...     0.0     0.0     0.0   
A00126503SUWI86KZBMIN  

As you can see, this is a very sparse matrix

In [28]:
from sklearn.metrics.pairwise import cosine_similarity
import operator

In [29]:
selecting_users = list(similarity.index)
selecting_users = selecting_users[:100]
print("You can select users from the below list:\n",selecting_users)

You can select users from the below list:
 ['A0010876CNE3ILIM9HV0', 'A0011102257KBXODKL24I', 'A00120381FL204MYH7G3B', 'A00126503SUWI86KZBMIN', 'A001573229XK5T8PI0OKA', 'A00203203EBR4E6BIUOKF', 'A00222842T0ZYI86C9LHU', 'A00258542AL4VKETFLGIJ', 'A00259242VSCRZPGIWP0M', 'A00262022JQPXX5SXEVJR', 'A00275441WYR3489IKNAB', 'A00328401T70RFN4P1IT6', 'A00349462AOAVUUPEJNQZ', 'A00370223FX3K9TUF1QCL', 'A00407141VL6SB77B1GGG', 'A00414041RD0BXM6WK0GX', 'A00426443G4MEWS3K1XFA', 'A004511036AHSSV5O4SBY', 'A00454102SR84NOYTI0JS', 'A00463203QYS5I5X6MMXW', 'A00473363TJ8YSZ3YAGG9', 'A00491723IYKW5UI74AEX', 'A0058336347PC7BSR0UJC', 'A00612582Z6ZU2SDMRQ07', 'A00615442TZG6MHZXJOIZ', 'A00627983P6OGUFJ3IW8H', 'A006502622TE53S3J9W6H', 'A00656692CXO0VGF00V9I', 'A006680338J29DP17XALU', 'A00669491O55AKJ5QVH9L', 'A00679332RYOO5406ARSG', 'A00700212KB3K0MVESPIY', 'A0072717335KA6520NEMI', 'A0074075A8TZJIPLGZEK', 'A00773851NXKGCZRY43PG', 'A0078719IR14X3NNUG0F', 'A00802872RVW2KLY6DAL0', 'A008374338GH2TUB0S8KP', 'A0085249

In [30]:
def getting_top_5_similar_users(user_id, similarity_table, k=5):
    '''

    :param user_id: the user we want to recommend
    :param similarity_table: the user-item matrix
    :return: Similar users to the user_id.
    '''

    # create a dataframe of just the current user
    user = similarity_table[similarity_table.index == user_id]
    # and a dataframe of all other users
    other_users = similarity_table[similarity_table.index != user_id]
    # calculate cosine similarity between user and each other user
    similarities = cosine_similarity(user, other_users)[0].tolist()

    indices = other_users.index.tolist()
    index_similarity = dict(zip(indices, similarities))

    # sort by similarity
    index_similarity_sorted = sorted(index_similarity.items(), key=operator.itemgetter(1))
    index_similarity_sorted.reverse()

    # take users
    top_users_similarities = index_similarity_sorted[:k]
    users = []
    for user in top_users_similarities:
        users.append(user[0])

    return users


In [31]:
user_id = "A0010876CNE3ILIM9HV0"
similar_users = getting_top_5_similar_users(user_id, similarity)


In [32]:
print("Top 5 similar users for user_id:",user_id," are: ",similar_users)

Top 5 similar users for user_id: A0010876CNE3ILIM9HV0  are:  ['AXNF1BLDR4P47', 'ARTHT19OB79VZ', 'ARQ9I3Y0VPB6N', 'AOXEXSN7M9ENJ', 'AN0AO97264HP4']


## Recommend products based on these top similar users

In [33]:
def getting_top_5_recommendations_based_on_users(user_id, similar_users, similarity_table, top_recommendations=5):
    '''

    :param user_id: user for whom we want to recommend
    :param similar_users: top 5 similar users
    :param similarity_table: the user-item matrix
    :param top_recommendations: no. of recommendations
    :return: top_5_recommendations
    '''

    # taking the data for similar users
    similar_user_products = dataset[dataset.UserId.isin(similar_users)]
#     print("Products used by other users: \n", similar_user_products.head())
#     print("---------------------------------------------------------------------------------")

    # getting all similar users
    similar_users = similarity_table[similarity_table.index.isin(similar_users)]

    #getting mean ratings given by users
    similar_users = similar_users.mean(axis=0)


    similar_users_df = pd.DataFrame(similar_users, columns=['mean'])

    # for the current user data
    user_df = similarity_table[similarity_table.index == user_id]


    # transpose it so its easier to filter
    user_df_transposed = user_df.transpose()


    # rename the column as 'rating'
    user_df_transposed.columns = ['rating']

    # rows with a 0 value.
    user_df_transposed = user_df_transposed[user_df_transposed['rating'] == 0]


    # generate a list of products the user has not used
    products_not_rated = user_df_transposed.index.tolist()
#     print("Products not used by target user: ", products_not_rated)
#     print("-------------------------------------------------------------------")

    # filter avg ratings of similar users for only products the current user has not rated
    similar_users_df_filtered = similar_users_df[similar_users_df.index.isin(products_not_rated)]

    # order the dataframe
    similar_users_df_ordered = similar_users_df_filtered.sort_values(by=['mean'], ascending=False)



    # take the top products
    top_products = similar_users_df_ordered.head(top_recommendations)
    top_products_indices = top_products.index.tolist()


    return top_products_indices



In [34]:
print("Top 5 productID recommended are: ",
      getting_top_5_recommendations_based_on_users(user_id, similar_users, similarity))


Top 5 productID recommended are:  [249211, 704, 719, 754, 834]


In [35]:
filtered_ratings_data.shape

(370511, 8)

In [36]:
filtered_ratings_data.head()

Unnamed: 0,UserId,ProductId,real_rating,Timestamp,user,product,average_rating,normalized_rating
2589,AIQX2510USU1W,B00004TUBL,5.0,1163289600,1057161,704,5.0,0.0
2590,A2AOLV77AF11M2,B00004TUBL,5.0,1334793600,415068,704,5.0,0.0
2591,A3B1V3AUUV13R6,B00004TUBL,5.0,1375228800,738916,704,3.5,1.5
2592,A3DDQR5XWQFLRE,B00004TUBL,5.0,1363132800,759568,704,4.333333,0.666667
2593,AT8JPRINM298P,B00004TUBL,5.0,1366243200,1150521,704,5.0,0.0


In [37]:
filtered_ratings_data[filtered_ratings_data['UserId']=="A0010876CNE3ILIM9HV0"]

Unnamed: 0,UserId,ProductId,real_rating,Timestamp,user,product,average_rating,normalized_rating
1363214,A0010876CNE3ILIM9HV0,B0055MYJ0U,1.0,1390521600,11,136012,2.5,-1.5


In [38]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(filtered_ratings_data,test_size=0.2)

train_data = pd.DataFrame(train_data)
test_data = pd.DataFrame(test_data)

In [39]:
similarity = pd.pivot_table(train_data,values='normalized_rating',index='UserId',columns='product')
similarity = similarity.fillna(0)
print("Updated Dataset: \n",similarity.head())

Updated Dataset: 
 product                704     719     754     834     843     858     861     \
UserId                                                                          
A0010876CNE3ILIM9HV0      0.0     0.0     0.0     0.0     0.0     0.0     0.0   
A0011102257KBXODKL24I     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
A00120381FL204MYH7G3B     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
A00126503SUWI86KZBMIN     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
A001573229XK5T8PI0OKA     0.0     0.0     0.0     0.0     0.0     0.0     0.0   

product                873     944     981     ...  241604  242018  242048  \
UserId                                         ...                           
A0010876CNE3ILIM9HV0      0.0     0.0     0.0  ...     0.0     0.0     0.0   
A0011102257KBXODKL24I     0.0     0.0     0.0  ...     0.0     0.0     0.0   
A00120381FL204MYH7G3B     0.0     0.0     0.0  ...     0.0     0.0     0.0   
A00126503SUWI86KZBMIN  

In [40]:
similarity.shape

(251871, 934)

In [41]:
selecting_users = list(similarity.index)
selecting_users = selecting_users[:100]
print("You can select users from the below list:\n",selecting_users)

You can select users from the below list:
 ['A0010876CNE3ILIM9HV0', 'A0011102257KBXODKL24I', 'A00120381FL204MYH7G3B', 'A00126503SUWI86KZBMIN', 'A001573229XK5T8PI0OKA', 'A00222842T0ZYI86C9LHU', 'A00258542AL4VKETFLGIJ', 'A00259242VSCRZPGIWP0M', 'A00275441WYR3489IKNAB', 'A00349462AOAVUUPEJNQZ', 'A00407141VL6SB77B1GGG', 'A00414041RD0BXM6WK0GX', 'A00426443G4MEWS3K1XFA', 'A004511036AHSSV5O4SBY', 'A00454102SR84NOYTI0JS', 'A00463203QYS5I5X6MMXW', 'A00473363TJ8YSZ3YAGG9', 'A00491723IYKW5UI74AEX', 'A0058336347PC7BSR0UJC', 'A00615442TZG6MHZXJOIZ', 'A00627983P6OGUFJ3IW8H', 'A006502622TE53S3J9W6H', 'A00656692CXO0VGF00V9I', 'A006680338J29DP17XALU', 'A00669491O55AKJ5QVH9L', 'A00679332RYOO5406ARSG', 'A00700212KB3K0MVESPIY', 'A0072717335KA6520NEMI', 'A0074075A8TZJIPLGZEK', 'A0078719IR14X3NNUG0F', 'A00802872RVW2KLY6DAL0', 'A008374338GH2TUB0S8KP', 'A0086401DFJEZA4RT4OL', 'A0090635250IP002KMMIX', 'A01026292DKV5RYUH42C9', 'A01032093UTJ2SF3EQFS1', 'A010356935P3D9IEDEUIN', 'A010407538LRAQYK3G2RZ', 'A01092453

In [42]:
user_id = "A02720223TDVZSWVZYFN7"
similar_users = getting_top_5_similar_users(user_id, similarity)

In [43]:
print("Top 5 similar users for user_id:",user_id," are: ",similar_users)

Top 5 similar users for user_id: A02720223TDVZSWVZYFN7  are:  ['AZZZRS1YZ8HVP', 'AZZZLM1E5JJ8C', 'AZZZKHVV482YT', 'AZZYW4YOE1B6E', 'AZZWMH759YWOO']


In [44]:
print("Top 5 productID recommended are: ",
      getting_top_5_recommendations_based_on_users(user_id, similar_users, similarity))

Top 5 productID recommended are:  [27327, 149282, 119018, 119506, 119742]


In [45]:
test_data.shape

(74103, 8)

In [46]:
len(test_data.user.unique())

69994

In [47]:
test_data.UserId

Unnamed: 0,UserId
352488,A2OI109I9ARL5G
315877,ATVANI7AU076P
158248,A1W7SPGD458CHS
200245,A1CACDKW0B9613
775648,A2I5JMVDTAYLCA
...,...
907415,A3OBHTUDRHW2FT
1115029,AW8UW3OFEZR8T
1245713,A1NU56Q21LEVCJ
1541691,A39XRHV3F3X0P4


In [48]:
test_data.head()

Unnamed: 0,UserId,ProductId,real_rating,Timestamp,user,product,average_rating,normalized_rating
352488,A2OI109I9ARL5G,B000O3OZD6,2.0,1375056000,538309,27745,3.5,-1.5
315877,ATVANI7AU076P,B000KVH1QU,5.0,1384819200,1156190,23784,3.0,2.0
158248,A1W7SPGD458CHS,B0009V1YR8,5.0,1222128000,287295,10516,5.0,0.0
200245,A1CACDKW0B9613,B000C1Z2D2,5.0,1324598400,110267,13509,5.0,0.0
775648,A2I5JMVDTAYLCA,B001RMP7M6,5.0,1376438400,481738,69175,4.4,0.6


In [49]:
def recommend_products_for_user(userId, similarity_matrix):
    similar_users = getting_top_5_similar_users(user_id, similarity_matrix)
#     print("Top 5 similar users for user_id:",user_id," are: ",similar_users)
    product_list = getting_top_5_recommendations_based_on_users(user_id, similar_users, similarity)
#     print("Top 5 productID recommended are: ", product_list)
    return product_list

In [50]:
recommend_products_for_user("A2XVNI270N97GL", similarity)

[27327, 149282, 119018, 119506, 119742]

### Conclusion

Recommender systems are a powerful technology that adds to a businesses value. Some business thrive on their recommender systems. It helps the business by creating more sales and it helps the end user buy enabling them to find items they like.