# What is Collaborative Filtering?

Collaborative filtering is the predictive process behind recommendation engines. Recommendation engines analyze information about users with similar tastes to assess the probability that a target individual will enjoy something.

Collaborative filtering uses algorithms to filter data from user reviews to make personalized recommendations for users with similar preferences. Collaborative filtering is also used to select content and advertising for individuals on social media.

Collaborative filtering filters information by using the interactions and data collected by the system from other users. For example when we want to find a new movie to watch we'll often ask our friends for recommendations.

Naturally, we have greater trust in the recommendations from friends who share tastes similar to our own. Collaborative filtering does the same job. Collaborative filtering mostly focuses on finding similarity between users and recommend each other their likes. There are various ways to find the similarity measure : Cosine similarity, Pearson similarity, Jaccard similarity etc.

# Importing required libraries

In [1]:
# Installing surprise Library
!pip install scikit-surprise





In [1]:
# Importing basic libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import random
from IPython.display import Image

# Importing scipy.sparse.csr_matrix for kNN data preparation
from scipy.sparse import csr_matrix

# Importing kNN algorithm
from sklearn.neighbors import NearestNeighbors

# Importing cosine_similarity to calculate cosine similarity in memory based collaborative filtering
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Importing surprise.Reader,Dataset for surprise data preparation
from surprise import Reader, Dataset

# Importing for surprise model customizations
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV

In [3]:
# Importing algorithms from Surprise package
from surprise.prediction_algorithms import CoClustering
from surprise.prediction_algorithms import NMF

# Importing accuracy to get metrics such as RMSE and MAE
from surprise import accuracy

# Importing the dataset as df

In [5]:
df = pd.read_excel('../input/Rec_sys_data.xlsx')

In [6]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Quantity,InvoiceDate,DeliveryDate,Discount%,ShipMode,ShippingCost,CustomerID
0,536365,84029E,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.2,ExpressAir,30.12,17850
1,536365,71053,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.21,ExpressAir,30.12,17850
2,536365,21730,6,2010-12-01 08:26:00,2010-12-03 08:26:00,0.56,Regular Air,15.22,17850
3,536365,84406B,8,2010-12-01 08:26:00,2010-12-03 08:26:00,0.3,Regular Air,15.22,17850
4,536365,22752,2,2010-12-01 08:26:00,2010-12-04 08:26:00,0.57,Delivery Truck,5.81,17850


In [7]:
df.shape

(272404, 9)

In [8]:
df.isnull().sum().sort_values(ascending=False)

InvoiceNo       0
StockCode       0
Quantity        0
InvoiceDate     0
DeliveryDate    0
Discount%       0
ShipMode        0
ShippingCost    0
CustomerID      0
dtype: int64

In [9]:
df1 = df.dropna()

In [10]:
df1.describe()

Unnamed: 0,InvoiceNo,Quantity,Discount%,ShippingCost,CustomerID
count,272404.0,272404.0,272404.0,272404.0,272404.0
mean,553740.733319,13.579536,0.300092,17.053491,15284.323523
std,9778.082879,149.136756,0.176023,10.01321,1714.478624
min,536365.0,1.0,0.0,5.81,12346.0
25%,545312.0,2.0,0.15,5.81,13893.0
50%,553902.0,6.0,0.3,15.22,15157.0
75%,562457.0,12.0,0.45,30.12,16788.0
max,569629.0,74215.0,0.6,30.12,18287.0


In [11]:
df1 = df1[df1.Quantity > 0]

In [12]:
df1.describe()

Unnamed: 0,InvoiceNo,Quantity,Discount%,ShippingCost,CustomerID
count,272404.0,272404.0,272404.0,272404.0,272404.0
mean,553740.733319,13.579536,0.300092,17.053491,15284.323523
std,9778.082879,149.136756,0.176023,10.01321,1714.478624
min,536365.0,1.0,0.0,5.81,12346.0
25%,545312.0,2.0,0.15,5.81,13893.0
50%,553902.0,6.0,0.3,15.22,15157.0
75%,562457.0,12.0,0.45,30.12,16788.0
max,569629.0,74215.0,0.6,30.12,18287.0


In [13]:
df1.shape

(272404, 9)

### Implementation

We are creating a df(matrix) which contains CustomerID and whether they have ever purchased a product using groupby.

In [14]:
purchase = (df1.groupby(['CustomerID', 'StockCode'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('CustomerID'))

In [15]:
purchase.head(30)

StockCode,10002,10080,10120,10125,10133,10135,11001,15030,15034,15036,...,90214R,90214S,90214V,90214Y,BANK CHARGES,C2,DOT,M,PADS,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12348,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0
12350,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
12352,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,5.0
12353,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12354,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12355,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12356,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.0
12358,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0


We are getting the quantity ordered (example : 48,24,126) while we just want to know if that particular item is purchased or not. 

Thus we need to do encoding as 1(if purchased) or 0(not purchased)


In [16]:
def encode_units(x):
    if x < 1: # If the quantity is less than 1
        return 0 # Not purchased
    if x >= 1: # If the quantity is greater than 1
        return 1 # Purchased


purchase = purchase.applymap(encode_units)

In [17]:
purchase.head(30)

StockCode,10002,10080,10120,10125,10133,10135,11001,15030,15034,15036,...,90214R,90214S,90214V,90214Y,BANK CHARGES,C2,DOT,M,PADS,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12347,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12350,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12352,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
12353,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12354,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12355,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12356,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12358,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


The purchase matrix is now ready, which describes the behaviour of Customers corresponding to all the items.

We can now apply Collaborative filtering on it.

In [19]:
# Applying cosine_similarity on the purchase matrix
user_similarity = cosine_similarity(purchase)

In [20]:
# Storing the similarity scores in a dataframe
user_similarity_df = pd.DataFrame(user_similarity,index=purchase.index,columns=purchase.index)

In [21]:
user_similarity_df

CustomerID,12346,12347,12348,12350,12352,12353,12354,12355,12356,12358,...,18269,18270,18272,18273,18278,18280,18281,18282,18283,18287
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,1.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.114708,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
12347,0.0,1.000000,0.070632,0.053567,0.048324,0.0,0.029001,0.091885,0.075845,0.000000,...,0.041739,0.000000,0.050669,0.0,0.036811,0.069843,0.000000,0.0,0.087667,0.021253
12348,0.0,0.070632,1.000000,0.051709,0.031099,0.0,0.027995,0.118262,0.146427,0.061546,...,0.000000,0.000000,0.024456,0.0,0.000000,0.000000,0.000000,0.0,0.123091,0.082061
12350,0.0,0.053567,0.051709,1.000000,0.035377,0.0,0.000000,0.000000,0.033315,0.070014,...,0.000000,0.000000,0.027821,0.0,0.000000,0.000000,0.000000,0.0,0.052511,0.000000
12352,0.0,0.048324,0.031099,0.035377,1.000000,0.0,0.095765,0.040456,0.100180,0.084215,...,0.110264,0.065233,0.133855,0.0,0.000000,0.000000,0.000000,0.0,0.094742,0.056143
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18280,0.0,0.069843,0.000000,0.000000,0.000000,0.0,0.041523,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.105409,1.000000,0.119523,0.0,0.000000,0.000000
18281,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.049629,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.0,0.000000,0.119523,1.000000,0.0,0.054554,0.000000
18282,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.043355,0.0,0.000000,0.000000,0.000000,1.0,0.000000,0.000000
18283,0.0,0.087667,0.123091,0.052511,0.094742,0.0,0.123191,0.040032,0.099131,0.041667,...,0.027277,0.000000,0.165567,0.0,0.000000,0.000000,0.054554,0.0,1.000000,0.111111


This is how the user_similarity_df looks like. It contains the similarity score of users with 0 being the least similar while 1 being the most similar.

#### Making Recommendations

In [22]:
def similar_users(user_id,k=5):
    # separating df rows for the entered user id
    user = user_similarity_df[user_similarity_df.index == user_id]
    
    # a df of all other users
    other_users = user_similarity_df[user_similarity_df.index != user_id]
    
    # calc cosine similarity between user and each other user
    similarities = cosine_similarity(user,other_users)[0].tolist()
    
    # create list of indices of these users
    indices = other_users.index.tolist()
    
    # create key/values pairs of user index and their similarity
    index_similarity = dict(zip(indices, similarities))
    
    # sort by similarity
    index_similarity_sorted = sorted(index_similarity.items(),key=lambda x: x[1],reverse=True)
    
    # grab k users off the top
    top_users_similarities = index_similarity_sorted[:k]
    users = [u[0] for u in top_users_similarities]
    
    print('The users with behaviour similar to that of user {0} are:'.format(user_id))
    return users

In [23]:
simu = similar_users(12347)

simu

The users with behaviour similar to that of user 12347 are:


[15502, 16684, 12395, 16710, 17444]

Further the similar users can be stored in a list and later we can display the items purchased by the similar users as done below

In [24]:
def simu_recommendation(userid):
    
    simu = similar_users(userid)

    #obtaining all the items bought by similar users
    simu_rec = []
    for j in simu:
        desc = df1[df1["CustomerID"]==j]['StockCode'].to_list()
        simu_rec.append(desc)
    
    #this gives us multi-dimensional list
    # we need to flatten it
    flat_list = []
    for sublist in simu_rec:
        for item in sublist:
            flat_list.append(item)
    final_list = list(dict.fromkeys(flat_list))
    
    # storing 10 random recommendations in a list
    ten_recs = random.sample(final_list, 10)
    
    print('Items bought by Similar users based on Cosine Similarity')
    
    #returning 10 random recommendations
    return ten_recs

In [25]:
simu_recommendation(12347)

The users with behaviour similar to that of user 12347 are:
Items bought by Similar users based on Cosine Similarity


[22634, 23170, 23306, 22336, 22093, 22674, 23281, 23005, '85159B', 21260]

### Implementation

We are creating a df(matrix) which contains item names and whether they have been ever purchased by a customer using groupby.

In [26]:
items_purchase = (df1.groupby(['StockCode','CustomerID'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('StockCode'))

In [27]:
items_purchase.head(30)

CustomerID,12346,12347,12348,12350,12352,12353,12354,12355,12356,12358,...,18269,18270,18272,18273,18278,18280,18281,18282,18283,18287
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10080,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10120,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10133,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10135,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15030,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15034,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15036,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We are getting the quantity ordered (example : 48,24,126) while we just want to know if that particular item is purchased or not. 

Thus we need to do encoding as 1(if purchased) or 0(not purchased)

In [28]:
items_purchase = items_purchase.applymap(encode_units)

The item_purchase matrix is now ready, which describes if the item was purchased by particular customer or not.

We can now apply Collaborative filtering on it.

In [29]:
# Applying Cosine similarity on the items
item_similarity = cosine_similarity(items_purchase)

In [30]:
# Storing the similarity scores in a dataframe
item_similarity_df = pd.DataFrame(item_similarity,index=items_purchase.index,columns=items_purchase.index)

In [31]:
item_similarity_df.head(10)

StockCode,10002,10080,10120,10125,10133,10135,11001,15030,15034,15036,...,90214R,90214S,90214V,90214Y,BANK CHARGES,C2,DOT,M,PADS,POST
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,1.0,0.0,0.108821,0.094281,0.062932,0.091902,0.110096,0.059761,0.083771,0.096449,...,0.0,0.0,0.0,0.0,0.0,0.032275,0.0,0.079333,0.0,0.066986
10080,0.0,1.0,0.0,0.043033,0.028724,0.067116,0.0,0.0,0.076472,0.044023,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10120,0.108821,0.0,1.0,0.068399,0.068483,0.026669,0.079872,0.086711,0.121547,0.034986,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076739,0.0,0.013885
10125,0.094281,0.043033,0.068399,1.0,0.044499,0.051988,0.0519,0.0,0.03949,0.0341,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.074796,0.0,0.063155
10133,0.062932,0.028724,0.068483,0.044499,1.0,0.266043,0.051964,0.075218,0.079078,0.05311,...,0.0,0.0,0.0,0.0,0.0,0.040622,0.0,0.066567,0.049752,0.024089
10135,0.091902,0.067116,0.026669,0.051988,0.266043,1.0,0.080944,0.043937,0.046192,0.044319,...,0.116248,0.116248,0.116248,0.116248,0.047458,0.023729,0.116248,0.068048,0.0,0.028143
11001,0.110096,0.0,0.079872,0.0519,0.051964,0.080944,1.0,0.065795,0.092229,0.092913,...,0.0,0.0,0.0,0.0,0.0,0.071067,0.174078,0.116457,0.0,0.052678
15030,0.059761,0.0,0.086711,0.0,0.075218,0.043937,0.065795,1.0,0.050063,0.086459,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.094821,0.0,0.022875
15034,0.083771,0.076472,0.121547,0.03949,0.079078,0.046192,0.092229,0.050063,1.0,0.232288,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055381,0.0,0.032066
15036,0.096449,0.044023,0.034986,0.0341,0.05311,0.044319,0.092913,0.086459,0.232288,1.0,...,0.0,0.0,0.0,0.0,0.031129,0.046693,0.0,0.082892,0.0,0.059993


#### Making Recommendations

In [32]:
def similar_items(item,k=10):
    # separating df rows of the selected item
    item = item_similarity_df[item_similarity_df.index == item]
    
    # a df of all other items
    other_items = item_similarity_df
    
    # calc cosine similarity between selected item with other items
    similarities = cosine_similarity(item,other_items)[0].tolist()
    
    # create list of indices of these items
    indices = other_items.index.tolist()
    
    # create key/values pairs of item index and their similarity
    index_similarity = dict(zip(indices, similarities))
    index_similarity = list(index_similarity.keys())

    # grab k items from the top
    top_item_similarities = index_similarity[:k]

    print('Similar items based on purchase behaviour (item-to-item collaborative filtering)')
    return top_item_similarities

In [33]:
similar_items(22966)

Similar items based on purchase behaviour (item-to-item collaborative filtering)


[10002, 10080, 10120, 10125, 10133, 10135, 11001, 15030, 15034, 15036]

## Collaborative Filtering using k-Nearest Neighbors

### Model building

In [34]:
purchase_matrix = csr_matrix(purchase.values)

# Creating KNN Model with metric parameter as euclidean distance
model_knn = NearestNeighbors(metric = 'euclidean', algorithm = 'brute')

# Fitting the model on purchase_matrix
model_knn.fit(purchase_matrix)

NearestNeighbors(algorithm='brute', metric='euclidean')

### Finding similar users

In [35]:
# Creating empty list where we will store user id of similar users
simu_knn = []

In [36]:
def similar_users_knn(purchase,query_index):
    
    # Storing the distance and index of nearest neighors
    distances, indices = model_knn.kneighbors(purchase.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 5)
    for i in range(0, len(distances.flatten())):
        if i == 0:
            print('Recommendations for {0}:\n'.format(purchase.index[query_index]))
        else:
            print('{0}: {1}, with distance of {2}:'.format(i, purchase.index[indices.flatten()[i]], distances.flatten()[i]))
            simu_knn.append(purchase.index[indices.flatten()[i]])    

In [37]:
similar_users_knn(purchase,1497)

Recommendations for 14729:

1: 16917, with distance of 8.12403840463596:
2: 16989, with distance of 8.12403840463596:
3: 15124, with distance of 8.12403840463596:
4: 12897, with distance of 8.246211251235321:


In [38]:
simu_knn

[16917, 16989, 15124, 12897]

### Making Recommendations

In [39]:
def simu_recommendation_knn(simu_knn):
    

    #obtaining all the items bought by similar users
    simu_rec = []
    for j in simu_knn:
        desc = df1[df1["CustomerID"]==j]['StockCode'].to_list()
        simu_rec.append(desc)
    
    #this gives us multi-dimensional list
    # we need to flatten it
    flat_list = []
    for sublist in simu_rec:
        for item in sublist:
            flat_list.append(item)
    final_list = list(dict.fromkeys(flat_list))
    
    # storing 10 random recommendations in a list
    ten_recs = random.sample(final_list, 10)
    
    print('Items bought by Similar users based on KNN')
    
    #returning 10 random recommendations
    return ten_recs

In [40]:
simu_recommendation_knn(simu_knn)

Items bought by Similar users based on KNN


[22957, '84997A', 22501, 22605, 22469, 22470, 22916, 22926, 22487, 23298]

## Collaborative Filtering using Matrix Factorization

For Matrix Factorization, we are using the Surprise Package.

Surprise package: This package has been specially developed to make recommendation based on collaborative filtering easy. It has default implementation for a variety of Collaborative Filtering algorithms such as NMF, kNN, Co-Clustering, SVD.

In [41]:
df3 = items_purchase.stack().to_frame()

In [42]:
df3.reset_index(inplace=True)

In [43]:
df3

Unnamed: 0,StockCode,CustomerID,0
0,10002,12346,0
1,10002,12347,0
2,10002,12348,0
3,10002,12350,0
4,10002,12352,0
...,...,...,...
12903081,POST,18280,0
12903082,POST,18281,0
12903083,POST,18282,0
12903084,POST,18283,0


3877 unique items x 4339 unique customer ids

Total records in df3 should be 3877x4339 = 1,68,22,303

And this size is too big to pass into an algorithm so we need to reduce the size of dataset by shortlisting.

### Shortlisting customers & items based on no. of orders

In [44]:
# Storing all customer ids in customers
customers = df['CustomerID']

# Storing all item descriptions in items
items = df['StockCode']

In [45]:
from collections import Counter

In [46]:
# counting no. of orders made by each customer
count1 = Counter(customers)

# storing the count and customer id in a dataframe
countdf1 = pd.DataFrame.from_dict(count1, orient='index').reset_index()

# dropping all customer ids with less than 120 orders
countdf1 = countdf1[countdf1[0]>120]

# renaming the index column as CustomerID for inner join
countdf1.rename(columns={'index':'CustomerID'},inplace=True)

In [47]:
countdf1

Unnamed: 0,CustomerID,0
0,17850,297
1,13047,140
2,12583,182
6,14688,265
8,15311,1892
...,...,...
3308,14096,1170
3367,16910,261
3392,16360,226
3413,17728,133


In [48]:
# counting no. of times an item was ordered
count2 = Counter(items)

# storing the count and item description in a dataframe
countdf2 = pd.DataFrame.from_dict(count2, orient='index').reset_index()

# dropping all items which were ordered less than 120 times
countdf2 = countdf2[countdf2[0]>120]

# renaming the index column as Description for inner join
countdf2.rename(columns={'index':'StockCode'},inplace=True)

In [49]:
countdf2

Unnamed: 0,StockCode,0
0,84029E,161
1,71053,220
3,84406B,213
4,22752,229
5,85123A,1606
...,...,...
3295,23294,181
3296,23295,213
3363,23328,129
3373,23356,148


Applying inner join

In [50]:
df4 = pd.merge(df3, countdf2, on='StockCode', how='inner')
df4 = pd.merge(df4, countdf1, on='CustomerID', how='inner')

In [51]:
df4

Unnamed: 0,StockCode,CustomerID,0_x,0_y,0
0,10133,12347,0,124,124
1,15036,12347,0,278,124
2,17003,12347,0,138,124
3,20675,12347,0,188,124
4,20676,12347,0,242,124
...,...,...,...,...,...
385667,85099F,18283,1,540,447
385668,85123A,18283,1,1606,447
385669,85132C,18283,0,127,447
385670,M,18283,1,198,447


In [52]:
# dropping columns which are not necessary
df4.drop(['0_y','0_x'],axis=1,inplace=True)

In [53]:
df4

Unnamed: 0,StockCode,CustomerID,0
0,10133,12347,124
1,15036,12347,124
2,17003,12347,124
3,20675,12347,124
4,20676,12347,124
...,...,...,...
385667,85099F,18283,447
385668,85123A,18283,447
385669,85132C,18283,447
385670,M,18283,447


In [54]:
df4.describe()

Unnamed: 0,CustomerID,0
count,385672.0,385672.0
mean,15360.985915,279.089789
std,1719.468125,337.879413
min,12347.0,121.0
25%,13996.25,151.0
50%,15413.0,198.0
75%,16840.0,290.0
max,18283.0,5095.0


This is how the df4 looks like. We have reduced the size from 1,68,22,303 to 3,85,672.

This format is exactly what is suitable to be passed into surprise library.

In [55]:
# reading the data in a format supported by surprise library.
reader = Reader(rating_scale=(0,5946))
# the range has been set as 0,5946 as the maximum value of quantity is 5946.

# loading Dataset in a format supported by surprise library.
data = Dataset.load_from_df(df4, reader)

In [56]:
# performing train test split on the dataset
trainset, testset = train_test_split(data, test_size= 0.2)

### Implementing NMF

In [57]:
algo1 = NMF()

algo1.fit(trainset)

pred1 = algo1.test(testset)

In [58]:
accuracy.rmse(pred1)

accuracy.mae(pred1)

RMSE: 433.7202
MAE:  273.5832


273.583216643018

In [59]:
cross_validate(algo1, data, verbose=True)

Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    429.2053424.9090425.7631433.2471427.4516428.11522.9590  
MAE (testset)     272.3958272.7828272.5152273.9501271.5215272.63310.7827  
Fit time          15.77   15.58   21.53   23.87   23.88   20.13   3.74    
Test time         0.45    0.32    0.73    0.62    0.66    0.56    0.15    


{'test_rmse': array([429.20534393, 424.90901061, 425.76309668, 433.24708772,
        427.45158648]),
 'test_mae': array([272.39582706, 272.78282692, 272.51519283, 273.95010672,
        271.52151851]),
 'fit_time': (15.765905380249023,
  15.575916290283203,
  21.530439615249634,
  23.871683597564697,
  23.884162425994873),
 'test_time': (0.45482707023620605,
  0.32199740409851074,
  0.7328190803527832,
  0.6238532066345215,
  0.6646733283996582)}

### Implementing Co-Clustering

In [60]:
algo = CoClustering()

algo.fit(trainset)

pred = algo.test(testset)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  algo.fit(trainset)


In [61]:
accuracy.rmse(pred)

accuracy.mae(pred)

RMSE: 7.1746
MAE:  5.8224


5.822375572786192

In [62]:
cross_validate(algo, data, verbose=True)

Evaluating RMSE, MAE of algorithm CoClustering on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    7.0594  7.2301  7.5031  6.7457  6.8753  7.0827  0.2667  
MAE (testset)     5.6804  5.8693  6.1813  5.4374  5.5380  5.7413  0.2636  
Fit time          12.33   11.33   10.88   10.83   11.28   11.33   0.54    
Test time         0.69    0.37    0.74    0.64    0.47    0.58    0.14    


{'test_rmse': array([7.05943816, 7.23006135, 7.50310719, 6.74572576, 6.87531768]),
 'test_mae': array([5.68040329, 5.86925624, 6.18128217, 5.43739732, 5.53804887]),
 'fit_time': (12.325151205062866,
  11.334140300750732,
  10.87869381904602,
  10.827836990356445,
  11.278704643249512),
 'test_time': (0.6859066486358643,
  0.3725264072418213,
  0.7365968227386475,
  0.6421375274658203,
  0.4671785831451416)}

### Giving out predictions

In [63]:
pred1

# Predictions given out by NMF

[Prediction(uid=22791, iid=16360, r_ui=226.0, est=5.184114156376821, details={'was_impossible': False}),
 Prediction(uid=22569, iid=17530, r_ui=304.0, est=7.008853545880178, details={'was_impossible': False}),
 Prediction(uid=21888, iid=14307, r_ui=180.0, est=4.164244611248198, details={'was_impossible': False}),
 Prediction(uid=22725, iid=12714, r_ui=191.0, est=4.338683325255341, details={'was_impossible': False}),
 Prediction(uid=21485, iid=18109, r_ui=329.0, est=7.614181460442531, details={'was_impossible': False}),
 Prediction(uid=22548, iid=17211, r_ui=141.0, est=3.205584144445285, details={'was_impossible': False}),
 Prediction(uid=21390, iid=15410, r_ui=122.0, est=2.802323412468582, details={'was_impossible': False}),
 Prediction(uid=22834, iid=15764, r_ui=180.0, est=4.021663224063316, details={'was_impossible': False}),
 Prediction(uid=84947, iid=14092, r_ui=173.0, est=3.9101251319970047, details={'was_impossible': False}),
 Prediction(uid=22430, iid=16110, r_ui=200.0, est=4.61

In [64]:
pred

# Predictions given out by Co-Clustering

[Prediction(uid=22791, iid=16360, r_ui=226.0, est=216.28060296907142, details={'was_impossible': False}),
 Prediction(uid=22569, iid=17530, r_ui=304.0, est=318.75664882373644, details={'was_impossible': False}),
 Prediction(uid=21888, iid=14307, r_ui=180.0, est=187.50694805255944, details={'was_impossible': False}),
 Prediction(uid=22725, iid=12714, r_ui=191.0, est=198.59765100040522, details={'was_impossible': False}),
 Prediction(uid=21485, iid=18109, r_ui=329.0, est=332.794755483224, details={'was_impossible': False}),
 Prediction(uid=22548, iid=17211, r_ui=141.0, est=150.65393071659275, details={'was_impossible': False}),
 Prediction(uid=21390, iid=15410, r_ui=122.0, est=123.33016451863296, details={'was_impossible': False}),
 Prediction(uid=22834, iid=15764, r_ui=180.0, est=179.29408315976684, details={'was_impossible': False}),
 Prediction(uid=84947, iid=14092, r_ui=173.0, est=178.0069188563316, details={'was_impossible': False}),
 Prediction(uid=22430, iid=16110, r_ui=200.0, est

### Best and Worst Predictions made by NMF

In [65]:
def get_item_orders(uid):
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)]) # returns the number of orders made for that item
    except ValueError: # user was not part of the trainset
        return 0
    
def get_customer_orders(iid):
    try: 
        return len(trainset.ir[trainset.to_inner_iid(iid)]) # returns the number of orders made by that customers
    except ValueError: # item was not part of the trainset
        return 0
    
predictions_df = pd.DataFrame(pred1, columns=['item', 'customer', 'quantity', 'est', 'details'])
predictions_df['item_orders'] = predictions_df.item.apply(get_item_orders)
predictions_df['customer_orders'] = predictions_df.customer.apply(get_customer_orders)
predictions_df['err'] = abs(predictions_df.est - predictions_df.quantity)
best_predictions = predictions_df.sort_values(by='err')[:10]
worst_predictions = predictions_df.sort_values(by='err')[-10:]

In [66]:
predictions_df

Unnamed: 0,item,customer,quantity,est,details,item_orders,customer_orders,err
0,22791,16360,226.0,5.184114,{'was_impossible': False},462,545,220.815886
1,22569,17530,304.0,7.008854,{'was_impossible': False},449,535,296.991146
2,21888,14307,180.0,4.164245,{'was_impossible': False},441,527,175.835755
3,22725,12714,191.0,4.338683,{'was_impossible': False},441,545,186.661317
4,21485,18109,329.0,7.614181,{'was_impossible': False},455,554,321.385819
...,...,...,...,...,...,...,...,...
77130,22644,12362,145.0,3.319100,{'was_impossible': False},469,563,141.680900
77131,20977,13952,137.0,3.129427,{'was_impossible': False},454,531,133.870573
77132,21871,14808,208.0,5.041698,{'was_impossible': False},460,536,202.958302
77133,22845,17827,156.0,3.433016,{'was_impossible': False},454,537,152.566984


In [67]:
best_predictions

Unnamed: 0,item,customer,quantity,est,details,item_orders,customer_orders,err
4368,22983,15089,121.0,3.062984,{'was_impossible': False},464,541,117.937016
73659,22196,15089,121.0,3.018809,{'was_impossible': False},455,541,117.981191
22374,82578,15089,121.0,3.010912,{'was_impossible': False},453,541,117.989088
49560,21755,15089,121.0,3.003918,{'was_impossible': False},460,541,117.996082
70725,22562,15089,121.0,2.984018,{'was_impossible': False},457,541,118.015982
2650,21985,15443,121.0,2.983422,{'was_impossible': False},448,534,118.016578
58417,22222,15089,121.0,2.980661,{'was_impossible': False},462,541,118.019339
53546,22111,15089,121.0,2.980435,{'was_impossible': False},457,541,118.019565
32901,20685,16477,121.0,2.977647,{'was_impossible': False},451,554,118.022353
41162,20972,15443,121.0,2.972454,{'was_impossible': False},450,534,118.027546


In [68]:
worst_predictions

Unnamed: 0,item,customer,quantity,est,details,item_orders,customer_orders,err
14934,22834,17841,5095.0,111.155933,{'was_impossible': False},474,532,4983.844067
3314,84970L,17841,5095.0,110.98459,{'was_impossible': False},447,532,4984.01541
55202,22079,17841,5095.0,110.323833,{'was_impossible': False},441,532,4984.676167
17371,22441,17841,5095.0,110.271043,{'was_impossible': False},445,532,4984.728957
5142,22426,17841,5095.0,110.068774,{'was_impossible': False},451,532,4984.931226
4405,22562,17841,5095.0,110.063804,{'was_impossible': False},457,532,4984.936196
38753,23308,17841,5095.0,108.74728,{'was_impossible': False},467,532,4986.25272
31095,21484,17841,5095.0,107.522526,{'was_impossible': False},455,532,4987.477474
7780,22632,17841,5095.0,105.565253,{'was_impossible': False},453,532,4989.434747
10918,85049E,17841,5095.0,103.714255,{'was_impossible': False},454,532,4991.285745
