# Chapter 5 - Collaborative Filtering (Part 2)

<div style="text-align:center;">
    <img src='images/intro.JPG' width='800'>
</div>

Collaborative filtering is the predictive process behind recommendation engines. Recommendation engines analyze information about users with similar tastes to assess the probability that a target individual will enjoy something.

Collaborative filtering uses algorithms to filter data from user reviews to make personalized recommendations for users with similar preferences. Collaborative filtering is also used to select content and advertising for individuals on social media.

Collaborative filtering filters information by using the interactions and data collected by the system from other users. For example when we want to find a new movie to watch we'll often ask our friends for recommendations.

Naturally, we have greater trust in the recommendations from friends who share tastes similar to our own. Collaborative filtering does the same job. Collaborative filtering m**ostly focuses on finding similarity between** users and recommend each other their likes. There are various ways to find the similarity meas*ure : Cosine simi*la*rity, Pearson simi*la*rity, Jaccard simi*larity etc.

In [1]:
# Importing basic libraries
import pandas as pd
import numpy as np
import random

# Importing scipy.sparse.csr_matrix for kNN data preparation
from scipy.sparse import csr_matrix

# Importing kNN algorithm
from sklearn.neighbors import NearestNeighbors

# Importing cosine_similarity to calculate cosine similarity in memory based collaborative filtering
from sklearn.metrics.pairwise import cosine_similarity

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Importing surprise.Reader,Dataset for surprise data preparation
from surprise import Reader, Dataset

# Importing for surprise model customizations
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV

In [3]:
# Importing algorithms from Surprise package
from surprise.prediction_algorithms import NMF,CoClustering,SVD

# Importing accuracy to get metrics such as RMSE and MAE
from surprise import accuracy

### About the Dataset
The following is the data dictionary for the dataset; it has nine features (columns).

    • InvoiceNo: The invoice number of a particular transaction
    • StockCode: The unique identifier for a particular item
    • Quantity: The quantity of that item bought by the customer
    • InvoiceDate: The date and time when the transaction was made
    • DeliveryDate: The date and time when the delivery happened
    • Discount%: Percentage of discount on the purchased item
    • ShipMode: Mode of shipping
    • ShippingCost: Cost of shipping that item
    • CustomerID: The unique identifier of a particular customer

In [6]:
#read csv data
df = pd.read_excel('data/Rec_sys_data.xlsx')

#view first 5 rows
df.head()

Unnamed: 0,InvoiceNo,StockCode,Quantity,InvoiceDate,DeliveryDate,Discount%,ShipMode,ShippingCost,CustomerID
0,536365,84029E,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.2,ExpressAir,30.12,17850
1,536365,71053,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.21,ExpressAir,30.12,17850
2,536365,21730,6,2010-12-01 08:26:00,2010-12-03 08:26:00,0.56,Regular Air,15.22,17850
3,536365,84406B,8,2010-12-01 08:26:00,2010-12-03 08:26:00,0.3,Regular Air,15.22,17850
4,536365,22752,2,2010-12-01 08:26:00,2010-12-04 08:26:00,0.57,Delivery Truck,5.81,17850


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 272404 entries, 0 to 272403
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   InvoiceNo     272404 non-null  int64         
 1   StockCode     272404 non-null  object        
 2   Quantity      272404 non-null  int64         
 3   InvoiceDate   272404 non-null  datetime64[ns]
 4   DeliveryDate  272404 non-null  datetime64[ns]
 5   Discount%     272404 non-null  float64       
 6   ShipMode      272404 non-null  object        
 7   ShippingCost  272404 non-null  float64       
 8   CustomerID    272404 non-null  int64         
dtypes: datetime64[ns](2), float64(2), int64(3), object(2)
memory usage: 18.7+ MB


### Data Preparation

In [8]:
# null check
df.isnull().sum().sort_values(ascending=False)

InvoiceNo       0
StockCode       0
Quantity        0
InvoiceDate     0
DeliveryDate    0
Discount%       0
ShipMode        0
ShippingCost    0
CustomerID      0
dtype: int64

In [9]:
# Drop NaN
data1 = df.dropna()

data1.describe()

Unnamed: 0,InvoiceNo,Quantity,InvoiceDate,DeliveryDate,Discount%,ShippingCost,CustomerID
count,272404.0,272404.0,272404,272404,272404.0,272404.0,272404.0
mean,553740.733319,13.579536,2011-05-16 04:33:17.259658240,2011-05-18 04:33:04.572620288,0.300092,17.053491,15284.323523
min,536365.0,1.0,2010-12-01 08:26:00,2010-12-02 08:26:00,0.0,5.81,12346.0
25%,545312.0,2.0,2011-03-01 13:51:00,2011-03-03 14:53:00,0.15,5.81,13893.0
50%,553902.0,6.0,2011-05-19 18:02:00,2011-05-22 08:52:30,0.3,15.22,15157.0
75%,562457.0,12.0,2011-08-05 11:00:00,2011-08-07 12:05:00,0.45,30.12,16788.0
max,569629.0,74215.0,2011-10-05 11:37:00,2011-10-08 11:37:00,0.6,30.12,18287.0
std,9778.082879,149.136756,,,0.176023,10.01321,1714.478624


# Memory-Based Approach
In Memory-Based approach, the closest user or items are calculated only by using Cosine similarity or Pearson correlation coefficients, which are only based on arithmetic operations.

A common distance metric is cosine similarity. The metric can be thought of geometrically if one treats a given user’s (item’s) row (column) of the ratings matrix as a vector. For user-based collaborative filtering, two users’ similarity is measured as the cosine of the angle between the two users’ vectors. For users u and u′, the cosine similarity is:

As no training or optimization is involved, it is an easy to use approach. But its performance decreases when we have sparse data which hinders scalability of this approach for most of the real-world problems.

Memory-Based approach is further divided into :

1. User-to-User Collaborative Filtering
2. Item-to-Item Collaborative Filtering

## User-to-User Collaborative Filtering
User-Based Collaborative Filtering is a technique used to predict the items that a user might like on the basis of ratings given to that item by the other users who have similar taste with that of the target user.

<div style="text-align:center;">
    <img src='images/User_based1.JPG' width='400'>
</div>

In User-Based Collaborative Filtering, we create a matrix that describes behaviour of all users corresponding to all the items. Further, we build relation between mutiple users to identify the similar users.

### Implementation
We are creating a data(matrix) which contains CustomerID and whether they have ever purchased a product using groupby.

In [12]:
data1.head()

Unnamed: 0,InvoiceNo,StockCode,Quantity,InvoiceDate,DeliveryDate,Discount%,ShipMode,ShippingCost,CustomerID
0,536365,84029E,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.2,ExpressAir,30.12,17850
1,536365,71053,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.21,ExpressAir,30.12,17850
2,536365,21730,6,2010-12-01 08:26:00,2010-12-03 08:26:00,0.56,Regular Air,15.22,17850
3,536365,84406B,8,2010-12-01 08:26:00,2010-12-03 08:26:00,0.3,Regular Air,15.22,17850
4,536365,22752,2,2010-12-01 08:26:00,2010-12-04 08:26:00,0.57,Delivery Truck,5.81,17850


In [14]:
purchase_df = (data1.groupby(['CustomerID', 'StockCode'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('CustomerID'))

purchase_df.head()

StockCode,10002,10080,10120,10125,10133,10135,11001,15030,15034,15036,...,90214R,90214S,90214V,90214Y,BANK CHARGES,C2,DOT,M,PADS,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12348,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0
12350,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
12352,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,5.0


We need to do encoding as 1 (if purchased) or 0 (not purchased):

In [16]:
'''
Create a map
'''

def encode_units(x):
    if x < 1:    # If the quantity is less than 1
        return 0 # Not purchased
    if x >= 1:   # If the quantity is greater than 1
        return 1 # Purchased


purchase_df = purchase_df.applymap(encode_units)

purchase_df.head()

StockCode,10002,10080,10120,10125,10133,10135,11001,15030,15034,15036,...,90214R,90214S,90214V,90214Y,BANK CHARGES,C2,DOT,M,PADS,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12347,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12350,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12352,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1


The purchase matrix describes the behaviour of Customers corresponding to all the items. Now, we can apply Collaborative filtering on it.

In [17]:
# Applying cosine_similarity on the purchase matrix
user_similarities = cosine_similarity(purchase_df)

# Storing the similarity scores in a dataframe, i.e., the similarity scores matrix
user_similarity_data = pd.DataFrame(user_similarities,index=purchase_df.index,columns=purchase_df.index)

user_similarity_data.head()

CustomerID,12346,12347,12348,12350,12352,12353,12354,12355,12356,12358,...,18269,18270,18272,18273,18278,18280,18281,18282,18283,18287
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.114708,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347,0.0,1.0,0.070632,0.053567,0.048324,0.0,0.029001,0.091885,0.075845,0.0,...,0.041739,0.0,0.050669,0.0,0.036811,0.069843,0.0,0.0,0.087667,0.021253
12348,0.0,0.070632,1.0,0.051709,0.031099,0.0,0.027995,0.118262,0.146427,0.061546,...,0.0,0.0,0.024456,0.0,0.0,0.0,0.0,0.0,0.123091,0.082061
12350,0.0,0.053567,0.051709,1.0,0.035377,0.0,0.0,0.0,0.033315,0.070014,...,0.0,0.0,0.027821,0.0,0.0,0.0,0.0,0.0,0.052511,0.0
12352,0.0,0.048324,0.031099,0.035377,1.0,0.0,0.095765,0.040456,0.10018,0.084215,...,0.110264,0.065233,0.133855,0.0,0.0,0.0,0.0,0.0,0.094742,0.056143


This is how the user_similarity_data looks like. It contains the similarity score of users with 0 being the least similar while 1 being the most similar.

#### Making Recommendations