## Collaborative Filtering
In collaborative filtering, algorithms are used to make automatic predictions about a user's interests by compiling preferences from several users.

Different Types:
- Memory Based: This method makes use of user rating information to calculate the likeness between the users or items. This calculated likeness is then used to make recommendations. User-User/Item-Item Collaborative Filter
- Model Based: Models are created by using data mining, and the system learns algorithms to look for habits according to training data. These models are then used to come up with predictions for actual data. Matrix-Factorisation
- Hybrid: Various programs combine the model-based and memory-based CF algorithms.


## Item-to-Item Collaborative Filtering
An attempt to understand Amazons item-item collaborative filtering method based on their high level algorithm

Method:
- Compute a item-user matrix based on whether a user has purchased an item 1 if they have 0 otherwise.
- Compute an item-item similarity matrix using the jacard similarity
- Create a similar item table lookup:

<code>
    for every item:
        for every customer:
            if customer has purchased item:
                for every other item purchased by customer:
                   if item and other item are similar based on condition(s) e.g. purchased on the same date/jaccard similarity
                       add item and similar items based on condition(s) to the look up table
</code>

- Lookup table consists of for every customer there purchases and the similar items to there purchases
- Recommend items based on a customers purchases and for a given item, similar items that other customers purchased

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

%matplotlib inline

RANDOM_STATE = 42

import warnings
warnings.filterwarnings('ignore')

In [2]:
use_cols = {
    'Description',
    'Customer ID',
    'Quantity',
    'InvoiceDate'
}

data = pd.read_excel('data/UK_Retail.xlsx', usecols=use_cols)

In [3]:
data.head()

Unnamed: 0,Description,Quantity,InvoiceDate,Customer ID
0,"PAPER CRAFT , LITTLE BIRDIE",-80995.0,2011-12-09 09:27:00,16446.0
1,MEDIUM CERAMIC TOP STORAGE JAR,-74215.0,2011-01-18 10:17:00,12346.0
2,ROTATING SILVER ANGELS T-LIGHT HLDR,-9360.0,2010-12-02 14:23:00,15838.0
3,ROTATING SILVER ANGELS T-LIGHT HLDR,-9360.0,2010-12-02 14:23:00,15838.0
4,FAIRY CAKE FLANNEL ASSORTED COLOUR,-3114.0,2011-04-18 13:08:00,15749.0


### Preprocessing

In [4]:
# Data Preprocess

data.dropna(inplace=True) # Drop null values
data['Description'] = data['Description'].apply(lambda x: x.strip()) # Remove trailing/leading whitespace
data['Quantity'] = data['Quantity'].astype(int) # Convert col to int
data = data[data['Quantity'] > 0] # Remove negataive quantities
data['Customer ID'] = data['Customer ID'].astype(str) # Convert to string
data['Customer ID'] = data['Customer ID'].apply(lambda x: x.strip('.0')) # Remove trailing .0
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate']) # Convert invoice to datetime

data.head()

Unnamed: 0,Description,Quantity,InvoiceDate,Customer ID
15395,JOY LARGE WOOD LETTERS,1,2009-12-01 09:08:00,15362
15396,EDWARDIAN TOILET ROLL UNIT,1,2009-12-01 09:57:00,17519
15397,CHARLIE LOLA BLUE HOT WATER BOTTLE,1,2009-12-01 10:59:00,17238
15398,CHARLIE+LOLA PINK HOT WATER BOTTLE,1,2009-12-01 10:59:00,17238
15399,CHARLIE + LOLA RED HOT WATER BOTTLE,1,2009-12-01 10:59:00,17238


In [5]:
items = len(set(data['Description']))
customers = len(set(data['Customer ID']))
print(f'Unique Items: {items} Unique Customers: {customers}')

Unique Items: 5195 Unique Customers: 5335


### Create pivot table and distance matrix

#### Similarity measure
- Use Pearson when your data is subject to user-bias/ different ratings scales of users
- Use Cosine, if data is sparse (many ratings are undefined)
- Use Euclidean, if your data is not sparse and the magnitude of the attribute values is significant
- Use adjusted cosine for Item-based approach to adjust for user-bias

In [6]:
from sklearn.metrics import pairwise_distances
from scipy.sparse import csr_matrix

pivot_data=data.copy()
pivot_data.drop(columns=['InvoiceDate'], inplace=True)
pivot_data = data.groupby(['Customer ID','Description']).count()
pivot_data = pivot_data.reset_index().pivot(index='Description', columns='Customer ID', values='Quantity').fillna(0)
pivot_data = pivot_data.astype(bool).astype(int)
distance_matrix = (1 - pairwise_distances(pivot_data.values, metric='jaccard', n_jobs=-1)) # Jaccard similaity

In [7]:
pivot_data.iloc[:10,:10]

Customer ID,12346,12608,12745,12746,12747,12748,12749,12777,12819,1282
Description,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
10 COLOUR SPACEBOY PEN,0,0,0,0,0,1,0,0,0,1
11 PC CERAMIC TEA SET POLKADOT,0,0,0,0,0,0,0,0,0,0
12 ASS ZINC CHRISTMAS DECORATIONS,0,0,0,0,0,0,0,0,0,0
12 COLOURED PARTY BALLOONS,0,0,0,0,0,1,0,0,0,0
12 DAISY PEGS IN WOOD BOX,0,0,0,0,0,0,0,0,0,0
12 EGG HOUSE PAINTED WOOD,0,0,0,0,0,0,0,0,0,0
12 HANGING EGGS HAND PAINTED,0,0,0,0,0,0,0,0,0,0
12 IVORY ROSE PEG PLACE SETTINGS,0,0,0,0,0,1,0,0,0,0
12 MESSAGE CARDS WITH ENVELOPES,0,0,0,0,0,1,0,0,0,0
12 MINI TOADSTOOL PEGS,0,0,0,0,0,0,0,0,0,0


In [8]:
distance_matrix = pd.DataFrame(np.round(distance_matrix, 2), index=pivot_data.index, columns=pivot_data.index)

In [20]:
distance_matrix.iloc[:10,:10]

Description,10 COLOUR SPACEBOY PEN,11 PC CERAMIC TEA SET POLKADOT,12 ASS ZINC CHRISTMAS DECORATIONS,12 COLOURED PARTY BALLOONS,12 DAISY PEGS IN WOOD BOX,12 EGG HOUSE PAINTED WOOD,12 HANGING EGGS HAND PAINTED,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,12 MINI TOADSTOOL PEGS
Description,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
10 COLOUR SPACEBOY PEN,1.0,0.0,0.0,0.07,0.05,0.01,0.0,0.04,0.06,0.02
11 PC CERAMIC TEA SET POLKADOT,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12 ASS ZINC CHRISTMAS DECORATIONS,0.0,0.0,1.0,0.02,0.04,0.02,0.0,0.03,0.02,0.02
12 COLOURED PARTY BALLOONS,0.07,0.0,0.02,1.0,0.04,0.03,0.01,0.06,0.09,0.05
12 DAISY PEGS IN WOOD BOX,0.05,0.0,0.04,0.04,1.0,0.02,0.01,0.09,0.09,0.15
12 EGG HOUSE PAINTED WOOD,0.01,0.0,0.02,0.03,0.02,1.0,0.0,0.04,0.02,0.01
12 HANGING EGGS HAND PAINTED,0.0,0.0,0.0,0.01,0.01,0.0,1.0,0.0,0.01,0.0
12 IVORY ROSE PEG PLACE SETTINGS,0.04,0.0,0.03,0.06,0.09,0.04,0.0,1.0,0.07,0.06
12 MESSAGE CARDS WITH ENVELOPES,0.06,0.0,0.02,0.09,0.09,0.02,0.01,0.07,1.0,0.02
12 MINI TOADSTOOL PEGS,0.02,0.0,0.02,0.05,0.15,0.01,0.0,0.06,0.02,1.0


In [19]:
def get_date(data, customer: str, item: str):
    """
        Return a list of date(s) of purchase for a given item
    """
    return data['InvoiceDate'][(data['Customer ID'] == customer) & (data['Description'] == item)].values

### Look up table creation

In [22]:
from time import perf_counter
look_up = {}
pivot_data = pivot_data.iloc[:100,:100]
start = perf_counter()
similarity_threshold = 0.15

for item in pivot_data.index: # for item 
    for customer in pivot_data.columns: # for customer
        if pivot_data[customer][item] == 1: # if customer has bought item
            similar_items = []
            item_dates = get_date(data, customer, item) # get item date(s) of purchase
            purchased_also = list(pivot_data[customer][pivot_data[customer] == 1].index.values) # Get other items purchased by customer
            purchased_also.remove(item)
            for other_item in purchased_also: # for other items
                other_item_dates = get_date(data, customer, other_item) # get other item date(s) of purchase
                if len([date for date in item_dates if date in other_item_dates]) > 0 or (distance_matrix.loc[item, other_item] > similarity_threshold): 
                    similar_items.append(other_item) # If they have been bought on the same day or meet the similarity threshold they're similar

            if customer not in look_up.keys():
                look_up[customer] = {item: similar_items}
            else:
                look_up[customer].update({item: similar_items})
                
end = perf_counter()
execution_time = (end - start)
print(f'{execution_time:.2f}s - {execution_time/60:.2f}mins')

375.61s - 6.26mins


### Recommend Items

In [23]:
from collections import Counter

def most_frequent(List, n):
    return [i for i, item in Counter(List).most_common(n)]

In [24]:
def recommendations_from_purchases(customer: str, n: int):
    """
        Return n most common items that a customer hasn't purchased that are similar to the customers purchases
        
        customer - customer ID
        n - how many items to recommend
    """
    most_similar_item = []
    
    for i, item in enumerate(list(look_up[customer].keys())): # For all purchased items
        items_not_purchased = set(distance_matrix.index) - set(look_up[customer].keys())
        similar_items = distance_matrix.loc[items_not_purchased, look_up[customer]][item].sort_values(ascending=False).index[0] # most similar item
        most_similar_item.append(similar_items) # Create list of the most similar item for each item purchased
    
    recommended =  most_frequent(most_similar_item, n) # Recommend most common items
    return print('Based on your purchases we recommend: ', *recommended, sep='\n')

In [25]:
recommendations_from_purchases(customer='12748', n=3)

Based on your purchases we recommend: 
SET OF 4 ROSE BOTANICAL CANDLES
RIBBON REEL SNOWY VILLAGE
12 PINK ROSE PEG PLACE SETTINGS


In [26]:
def recommendation_from_others_purchases(customer, item: str, n=10):
    """
        Given an item check what other customers also purchased - may be used in a search function
        
        customer - customer ID
        item - item name/ID
        n - n most common items
    """
    also_purchased = []
    items_not_purchased = set(distance_matrix.index) - set(look_up[customer].keys())
    for other_customer in list(set(look_up.keys()) - set(customer)):
        try:
            also_purchased.append(look_up[other_customer][item])
        except:
            continue
    also_purchased = [item for sublist in also_purchased for item in sublist] # Flatten list
    also_purchased = set(also_purchased).intersection(items_not_purchased) # Items customers also purchased that the customer hasn't
    
    if also_purchased:
        return print(f'People that bought {item} also bought these items:', *also_purchased, sep='\n')
    else:
        similar_items = distance_matrix.loc[items_not_purchased, :][item].sort_values(ascending=False).index[1:n]
        return print(f'You may be interested in these items: ', *similar_items, sep='\n')

In [27]:
recommendation_from_others_purchases('12895', '36 DOILIES VINTAGE CHRISTMAS')

People that bought 36 DOILIES VINTAGE CHRISTMAS also bought these items:
12 PENCILS SMALL TUBE RED SPOTTY
12 PENCIL SMALL TUBE WOODLAND
3 TIER CAKE TIN RED AND CREAM
36 DOILIES DOLLY GIRL
