## General Context 

Recommender systems have become a very important part of the retail industries by providing decision-making support to its customers. Several studies such as Iyengar and Lepper (2000) have proved that when faced with easy choices, customers tend to buy more. Given the number of possible choices available, especially for online shopping, having some extra guidance on these choices can really make a difference and lead to an increase in sales. As an example, **35% of Amazon sales come from recommendations**. Moreover, recommender systems are a useful alternative to search algorithms since they help users discover items they might not have found otherwise.

Recommender systems usually make use of either or both **collaborative filtering and content-based filtering**. Collaborative filtering approaches build a model from a **user's past behavior (items previously purchased or selected and/or numerical ratings given to those items)** as well as **similar decisions made by other users**. It relies solely on user/ item interaction data. In the opposite side, **content-based filtering relies** on **item attribute** data and it uses this kind of data to recommend items with **similar properties to the ones a user has liked** in the past. Modern recommender systems typically combine one or more approaches into a **hybrid system**.

## Business Situation 

ManyGiftsUK is a UK-based and registered non-store online retailer with some 80 members of staff. The company was established in 1981 mainly selling unique all-occasion gifts. For years in the past, the merchant relied heavily on direct mailing catalogues, and orders were taken over phone calls. It was only 2 years ago that the company launched its own web site and shifted completely to the web. Since then the company has maintained a steady and healthy number of customers from all parts of the United Kingdom and the world, and has accumulated a huge amount of data about many customers. The company also uses Amazon.co.uk to market and sell its products. 

With this new data the company expects to build a recommender system that is able to facilitate user choices by recommending items the user likes and improve user experience when making purchases on its website. A particular challenge is the **cold start problem - how can we suggest relevant items to new customers?**

The customer transaction dataset held by the merchant has **8 variables** as shown below, and it contains all the transactions occurring **between 01/12/2010 and 09/12/2011**. Over that particular period, there were **25900 valid transactions** in total, associated with **4070 unique items and 4372 customers from 38 different countries**. The dataset has **541909 instances**, each for a **particular item contained** in a transaction. Also it is important to note that many of ManyGiftsUK customers are wholesalers.

## Metadata

| Name                        | Meaning                                                                                                                                                        |
|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| InvoiceNo                   | Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.    |
| StockCode                   | Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.                                                            |
| Description                 | Product (item) name. Nominal.                                                                                                                                  |
| Quantity                    | The quantities of each product (item) per transaction. Numeric.                                                                                                |
| InvoiceDate                 | Invoice Date and time. Numeric, the day and time when each transaction was generated.                                                                          |
| UnitPrice                   | Unit price. Numeric, Product price per unit in pounds.                                                                                                         |
| CustomerID                  | Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.                                                                        |
| Country                     | Country name. Nominal, the name of the country where each customer resides.                                                                                    |

In [None]:
## Expected Outcomes

1. Explore the data and build models to answer the problems:
    1. Recommender system: the website homepage offers a wide range of products the user might be interested on
    2. Cold start: offer relevant products to new customers
2. Implement adequate evaluation strategies and select an appropriate quality measure
3. In the deployment phase, elaborate on the challenges and recommendations in implementing the recommender system

## Importing Packages

In [1]:
#standard
import os
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
import matplotlib.pyplot as plt
import plotly_express as px
from math import ceil
from itertools import combinations
#scipy
from scipy.sparse import coo_matrix
import scipy.sparse as sparse
from scipy.sparse.linalg import spsolve
#ALS
import implicit
from implicit.als import AlternatingLeastSquares
from implicit.evaluation import ranking_metrics_at_k
from tqdm import tqdm
import random
#Sklearn for metrics and preprocessing
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler

# Data Exploration

## Dataset Problem

In [2]:
retail = pd.read_csv('retail.csv')
#retail[['day', 'time']] = retail['InvoiceDate'].str.split(' ', 1, expand=True)
retail

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France


In [3]:
retail

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France


Since:
- Records that have a CustomerID null correspond to customers that don't have a user account. That said, every customer that doesn't have a user account should be treated, in terms of recommendations, as a new user because we don't have any history of them;
- Records with negative Quantity value are associated with cancellations i.e. records with InvoiceNo starting with "C";
- Records with non-positive UnitPrice value might occur for several reasons (e.g. offer; refund, etc.). The Description of   the record might contain some information regarding the reason for this event.

Then it doesn't make sense to have incorporate products that didn't were chosen by others customers, following a perspective from a user point of view since those products didn't satisfied others clients or even got assigned as
bad manufacture. Following this thought process, it was decided to exclude them since they can bias customers to buy products inadequates to fullfil their expectations/preferences

In [4]:
df = retail[(retail.Quantity > 0) & (retail.UnitPrice > 0)]

In [5]:
new_users = df.loc[pd.isnull(retail.CustomerID) == True]
retail_active = df.loc[pd.isnull(retail.CustomerID) == False]
refund_cancel = retail[(retail.Quantity < 0) | (retail.UnitPrice < 0)]

### Cancel or Refund Orders Exploration

In [6]:
print(" There were", len(refund_cancel.InvoiceNo.unique()), "cancelled orders from " , len(refund_cancel.Country.unique()), "countries, relative to ",len(refund_cancel.StockCode.unique()), "items!")

 There were 5174 cancelled orders from  30 countries, relative to  2560 items!


In [7]:
refund_cancel.Quantity = abs(refund_cancel.Quantity)
most_cancel_items = refund_cancel.groupby(['Description','Country']).agg({'Quantity':['sum']}).reset_index()
most_cancel_items.columns = ['Description','Country', 'Quantity']
most_cancel_items.sort_values(by='Quantity', ascending=False).head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,Description,Country,Quantity
1699,"PAPER CRAFT , LITTLE BIRDIE",United Kingdom,80995
1465,MEDIUM CERAMIC TOP STORAGE JAR,United Kingdom,74467
2953,printing smudges/thrown away,United Kingdom,19200
2675,"Unsaleable, destroyed.",United Kingdom,15644
2914,check,United Kingdom,13247
102,?,United Kingdom,9496
2152,ROTATING SILVER ANGELS T-LIGHT HLDR,United Kingdom,9370
1897,Printing smudges/thrown away,United Kingdom,9058
794,Damaged,United Kingdom,7540
2977,throw away,United Kingdom,5368


### New Users Exploration

In [8]:
new_users

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
1443,536544,21773,DECORATIVE ROSE BATHROOM BOTTLE,1,12/1/2010 14:32,2.51,,United Kingdom
1444,536544,21774,DECORATIVE CATS BATHROOM BOTTLE,2,12/1/2010 14:32,2.51,,United Kingdom
1445,536544,21786,POLKADOT RAIN HAT,4,12/1/2010 14:32,0.85,,United Kingdom
1446,536544,21787,RAIN PONCHO RETROSPOT,2,12/1/2010 14:32,1.66,,United Kingdom
1447,536544,21790,VINTAGE SNAP CARDS,9,12/1/2010 14:32,1.66,,United Kingdom
...,...,...,...,...,...,...,...,...
541536,581498,85099B,JUMBO BAG RED RETROSPOT,5,12/9/2011 10:26,4.13,,United Kingdom
541537,581498,85099C,JUMBO BAG BAROQUE BLACK WHITE,4,12/9/2011 10:26,4.13,,United Kingdom
541538,581498,85150,LADIES & GENTLEMEN METAL SIGN,1,12/9/2011 10:26,4.96,,United Kingdom
541539,581498,85174,S/4 CACTI CANDLES,1,12/9/2011 10:26,10.79,,United Kingdom


In [9]:
print(" There were", len(new_users.InvoiceNo.unique()), " invoices from new users from " , len(new_users.Country.unique()), "countries, relative to ",len(new_users.StockCode.unique()), "items!")

 There were 1428  invoices from new users from  9 countries, relative to  3408 items!


In [10]:
new_users.Quantity = abs(new_users.Quantity)
most_new_users_items = new_users.groupby(['Description','Country']).agg({'Quantity':['sum']}).reset_index()
most_new_users_items.columns = ['Description','Country', 'Quantity']
most_new_users_items.sort_values(by='Quantity', ascending=False).head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,Description,Country,Quantity
708,CHARLOTTE BAG SUKI DESIGN,United Kingdom,9167
2797,POPCORN HOLDER,United Kingdom,5803
2952,RED RETROSPOT CHARLOTTE BAG,United Kingdom,4946
4238,WOODLAND CHARLOTTE BAG,United Kingdom,4041
2483,PAPER CHAIN KIT 50'S CHRISTMAS,United Kingdom,3738
2862,RABBIT NIGHT LIGHT,United Kingdom,3537
2514,PARTY BUNTING,United Kingdom,2981
3758,STRAWBERRY CHARLOTTE BAG,United Kingdom,2951
2435,PACK OF 72 RETROSPOT CAKE CASES,United Kingdom,2521
706,CHARLOTTE BAG PINK POLKADOT,United Kingdom,2403


Since we are looking at **new customers**, the company **doesn't have any data relative to their historical purchases**.
To create a recomendation system for this **small subset** of clients, it was used the country product most bought items to make some **recomendation to new users**!

## Recommendation for new users based on Country historical purchases 

Checking for some inconsistencies/missing values

In [11]:
retail_active.isna().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

In [12]:
retail_active[['StockCode', 'Description', 'Quantity','Country']]

Unnamed: 0,StockCode,Description,Quantity,Country
0,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,United Kingdom
1,71053,WHITE METAL LANTERN,6,United Kingdom
2,84406B,CREAM CUPID HEARTS COAT HANGER,8,United Kingdom
3,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,United Kingdom
4,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,United Kingdom
...,...,...,...,...
541904,22613,PACK OF 20 SPACEBOY NAPKINS,12,France
541905,22899,CHILDREN'S APRON DOLLY GIRL,6,France
541906,23254,CHILDRENS CUTLERY DOLLY GIRL,4,France
541907,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,France


### Most bought items from historical data from regular customers

In [13]:
most_bought_items = retail_active.groupby(['Description','Country']).agg({'Quantity':['sum']}).reset_index()
most_bought_items.columns = ['Description','Country', 'Quantity']
most_bought_items.sort_values(by='Quantity', ascending=False).head(10)

Unnamed: 0,Description,Country,Quantity
11074,"PAPER CRAFT , LITTLE BIRDIE",United Kingdom,80995
9417,MEDIUM CERAMIC TOP STORAGE JAR,United Kingdom,76919
18900,WORLD WAR 2 GLIDERS ASSTD DESIGNS,United Kingdom,49182
8272,JUMBO BAG RED RETROSPOT,United Kingdom,41981
18403,WHITE HANGING HEART T-LIGHT HOLDER,United Kingdom,34648
1100,ASSORTED COLOUR BIRD ORNAMENT,United Kingdom,32727
12342,POPCORN HOLDER,United Kingdom,28935
10437,PACK OF 12 LONDON TISSUES,United Kingdom,24337
2464,BROCADE RING PURSE,United Kingdom,22711
10799,PACK OF 72 RETROSPOT CAKE CASES,United Kingdom,22465


### Density Matrix

In [14]:
#Calculate the density of the rating matrix by Country

coldstart_matrix = most_bought_items.pivot(index = 'Description', columns ='Country', values = 'Quantity').fillna(0)
print('Shape of coldstart_matrix: ', coldstart_matrix.shape)

#0 represents that the item was not bought, so in order to have the number of items bought ,
#the value has to be different than one
given_num_of_purchases = np.count_nonzero(coldstart_matrix)
print('given_num_of_purchases = ', given_num_of_purchases)
possible_num_of_purchases = coldstart_matrix.shape[0] * coldstart_matrix.shape[1]
print('possible_num_of_purchases = ', possible_num_of_purchases)
density = (given_num_of_purchases/possible_num_of_purchases)
density *= 100
print ('density: {:4.2f}%'.format(density))

Shape of coldstart_matrix:  (3877, 37)
given_num_of_purchases =  19321
possible_num_of_purchases =  143449
density: 13.47%


### Ranking Assignment --> Rank nº1 refers to the highest amount of a item purchases

In [15]:
# Use popularity based recommender model to make predictions
def recommend(database, country):  
    df = database[(database.Country == country)]
    Data_new_grouped = df.groupby('Description').agg({'Country': 'count'}).reset_index()
    Data_new_grouped.rename(columns = {'Country': 'score'},inplace=True)
    
    train_data_sort = coldstart_matrix[country].sort_values(ascending = False).to_frame()
    train_data_sort['Rank'] = coldstart_matrix[country].rank(ascending=0, method='first')
    popularity_recommendations = train_data_sort.head(5)
    
    new_user_recommendations = popularity_recommendations 
          
    #Add user country column for which the recommendations are being generated 
    new_user_recommendations['Country'] = country 
      
    #Bring user country column to the front 
    cols = new_user_recommendations.columns.tolist() 
    cols = cols[-1:] + cols[:-1] 
    new_user_recommendations = new_user_recommendations[cols] 
          
    return new_user_recommendations 

###  An example for a recommendation system for a new user for any of the following countries

In [16]:
find_recom = ['France','United Kingdom','USA']   # This list is user choice.

In [17]:
for i in find_recom:
    print("Here is the recommendation for the user's Country: %s\n" %(i))
    print(recommend(retail,i))    
    print("\n") 

Here is the recommendation for the user's Country: France

                              Country  France  Rank
Description                                        
RABBIT NIGHT LIGHT             France  4000.0   1.0
MINI PAINT SET VINTAGE         France  2196.0   2.0
RED TOADSTOOL LED NIGHT LIGHT  France  1291.0   3.0
SET/6 RED SPOTTY PAPER CUPS    France  1272.0   4.0
ASSORTED COLOUR BIRD ORNAMENT  France  1204.0   5.0


Here is the recommendation for the user's Country: United Kingdom

                                           Country  United Kingdom  Rank
Description                                                             
PAPER CRAFT , LITTLE BIRDIE         United Kingdom         80995.0   1.0
MEDIUM CERAMIC TOP STORAGE JAR      United Kingdom         76919.0   2.0
WORLD WAR 2 GLIDERS ASSTD DESIGNS   United Kingdom         49182.0   3.0
JUMBO BAG RED RETROSPOT             United Kingdom         41981.0   4.0
WHITE HANGING HEART T-LIGHT HOLDER  United Kingdom         34648.0   5

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_user_recommendations['Country'] = country
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_user_recommendations['Country'] = country
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_user_recommendations['Country'] = country


                                    Country   USA  Rank
Description                                            
SET 12 COLOURING PENCILS DOILY          USA  88.0   1.0
12 PENCILS SMALL TUBE RED RETROSPOT     USA  72.0   2.0
SET/10 BLUE POLKADOT PARTY CANDLES      USA  72.0   3.0
SET/10 IVORY POLKADOT PARTY CANDLES     USA  72.0   4.0
SET/10 PINK POLKADOT PARTY CANDLES      USA  72.0   5.0




## Recomendation System Predictions for Regular User 

### Tackling the Cold Start Problem

There are two kinds of user/ item interaction data available: explicit and implicit.
- **Explicit**: A score, such as a rating or a like
- **Implicit**: Not as obvious in terms of preference, such as a click, view, or purchase

The most common example of explicit data discussed is movie ratings, which are given on a numeric scale. We can easily see whether a user enjoyed a movie based on the rating provided. The problem, however, is that most of the time, people don’t provide ratings at all, so the amount of explicit data available is quite scarce. Sometimes we get access to data about certain interactions between users and items that give us some degree of certainty on whether a user likes an item - this is what we call implicit data. With **implicit data**, the **more interactions a user has with a item**, the more **certain we are about its preference**. A common example might be viewing a product in Amazon website or even purchasing it.

The right target audience for an advertisement is best calculated by looking at the **former visitors for the ad**. According to the basic assumption of the **collaborative filtering concept**, if an **ad was already popular with a certain group of people**, then **others** that **fit the group’s profile** are likely to **respond well to the ad**.

In our case , we suggested an **Explicit** method for the recommmendation for **new users** based on the previous invoices of other customers such like a ranking/rating

For **Regular Users** where we have an historical data about their purchases and their behaviour, it was decided to apply a **Implicit approach** using the **ALS**

## Alternating Least Squares

Matrix factorization is applied in the realm of **dimensionality reduction**, where we are trying to reduce the number of features while still **keeping the relevant information**. This is the case with principal component analysis (PCA) and the very similar singular value decomposition (SVD).

Essentially, can we take a large matrix of user/item interactions and figure out the **latent (or hidden) features that relate them to each other** in a much smaller matrix of user features and item features? That’s exactly what ALS is trying to do through matrix factorization.

## Inventory Check

In [18]:
items = retail_active[['StockCode', 'Description']].drop_duplicates() # Only get unique item/description pairs
items['StockCode'] = items.StockCode.astype(str) # Encode as strings for future lookup ease

In [19]:
items

Unnamed: 0,StockCode,Description
0,85123A,WHITE HANGING HEART T-LIGHT HOLDER
1,71053,WHITE METAL LANTERN
2,84406B,CREAM CUPID HEARTS COAT HANGER
3,84029G,KNITTED UNION FLAG HOT WATER BOTTLE
4,84029E,RED WOOLLY HOTTIE WHITE HEART.
...,...,...
527067,90214W,"LETTER ""W"" BLING KEY RING"
527069,90214Z,"LETTER ""Z"" BLING KEY RING"
530382,90089,PINK CRYSTAL SKULL PHONE CHARM
537621,85123A,CREAM HANGING HEART T-LIGHT HOLDER


Some items are missing since we are only using a dataset that has the customer_ID

In [20]:
retail_active['CustomerID'] = retail_active.CustomerID.astype(int) # Convert to int for customer ID
inventory = retail_active[['StockCode', 'Quantity', 'CustomerID']] # Get rid of unnecessary info

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  retail_active['CustomerID'] = retail_active.CustomerID.astype(int) # Convert to int for customer ID


In [21]:
inventory = inventory.groupby(['CustomerID', 'StockCode']).sum().reset_index() # Group together
inventory.Quantity.loc[inventory.Quantity == 0] = 1 # Replace a sum of zero purchases with a one to
# indicate purchased
purchases = inventory.query('Quantity > 0') # Only get customers where purchase totals were positive

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [22]:
purchases

Unnamed: 0,CustomerID,StockCode,Quantity
0,12346,23166,74215
1,12347,16008,24
2,12347,17021,36
3,12347,20665,6
4,12347,20719,40
...,...,...,...
266787,18287,84920,4
266788,18287,85039A,96
266789,18287,85039B,120
266790,18287,85040A,48


In [23]:
customers = list(np.sort(purchases.CustomerID.unique())) # Get our unique customers
products = list(purchases.StockCode.unique()) # Get our unique products that were purchased
quantity = list(purchases.Quantity) # All of our purchases

customers_rows = purchases.CustomerID.astype('category').cat.codes 
# Get the associated row indices
stock_itens_cols = purchases.StockCode.astype('category').cat.codes 
# Get the associated column indices
matrix_customer = sparse.csr_matrix((quantity, (customers_rows, stock_itens_cols)), shape=(len(customers), len(products)))

In [24]:
matrix_customer

<4338x3665 sparse matrix of type '<class 'numpy.intc'>'
	with 266792 stored elements in Compressed Sparse Row format>

In [25]:
matrix_size = matrix_customer.shape[0]*matrix_customer.shape[1] # Number of possible interactions in the matrix
num_purchases = len(matrix_customer.nonzero()[0]) # Number of items interacted with
sparsity = 100*(1 - (num_purchases/matrix_size))
sparsity

98.3219330803578

**98.3%** of the interaction matrix is **sparse**!

## Train and Test Split

In [26]:
#https://jessesw.com/Rec-System/
def make_train(ratings, pct_test = 0.2):
    '''
    This function will take in the original user-item matrix and "mask" a percentage of the original ratings where a
    user-item interaction has taken place for use as a test set. The test set will contain all of the original ratings, 
    while the training set replaces the specified percentage of them with a zero in the original ratings matrix. 
    
    parameters: 
    
    ratings - the original ratings matrix from which you want to generate a train/test set. Test is just a complete
    copy of the original set. This is in the form of a sparse csr_matrix. 
    
    pct_test - The percentage of user-item interactions where an interaction took place that you want to mask in the 
    training set for later comparison to the test set, which contains all of the original ratings. 
    
    returns:
    
    training_set - The altered version of the original data with a certain percentage of the user-item pairs 
    that originally had interaction set back to zero.
    
    test_set - A copy of the original ratings matrix, unaltered, so it can be used to see how the rank order 
    compares with the actual interactions.
    
    user_inds - From the randomly selected user-item indices, which user rows were altered in the training data.
    This will be necessary later when evaluating the performance via AUC.
    '''
    test_set = ratings.copy() # Make a copy of the original set to be the test set. 
    test_set[test_set != 0] = 1 # Store the test set as a binary preference matrix
    training_set = ratings.copy() # Make a copy of the original data we can alter as our training set. 
    nonzero_inds = training_set.nonzero() # Find the indices in the ratings data where an interaction exists
    nonzero_pairs = list(zip(nonzero_inds[0], nonzero_inds[1])) # Zip these pairs together of user,item index into list
    random.seed(0) # Set the random seed to zero for reproducibility
    num_samples = int(np.ceil(pct_test*len(nonzero_pairs))) # Round the number of samples needed to the nearest integer
    samples = random.sample(nonzero_pairs, num_samples) # Sample a random number of user-item pairs without replacement
    user_inds = [index[0] for index in samples] # Get the user row indices
    item_inds = [index[1] for index in samples] # Get the item column indices
    training_set[user_inds, item_inds] = 0 # Assign all of the randomly chosen user-item pairs to zero
    training_set.eliminate_zeros() # Get rid of zeros in sparse array storage after update to save space
    return training_set, test_set, list(set(user_inds)) # Output the unique list of user rows that were altered  

This will return our **training set**, a **test set that has been binarized to 0/1 for purchased/not purchased**, and a list of which **users had at least one item masked**. We will test the performance of the recommender system on these users only

In [27]:
product_train, product_test, product_users_altered = make_train(matrix_customer, pct_test = 0.2)

In [28]:
product_train

<4338x3665 sparse matrix of type '<class 'numpy.intc'>'
	with 213433 stored elements in Compressed Sparse Row format>

In [29]:
alpha = 15
user_vecs, item_vecs = implicit.alternating_least_squares((product_train*alpha).astype('double'), 
                                                          factors=20, 
                                                          regularization = 0.1, 
                                                         iterations = 50)

This method is deprecated. Please use the AlternatingLeastSquares class instead


  0%|          | 0/50 [00:00<?, ?it/s]

In [30]:
def auc_score(predictions, test):
    '''
    This simple function will output the area under the curve using sklearn's metrics. 
    
    parameters:
    
    - predictions: your prediction output
    
    - test: the actual target result you are comparing to
    
    returns:
    
    - AUC (area under the Receiver Operating Characterisic curve)
    '''
    fpr, tpr, thresholds = metrics.roc_curve(test, predictions)
    return metrics.auc(fpr, tpr)   

In [31]:
def calc_mean_auc(training_set, altered_users, predictions, test_set):
    '''
    This function will calculate the mean AUC by user for any user that had their user-item matrix altered. 
    
    parameters:
    
    training_set - The training set resulting from make_train, where a certain percentage of the original
    user/item interactions are reset to zero to hide them from the model 
    
    predictions - The matrix of your predicted ratings for each user/item pair as output from the implicit MF.
    These should be stored in a list, with user vectors as item zero and item vectors as item one. 
    
    altered_users - The indices of the users where at least one user/item pair was altered from make_train function
    
    test_set - The test set constucted earlier from make_train function
    
    
    
    returns:
    
    The mean AUC (area under the Receiver Operator Characteristic curve) of the test set only on user-item interactions
    there were originally zero to test ranking ability in addition to the most popular items as a benchmark.
    '''
    
    
    store_auc = [] # An empty list to store the AUC for each user that had an item removed from the training set
    popularity_auc = [] # To store popular AUC scores
    pop_items = np.array(test_set.sum(axis = 0)).reshape(-1) # Get sum of item iteractions to find most popular
    item_vecs = predictions[1]
    for user in altered_users: # Iterate through each user that had an item altered
        training_row = training_set[user,:].toarray().reshape(-1) # Get the training set row
        zero_inds = np.where(training_row == 0) # Find where the interaction had not yet occurred
        # Get the predicted values based on our user/item vectors
        user_vec = predictions[0][user,:]
        pred = user_vec.dot(item_vecs).toarray()[0,zero_inds].reshape(-1)
        # Get only the items that were originally zero
        # Select all ratings from the MF prediction for this user that originally had no iteraction
        actual = test_set[user,:].toarray()[0,zero_inds].reshape(-1) 
        # Select the binarized yes/no interaction pairs from the original full data
        # that align with the same pairs in training 
        pop = pop_items[zero_inds] # Get the item popularity for our chosen items
        store_auc.append(auc_score(pred, actual)) # Calculate AUC for the given user and store
        popularity_auc.append(auc_score(pop, actual)) # Calculate AUC using most popular and score
    # End users iteration
    
    return float('%.3f'%np.mean(store_auc)), float('%.3f'%np.mean(popularity_auc))  
   # Return the mean AUC rounded to three decimal places for both test and popularity benchmark

In [32]:
calc_mean_auc(product_train, product_users_altered, 
              [sparse.csr_matrix(user_vecs), sparse.csr_matrix(item_vecs.T)], product_test)
# AUC for our recommender system

(0.875, 0.817)

Our recommender system beat popularity. **Our system** had a mean **AUC of 0.876**, while the **popular item benchmark** had a lower **AUC of 0.817**

An AUC of 0.876 means the system is recommending items the user in fact **had purchased in the test set far more frequently** than items the user **never ended up** purchasing

In [33]:
customers_arr = np.array(list(np.sort(purchases.CustomerID.unique()))) # Array of customer IDs from the ratings matrix
products_arr = np.array(list(purchases.StockCode.unique())) # Array of product IDs from the ratings matrix

In [34]:
customers_arr

array([12346, 12347, 12348, ..., 18282, 18283, 18287], dtype=int64)

## List of the item descriptions from the earlier created item lookup table.

In [35]:
def get_items_purchased(customer_id, mf_train, customers_list, products_list, item_lookup):
    '''
    This just tells me which items have been already purchased by a specific user in the training set. 
    
    parameters: 
    
    customer_id - Input the customer's id number that you want to see prior purchases of at least once
    
    mf_train - The initial ratings training set used (without weights applied)
    
    customers_list - The array of customers used in the ratings matrix
    
    products_list - The array of products used in the ratings matrix
    
    item_lookup - A simple pandas dataframe of the unique product ID/product descriptions available
    
    returns:
    
    A list of item IDs and item descriptions for a particular customer that were already purchased in the training set
    '''
    cust_ind = np.where(customers_list == customer_id)[0][0] # Returns the index row of our customer id
    purchased_ind = mf_train[cust_ind,:].nonzero()[1] # Get column indices of purchased items
    prod_codes = products_list[purchased_ind] # Get the stock codes for our purchased items
    return item_lookup.loc[item_lookup.StockCode.isin(prod_codes)]

In [36]:
customers_arr[:5]

array([12346, 12347, 12348, 12349, 12350], dtype=int64)

### Item bought by client with id number 12350

In [37]:
get_items_purchased(12346, product_train, customers_arr, products_arr, items)

Unnamed: 0,StockCode,Description
10108,22283,6 EGG HOUSE PAINTED WOOD


## What items does the recommender system say this customer should purchase?

In [38]:
def rec_items(customer_id, mf_train, user_vecs, item_vecs, customer_list, item_list, item_lookup, num_items = 10):
    '''
    This function will return the top recommended items to our users 
    
    parameters:
    
    customer_id - Input the customer's id number that you want to get recommendations for
    
    mf_train - The training matrix you used for matrix factorization fitting
    
    user_vecs - the user vectors from your fitted matrix factorization
    
    item_vecs - the item vectors from your fitted matrix factorization
    
    customer_list - an array of the customer's ID numbers that make up the rows of your ratings matrix 
                    (in order of matrix)
    
    item_list - an array of the products that make up the columns of your ratings matrix
                    (in order of matrix)
    
    item_lookup - A simple pandas dataframe of the unique product ID/product descriptions available
    
    num_items - The number of items you want to recommend in order of best recommendations. Default is 10. 
    
    returns:
    
    - The top n recommendations chosen based on the user/item vectors for items never interacted with/purchased
    '''
    
    cust_ind = np.where(customer_list == customer_id)[0][0] # Returns the index row of our customer id
    pref_vec = mf_train[cust_ind,:].toarray() # Get the ratings from the training set ratings matrix
    pref_vec = pref_vec.reshape(-1) + 1 # Add 1 to everything, so that items not purchased yet become equal to 1
    pref_vec[pref_vec > 1] = 0 # Make everything already purchased zero
    rec_vector = user_vecs[cust_ind,:].dot(item_vecs.T) # Get dot product of user vector and all item vectors
    # Scale this recommendation vector between 0 and 1
    min_max = MinMaxScaler()
    rec_vector_scaled = min_max.fit_transform(rec_vector.reshape(-1,1))[:,0] 
    recommend_vector = pref_vec*rec_vector_scaled 
    # Items already purchased have their recommendation multiplied by zero
    product_idx = np.argsort(recommend_vector)[::-1][:num_items] # Sort the indices of the items into order 
    # of best recommendations
    rec_list = [] # start empty list to store items
    for index in product_idx:
        code = item_list[index]
        rec_list.append([code, item_lookup.Description.loc[item_lookup.StockCode == code].iloc[0]]) 
        # Append our descriptions to the list
    codes = [item[0] for item in rec_list]
    descriptions = [item[1] for item in rec_list]
    final_frame = pd.DataFrame({'StockCode': codes, 'Description': descriptions}) # Create a dataframe 
    return final_frame[['StockCode', 'Description']] # Switch order of columns around

In [39]:
rec_items(12346, product_train, user_vecs, item_vecs, customers_arr, products_arr, items,
                       num_items = 10)

Unnamed: 0,StockCode,Description
0,22473,TV DINNER TRAY VINTAGE PAISLEY
1,22282,12 EGG HOUSE PAINTED WOOD
2,21015,DARK BIRD HOUSE TREE DECORATION
3,35965,FOLKART HEART NAPKIN RINGS
4,21949,SET OF 6 STRAWBERRY CHOPSTICKS
5,35911A,MULTICOLOUR RABBIT EGG WARMER
6,21948,SET OF 6 CAKE CHOPSTICKS
7,23154,SET OF 4 JAM JAR MAGNETS
8,22610,PENS ASSORTED FUNNY FACE
9,22311,OFFICE MUG WARMER BLACK+SILVER
