# Collaborative filtering Models

In this notebook I will use turicreate to build a recommendation system. Turicreate has built in models such as popularity, cosine similarity, pearson similarity which makes it very user friendly. It's also designed to handle large datasets which is perfect for this considering the transaction table has 31 million rows. We have to convert pandas Dataframes into SFrames in order for it to deal with the data in a memory-efficient way. 

I will be building a recommendation system model based on collaborative filtering, which is essentially predicting items a user may like or buy based on their similar purchase behaviour to other users in the past. 

My prediction results will include a ranking of the top 8 products per customer. Type of collaborative filtering is also referred to as Market Basket Analysis - providing a list of products a customer may want to put in their basket based on previous purchases.

I will be using 3 different variations of my datasets. The first one will be a table with customer id, article id and article purchase count. This will show how many of each article a customer bought. 

The second table will have a purchase dummy column, this will ignore how many of something a customer bought and just mark which articles they bought and which they didn't. This format is useful for cosine and pearson similarity when we are turning the tables into matrices.

The third table will have a column which is a normalized column of article purchase count. 

I will assess turicreate's popularity model, cosine similarity and pearson similarity model on each table and evaluate the results to see which model is most effective

In [1]:
import pandas as pd
import numpy as np
import time
import turicreate as tc

from sklearn.model_selection import train_test_split


Load the data

In [3]:
customers = pd.read_csv('../src/data/cleancustomers.csv', index_col=0)

In [4]:
transactions = pd.read_csv('../src/data/cleantransactions.csv', index_col=0)


In [5]:
articles = pd.read_csv('../src/data/cleanarticles.csv', index_col=0)

In [6]:
transactions.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,article_purchase_count
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2,1
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2,1
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221001,0.020322,2,1
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2,1
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687001,0.016932,2,1


Looking at the most popular items by count and assigning to a new table with the column 'no_purchases'

In [7]:
popular_articles = transactions['article_id'].value_counts(ascending=False).rename_axis('article_id').reset_index(name='no_purchases')

In [8]:
popular_articles

Unnamed: 0,article_id,no_purchases
0,706016001,42672
1,706016002,30862
2,372860001,29337
3,610776002,25234
4,759871002,23799
...,...,...
104542,598581001,1
104543,598581002,1
104544,600665002,1
104545,656980001,1


In [9]:
articles.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105126 entries, 0 to 105125
Data columns (total 25 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   article_id                    105126 non-null  int64 
 1   product_code                  105126 non-null  int64 
 2   prod_name                     105126 non-null  object
 3   product_type_no               105126 non-null  int64 
 4   product_type_name             105126 non-null  object
 5   product_group_name            105126 non-null  object
 6   graphical_appearance_no       105126 non-null  int64 
 7   graphical_appearance_name     105126 non-null  object
 8   colour_group_code             105126 non-null  int64 
 9   colour_group_name             105126 non-null  object
 10  perceived_colour_value_id     105126 non-null  int64 
 11  perceived_colour_value_name   105126 non-null  object
 12  perceived_colour_master_id    105126 non-null  int64 
 13 

As there are 105,000 different articles in total, I am going to remove those that weren't that popular. I will eventually be making a matrix with the articles as columns therefore in order to improve run time and accuracy (by only suggesting relatively popular products) I will first look at the products that had over 1000 transactions

In [10]:
overthousand = popular_articles[popular_articles['no_purchases'] > 1000]

In [11]:
overthousand.count()

article_id      6864
no_purchases    6864
dtype: int64

6848 articles still seems a very big number for my matrix so I will now try > 2000 purchases

In [12]:
overtwothousand = popular_articles[popular_articles['no_purchases'] > 2000]

In [13]:
overtwothousand.count()

article_id      2188
no_purchases    2188
dtype: int64

I'm going to go with the top 1000 articles 

In [14]:
popular_articles = popular_articles.head(1000)

In [15]:
popular_articles

Unnamed: 0,article_id,no_purchases
0,706016001,42672
1,706016002,30862
2,372860001,29337
3,610776002,25234
4,759871002,23799
...,...,...
995,573716002,2906
996,720125039,2905
997,733098018,2905
998,556539016,2904


Selecting only the rows in the transaction table that include the 1000 most popular articles 

In [16]:
top1000transactions = transactions[transactions['article_id'].isin(popular_articles['article_id']).reset_index(drop=True)]

In [17]:
top1000transactions

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,article_purchase_count
14,2018-09-20,000aa7f0dc06cd7174389e76c9e132a67860c5f65f9706...,377277001,0.008458,2,2
48,2018-09-20,00401a367c5ac085cb9d4b77c56f3edcabf25153615db9...,573937001,0.032186,2,1
56,2018-09-20,00402f4463c8dc1b3ee54abfdea280e96cd87320449eca...,507909001,0.025407,1,1
67,2018-09-20,00609a1cc562140fa87a6de432bef9c9f0b936b259ad30...,611415001,0.016932,2,1
68,2018-09-20,00609a1cc562140fa87a6de432bef9c9f0b936b259ad30...,611415005,0.016932,2,1
...,...,...,...,...,...,...
28813368,2020-09-22,ff57364873464edd79e4807350b1cc0902d14f24490e48...,706016003,0.033881,2,1
28813369,2020-09-22,ff57364873464edd79e4807350b1cc0902d14f24490e48...,706016019,0.033881,2,1
28813386,2020-09-22,ff813df6887c2a6d7065aed247bf1db3d6f629eee23798...,806388002,0.013542,1,1
28813393,2020-09-22,ffb72741f3bc3d98855703b55d34e05bc7893a5d6a99a3...,762846006,0.025407,2,1


Now I will look at the number of articles purchased by each customer. As I will be comparing customers and the items they've bought, I will need to remove those customers who haven't bought many items 

In [18]:
top_customers = top1000transactions.groupby('customer_id')['article_id'].count().sort_values(ascending=False)

In [19]:
top_customers

customer_id
ffc247b933f175b37fccbb4f71c0479d6625e703b36f637be643afc224a8977f    184
be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee985513d9e8e53c6d91b    181
0bf4c6fd4e9d33f9bfb807bb78348cbf5c565846ff4006acf5c1b9aea77b0e54    167
2df54f0d0653811fe06479c93905f3e6ecc6d07edf39d8b56e5b66c86182bedf    162
d190fc2dc41e27e5f79f8b1f58bfcd7a13ab22857f39ca9f1cd12230520d58ab    149
                                                                   ... 
bab2682a1f1bff22a1d65d038948b6062ce0640c69ece88b68ba12905668a33b      1
4b767c31fae9030812089b6f14befb3f51483815d99b813008776b7d20a8d5d8      1
4b77256fcaa3ddbf509f048b30701a51d1b299480e9e8c74b367ee0324f1f686      1
4b773691182c0ced6569159bcec31e77eb2891cb8d1e6a57183006f4167bbac5      1
ffffd7744cebcf3aca44ae7049d2a94b87074c3d4ffe38b2236865d949d4df6a      1
Name: article_id, Length: 904179, dtype: int64

In [20]:
# put these into a df 
top_customers = top_customers.rename_axis('customer_id').reset_index(name='no_purchases')

In [21]:
top_customers

Unnamed: 0,customer_id,no_purchases
0,ffc247b933f175b37fccbb4f71c0479d6625e703b36f63...,184
1,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,181
2,0bf4c6fd4e9d33f9bfb807bb78348cbf5c565846ff4006...,167
3,2df54f0d0653811fe06479c93905f3e6ecc6d07edf39d8...,162
4,d190fc2dc41e27e5f79f8b1f58bfcd7a13ab22857f39ca...,149
...,...,...
904174,bab2682a1f1bff22a1d65d038948b6062ce0640c69ece8...,1
904175,4b767c31fae9030812089b6f14befb3f51483815d99b81...,1
904176,4b77256fcaa3ddbf509f048b30701a51d1b299480e9e8c...,1
904177,4b773691182c0ced6569159bcec31e77eb2891cb8d1e6a...,1


User-item collaborative filtering compares users based on the items they've bought, so it is not useful to compare users who only bought 1 or 2 items. I will use 30 as the minimum number of items a customer has bought 

In [22]:
above30 = top_customers[top_customers['no_purchases'] > 30]

In [23]:
above30

Unnamed: 0,customer_id,no_purchases
0,ffc247b933f175b37fccbb4f71c0479d6625e703b36f63...,184
1,be1981ab818cf4ef6765b2ecaea7a2cbf14ccd6e8a7ee9...,181
2,0bf4c6fd4e9d33f9bfb807bb78348cbf5c565846ff4006...,167
3,2df54f0d0653811fe06479c93905f3e6ecc6d07edf39d8...,162
4,d190fc2dc41e27e5f79f8b1f58bfcd7a13ab22857f39ca...,149
...,...,...
11061,6da15d4c7e1f1ef76a436840ab38743556648ff800dc45...,31
11062,bebb648bb2d9523e5786d13a87b02dec2548973a901326...,31
11063,f30e2bed1a14665fdf7fc7887a1b2c4ea701f7bdd68f3a...,31
11064,d19933b2457ec2f7dce6d4daabe20de332127fbe18c256...,31


Now I will select the transactions of the above30 customers from the original transactions table

In [24]:
filtered_df1 = transactions[transactions['customer_id'].isin(above30['customer_id']).reset_index(drop=True)]

In [25]:
filtered_df1

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,article_purchase_count
66,2018-09-20,00609a1cc562140fa87a6de432bef9c9f0b936b259ad30...,578374001,0.042356,2,1
67,2018-09-20,00609a1cc562140fa87a6de432bef9c9f0b936b259ad30...,611415001,0.016932,2,1
68,2018-09-20,00609a1cc562140fa87a6de432bef9c9f0b936b259ad30...,611415005,0.016932,2,1
69,2018-09-20,00609a1cc562140fa87a6de432bef9c9f0b936b259ad30...,673677002,0.016932,2,1
70,2018-09-20,00609a1cc562140fa87a6de432bef9c9f0b936b259ad30...,676352001,0.025407,2,1
...,...,...,...,...,...,...
28813410,2020-09-22,ffd4cf2217de4a0a3f9f610cdec334c803692a18af08ac...,856440002,0.042356,2,1
28813411,2020-09-22,ffd4cf2217de4a0a3f9f610cdec334c803692a18af08ac...,896169005,0.050831,2,1
28813412,2020-09-22,ffd4cf2217de4a0a3f9f610cdec334c803692a18af08ac...,902288001,0.022017,2,1
28813413,2020-09-22,ffd4cf2217de4a0a3f9f610cdec334c803692a18af08ac...,910949002,0.050831,2,1


In [26]:
# grouping by customer_id and article id to make a column with the number of times each customer bought a certain item
no_purchases = filtered_df1.groupby(['customer_id', 'article_id'])['article_purchase_count'].count().reset_index()

### No Purchases Table, 1st table for Modeling

In [27]:
no_purchases

Unnamed: 0,customer_id,article_id,article_purchase_count
0,000fb6e772c5d0023892065e659963da90b1866035558e...,108775044,2
1,000fb6e772c5d0023892065e659963da90b1866035558e...,111565001,1
2,000fb6e772c5d0023892065e659963da90b1866035558e...,111586001,1
3,000fb6e772c5d0023892065e659963da90b1866035558e...,111593001,1
4,000fb6e772c5d0023892065e659963da90b1866035558e...,158340001,3
...,...,...,...
2006346,fffb68e203e88449a1dc7173e938b1b3e91b0c93ff4e1d...,841699003,1
2006347,fffb68e203e88449a1dc7173e938b1b3e91b0c93ff4e1d...,842063001,1
2006348,fffb68e203e88449a1dc7173e938b1b3e91b0c93ff4e1d...,878508003,1
2006349,fffb68e203e88449a1dc7173e938b1b3e91b0c93ff4e1d...,897693002,1


The no_purchases table will be my first table used for modeling.

### Dummy df - 2nd table for modeling
### Creating 2nd table showing whether a customer bought a product or not

Creating a dummy column which represents 1 for every article id bought.  

In [28]:
# making a copy of previous table
dummy_df = no_purchases.copy()

In [29]:
# adding a column to table 
dummy_df['dummy'] = 1

In [30]:
dummy_df.shape

(2006351, 4)

In [74]:
dummy_df.to_csv('../src/Data/dummy_df.csv')

### Data Normalized - 3rd table for modeling
### Creating 3rd table which normalizes the purchase frequency of a customer per item, 0 = not purchased and 1 = the most purchased

Creating a user-item matrix using the no_purchases table

In [31]:
# matrix of customer id as rows and articles id as columns 
df_matrix = pd.pivot_table(no_purchases, values='article_purchase_count', index='customer_id', columns='article_id')

In [32]:
df_matrix

article_id,108775015,108775044,108775051,110065001,110065002,110065011,111565001,111565003,111586001,111593001,...,947509001,947599001,947934001,949198001,949551001,949551002,952267001,952938001,953763001,956217002
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000fb6e772c5d0023892065e659963da90b1866035558ec16fca51b0dcfb7e59,,2.0,,,,,1.0,,1.0,1.0,...,,,,,,,,,,
0024dea548c64fb75a563e0b300c0b16210decee446f1aa9ed6dcf2cc965d462,,,,,,,,,,,...,,,,,,,,,,
00357b192b81fc83261a45be87f5f3d59112db7d117513c1e908e6a7021edc35,,,,,,,,,,,...,,,,,,,,,,
0036a44bd648ce2dbc32688a465b9628b7a78395302f26dd57b4ed75dce9b70c,,,,,,,,,,,...,,,,,,,,,,
0040e2fc2d1e7931a38355aca56b2c62b87e65051b72878c813c093d7b9b87aa,,,,,,,,,4.0,1.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ffddc52a24cd9e170570b48773779eec8ad05bd0cf81639e278ceff3eb7a64d5,,,,,,,,,,,...,,,,,,,,,,
ffe6376eb6b854d842e5a7714ea758de127f086a60d67d5cf425ef20361acea1,,,,,,,,,,,...,,,,,,,,,,
fff4d3a8b1f3b60af93e78c30a7cb4cf75edaf2590d3e593881ae6007d775f0f,,,,,,,,,,,...,,,,,,,,,,
fffae8eb3a282d8c43c77dd2ca0621703b71e90904dfde2189bdd644f59071dd,,,,,,,,,,,...,,,,,,,,,,


Scaling the values

In [33]:
# subtracting each cell value with the minimum count and then dividing by difference in max and minimum purchase count 
df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min())

In [34]:
df_matrix_norm

article_id,108775015,108775044,108775051,110065001,110065002,110065011,111565001,111565003,111586001,111593001,...,947509001,947599001,947934001,949198001,949551001,949551002,952267001,952938001,953763001,956217002
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000fb6e772c5d0023892065e659963da90b1866035558ec16fca51b0dcfb7e59,,0.333333,,,,,0.0,,0.0,0.0,...,,,,,,,,,,
0024dea548c64fb75a563e0b300c0b16210decee446f1aa9ed6dcf2cc965d462,,,,,,,,,,,...,,,,,,,,,,
00357b192b81fc83261a45be87f5f3d59112db7d117513c1e908e6a7021edc35,,,,,,,,,,,...,,,,,,,,,,
0036a44bd648ce2dbc32688a465b9628b7a78395302f26dd57b4ed75dce9b70c,,,,,,,,,,,...,,,,,,,,,,
0040e2fc2d1e7931a38355aca56b2c62b87e65051b72878c813c093d7b9b87aa,,,,,,,,,0.3,0.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ffddc52a24cd9e170570b48773779eec8ad05bd0cf81639e278ceff3eb7a64d5,,,,,,,,,,,...,,,,,,,,,,
ffe6376eb6b854d842e5a7714ea758de127f086a60d67d5cf425ef20361acea1,,,,,,,,,,,...,,,,,,,,,,
fff4d3a8b1f3b60af93e78c30a7cb4cf75edaf2590d3e593881ae6007d775f0f,,,,,,,,,,,...,,,,,,,,,,
fffae8eb3a282d8c43c77dd2ca0621703b71e90904dfde2189bdd644f59071dd,,,,,,,,,,,...,,,,,,,,,,


In [35]:
# putting this back into a table to get rid of so many nan values  
df = df_matrix_norm.reset_index() 
df.index.names = ['scaled_purchase_freq'] #creating scaled column
# using melt to unpivot table 
#id_vars = identifier variable, to specify unique values in rows 
data_normalized = pd.melt(df, id_vars=['customer_id'], value_name='scaled_purchase_freq').dropna()
print(data_normalized.shape)
data_normalized.head()

(1758923, 3)


Unnamed: 0,customer_id,article_id,scaled_purchase_freq
18,00b19d74d6d60c157a9e08e8d83d9b38b1142a32676543...,108775015,0.0
29,00e59bc10e162c83758a8ece0d6536d96fe2c7afdae9d5...,108775015,0.0
31,00ef0d28ba9c5077e4758b08bc04210c4fecdfc154c084...,108775015,0.0
39,010109a899c4706836f7a96ec878ceb7ea9b0c24693205...,108775015,0.0
44,01111989ee5b96b0f0a47c6fed861fbe19f6d63d3cdb48...,108775015,0.0


In [36]:
data_normalized.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1758923 entries, 18 to 847024420
Data columns (total 3 columns):
 #   Column                Dtype  
---  ------                -----  
 0   customer_id           object 
 1   article_id            object 
 2   scaled_purchase_freq  float64
dtypes: float64(1), object(2)
memory usage: 53.7+ MB


In [37]:
data_normalized['scaled_purchase_freq'].nunique()

117

### Splitting train and test data 

Splitting each table into train and test. This will prevent data leakage when testing the accuracy of the models later on

In [38]:
df1 = no_purchases
df2 = dummy_df
df3 = data_normalized

In [39]:
# function to split each table, using 0.25 as test set 
# using tc.Sframe as it is much better equipped to handle large data than pandas
def tt_split(data):
    train, test = train_test_split(data, test_size = .25)
    train_data = tc.SFrame(train)
    test_data = tc.SFrame(test)
    return train_data, test_data

In [40]:
# calling the train test split function
train_purchases, test_purchases = tt_split(df1)
train_dummy, test_dummy = tt_split(df2)
train_normalized, test_normalized = tt_split(df3)

Sanity check that splits of each table have the same number of rows

In [41]:
train_purchases.shape

(1504763, 3)

In [42]:
test_purchases.shape

(501588, 3)

In [43]:
train_dummy.shape

(1504763, 4)

In [44]:
test_dummy.shape

(501588, 4)

In order to understand the results of more complex recommendation systems like collaborative filtering, I am going to start with a baseline model which recommends customers the most popular products. This way I can compare the accuracy of my more advanced model to make sure it's more accurate than something as obvious as recommending a customer the most popular product

In [45]:
# defining the constant variables for my turicreate models 
user_id = 'customer_id'
item_id = 'article_id'
# customer_to_rec defines the customers that we will be using in our model
customers_to_rec = list(train_purchases['customer_id'].unique())
# n_rec is the number of items we will recommend to each customer
n_rec = 8
# number of rows we want to see in the initial output 
n_display = 24


defining a function of 3 different models from turicreate that we will be using:

    - popularity
    - cosine similarity
    - pearson correlation 

In [46]:
def model(train_data, name, user_id, item_id, target, customers_to_rec, n_rec, n_display):
    if name == 'popularity':
        model = tc.popularity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target)
    elif name == 'cosine':
        model = tc.item_similarity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target, 
                                                    similarity_type='cosine')
    elif name == 'pearson':
        model = tc.item_similarity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target, 
                                                    similarity_type='pearson')
        
    recom = model.recommend(users=customers_to_rec, k=n_rec)
    recom.print_rows(n_display)
    return model, recom

In [47]:
train_purchases.head(5)

customer_id,article_id,article_purchase_count
d734b8669438f34b1878dc078 1f540dd889deb02f939f6 ...,516859002,1
cae43013ac28d76c74c98fd77 caaea2f937da1c2f743c7 ...,721762002,1
214375b663ccc87230c66d96d 8963a132ae0cb5c07a40c ...,894135001,1
ff0709a27864e67ae3eaef133 6da05d0f93a742a9f0e01 ...,697054008,1
1f21c6fe5ac3385492780d488 54d3cb02e378cfef7f83e ...,456163053,1


In [48]:
train_dummy.head(5)

customer_id,article_id,article_purchase_count,dummy
9de8892353db619dfda6b6ab5 99f7ee745b36f92d0757f ...,663564006,1,1
bfd273cb02e60c527ecc82f2e 6c2102d110c1067714704 ...,632832006,1,1
abf06a79182a8b5e4bba18b62 377eabbc363aee42e8de5 ...,658298007,1,1
2bdcbd16a0e11e87335237a2f bdcc4b4556afc130bad4e ...,854043006,1,1
9e471f421774d518e31b3cefb d7472a2733db531e18c7a ...,824079001,1,1


In [49]:
train_normalized.head(5)

customer_id,article_id,scaled_purchase_freq
8bfb613161a72097298c42551 1479f64e17a1985fea336 ...,803592001,0.0
e47526e62a1804059294f8ef3 9f543836f81a98593efec ...,706016006,0.0
45c29301075615c4e896dddf7 3f5a08b8b0c8d9e6d2bec ...,653828014,0.0
7f83f69e4c5605835f6beba0c 045c8e62ac472115016a8 ...,373506001,0.0
e2b69a4ddfff8ec812755ebfc 57e8bfcec676435d6647c ...,697564022,0.0


## Popularity model

### Train_purchases table

In [50]:
name = 'popularity'
target = 'article_purchase_count'

popularity_purchase, _ = model(train_purchases, name, user_id, item_id, target, customers_to_rec, n_rec, n_display)
_.to_dataframe()["article_id"].nunique()

+-------------------------------+------------+-------+------+
|          customer_id          | article_id | score | rank |
+-------------------------------+------------+-------+------+
| 250039015607871019994d80d9... | 554141018  |  8.0  |  1   |
| 250039015607871019994d80d9... | 822161001  |  5.0  |  2   |
| 250039015607871019994d80d9... | 653358001  |  4.5  |  3   |
| 250039015607871019994d80d9... | 644032001  |  4.0  |  4   |
| 250039015607871019994d80d9... | 652671001  |  4.0  |  5   |
| 250039015607871019994d80d9... | 237222016  |  4.0  |  6   |
| 250039015607871019994d80d9... | 703737003  |  4.0  |  7   |
| 250039015607871019994d80d9... | 647837001  |  4.0  |  8   |
| eae27980951a8e176ee6721ba1... | 554141018  |  8.0  |  1   |
| eae27980951a8e176ee6721ba1... | 822161001  |  5.0  |  2   |
| eae27980951a8e176ee6721ba1... | 653358001  |  4.5  |  3   |
| eae27980951a8e176ee6721ba1... | 644032001  |  4.0  |  4   |
| eae27980951a8e176ee6721ba1... | 652671001  |  4.0  |  5   |
| eae279

11

In [51]:
train_purchases

customer_id,article_id,article_purchase_count
d734b8669438f34b1878dc078 1f540dd889deb02f939f6 ...,516859002,1
cae43013ac28d76c74c98fd77 caaea2f937da1c2f743c7 ...,721762002,1
214375b663ccc87230c66d96d 8963a132ae0cb5c07a40c ...,894135001,1
ff0709a27864e67ae3eaef133 6da05d0f93a742a9f0e01 ...,697054008,1
1f21c6fe5ac3385492780d488 54d3cb02e378cfef7f83e ...,456163053,1
a6e1e553ca1df628a9ef6d209 350077b531c82aa3ab185 ...,649397006,2
682604048737103df0e40f176 e7530a1c604e77c35bb79 ...,160442043,1
4a8be1df5f257d049cce39cb6 69aab869980b13e93482b ...,625773003,1
823a95fecfa24a9b3ab371398 f0b897c95727e245475b6 ...,448509018,1
5a7e208f8b34f09b7ba95c267 225075a4062f64ff43a4e ...,661610001,1


In [52]:
train_pd = train_purchases.to_dataframe()

In [53]:
train_pd.head(5)

Unnamed: 0,customer_id,article_id,article_purchase_count
0,d734b8669438f34b1878dc0781f540dd889deb02f939f6...,516859002,1
1,cae43013ac28d76c74c98fd77caaea2f937da1c2f743c7...,721762002,1
2,214375b663ccc87230c66d96d8963a132ae0cb5c07a40c...,894135001,1
3,ff0709a27864e67ae3eaef1336da05d0f93a742a9f0e01...,697054008,1
4,1f21c6fe5ac3385492780d48854d3cb02e378cfef7f83e...,456163053,1


In [54]:
train_pd.groupby('article_id')['article_purchase_count'].sum().sort_values(ascending=False)

article_id
706016001    3766
706016002    2333
610776002    1985
706016003    1647
610776001    1508
             ... 
830747001       1
688700001       1
688692008       1
688662002       1
714885001       1
Name: article_purchase_count, Length: 72482, dtype: int64

In [55]:
train_pd.groupby('article_id')['article_purchase_count'].mean().sort_values(ascending=False)

article_id
554141018    8.0
822161001    5.0
653358001    4.5
237222016    4.0
652671001    4.0
            ... 
685814037    1.0
685814050    1.0
685814051    1.0
685814052    1.0
956217002    1.0
Name: article_purchase_count, Length: 72482, dtype: float64

### Train_dummy table

In [56]:
name = 'popularity'
target = 'dummy'

popularity_dummy, _ = model(train_dummy, name, user_id, item_id, target, customers_to_rec, n_rec, n_display)

+-------------------------------+------------+-------+------+
|          customer_id          | article_id | score | rank |
+-------------------------------+------------+-------+------+
| 250039015607871019994d80d9... | 698296002  |  1.0  |  1   |
| 250039015607871019994d80d9... | 665082001  |  1.0  |  2   |
| 250039015607871019994d80d9... | 790644001  |  1.0  |  3   |
| 250039015607871019994d80d9... | 732842008  |  1.0  |  4   |
| 250039015607871019994d80d9... | 824079001  |  1.0  |  5   |
| 250039015607871019994d80d9... | 658298007  |  1.0  |  6   |
| 250039015607871019994d80d9... | 632832006  |  1.0  |  7   |
| 250039015607871019994d80d9... | 663564006  |  1.0  |  8   |
| eae27980951a8e176ee6721ba1... | 665082001  |  1.0  |  1   |
| eae27980951a8e176ee6721ba1... | 790644001  |  1.0  |  2   |
| eae27980951a8e176ee6721ba1... | 732842008  |  1.0  |  3   |
| eae27980951a8e176ee6721ba1... | 824079001  |  1.0  |  4   |
| eae27980951a8e176ee6721ba1... | 854043006  |  1.0  |  5   |
| eae279

### Train_norm table

In [57]:
name = 'popularity'
target = 'scaled_purchase_freq'

popularity_normalized, _ = model(train_normalized, name, user_id, item_id, target, customers_to_rec, n_rec, n_display)

+-------------------------------+------------+-------+------+
|          customer_id          | article_id | score | rank |
+-------------------------------+------------+-------+------+
| 250039015607871019994d80d9... | 634786001  |  1.0  |  1   |
| 250039015607871019994d80d9... | 787039001  |  1.0  |  2   |
| 250039015607871019994d80d9... | 708487003  |  1.0  |  3   |
| 250039015607871019994d80d9... | 855809001  |  1.0  |  4   |
| 250039015607871019994d80d9... | 782616026  |  1.0  |  5   |
| 250039015607871019994d80d9... | 627174001  |  1.0  |  6   |
| 250039015607871019994d80d9... | 886603001  |  1.0  |  7   |
| 250039015607871019994d80d9... | 810407001  |  1.0  |  8   |
| eae27980951a8e176ee6721ba1... | 634786001  |  1.0  |  1   |
| eae27980951a8e176ee6721ba1... | 787039001  |  1.0  |  2   |
| eae27980951a8e176ee6721ba1... | 708487003  |  1.0  |  3   |
| eae27980951a8e176ee6721ba1... | 855809001  |  1.0  |  4   |
| eae27980951a8e176ee6721ba1... | 782616026  |  1.0  |  5   |
| eae279

## Cosine similarity

### Train_purchases table

In [58]:
name = 'cosine'
target = 'article_purchase_count'

cosine_purchase, _ = model(train_purchases, name, user_id, item_id, target, customers_to_rec, n_rec, n_display)

+-------------------------------+------------+-----------------------+------+
|          customer_id          | article_id |         score         | rank |
+-------------------------------+------------+-----------------------+------+
| 250039015607871019994d80d9... | 611030001  |  0.010002143091435056 |  1   |
| 250039015607871019994d80d9... | 506410009  |  0.010002143091435056 |  2   |
| 250039015607871019994d80d9... | 866578002  |  0.008726678306250263 |  3   |
| 250039015607871019994d80d9... | 844906001  |  0.008699242588427428 |  4   |
| 250039015607871019994d80d9... | 832815003  |  0.00792233720957804  |  5   |
| 250039015607871019994d80d9... | 801621001  | 0.0076560155093241085 |  6   |
| 250039015607871019994d80d9... | 861836006  | 0.0076560155093241085 |  7   |
| 250039015607871019994d80d9... | 817130001  | 0.0076560155093241085 |  8   |
| eae27980951a8e176ee6721ba1... | 716673001  |  0.017499503336454694 |  1   |
| eae27980951a8e176ee6721ba1... | 565379002  |  0.01622780373221

### Train_dummy table

In [59]:
name = 'cosine'
target = 'dummy'

cosine_dummy, _ = model(train_dummy, name, user_id, item_id, target, customers_to_rec, n_rec, n_display)


+-------------------------------+------------+-----------------------+------+
|          customer_id          | article_id |         score         | rank |
+-------------------------------+------------+-----------------------+------+
| 250039015607871019994d80d9... | 706016001  |  0.008599273968433989 |  1   |
| 250039015607871019994d80d9... | 610776002  | 0.0072324738122414856 |  2   |
| 250039015607871019994d80d9... | 693651005  |  0.007073987653290016 |  3   |
| 250039015607871019994d80d9... | 820190001  |  0.007073987653290016 |  4   |
| 250039015607871019994d80d9... | 663277003  |  0.006845822800760684 |  5   |
| 250039015607871019994d80d9... | 706016002  |  0.006827744452849678 |  6   |
| 250039015607871019994d80d9... | 633219005  |  0.006676051495731741 |  7   |
| 250039015607871019994d80d9... | 697531002  |  0.006676051495731741 |  8   |
| eae27980951a8e176ee6721ba1... | 572797001  |  0.014213631195681435 |  1   |
| eae27980951a8e176ee6721ba1... | 565379002  |  0.01166856608220

### Train_normalized table

In [60]:
name = 'cosine'
target = 'scaled_purchase_freq'

cosine_normalized, _ = model(train_normalized, name, user_id, item_id, target, customers_to_rec, n_rec, n_display)

+-------------------------------+------------+-----------------------+------+
|          customer_id          | article_id |         score         | rank |
+-------------------------------+------------+-----------------------+------+
| 250039015607871019994d80d9... | 827500005  |  0.004201680672268907 |  1   |
| 250039015607871019994d80d9... | 638355009  |  0.004201680672268907 |  2   |
| 250039015607871019994d80d9... | 821911001  |  0.004201680672268907 |  3   |
| 250039015607871019994d80d9... | 806243004  | 0.0038557267990432867 |  4   |
| 250039015607871019994d80d9... | 654758001  |  0.003758097396177404 |  5   |
| 250039015607871019994d80d9... | 747204008  |  0.003176172240441587 |  6   |
| 250039015607871019994d80d9... | 847086001  | 0.0029710370953343495 |  7   |
| 250039015607871019994d80d9... | 688764001  | 0.0029710370953343495 |  8   |
| eae27980951a8e176ee6721ba1... | 783707151  |  0.007011891797531483 |  1   |
| eae27980951a8e176ee6721ba1... | 188183009  |  0.00701189179753

## Pearson model

### Train purchases table

In [61]:
name = 'pearson'
target = 'article_purchase_count'

pearson_purchase, _ = model(train_purchases, name, user_id, item_id, target, customers_to_rec, n_rec, n_display)

+-------------------------------+------------+-------------------+------+
|          customer_id          | article_id |       score       | rank |
+-------------------------------+------------+-------------------+------+
| 250039015607871019994d80d9... | 554141018  |        8.0        |  1   |
| 250039015607871019994d80d9... | 822161001  |        5.0        |  2   |
| 250039015607871019994d80d9... | 653358001  |        4.5        |  3   |
| 250039015607871019994d80d9... | 618112001  |        4.0        |  4   |
| 250039015607871019994d80d9... | 644032001  |        4.0        |  5   |
| 250039015607871019994d80d9... | 652671001  |        4.0        |  6   |
| 250039015607871019994d80d9... | 237222016  |        4.0        |  7   |
| 250039015607871019994d80d9... | 703737003  |        4.0        |  8   |
| eae27980951a8e176ee6721ba1... | 554141018  |        8.0        |  1   |
| eae27980951a8e176ee6721ba1... | 822161001  |        5.0        |  2   |
| eae27980951a8e176ee6721ba1... | 6533

In [62]:
train_purchases

customer_id,article_id,article_purchase_count
d734b8669438f34b1878dc078 1f540dd889deb02f939f6 ...,516859002,1
cae43013ac28d76c74c98fd77 caaea2f937da1c2f743c7 ...,721762002,1
214375b663ccc87230c66d96d 8963a132ae0cb5c07a40c ...,894135001,1
ff0709a27864e67ae3eaef133 6da05d0f93a742a9f0e01 ...,697054008,1
1f21c6fe5ac3385492780d488 54d3cb02e378cfef7f83e ...,456163053,1
a6e1e553ca1df628a9ef6d209 350077b531c82aa3ab185 ...,649397006,2
682604048737103df0e40f176 e7530a1c604e77c35bb79 ...,160442043,1
4a8be1df5f257d049cce39cb6 69aab869980b13e93482b ...,625773003,1
823a95fecfa24a9b3ab371398 f0b897c95727e245475b6 ...,448509018,1
5a7e208f8b34f09b7ba95c267 225075a4062f64ff43a4e ...,661610001,1


### Train_dummy table

In [63]:
name = 'pearson'
target = 'dummy'

pearson_dummy, _ = model(train_dummy, name, user_id, item_id, target, customers_to_rec, n_rec, n_display)

+-------------------------------+------------+-------+------+
|          customer_id          | article_id | score | rank |
+-------------------------------+------------+-------+------+
| 250039015607871019994d80d9... | 698296002  |  0.0  |  1   |
| 250039015607871019994d80d9... | 665082001  |  0.0  |  2   |
| 250039015607871019994d80d9... | 790644001  |  0.0  |  3   |
| 250039015607871019994d80d9... | 732842008  |  0.0  |  4   |
| 250039015607871019994d80d9... | 824079001  |  0.0  |  5   |
| 250039015607871019994d80d9... | 658298007  |  0.0  |  6   |
| 250039015607871019994d80d9... | 632832006  |  0.0  |  7   |
| 250039015607871019994d80d9... | 663564006  |  0.0  |  8   |
| eae27980951a8e176ee6721ba1... | 665082001  |  0.0  |  1   |
| eae27980951a8e176ee6721ba1... | 790644001  |  0.0  |  2   |
| eae27980951a8e176ee6721ba1... | 732842008  |  0.0  |  3   |
| eae27980951a8e176ee6721ba1... | 824079001  |  0.0  |  4   |
| eae27980951a8e176ee6721ba1... | 854043006  |  0.0  |  5   |
| eae279

### Train_normalized table

In [64]:
name = 'pearson'
target = 'scaled_purchase_freq'

pearson_normalized, _ = model(train_normalized, name, user_id, item_id, target, customers_to_rec, n_rec, n_display)

+-------------------------------+------------+-------+------+
|          customer_id          | article_id | score | rank |
+-------------------------------+------------+-------+------+
| 250039015607871019994d80d9... | 903728002  |  1.0  |  1   |
| 250039015607871019994d80d9... | 787039001  |  1.0  |  2   |
| 250039015607871019994d80d9... | 886603001  |  1.0  |  3   |
| 250039015607871019994d80d9... | 634786001  |  1.0  |  4   |
| 250039015607871019994d80d9... | 767473009  |  1.0  |  5   |
| 250039015607871019994d80d9... | 708487003  |  1.0  |  6   |
| 250039015607871019994d80d9... | 855809001  |  1.0  |  7   |
| 250039015607871019994d80d9... | 810407001  |  1.0  |  8   |
| eae27980951a8e176ee6721ba1... | 903728002  |  1.0  |  1   |
| eae27980951a8e176ee6721ba1... | 787039001  |  1.0  |  2   |
| eae27980951a8e176ee6721ba1... | 886603001  |  1.0  |  3   |
| eae27980951a8e176ee6721ba1... | 634786001  |  1.0  |  4   |
| eae27980951a8e176ee6721ba1... | 767473009  |  1.0  |  5   |
| eae279

## Model Evaluation

In [65]:
models_purchases = [popularity_purchase, cosine_purchase, pearson_purchase]
models_w_dummy = [popularity_dummy, cosine_dummy, pearson_dummy]
models_w_norm = [popularity_normalized, cosine_normalized, pearson_normalized]



In [66]:
names_w_counts = ['Popularity Model on Purchase Counts', 'Cosine Similarity on Purchase Counts', 'Pearson Similarity on Purchase Counts']
names_w_dummy = ['Popularity Model on Purchase Dummy', 'Cosine Similarity on Purchase Dummy', 'Pearson Similarity on Purchase Dummy']
names_w_norm = ['Popularity Model on Scaled Purchase Counts', 'Cosine Similarity on Scaled Purchase Counts', 'Pearson Similarity on Scaled Purchase Counts']

In [67]:
eval_counts = tc.recommender.util.compare_models(test_purchases, models_purchases, model_names=names_w_counts)
eval_dummy = tc.recommender.util.compare_models(test_dummy, models_w_dummy, model_names=names_w_dummy)
eval_norm = tc.recommender.util.compare_models(test_normalized, models_w_norm, model_names=names_w_norm)

PROGRESS: Evaluate model Popularity Model on Purchase Counts



Precision and recall summary statistics by cutoff
+--------+------------------------+------------------------+
| cutoff |     mean_precision     |      mean_recall       |
+--------+------------------------+------------------------+
|   1    |          0.0           |          0.0           |
|   2    | 4.518344478583048e-05  | 1.7378247994550204e-06 |
|   3    | 6.0244593047773936e-05 | 3.5451625908882356e-06 |
|   4    | 4.5183444785830404e-05 | 3.5451625908882356e-06 |
|   5    | 3.6146755828664334e-05 | 3.5451625908882407e-06 |
|   6    | 3.0122296523886964e-05 | 3.5451625908882394e-06 |
|   7    | 2.5819111306188846e-05 | 3.5451625908882356e-06 |
|   8    | 2.2591722392915206e-05 | 3.5451625908882292e-06 |
|   9    | 2.008153101592466e-05  | 3.5451625908882343e-06 |
|   10   | 2.7110066871498268e-05 | 5.103212411089296e-06  |
+--------+------------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 0.3910901341478688

Per User RMSE (best)
+---------------


Precision and recall summary statistics by cutoff
+--------+----------------------+-----------------------+
| cutoff |    mean_precision    |      mean_recall      |
+--------+----------------------+-----------------------+
|   1    | 0.11377191397072113  | 0.0037185739785528375 |
|   2    | 0.08526116031086203  |  0.005606717438742665 |
|   3    | 0.06901018133622522  |  0.006855413492126627 |
|   4    | 0.05880625338875835  |  0.00782147460842744  |
|   5    | 0.05199710825953347  |  0.008633698178370211 |
|   6    | 0.04687029339116831  |  0.009360517041297435 |
|   7    | 0.043027548991763975 |  0.009999741282206947 |
|   8    | 0.04029233688776431  |  0.010710571656971356 |
|   9    | 0.03758258529630285  |  0.011240983471420406 |
|   10   | 0.03534249051147663  |  0.011749843274191889 |
+--------+----------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 1.1697217631391916

Per User RMSE (best)
+-------------------------------+--------------------+----


Precision and recall summary statistics by cutoff
+--------+------------------------+------------------------+
| cutoff |     mean_precision     |      mean_recall       |
+--------+------------------------+------------------------+
|   1    |          0.0           |          0.0           |
|   2    | 4.518344478583048e-05  | 1.7378247994550174e-06 |
|   3    | 6.0244593047773956e-05 | 3.5451625908882318e-06 |
|   4    | 4.518344478583048e-05  | 3.545162590888235e-06  |
|   5    | 3.614675582866423e-05  | 3.5451625908882313e-06 |
|   6    | 3.0122296523886964e-05 | 3.5451625908882394e-06 |
|   7    | 2.581911130618883e-05  | 3.545162590888241e-06  |
|   8    | 2.259172239291524e-05  | 3.5451625908882343e-06 |
|   9    | 2.0081531015924662e-05 | 3.5451625908882343e-06 |
|   10   |  2.71100668714983e-05  | 5.103212411089293e-06  |
+--------+------------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 0.4041120042920001

Per User RMSE (best)
+---------------


Precision and recall summary statistics by cutoff
+--------+-----------------------+------------------------+
| cutoff |     mean_precision    |      mean_recall       |
+--------+-----------------------+------------------------+
|   1    | 0.0029821073558648106 | 5.791959072644346e-05  |
|   2    | 0.0018073377914332167 | 7.277407918122313e-05  |
|   3    |  0.004759322850774141 | 0.00033426657583933813 |
|   4    |  0.004563527923368877 | 0.00042744009882202615 |
|   5    |  0.003723115850352434 | 0.00043390991729698854 |
|   6    |  0.003509247545032831 | 0.0004969192335062776  |
|   7    | 0.0032661175802328885 | 0.0005334910796344751  |
|   8    | 0.0028804446050966935 | 0.0005358735665216986  |
|   9    | 0.0029319035283249893 | 0.0006129377411978416  |
|   10   | 0.0038767395626242754 | 0.0009003977189694016  |
+--------+-----------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 0.0

Per User RMSE (best)
+-------------------------------+------+-----


Precision and recall summary statistics by cutoff
+--------+---------------------+-----------------------+
| cutoff |    mean_precision   |      mean_recall      |
+--------+---------------------+-----------------------+
|   1    | 0.11738658955358756 | 0.0037714049358669875 |
|   2    | 0.09786734140610882 |  0.006195835077316818 |
|   3    |  0.0866618470992228 |  0.00830031147639314  |
|   4    | 0.07809958431230786 |  0.010014832932082477 |
|   5    | 0.07191397072112785 |  0.01157767153186948  |
|   6    | 0.06658533646605179 |  0.012872574183222808 |
|   7    | 0.06209496269138438 |  0.014011995990967764 |
|   8    | 0.05862551960961504 |  0.015136627413709802 |
|   9    | 0.05576641163122245 |  0.016192631335151943 |
|   10   | 0.05320802457979405 |  0.01719839233114133  |
+--------+---------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.9993247882398781

Per User RMSE (best)
+-------------------------------+--------------------+-------+
|        


Precision and recall summary statistics by cutoff
+--------+-----------------------+------------------------+
| cutoff |     mean_precision    |      mean_recall       |
+--------+-----------------------+------------------------+
|   1    | 0.0029821073558648084 | 5.791959072644343e-05  |
|   2    | 0.0018073377914332174 | 7.277407918122309e-05  |
|   3    |  0.004759322850774141 | 0.0003342665758393384  |
|   4    |  0.004563527923368881 |  0.000427440098822026  |
|   5    |  0.003723115850352434 | 0.00043390991729698865 |
|   6    |  0.003509247545032827 |  0.000496919233506277  |
|   7    | 0.0032661175802328924 | 0.0005334910796344764  |
|   8    |  0.002880444605096693 | 0.0005358735665216982  |
|   9    |  0.002931903528324987 | 0.0006129377411978409  |
|   10   | 0.0038767395626242837 |  0.000900397718969401  |
+--------+-----------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 1.0

Per User RMSE (best)
+-------------------------------+------+-----


Precision and recall summary statistics by cutoff
+--------+------------------------+------------------------+
| cutoff |     mean_precision     |      mean_recall       |
+--------+------------------------+------------------------+
|   1    | 9.036688957166096e-05  | 9.316174182645462e-07  |
|   2    | 0.00013555033435749144 | 4.6148522232434556e-06 |
|   3    | 0.00012048918609554782 | 1.106963004979066e-05  |
|   4    | 0.00015814205675040657 | 1.6671320280548597e-05 |
|   5    | 0.0001807337791433218  | 2.4837809560357932e-05 |
|   6    | 0.00015061148261943455 | 2.4837809560357966e-05 |
|   7    | 0.00014200511218403832 | 2.5809496544999514e-05 |
|   8    | 0.00012425447316103362 | 2.5809496544999447e-05 |
|   9    | 0.00011044842058758571 | 2.580949654499947e-05  |
|   10   | 9.940357852882707e-05  | 2.580949654499948e-05  |
+--------+------------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 0.2165021251170643

Per User RMSE (best)
+---------------


Precision and recall summary statistics by cutoff
+--------+-----------------------+------------------------+
| cutoff |     mean_precision    |      mean_recall       |
+--------+-----------------------+------------------------+
|   1    |  0.007590818724019523 | 0.00027482298589924227 |
|   2    |  0.005828664377372133 | 0.00041585917899409773 |
|   3    |  0.004789445147298033 | 0.0004985318205117558  |
|   4    |  0.004292427254653901 |  0.000586956480826317  |
|   5    |  0.003976143141153085 | 0.0006628763065622877  |
|   6    | 0.0037652870654858697 |  0.000744218937244094  |
|   7    |  0.003550127804600969 | 0.0008144171001978192  |
|   8    |  0.003445237664919574 | 0.0008981887529255231  |
|   9    |  0.003373697210675339 | 0.0009863785635454052  |
|   10   | 0.0032893547804084735 | 0.0010664103866589272  |
+--------+-----------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 0.21829171695495195

Per User RMSE (best)
+----------------------------


Precision and recall summary statistics by cutoff
+--------+------------------------+------------------------+
| cutoff |     mean_precision     |      mean_recall       |
+--------+------------------------+------------------------+
|   1    | 9.036688957166096e-05  | 9.316174182645444e-07  |
|   2    | 4.518344478583048e-05  | 9.316174182645456e-07  |
|   3    | 6.024459304777405e-05  |  3.03317298969852e-06  |
|   4    | 9.036688957166098e-05  | 6.311187611415642e-06  |
|   5    | 9.036688957166088e-05  | 1.082953208999867e-05  |
|   6    | 7.530574130971732e-05  | 1.0829532089998678e-05 |
|   7    | 7.745733391856653e-05  | 1.3033602567356265e-05 |
|   8    | 9.036688957166092e-05  | 1.6858656094199037e-05 |
|   9    | 0.00011044842058758571 | 2.4693495846624237e-05 |
|   10   | 0.00010844026748599308 | 2.6616195624744716e-05 |
+--------+------------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 0.2165072247726768

Per User RMSE (best)
+---------------

In [68]:
test_dummy

customer_id,article_id,article_purchase_count,dummy
7c35b6c2d23b6d684492642a6 fb15ef1e635f8c08292f6 ...,860367001,1,1
a94c6acdbaa9bd579fdc8a976 b45fb3e232228eca544e8 ...,572797002,1,1
b0cd9e07f491ad112dfa2f72a 2616557b22fddf38e394b ...,636938001,1,1
4e5430df1a0621f389b09cef9 b13e6540564a7170bde4c ...,680391001,1,1
13d99b8414400684671c065d1 7760dcc87e60bc698bc64 ...,719530003,1,1
b1fcf0344cb9f8bbf6f88f8b0 17f4464092895c3a73254 ...,771829002,1,1
eb58fdb29b4b31371ff469f76 4494bcaa4acf73c87ea6e ...,622958016,1,1
9db98df68653f8ac6509bc8d3 fcd7fa4d26c9271698c0d ...,792817003,1,1
f5f08b8a672add528657f0934 52d51b7ca461e1607fcfd ...,713754001,1,1
72f1e8e3cac079befde45bf3c 3570951370d2a545ec43f ...,811398005,1,1


In [69]:
test_purchases.shape

(501588, 3)

### Predictions using best model - Cosine similarity dummy purchases

In [70]:
prediction_model1 = tc.item_similarity_recommender.create(tc.SFrame(dummy_df), 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target="dummy",
                                                    similarity_type="cosine",
                                                    )

recommender1 = prediction_model1.recommend(users=customers_to_rec, k=8)

In [71]:
recommender_df1 = recommender1.to_dataframe()

In [72]:
recommender_df1.head(20)

Unnamed: 0,customer_id,article_id,score,rank
0,250039015607871019994d80d94192ef305063f270fb94...,706016001,0.018025,1
1,250039015607871019994d80d94192ef305063f270fb94...,706016002,0.01446,2
2,250039015607871019994d80d94192ef305063f270fb94...,610776002,0.014011,3
3,250039015607871019994d80d94192ef305063f270fb94...,610776001,0.013103,4
4,250039015607871019994d80d94192ef305063f270fb94...,869232003,0.010265,5
5,250039015607871019994d80d94192ef305063f270fb94...,866578002,0.010265,6
6,250039015607871019994d80d94192ef305063f270fb94...,720125001,0.009979,7
7,250039015607871019994d80d94192ef305063f270fb94...,844906001,0.009897,8
8,eae27980951a8e176ee6721ba135201a4094c291c1219f...,610776002,0.025319,1
9,eae27980951a8e176ee6721ba135201a4094c291c1219f...,706016001,0.024215,2
