# How to Build a Recommendation System for Purchase Data (Step-by-Step)
* Description: A documentation on building collaborative filtering models for recommending products to customers
* Link: https://medium.com/datadriveninvestor/how-to-build-a-recommendation-system-for-purchase-data-step-by-step-d6d7a78800b6
* Author: Moorissa Tjokro

## Problem statement
In this data challenge, we are building collaborative filtering models for recommending product items. The steps below aim to recommend users their top 10 items to place into their basket. The final output will be a csv file in the `output` folder, and a function that searches for a recommendation list based on a speficied user:
* Input: user - customer ID
* Returns: ranked list of items (product IDs), that the user is most likely to want to put in his/her (empty) "basket"

## 1. Import modules
* `pandas` and `numpy` for data manipulation
* `turicreate` for performing model selection and evaluation
* `sklearn` for splitting the data into train and test set

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import time
import turicreate as tc
from sklearn.model_selection import train_test_split

import sys
sys.path.append("..")
import scripts.data_layer as data_layer

## 2. Load data
Two datasets are used in this exercise, which can be found in `data` folder: 
* `recommend_1.csv` consisting of a list of 1000 customer IDs to recommend as output
* `trx_data.csv` consisting of user transactions

The format is as follows.

In [2]:
customers = pd.read_csv('../data/recommend_1.csv')
transactions = pd.read_csv('../data/trx_data.csv')

In [3]:
print(customers.shape)
print(customers.head())

(1000, 1)
   customerId
0        1553
1       20400
2       19750
3        6334
4       27773


In [4]:
print(transactions.shape)
transactions.head()

(62483, 2)


Unnamed: 0,customerId,products
0,0,20
1,1,2|2|23|68|68|111|29|86|107|152
2,2,111|107|29|11|11|11|33|23
3,3,164|227
4,5,2|2


## 3. Data preparation
* Our goal here is to break down each list of items in the `products` column into rows and count the number of products bought by a user

In [5]:
# example 1: split product items
transactions['products'] = transactions['products'].apply(lambda x: [int(i) for i in x.split('|')])
transactions.head(5).set_index('customerId')['products'].apply(pd.Series).reset_index()

Unnamed: 0,customerId,0,1,2,3,4,5,6,7,8,9
0,0,20.0,,,,,,,,,
1,1,2.0,2.0,23.0,68.0,68.0,111.0,29.0,86.0,107.0,152.0
2,2,111.0,107.0,29.0,11.0,11.0,11.0,33.0,23.0,,
3,3,164.0,227.0,,,,,,,,
4,5,2.0,2.0,,,,,,,,


In [6]:
# example 2: organize a given table into a dataframe with customerId, single productId, and purchase count
pd.melt(transactions.head(2).set_index('customerId')['products'].apply(pd.Series).reset_index(), 
             id_vars=['customerId'],
             value_name='products') \
    .dropna().drop(['variable'], axis=1) \
    .groupby(['customerId', 'products']) \
    .agg({'products': 'count'}) \
    .rename(columns={'products': 'purchase_count'}) \
    .reset_index() \
    .rename(columns={'products': 'productId'})

Unnamed: 0,customerId,productId,purchase_count
0,0,20.0,1
1,1,2.0,2
2,1,23.0,1
3,1,29.0,1
4,1,68.0,2
5,1,86.0,1
6,1,107.0,1
7,1,111.0,1
8,1,152.0,1


### 3.1. Create data with user, item, and target field
* This table will be an input for our modeling later
    * In this case, our user is `customerId`, `productId`, and `purchase_count`

In [7]:
s=time.time()

data = pd.melt(transactions.set_index('customerId')['products'].apply(pd.Series).reset_index(), 
             id_vars=['customerId'],
             value_name='products') \
    .dropna().drop(['variable'], axis=1) \
    .groupby(['customerId', 'products']) \
    .agg({'products': 'count'}) \
    .rename(columns={'products': 'purchase_count'}) \
    .reset_index() \
    .rename(columns={'products': 'productId'})
data['productId'] = data['productId'].astype(np.int64)

print("Execution time:", round((time.time()-s)/60,2), "minutes")

Execution time: 0.34 minutes


In [8]:
print(data.shape)
data.head()

(133585, 3)


Unnamed: 0,customerId,productId,purchase_count
0,0,1,2
1,0,13,1
2,0,19,3
3,0,20,1
4,0,31,2


### 3.2. Create dummy
* Dummy for marking whether a customer bought that item or not.
* If one buys an item, then `purchase_dummy` are marked as 1
* Why create a dummy instead of normalizing it, you ask?
    * Normalizing the purchase count, say by each user, would not work because customers may have different buying frequency don't have the same taste
    * However, we can normalize items by purchase frequency across all users, which is done in section 3.3. below.

In [9]:
def create_data_dummy(data):
    data_dummy = data.copy()
    data_dummy['purchase_dummy'] = 1
    return data_dummy

In [10]:
data_dummy = create_data_dummy(data)
data_dummy.head()

Unnamed: 0,customerId,productId,purchase_count,purchase_dummy
0,0,1,2,1
1,0,13,1,1
2,0,19,3,1
3,0,20,1,1
4,0,31,2,1


### 3.3. Normalize item values across users
* To do this, we normalize purchase frequency of each item across users by first creating a user-item matrix as follows

In [11]:
df_matrix = pd.pivot_table(data, values='purchase_count', index='customerId', columns='productId')
df_matrix.head()

productId,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
customerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,2.0,,,,,,,,,...,,,,,,,,,,
1,,,6.0,,,,,,,,...,,,,1.0,,,1.0,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [12]:
(df_matrix.shape)

(24429, 300)

In [13]:
df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min())
print(df_matrix_norm.shape)
df_matrix_norm.head()

(24429, 300)


productId,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
customerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,0.1,,,,,,,,,...,,,,,,,,,,
1,,,0.166667,,,,,,,,...,,,,0.0,,,0.0,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [14]:
# create a table for input to the modeling

d = df_matrix_norm.reset_index()
d.index.names = ['scaled_purchase_freq']
data_norm = pd.melt(d, id_vars=['customerId'], value_name='scaled_purchase_freq').dropna()
print(data_norm.shape)
data_norm.head()

(133585, 3)


Unnamed: 0,customerId,productId,scaled_purchase_freq
9,9,0,0.133333
25,25,0,0.133333
32,33,0,0.133333
35,36,0,0.133333
43,44,0,0.133333


#### Define a function for normalizing data

In [15]:
def normalize_data(data):
    df_matrix = pd.pivot_table(data, values='purchase_count', index='customerId', columns='productId')
    df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min())
    d = df_matrix_norm.reset_index()
    d.index.names = ['scaled_purchase_freq']
    return pd.melt(d, id_vars=['customerId'], value_name='scaled_purchase_freq').dropna()

* We can normalize the their purchase history, from 0-1 (with 1 being the most number of purchase for an item and 0 being 0 purchase count for that item).

## 4. Split train and test set
* Splitting the data into training and testing sets is an important part of evaluating predictive modeling, in this case a collaborative filtering model. Typically, we use a larger portion of the data for training and a smaller portion for testing. 
* We use 80:20 ratio for our train-test set size.
* Our training portion will be used to develop a predictive model, while the other to evaluate the model's performance.
* Now that we have three datasets with purchase counts, purchase dummy, and scaled purchase counts, we would like to split each.

In [16]:
train, test = train_test_split(data, test_size = .2)
print(train.shape, test.shape)

(106868, 3) (26717, 3)


In [17]:
# Using turicreate library, we convert dataframe to SFrame - this will be useful in the modeling part

train_data = tc.SFrame(train)
test_data = tc.SFrame(test)
train_data

customerId,productId,purchase_count
13876,99,2
4875,19,2
4698,174,1
14311,138,1
499,91,3
1515,24,1
16209,25,1
3031,19,3
1199,153,1
8246,39,1


In [18]:
train_data

customerId,productId,purchase_count
13876,99,2
4875,19,2
4698,174,1
14311,138,1
499,91,3
1515,24,1
16209,25,1
3031,19,3
1199,153,1
8246,39,1


In [19]:
test_data

customerId,productId,purchase_count
9170,2,2
20910,274,2
10027,57,1
6194,29,2
14653,1,3
10954,162,1
9637,67,2
7393,108,1
16294,36,1
6441,3,3


#### Define a `split_data` function for splitting data to training and test set

In [20]:
# We can define a function for this step as follows

def split_data(data):
    '''
    Splits dataset into training and test set.
    
    Args:
        data (pandas.DataFrame)
        
    Returns
        train_data (tc.SFrame)
        test_data (tc.SFrame)
    '''
    train, test = train_test_split(data, test_size = .2)
    train_data = tc.SFrame(train)
    test_data = tc.SFrame(test)
    return train_data, test_data

In [21]:
# lets try with both dummy table and scaled/normalized purchase table

train_data_dummy, test_data_dummy = split_data(data_dummy)
train_data_norm, test_data_norm = split_data(data_norm)

## 5. Baseline Model
Before running a more complicated approach such as collaborative filtering, we would like to use a baseline model to compare and evaluate models. Since baseline typically uses a very simple approach, techniques used beyond this approach should be chosen if they show relatively better accuracy and complexity.

### 5.1. Using a Popularity model as a baseline
* The popularity model takes the most popular items for recommendation. These items are products with the highest number of sells across customers.
* We use `turicreate` library for running and evaluating both baseline and collaborative filtering models below
* Training data is used for model selection

#### Using purchase counts

In [22]:
# variables to define field names
user_id = 'customerId'
item_id = 'productId'
target = 'purchase_count'
users_to_recommend = list(transactions[user_id])
n_rec = 10 # number of items to recommend
n_display = 30

In [23]:
popularity_model = tc.popularity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target)

In [24]:
# Get recommendations for a list of users to recommend (from customers file)
# Printed below is head / top 30 rows for first 3 customers with 10 recommendations each

popularity_recomm = popularity_model.recommend(users=users_to_recommend, k=n_rec)
popularity_recomm.print_rows(n_display)

+------------+-----------+--------------------+------+
| customerId | productId |       score        | rank |
+------------+-----------+--------------------+------+
|     0      |    248    | 3.2127659574468086 |  1   |
|     0      |    132    | 3.096774193548387  |  2   |
|     0      |     37    | 3.007434944237918  |  3   |
|     0      |     34    | 2.9878048780487805 |  4   |
|     0      |     0     | 2.9388145315487573 |  5   |
|     0      |     27    |       2.896        |  6   |
|     0      |     3     | 2.814583333333333  |  7   |
|     0      |    110    | 2.8106508875739644 |  8   |
|     0      |     32    |       2.665        |  9   |
|     0      |    245    | 2.630952380952381  |  10  |
|     1      |    248    | 3.2127659574468086 |  1   |
|     1      |    132    | 3.096774193548387  |  2   |
|     1      |     37    | 3.007434944237918  |  3   |
|     1      |     34    | 2.9878048780487805 |  4   |
|     1      |     0     | 2.9388145315487573 |  5   |
|     1   

#### Define a `model` function for model selection

In [25]:
# Since turicreate is very accessible library, we can define a model selection function as below

def model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display):
    if name == 'popularity':
        model = tc.popularity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target)
    elif name == 'cosine':
        model = tc.item_similarity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target, 
                                                    similarity_type='cosine')
    elif name == 'pearson':
        model = tc.item_similarity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target, 
                                                    similarity_type='pearson')
        
    recom = model.recommend(users=users_to_recommend, k=n_rec)
    recom.print_rows(n_display)
    return model

In [26]:
# variables to define field names
# constant variables include:
user_id = 'customerId'
item_id = 'productId'
users_to_recommend = list(customers[user_id])
n_rec = 10 # number of items to recommend
n_display = 30 # to print the head / first few rows in a defined dataset

#### Using purchase dummy

In [27]:
# these variables will change accordingly
name = 'popularity'
target = 'purchase_dummy'
pop_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+-------+------+
| customerId | productId | score | rank |
+------------+-----------+-------+------+
|    1553    |    297    |  1.0  |  1   |
|    1553    |     2     |  1.0  |  2   |
|    1553    |     43    |  1.0  |  3   |
|    1553    |    265    |  1.0  |  4   |
|    1553    |     1     |  1.0  |  5   |
|    1553    |    215    |  1.0  |  6   |
|    1553    |     16    |  1.0  |  7   |
|    1553    |     39    |  1.0  |  8   |
|    1553    |     19    |  1.0  |  9   |
|    1553    |     12    |  1.0  |  10  |
|   20400    |    297    |  1.0  |  1   |
|   20400    |     2     |  1.0  |  2   |
|   20400    |     43    |  1.0  |  3   |
|   20400    |    265    |  1.0  |  4   |
|   20400    |     1     |  1.0  |  5   |
|   20400    |    215    |  1.0  |  6   |
|   20400    |     16    |  1.0  |  7   |
|   20400    |     39    |  1.0  |  8   |
|   20400    |     19    |  1.0  |  9   |
|   20400    |     12    |  1.0  |  10  |
|   19750    |    297    |  1.0  |

#### Using normalized purchase count

In [28]:
name = 'popularity'
target = 'scaled_purchase_freq'
pop_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+---------------------+------+
| customerId | productId |        score        | rank |
+------------+-----------+---------------------+------+
|    1553    |    226    |  0.7758620689655172 |  1   |
|    1553    |    247    | 0.33604336043360433 |  2   |
|    1553    |    230    | 0.33225806451612866 |  3   |
|    1553    |    248    |  0.2621951219512195 |  4   |
|    1553    |    125    |  0.2562962962962959 |  5   |
|    1553    |    294    | 0.25038167938931266 |  6   |
|    1553    |    204    |  0.2372093023255812 |  7   |
|    1553    |    276    | 0.23484848484848486 |  8   |
|    1553    |     72    | 0.23060796645702306 |  9   |
|    1553    |     74    | 0.23059360730593606 |  10  |
|   20400    |    226    |  0.7758620689655172 |  1   |
|   20400    |    247    | 0.33604336043360433 |  2   |
|   20400    |    230    | 0.33225806451612866 |  3   |
|   20400    |    248    |  0.2621951219512195 |  4   |
|   20400    |    125    |  0.2562962962962959 |

#### Notes
* Once we created the model, we predicted the recommendation items using scores by popularity. As you can tell for each model results above, the rows show the first 30 records from 1000 users with 10 recommendations. These 30 records include 3 users and their recommended items, along with score and descending ranks. 
* In the result, although different models have different recommendation list, each user is recommended the same list of 10 items. This is because popularity is calculated by taking the most popular items across all users.
* If a grouping example below, products 132, 248, 37, and 34 are the most popular (best-selling) across customers. Using their purchase counts divided by the number of customers, we see that these products are at least bought 3 times on average in the training set of transactions (same as the first popularity measure on `purchase_count` variable)

In [29]:
train.groupby(by=item_id)['purchase_count'].mean().sort_values(ascending=False).head(20)

productId
248    3.212766
132    3.096774
37     3.007435
34     2.987805
0      2.938815
27     2.896000
3      2.814583
110    2.810651
32     2.665000
245    2.630952
230    2.593985
10     2.570149
226    2.552448
58     2.551724
82     2.483193
129    2.402516
91     2.398438
87     2.396552
18     2.390428
41     2.346875
Name: purchase_count, dtype: float64

## 6. Collaborative Filtering Model

* In collaborative filtering, we would recommend items based on how similar users purchase items. For instance, if customer 1 and customer 2 bought similar items, e.g. 1 bought X, Y, Z and 2 bought X, Y, we would recommend an item Z to customer 2.

* To define similarity across users, we use the following steps:
    1. Create a user-item matrix, where index values represent unique customer IDs and column values represent unique product IDs
    
    2. Create an item-to-item similarity matrix. The idea is to calculate how similar a product is to another product. There are a number of ways of calculating this. In steps 6.1 and 6.2, we use cosine and pearson similarity measure, respectively.  
    
        * To calculate similarity between products X and Y, look at all customers who have rated both these items. For example, both X and Y have been rated by customers 1 and 2. 
        * We then create two item-vectors, v1 for item X and v2 for item Y, in the user-space of (1, 2) and then find the `cosine` or `pearson` angle/distance between these vectors. A zero angle or overlapping vectors with cosine value of 1 means total similarity (or per user, across all items, there is same rating) and an angle of 90 degree would mean cosine of 0 or no similarity.
        
    3. For each customer, we then predict his likelihood to buy a product (or his purchase counts) for products that he had not bought. 
    
        * For our example, we will calculate rating for user 2 in the case of item Z (target item). To calculate this we weigh the just-calculated similarity-measure between the target item and other items that customer has already bought. The weighing factor is the purchase counts given by the user to items already bought by him. 
        * We then scale this weighted sum with the sum of similarity-measures so that the calculated rating remains within a predefined limits. Thus, the predicted rating for item Z for user 2 would be calculated using similarity measures.

* While I wrote python scripts for all the process including finding similarity using python scripts (which can be found in `scripts` folder, we can use `turicreate` library for now to capture different measures like using `cosine` and `pearson` distance, and evaluate the best model.

### 6.1. `Cosine` similarity
* Similarity is the cosine of the angle between the 2 vectors of the item vectors of A and B
* It is defined by the following formula
![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTnRHSAx1c084UXF2wIHYwaHJLmq2qKtNk_YIv3RjHUO00xwlkt)
* Closer the vectors, smaller will be the angle and larger the cosine

#### Using purchase count

In [30]:
# these variables will change accordingly
name = 'cosine'
target = 'purchase_count'
cos = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+----------------------+------+
| customerId | productId |        score         | rank |
+------------+-----------+----------------------+------+
|    1553    |     35    | 0.07134616374969482  |  1   |
|    1553    |     41    | 0.06232533852259318  |  2   |
|    1553    |    148    | 0.055234273274739586 |  3   |
|    1553    |     2     | 0.05025319258371989  |  4   |
|    1553    |     5     |  0.0500976045926412  |  5   |
|    1553    |     1     | 0.050045788288116455 |  6   |
|    1553    |    269    | 0.04929194847742716  |  7   |
|    1553    |     33    | 0.044747730096181236 |  8   |
|    1553    |     8     | 0.044434587160746254 |  9   |
|    1553    |     17    | 0.04313389460245768  |  10  |
|   20400    |    284    | 0.04531627893447876  |  1   |
|   20400    |    182    | 0.04524850845336914  |  2   |
|   20400    |    160    | 0.04293942451477051  |  3   |
|   20400    |     1     | 0.04214388132095337  |  4   |
|   20400    |    122    | 0.04

#### Using purchase dummy

In [31]:
# these variables will change accordingly
name = 'cosine'
target = 'purchase_dummy'
cos_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+----------------------+------+
| customerId | productId |        score         | rank |
+------------+-----------+----------------------+------+
|    1553    |     2     | 0.10158456563949585  |  1   |
|    1553    |     35    | 0.08096349239349365  |  2   |
|    1553    |     1     | 0.07903692722320557  |  3   |
|    1553    |     5     | 0.07059823274612427  |  4   |
|    1553    |     21    | 0.06008375883102417  |  5   |
|    1553    |     8     | 0.059696948528289794 |  6   |
|    1553    |     33    | 0.054522716999053956 |  7   |
|    1553    |     17    | 0.05138728618621826  |  8   |
|    1553    |     13    | 0.04866786003112793  |  9   |
|    1553    |     47    | 0.04690022468566894  |  10  |
|   20400    |    297    |         0.0          |  1   |
|   20400    |     2     |         0.0          |  2   |
|   20400    |     43    |         0.0          |  3   |
|   20400    |    265    |         0.0          |  4   |
|   20400    |     1     |     

#### Using normalized purchase count

In [32]:
name = 'cosine'
target = 'scaled_purchase_freq'
cos_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+-----------------------+------+
| customerId | productId |         score         | rank |
+------------+-----------+-----------------------+------+
|    1553    |    269    |          0.0          |  1   |
|    1553    |    113    |          0.0          |  2   |
|    1553    |     62    |          0.0          |  3   |
|    1553    |     26    |          0.0          |  4   |
|    1553    |     11    |          0.0          |  5   |
|    1553    |    176    |          0.0          |  6   |
|    1553    |     46    |          0.0          |  7   |
|    1553    |     10    |          0.0          |  8   |
|    1553    |     2     |          0.0          |  9   |
|    1553    |     59    |          0.0          |  10  |
|   20400    |    269    |          0.0          |  1   |
|   20400    |    113    |          0.0          |  2   |
|   20400    |     62    |          0.0          |  3   |
|   20400    |     26    |          0.0          |  4   |
|   20400    |

### 6.2. `Pearson` similarity
* Similarity is the pearson coefficient between the two vectors.
* It is defined by the following formula
![](http://critical-numbers.group.shef.ac.uk/glossary/images/correlationKT1.png)

#### Using purchase count

In [33]:
# these variables will change accordingly
name = 'pearson'
target = 'purchase_count'
pear = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+--------------------+------+
| customerId | productId |       score        | rank |
+------------+-----------+--------------------+------+
|    1553    |    248    | 3.2108798081993215 |  1   |
|    1553    |    132    | 3.096774193548386  |  2   |
|    1553    |     37    | 3.005604406124004  |  3   |
|    1553    |     34    | 2.987804878048781  |  4   |
|    1553    |     0     | 2.9365094808577887 |  5   |
|    1553    |     27    | 2.8960000000000004 |  6   |
|    1553    |     3     | 2.8145833333333363 |  7   |
|    1553    |    110    | 2.8010503078353484 |  8   |
|    1553    |     32    | 2.665000000000001  |  9   |
|    1553    |    245    | 2.630952380952382  |  10  |
|   20400    |    248    | 3.212765957446809  |  1   |
|   20400    |    132    | 3.096774193548386  |  2   |
|   20400    |     37    | 3.0074349442379162 |  3   |
|   20400    |     34    | 2.987804878048781  |  4   |
|   20400    |     0     | 2.9388145315487577 |  5   |
|   20400 

#### Using purchase dummy

In [34]:
# these variables will change accordingly
name = 'pearson'
target = 'purchase_dummy'
pear_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+-------+------+
| customerId | productId | score | rank |
+------------+-----------+-------+------+
|    1553    |    297    |  0.0  |  1   |
|    1553    |     2     |  0.0  |  2   |
|    1553    |     43    |  0.0  |  3   |
|    1553    |    265    |  0.0  |  4   |
|    1553    |     1     |  0.0  |  5   |
|    1553    |    215    |  0.0  |  6   |
|    1553    |     16    |  0.0  |  7   |
|    1553    |     39    |  0.0  |  8   |
|    1553    |     19    |  0.0  |  9   |
|    1553    |     12    |  0.0  |  10  |
|   20400    |    297    |  0.0  |  1   |
|   20400    |     2     |  0.0  |  2   |
|   20400    |     43    |  0.0  |  3   |
|   20400    |    265    |  0.0  |  4   |
|   20400    |     1     |  0.0  |  5   |
|   20400    |    215    |  0.0  |  6   |
|   20400    |     16    |  0.0  |  7   |
|   20400    |     39    |  0.0  |  8   |
|   20400    |     19    |  0.0  |  9   |
|   20400    |     12    |  0.0  |  10  |
|   19750    |    297    |  0.0  |

#### Using normalized purchase count

In [35]:
name = 'pearson'
target = 'scaled_purchase_freq'
pear_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+---------------------+------+
| customerId | productId |        score        | rank |
+------------+-----------+---------------------+------+
|    1553    |    226    |  0.7758620689655175 |  1   |
|    1553    |    247    | 0.33604336043360433 |  2   |
|    1553    |    230    |  0.3319810221272131 |  3   |
|    1553    |    248    | 0.26219512195121963 |  4   |
|    1553    |    125    | 0.25617956059950375 |  5   |
|    1553    |    294    |  0.250381679389313  |  6   |
|    1553    |    204    |  0.2372093023255814 |  7   |
|    1553    |    276    |  0.234683305476651  |  8   |
|    1553    |     72    |  0.2306079664570231 |  9   |
|    1553    |     74    | 0.23047444271986883 |  10  |
|   20400    |    226    |  0.7758620689655175 |  1   |
|   20400    |    247    | 0.33604336043360433 |  2   |
|   20400    |    230    |  0.3322580645161291 |  3   |
|   20400    |    248    | 0.26219512195121963 |  4   |
|   20400    |    125    | 0.25586136120337033 |

#### Note
* In collaborative filtering above, we used two approaches: cosine and pearson distance. We also got to apply them to three training datasets with normal counts, dummy, or normalized counts of items purchase.
* We can see that the recommendations are different for each user. This suggests that personalization does exist. 
* But how good is this model compared to the baseline, and to each other? We need some means of evaluating a recommendation engine. Lets focus on that in the next section.

## 7. Model Evaluation
For evaluating recommendation engines, we can use the concept of precision-recall.

* RMSE (Root Mean Squared Errors)
    * Measures the error of predicted values
    * Lesser the RMSE value, better the recommendations
* Recall
    * What percentage of products that a user buys are actually recommended?
    * If a customer buys 5 products and the recommendation decided to show 3 of them, then the recall is 0.6
* Precision
    * Out of all the recommended items, how many the user actually liked?
    * If 5 products were recommended to the customer out of which he buys 4 of them, then precision is 0.8
    
* Why are both recall and precision important?
    * Consider a case where we recommend all products, so our customers will surely cover the items that they liked and bought. In this case, we have 100% recall! Does this mean our model is good?
    * We have to consider precision. If we recommend 300 items but user likes and buys only 3 of them, then precision is 0.1%! This very low precision indicates that the model is not great, despite their excellent recall.
    * So our aim has to be optimizing both recall and precision (to be close to 1 as possible).

Lets compare all the models we have built based on precision-recall characteristics:

In [36]:
# create initial callable variables

models_w_counts = [popularity_model, cos, pear]
models_w_dummy = [pop_dummy, cos_dummy, pear_dummy]
models_w_norm = [pop_norm, cos_norm, pear_norm]

names_w_counts = ['Popularity Model on Purchase Counts', 'Cosine Similarity on Purchase Counts', 'Pearson Similarity on Purchase Counts']
names_w_dummy = ['Popularity Model on Purchase Dummy', 'Cosine Similarity on Purchase Dummy', 'Pearson Similarity on Purchase Dummy']
names_w_norm = ['Popularity Model on Scaled Purchase Counts', 'Cosine Similarity on Scaled Purchase Counts', 'Pearson Similarity on Scaled Purchase Counts']

#### Models on purchase counts

In [37]:
eval_counts = tc.recommender.util.compare_models(test_data, models_w_counts, model_names=names_w_counts)

PROGRESS: Evaluate model Popularity Model on Purchase Counts



Precision and recall summary statistics by cutoff
+--------+-----------------------+-----------------------+
| cutoff |     mean_precision    |      mean_recall      |
+--------+-----------------------+-----------------------+
|   1    | 0.0007919936640506857 |  0.00033496874882144  |
|   2    | 0.0006479948160414707 | 0.0005372528448343863 |
|   3    |  0.002087983296133628 | 0.0033771685870469174 |
|   4    | 0.0027179782561739508 |  0.005733197056718624 |
|   5    | 0.0059183526531787455 |  0.01681937986097987  |
|   6    | 0.0055199558403532835 |  0.018813473798338348 |
|   7    |  0.005883381504376558 |  0.022538460881558503 |
|   8    |  0.005552955576355358 |  0.023930021177647608 |
|   9    |  0.005375956992344102 |  0.026495472082611667 |
|   10   | 0.0050327597379220945 |  0.027545063685878963 |
+--------+-----------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 1.0056648217458366

Per User RMSE (best)
+------------+-----------------------+------


Precision and recall summary statistics by cutoff
+--------+----------------------+---------------------+
| cutoff |    mean_precision    |     mean_recall     |
+--------+----------------------+---------------------+
|   1    | 0.11887104903160778  | 0.07072651955922277 |
|   2    | 0.09543523651810745  | 0.11253182720099338 |
|   3    | 0.07924736602107199  | 0.13611886973863602 |
|   4    | 0.06870545035639669  |  0.1560406211821828 |
|   5    | 0.06106991144070839  | 0.17273709466855516 |
|   6    | 0.05525955792353647  | 0.18657104236697714 |
|   7    | 0.05087273587525585  |  0.1987105926663177 |
|   8    | 0.047069623443012666 | 0.20926129583833106 |
|   9    | 0.04403964768281845  | 0.21898513615787368 |
|   10   | 0.041608467132263016 | 0.22915050924363894 |
+--------+----------------------+---------------------+
[10 rows x 3 columns]


Overall RMSE: 1.8649652692790892

Per User RMSE (best)
+------------+---------------------+-------+
| customerId |         rmse        | coun


Precision and recall summary statistics by cutoff
+--------+-----------------------+------------------------+
| cutoff |     mean_precision    |      mean_recall       |
+--------+-----------------------+------------------------+
|   1    | 0.0007919936640506888 | 0.00033496874882143823 |
|   2    | 0.0006479948160414727 | 0.0005372528448343861  |
|   3    | 0.0020879832961336335 |  0.003275169403040383  |
|   4    | 0.0026999784001727842 |  0.005661197632714032  |
|   5    |  0.005903952768377861 |  0.01681283445879767   |
|   6    |  0.005483956128350943 |  0.01868147485432979   |
|   7    |  0.005852524608374591 |  0.02248674700955516   |
|   8    |  0.005561955504355976 |  0.023948021033648625  |
|   9    |  0.005407956736346131 |  0.026617871103419725  |
|   10   |  0.005075959392324885 |  0.027869061093899684  |
+--------+-----------------------+------------------------+
[10 rows x 3 columns]


Overall RMSE: 1.0027384836750355

Per User RMSE (best)
+------------+----------------

#### Models on purchase dummy

In [38]:
eval_dummy = tc.recommender.util.compare_models(test_data_dummy, models_w_dummy, model_names=names_w_dummy)

PROGRESS: Evaluate model Popularity Model on Purchase Dummy



Precision and recall summary statistics by cutoff
+--------+-----------------------+----------------------+
| cutoff |     mean_precision    |     mean_recall      |
+--------+-----------------------+----------------------+
|   1    | 0.0066858375269590225 | 0.003515206466003599 |
|   2    |  0.005966930265995687 | 0.005723506992061129 |
|   3    |  0.005919003115264766 | 0.00892284659326996  |
|   4    |  0.006919482386772101 | 0.01397621642018399  |
|   5    |  0.007174694464414089 | 0.01847884830904097  |
|   6    |  0.006973400431344361 | 0.021126395465933846 |
|   7    |  0.00700421074252853  | 0.025246019351912872 |
|   8    |  0.006946441409058237 | 0.029070430422909247 |
|   9    |  0.006829618979151715 | 0.032322972273581885 |
|   10   | 0.0066642703091301225 | 0.03495058502488876  |
+--------+-----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.0

Per User RMSE (best)
+------------+------+-------+
| customerId | rmse | count |
+------------


Precision and recall summary statistics by cutoff
+--------+----------------------+---------------------+
| cutoff |    mean_precision    |     mean_recall     |
+--------+----------------------+---------------------+
|   1    |  0.1228612508986338  | 0.07114254933363341 |
|   2    | 0.09701653486700207  | 0.10946407337374749 |
|   3    | 0.08118859333812645  | 0.13547165845722048 |
|   4    | 0.07138749101365932  |  0.1580745659034473 |
|   5    | 0.06416966211358754  | 0.17650957665513736 |
|   6    | 0.058375269590222933 | 0.19156193316655842 |
|   7    | 0.05350724042312809  | 0.20359116122741727 |
|   8    |  0.0496315600287564  |  0.2153786805616669 |
|   9    | 0.04662512980269981  | 0.22699725150266784 |
|   10   | 0.044040258806614135 | 0.23803991397361043 |
+--------+----------------------+---------------------+
[10 rows x 3 columns]


Overall RMSE: 0.9697944917087329

Per User RMSE (best)
+------------+---------------------+-------+
| customerId |         rmse        | coun


Precision and recall summary statistics by cutoff
+--------+-----------------------+----------------------+
| cutoff |     mean_precision    |     mean_recall      |
+--------+-----------------------+----------------------+
|   1    |  0.006685837526959017 | 0.003515206466003602 |
|   2    |  0.005966930265995687 | 0.005723506992061135 |
|   3    | 0.0059190031152647595 | 0.00892284659326991  |
|   4    |  0.006919482386772126 | 0.013976216420183936 |
|   5    |  0.007174694464414096 | 0.01847884830904085  |
|   6    |  0.006973400431344332 | 0.021126395465933947 |
|   7    |  0.007004210742528477 | 0.025246019351912986 |
|   8    |  0.006946441409058246 | 0.02907043042290921  |
|   9    |  0.006829618979151707 | 0.032322972273581774 |
|   10   |  0.00666427030913014  | 0.03495058502488914  |
+--------+-----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 1.0

Per User RMSE (best)
+------------+------+-------+
| customerId | rmse | count |
+------------

#### Models on normalized purchase frequency

In [39]:
eval_norm = tc.recommender.util.compare_models(test_data_norm, models_w_norm, model_names=names_w_norm)

PROGRESS: Evaluate model Popularity Model on Scaled Purchase Counts



Precision and recall summary statistics by cutoff
+--------+-----------------------+-----------------------+
| cutoff |     mean_precision    |      mean_recall      |
+--------+-----------------------+-----------------------+
|   1    |  0.002587136183974133 | 0.0013445579453484609 |
|   2    | 0.0025152712899748435 |  0.002632422474319713 |
|   3    |  0.002754820936639118 | 0.0048458612094976015 |
|   4    |  0.002299676607976999 |  0.00526447421704341  |
|   5    |  0.002357168523176412 |  0.006963679710715294 |
|   6    | 0.0023236315726434353 |  0.008313541969668405 |
|   7    |  0.002268877367691601 |  0.009465176896006912 |
|   8    | 0.0023266259432267386 |  0.010709782492031819 |
|   9    |  0.003058250489080544 |  0.015165691098138085 |
|   10   | 0.0029320876751706784 |  0.016153833390628237 |
+--------+-----------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.13212658151018739

Per User RMSE (best)
+------------+-----------------------+-----


Precision and recall summary statistics by cutoff
+--------+----------------------+----------------------+
| cutoff |    mean_precision    |     mean_recall      |
+--------+----------------------+----------------------+
|   1    |  0.0681279195113183  | 0.039010352773959395 |
|   2    | 0.05565936040244344  |   0.06295700964129   |
|   3    | 0.046209126841537926 | 0.07682975797473487  |
|   4    | 0.040819259791591976 | 0.08877259787270143  |
|   5    | 0.036507366151634774 | 0.09885186134488626  |
|   6    | 0.033261468439334044 | 0.10751002685087765  |
|   7    | 0.031086699861403368 | 0.11675593026674184  |
|   8    | 0.029033417175709637 | 0.12441091033278888  |
|   9    | 0.027524254401724714 |  0.1324808598361243  |
|   10   | 0.02613007545813864  | 0.13918964963838207  |
+--------+----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.160037702525075

Per User RMSE (best)
+------------+------+-------+
| customerId | rmse | count |
+------------


Precision and recall summary statistics by cutoff
+--------+-----------------------+-----------------------+
| cutoff |     mean_precision    |      mean_recall      |
+--------+-----------------------+-----------------------+
|   1    |  0.002587136183974122 | 0.0013445579453484659 |
|   2    | 0.0025152712899748518 |  0.002632422474319721 |
|   3    | 0.0027548209366391276 |  0.004845861209497577 |
|   4    | 0.0022996766079769913 |  0.005264474217043418 |
|   5    |  0.002357168523176436 |  0.006963679710715252 |
|   6    |  0.002335609054976647 |  0.008349474416668036 |
|   7    | 0.0022791437811200636 |  0.009525064307672977 |
|   8    |  0.002299676607976992 |  0.010703793750865222 |
|   9    |  0.002770790913083399 |  0.013970851681954862 |
|   10   | 0.0029536471433704607 |  0.01629756317862674  |
+--------+-----------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.13183247132462583

Per User RMSE (best)
+------------+-----------------------+-----

## 8. Model Selection
### 8.1. Evaluation summary
* Based on RMSE


    1. Popularity on purchase counts: 1.1111750034210488
    2. Cosine similarity on purchase counts: 1.9230643981653215
    3. Pearson similarity on purchase counts: 1.9231102838192284
    
    4. Popularity on purchase dummy: 0.9697374361161925
    5. Cosine similarity on purchase dummy: 0.9697509978436404
    6. Pearson similarity on purchase dummy: 0.9697745320187097
    
    7. Popularity on scaled purchase counts: 0.16230660626840343
    8. Cosine similarity on scaled purchase counts: 0.16229800354111104
    9. Pearson similarity on scaled purchase counts: 0.1622982668334026
    
* Based on Precision and Recall
![](../images/model_comparisons.png)


#### Notes

* Popularity v. Collaborative Filtering: We can see that the collaborative filtering algorithms work better than popularity model for purchase counts. Indeed, popularity model doesn’t give any personalizations as it only gives the same list of recommended items to every user.
* Precision and recall: Looking at the summary above, we see that the precision and recall for Purchase Counts > Purchase Dummy > Normalized Purchase Counts. However, because the recommendation scores for the normalized purchase data is zero and constant, we choose the dummy. In fact, the RMSE isn’t much different between models on the dummy and those on the normalized data.
* RMSE: Since RMSE is higher using pearson distance thancosine, we would choose model the smaller mean squared errors, which in this case would be cosine.
Therefore, we select the Cosine similarity on Purchase Dummy approach as our final model.

## 8. Final Output
* In this step, we would like to manipulate format for recommendation output to one we can export to csv, and also a function that will return recommendation list given a customer ID.
* We need to first rerun the model using the whole dataset, as we came to a final model using train data and evaluated with test set.

In [40]:
users_to_recommend = list(customers[user_id])

final_model = tc.item_similarity_recommender.create(tc.SFrame(data_dummy), 
                                            user_id=user_id, 
                                            item_id=item_id, 
                                            target='purchase_dummy', 
                                            similarity_type='cosine')

recom = final_model.recommend(users=users_to_recommend, k=n_rec)
recom.print_rows(n_display)

+------------+-----------+----------------------+------+
| customerId | productId |        score         | rank |
+------------+-----------+----------------------+------+
|    1553    |     1     | 0.10348175764083863  |  1   |
|    1553    |     2     |  0.0934672474861145  |  2   |
|    1553    |     35    |  0.0845762014389038  |  3   |
|    1553    |     33    |  0.0668614387512207  |  4   |
|    1553    |     61    | 0.06512556076049805  |  5   |
|    1553    |     5     | 0.06496070623397827  |  6   |
|    1553    |     15    | 0.06476415395736694  |  7   |
|    1553    |     11    | 0.05467898845672607  |  8   |
|    1553    |     36    | 0.05048650503158569  |  9   |
|    1553    |     13    | 0.04985467195510864  |  10  |
|   20400    |     26    | 0.05812269449234009  |  1   |
|   20400    |     6     | 0.05361741781234741  |  2   |
|   20400    |    113    | 0.05312788486480713  |  3   |
|   20400    |     1     | 0.05210459232330322  |  4   |
|   20400    |     15    | 0.04

### 8.1. CSV output file

In [41]:
df_rec = recom.to_dataframe()
print(df_rec.shape)
df_rec.head()

(10000, 4)


Unnamed: 0,customerId,productId,score,rank
0,1553,1,0.103482,1
1,1553,2,0.093467,2
2,1553,35,0.084576,3
3,1553,33,0.066861,4
4,1553,61,0.065126,5


In [42]:
df_rec['recommendedProducts'] = df_rec.groupby([user_id])[item_id].transform(lambda x: '|'.join(x.astype(str)))
df_output = df_rec[['customerId', 'recommendedProducts']].drop_duplicates().sort_values('customerId').set_index('customerId')

#### Define a function to create a desired output

In [43]:
def create_output(model, users_to_recommend, n_rec, print_csv=True):
    recomendation = model.recommend(users=users_to_recommend, k=n_rec)
    df_rec = recomendation.to_dataframe()
    df_rec['recommendedProducts'] = df_rec.groupby([user_id])[item_id] \
        .transform(lambda x: '|'.join(x.astype(str)))
    df_output = df_rec[['customerId', 'recommendedProducts']].drop_duplicates() \
        .sort_values('customerId').set_index('customerId')
    if print_csv:
        df_output.to_csv('../output/option1_recommendation.csv')
        print("An output file can be found in 'output' folder with name 'option1_recommendation.csv'")
    return df_output

In [44]:
df_output = create_output(pear_norm, users_to_recommend, n_rec, print_csv=True)
print(df_output.shape)
df_output.head()

An output file can be found in 'output' folder with name 'option1_recommendation.csv'
(1000, 1)


Unnamed: 0_level_0,recommendedProducts
customerId,Unnamed: 1_level_1
4,226|247|230|248|125|294|204|276|74|72
11,226|247|230|248|125|294|204|276|72|74
12,226|247|230|248|125|294|204|276|74|72
16,226|247|230|248|125|294|204|276|72|74
21,226|247|230|248|125|294|204|276|72|74


### 8.2. Customer recommendation function

In [45]:
def customer_recomendation(customer_id):
    if customer_id not in df_output.index:
        print('Customer not found.')
        return customer_id
    return df_output.loc[customer_id]

In [46]:
customer_recomendation(4)

recommendedProducts    226|247|230|248|125|294|204|276|74|72
Name: 4, dtype: object

In [47]:
customer_recomendation(21)

recommendedProducts    226|247|230|248|125|294|204|276|72|74
Name: 21, dtype: object

## Summary
In this exercise, we were able to traverse a step-by-step process for making recommendations to customers. We used Collaborative Filtering approaches with `cosine` and `pearson` measure and compare the models with our baseline popularity model. We also prepared three sets of data that include regular buying count, buying dummy, as well as normalized purchase frequency as our target variable. Using RMSE, precision and recall, we evaluated our models and observed the impact of personalization. Finally, we selected the Cosine approach in dummy purchase data. 