# Sales CF-Based Recommender System

There are two variations of collaborative filtering recommender systems. There is a user-based recommender system and an item-based recommender system. They both recommend an item based on previous opinions of that item, or previous actions taken by the user regarding similar items.



![User-Item Matrix](userRatings.png)

# User Based Collaborative Filtering

The goal of user-based collaborative filtering is to predict the rating of $i_n$ given the previous actions of one user, $u_m$. In user-based collaborative filtering, we find $k$ candidates of other users that are similar to $u_m$. To find similarity, we can use 

$\begin{align}
\textbf{Cosine Similarity:} \\
cos \theta &= \frac{A\cdot B}{|A||B|}
= \frac{\sum_{i=1}^n A_iB_i}{\sqrt{\sum_{i=1}^nA^2_i}\sqrt{\sum_{i=1}^nB^2_i}}
\end{align}$




$\textbf{Pearson Correlation:}$
$\begin{align}
r &= \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n (y_i-\bar{y})^2}}\\
\end{align}$


There is also a method of using K-nearest-neighbors to find the most similar candidate users, defined by $$p_{a,i} = \bar{r}_a+\frac{\sum_{u \in K}(r_u, i - \bar{r}_u) \times w_{a,u}}{\sum_{u \in K}w_{a,u}},$$
where $p_{a,1}$ is the prediction for target user $u_m$ at item $i_n$, $w_{a,u}$ is the similarity between the users, and K is the neighborhood of similar users. 

# Item-based Collaborative Filtering

The goal of item-based collaborative filtering is to predict ratings for a specific item $i_n$ based on ratings of similar previous items rated by all possible users. 

The rating for target item $i_n$ for active user $u_m$ can be found using

$\textbf{Weighted Average}$ $$p_{a,i} = \frac{\sum_{j \in K} r_{a,j}w_{i,j}}{\sum_{j \in K}|W_{i,j}|}$$
where $K$ is the neighborhood of most similar items rated by user $u_m$ and $w(i,j)$ is the similarity between items $i$ and $j$. 

$\textbf{Adjusted Cosine Similarity}$ $$\textbf{sim}_ij = \frac{\sum_{u \in U}(R_{u,i}-\bar{R}_u)(R_{u,j}-\bar{R}_u)}{\sqrt{\sum_{u \in U}(R_{u,i}-\bar{R}_u)^2}\sqrt{\sum_{u \in U}(R_{u,j}-\bar{R}_u)^2}}$$

# Implementation

In this program, I am building a collaborative filtering model for recommending product items. 



*   Input: User -  Customer ID
*   Output: Ranked list of items (product ID) that the user most likely wants to buy 



In [3]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import time
import turicreate as tc
from sklearn.model_selection import train_test_split

import sys

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 2. Load data
Two datasets are used in this exercise, which can be found in `data` folder: 
* `recommend_1.csv` consisting of a list of 1000 customer IDs to recommend as output
* `trx_data.csv` consisting of user transactions

The format is as follows.

In [None]:
customers = pd.read_csv('/recommend_1.csv')
transactions = pd.read_csv('/trx_data.csv')

In [6]:
print(customers.shape)
customers.head()

(1000, 1)


Unnamed: 0,customerId
0,1553
1,20400
2,19750
3,6334
4,27773


In [7]:
print(transactions.shape)
transactions.head()

(62483, 2)


Unnamed: 0,customerId,products
0,0,20
1,1,2|2|23|68|68|111|29|86|107|152
2,2,111|107|29|11|11|11|33|23
3,3,164|227
4,5,2|2


## 3. Data preparation
* Our goal here is to break down each list of items in the `products` column into rows and count the number of products bought by a user

In [8]:
# example 1: split product items
transactions['products'] = transactions['products'].apply(lambda x: [int(i) for i in x.split('|')])
transactions.head(2).set_index('customerId')['products'].apply(pd.Series).reset_index()

Unnamed: 0,customerId,0,1,2,3,4,5,6,7,8,9
0,0,20.0,,,,,,,,,
1,1,2.0,2.0,23.0,68.0,68.0,111.0,29.0,86.0,107.0,152.0


In [9]:
# example 2: organize a given table into a dataframe with customerId, single productId, and purchase count
pd.melt(transactions.head(2).set_index('customerId')['products'].apply(pd.Series).reset_index(), 
             id_vars=['customerId'],
             value_name='products') \
    .dropna().drop(['variable'], axis=1) \
    .groupby(['customerId', 'products']) \
    .agg({'products': 'count'}) \
    .rename(columns={'products': 'purchase_count'}) \
    .reset_index() \
    .rename(columns={'products': 'productId'})

Unnamed: 0,customerId,productId,purchase_count
0,0,20.0,1
1,1,2.0,2
2,1,23.0,1
3,1,29.0,1
4,1,68.0,2
5,1,86.0,1
6,1,107.0,1
7,1,111.0,1
8,1,152.0,1


### 3.1. Create data with user, item, and target field
* This table will be an input for our modeling later
    * In this case, our user is `customerId`, `productId`, and `purchase_count`

In [10]:
s=time.time()

data = pd.melt(transactions.set_index('customerId')['products'].apply(pd.Series).reset_index(), 
             id_vars=['customerId'],
             value_name='products') \
    .dropna().drop(['variable'], axis=1) \
    .groupby(['customerId', 'products']) \
    .agg({'products': 'count'}) \
    .rename(columns={'products': 'purchase_count'}) \
    .reset_index() \
    .rename(columns={'products': 'productId'})
data['productId'] = data['productId'].astype(np.int64)

print("Execution time:", round((time.time()-s)/60,2), "minutes")

Execution time: 0.28 minutes


In [11]:
print(data.shape)
data.head()

(133585, 3)


Unnamed: 0,customerId,productId,purchase_count
0,0,1,2
1,0,13,1
2,0,19,3
3,0,20,1
4,0,31,2


### 3.2. Create dummy
* Dummy for marking whether a customer bought that item or not.
* If one buys an item, then `purchase_dummy` are marked as 1
* Why create a dummy instead of normalizing it, you ask?
    * Normalizing the purchase count, say by each user, would not work because customers may have different buying frequency don't have the same taste
    * However, we can normalize items by purchase frequency across all users, which is done in section 3.3. below.

In [None]:
def create_data_dummy(data):
    data_dummy = data.copy()
    data_dummy['purchase_dummy'] = 1
    return data_dummy

In [None]:
data_dummy = create_data_dummy(data)

### 3.3. Normalize item values across users
* To do this, we normalize purchase frequency of each item across users by first creating a user-item matrix as follows

In [14]:
df_matrix = pd.pivot_table(data, values='purchase_count', index='customerId', columns='productId')
df_matrix.head()

productId,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299
customerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
0,,2.0,,,,,,,,,,,,1.0,,,,,,3.0,1.0,,,,,,,,,,,2.0,,,,,,,,,...,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,,,6.0,,,,,,,,,,,,,,,,,,,,,1.0,,1.0,,,,1.0,,,,,,,,,,,...,,,,,,,,,,,,,,,1.0,,,,,,,,,,,1.0,,,,,,,,1.0,,,1.0,,,
2,,,,,,,,,,,,3.0,,,,,,,,,,,,1.0,,,,,,1.0,,,,1.0,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,1.0,,,,,,,,,,,,,,,,,,,1.0,,,,,,,,,,


In [15]:
(df_matrix.shape)

(24429, 300)

In [16]:
df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min())
print(df_matrix_norm.shape)
df_matrix_norm.head()

(24429, 300)


productId,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299
customerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
0,,0.1,,,,,,,,,,,,0.0,,,,,,0.142857,0.0,,,,,,,,,,,0.2,,,,,,,,,...,0.571429,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,,,0.166667,,,,,,,,,,,,,,,,,,,,,0.0,,0.0,,,,0.0,,,,,,,,,,,...,,,,,,,,,,,,,,,0.0,,,,,,,,,,,0.0,,,,,,,,0.0,,,0.0,,,
2,,,,,,,,,,,,0.1,,,,,,,,,,,,0.0,,,,,,0.0,,,,0.0,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,0.022222,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,


In [17]:
# create a table for input to the modeling

d = df_matrix_norm.reset_index()
d.index.names = ['scaled_purchase_freq']
data_norm = pd.melt(d, id_vars=['customerId'], value_name='scaled_purchase_freq').dropna()
print(data_norm.shape)
data_norm.head()

(133585, 3)


Unnamed: 0,customerId,productId,scaled_purchase_freq
9,9,0,0.133333
25,25,0,0.133333
32,33,0,0.133333
35,36,0,0.133333
43,44,0,0.133333


#### Define a function for normalizing data

In [None]:
def normalize_data(data):
    df_matrix = pd.pivot_table(data, values='purchase_count', index='customerId', columns='productId')
    df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min())
    d = df_matrix_norm.reset_index()
    d.index.names = ['scaled_purchase_freq']
    return pd.melt(d, id_vars=['customerId'], value_name='scaled_purchase_freq').dropna()

* We can normalize the their purchase history, from 0-1 (with 1 being the most number of purchase for an item and 0 being 0 purchase count for that item).

## 4. Split train and test set
* Splitting the data into training and testing sets is an important part of evaluating predictive modeling, in this case a collaborative filtering model. Typically, we use a larger portion of the data for training and a smaller portion for testing. 
* We use 80:20 ratio for our train-test set size.
* Our training portion will be used to develop a predictive model, while the other to evaluate the model's performance.
* Now that we have three datasets with purchase counts, purchase dummy, and scaled purchase counts, we would like to split each.

In [19]:
train, test = train_test_split(data, test_size = .2)
print(train.shape, test.shape)

(106868, 3) (26717, 3)


In [None]:
# Using turicreate library, we convert dataframe to SFrame - this will be useful in the modeling part

train_data = tc.SFrame(train)
test_data = tc.SFrame(test)

In [21]:
train_data

customerId,productId,purchase_count
843,140,1
28348,108,1
26980,87,3
466,162,1
15286,15,2
8599,2,2
7519,2,2
24574,87,1
3961,82,1
10954,171,1


In [22]:
test_data

customerId,productId,purchase_count
1750,146,2
2870,76,1
21091,4,2
9881,2,1
10807,189,1
8564,274,1
13417,1,2
25794,228,1
21384,228,2
15982,175,1


#### Define a `split_data` function for splitting data to training and test set

In [None]:
# We can define a function for this step as follows

def split_data(data):
    '''
    Splits dataset into training and test set.
    
    Args:
        data (pandas.DataFrame)
        
    Returns
        train_data (tc.SFrame)
        test_data (tc.SFrame)
    '''
    train, test = train_test_split(data, test_size = .2)
    train_data = tc.SFrame(train)
    test_data = tc.SFrame(test)
    return train_data, test_data

In [None]:
# lets try with both dummy table and scaled/normalized purchase table

train_data_dummy, test_data_dummy = split_data(data_dummy)
train_data_norm, test_data_norm = split_data(data_norm)

## 5. Baseline Model
Before running a more complicated approach such as collaborative filtering, we would like to use a baseline model to compare and evaluate models. Since baseline typically uses a very simple approach, techniques used beyond this approach should be chosen if they show relatively better accuracy and complexity.

### 5.1. Using a Popularity model as a baseline
* The popularity model takes the most popular items for recommendation. These items are products with the highest number of sells across customers.
* We use `turicreate` library for running and evaluating both baseline and collaborative filtering models below
* Training data is used for model selection

#### Using purchase counts

In [None]:
# variables to define field names
user_id = 'customerId'
item_id = 'productId'
target = 'purchase_count'
users_to_recommend = list(transactions[user_id])
n_rec = 10 # number of items to recommend
n_display = 30

In [27]:
popularity_model = tc.popularity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target)

In [28]:
# Get recommendations for a list of users to recommend (from customers file)
# Printed below is head / top 30 rows for first 3 customers with 10 recommendations each

popularity_recomm = popularity_model.recommend(users=users_to_recommend, k=n_rec)
popularity_recomm.print_rows(n_display)

+------------+-----------+--------------------+------+
| customerId | productId |       score        | rank |
+------------+-----------+--------------------+------+
|     0      |    248    | 3.272727272727273  |  1   |
|     0      |    132    | 3.1724137931034484 |  2   |
|     0      |     37    | 3.032258064516129  |  3   |
|     0      |     34    | 2.9959016393442623 |  4   |
|     0      |     0     | 2.992079207920792  |  5   |
|     0      |     3     | 2.800429184549356  |  6   |
|     0      |    110    | 2.7329545454545454 |  7   |
|     0      |     27    | 2.699248120300752  |  8   |
|     0      |    230    | 2.676258992805755  |  9   |
|     0      |     10    | 2.6516516516516515 |  10  |
|     1      |    248    | 3.272727272727273  |  1   |
|     1      |    132    | 3.1724137931034484 |  2   |
|     1      |     37    | 3.032258064516129  |  3   |
|     1      |     34    | 2.9959016393442623 |  4   |
|     1      |     0     | 2.992079207920792  |  5   |
|     1   

#### Define a `model` function for model selection

In [None]:
# Since turicreate is very accessible library, we can define a model selection function as below

def model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display):
    if name == 'popularity':
        model = tc.popularity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target)
    elif name == 'cosine':
        model = tc.item_similarity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target, 
                                                    similarity_type='cosine')
    elif name == 'pearson':
        model = tc.item_similarity_recommender.create(train_data, 
                                                    user_id=user_id, 
                                                    item_id=item_id, 
                                                    target=target, 
                                                    similarity_type='pearson')
        
    recom = model.recommend(users=users_to_recommend, k=n_rec)
    recom.print_rows(n_display)
    return model

In [None]:
# variables to define field names
# constant variables include:
user_id = 'customerId'
item_id = 'productId'
users_to_recommend = list(customers[user_id])
n_rec = 10 # number of items to recommend
n_display = 30 # to print the head / first few rows in a defined dataset

#### Using purchase dummy

In [31]:
# these variables will change accordingly
name = 'popularity'
target = 'purchase_dummy'
pop_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+-------+------+
| customerId | productId | score | rank |
+------------+-----------+-------+------+
|    1553    |    259    |  1.0  |  1   |
|    1553    |     44    |  1.0  |  2   |
|    1553    |    298    |  1.0  |  3   |
|    1553    |     45    |  1.0  |  4   |
|    1553    |    249    |  1.0  |  5   |
|    1553    |    106    |  1.0  |  6   |
|    1553    |     15    |  1.0  |  7   |
|    1553    |     3     |  1.0  |  8   |
|    1553    |     49    |  1.0  |  9   |
|    1553    |     11    |  1.0  |  10  |
|   20400    |     44    |  1.0  |  1   |
|   20400    |    298    |  1.0  |  2   |
|   20400    |     45    |  1.0  |  3   |
|   20400    |    249    |  1.0  |  4   |
|   20400    |    106    |  1.0  |  5   |
|   20400    |     15    |  1.0  |  6   |
|   20400    |     3     |  1.0  |  7   |
|   20400    |     49    |  1.0  |  8   |
|   20400    |    193    |  1.0  |  9   |
|   20400    |     11    |  1.0  |  10  |
|   19750    |     44    |  1.0  |

#### Using normalized purchase count

In [32]:
name = 'popularity'
target = 'scaled_purchase_freq'
pop_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+---------------------+------+
| customerId | productId |        score        | rank |
+------------+-----------+---------------------+------+
|    1553    |    226    |  0.7567567567567568 |  1   |
|    1553    |    247    |  0.3433583959899749 |  2   |
|    1553    |    230    |  0.3319148936170206 |  3   |
|    1553    |    125    | 0.25611510791366876 |  4   |
|    1553    |    294    | 0.24806201550387563 |  5   |
|    1553    |     72    | 0.23844537815126052 |  6   |
|    1553    |    155    | 0.23333333333333325 |  7   |
|    1553    |    248    | 0.22965116279069767 |  8   |
|    1553    |    204    |  0.2282608695652172 |  9   |
|    1553    |    165    |  0.2265193370165746 |  10  |
|   20400    |    226    |  0.7567567567567568 |  1   |
|   20400    |    247    |  0.3433583959899749 |  2   |
|   20400    |    230    |  0.3319148936170206 |  3   |
|   20400    |    125    | 0.25611510791366876 |  4   |
|   20400    |    294    | 0.24806201550387563 |

#### Notes
* Once we created the model, we predicted the recommendation items using scores by popularity. As you can tell for each model results above, the rows show the first 30 records from 1000 users with 10 recommendations. These 30 records include 3 users and their recommended items, along with score and descending ranks. 
* In the result, although different models have different recommendation list, each user is recommended the same list of 10 items. This is because popularity is calculated by taking the most popular items across all users.
* If a grouping example below, products 132, 248, 37, and 34 are the most popular (best-selling) across customers. Using their purchase counts divided by the number of customers, we see that these products are at least bought 3 times on average in the training set of transactions (same as the first popularity measure on `purchase_count` variable)

In [33]:
train.groupby(by=item_id)['purchase_count'].mean().sort_values(ascending=False).head(20)

productId
248    3.272727
132    3.172414
37     3.032258
34     2.995902
0      2.992079
3      2.800429
110    2.732955
27     2.699248
230    2.676259
10     2.651652
32     2.603774
58     2.563218
226    2.514493
245    2.456790
129    2.443787
173    2.439024
252    2.403509
68     2.402490
91     2.384298
41     2.354232
Name: purchase_count, dtype: float64

#### Using purchase count

In [34]:
# these variables will change accordingly
name = 'cosine'
target = 'purchase_count'
cos = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+----------------------+------+
| customerId | productId |        score         | rank |
+------------+-----------+----------------------+------+
|    1553    |     2     | 0.07160244882106781  |  1   |
|    1553    |     35    |  0.0693587213754654  |  2   |
|    1553    |     1     |  0.0691552609205246  |  3   |
|    1553    |     61    | 0.06251166760921478  |  4   |
|    1553    |     21    | 0.04497094452381134  |  5   |
|    1553    |     76    | 0.04215094447135925  |  6   |
|    1553    |     0     | 0.03977116942405701  |  7   |
|    1553    |    269    | 0.037718966603279114 |  8   |
|    1553    |     8     | 0.03595167398452759  |  9   |
|    1553    |     5     | 0.03342823684215546  |  10  |
|   20400    |    122    | 0.045564889907836914 |  1   |
|   20400    |    215    | 0.04000645875930786  |  2   |
|   20400    |     6     | 0.03698962926864624  |  3   |
|   20400    |     1     | 0.036445021629333496 |  4   |
|   20400    |     54    | 0.03

#### Using purchase dummy

In [35]:
# these variables will change accordingly
name = 'cosine'
target = 'purchase_dummy'
cos_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+----------------------+------+
| customerId | productId |        score         | rank |
+------------+-----------+----------------------+------+
|    1553    |     2     |  0.0987391710281372  |  1   |
|    1553    |     35    | 0.08719536066055297  |  2   |
|    1553    |     1     | 0.08481138944625854  |  3   |
|    1553    |     5     | 0.06987897157669068  |  4   |
|    1553    |     21    | 0.062110412120819095 |  5   |
|    1553    |     33    | 0.05326050519943237  |  6   |
|    1553    |     17    | 0.050858259201049805 |  7   |
|    1553    |     8     | 0.04967168569564819  |  8   |
|    1553    |    105    | 0.049265813827514646 |  9   |
|    1553    |     15    |  0.0440105676651001  |  10  |
|   20400    |     26    | 0.04677283763885498  |  1   |
|   20400    |     54    | 0.04296356439590454  |  2   |
|   20400    |    229    | 0.04230165481567383  |  3   |
|   20400    |     1     | 0.042143821716308594 |  4   |
|   20400    |    220    | 0.04

#### Using normalized purchase count

In [36]:
name = 'cosine'
target = 'scaled_purchase_freq'
cos_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+-----------------------+------+
| customerId | productId |         score         | rank |
+------------+-----------+-----------------------+------+
|    1553    |     25    |          0.0          |  1   |
|    1553    |    180    |          0.0          |  2   |
|    1553    |    228    |          0.0          |  3   |
|    1553    |     42    |          0.0          |  4   |
|    1553    |    288    |          0.0          |  5   |
|    1553    |     1     |          0.0          |  6   |
|    1553    |    205    |          0.0          |  7   |
|    1553    |     19    |          0.0          |  8   |
|    1553    |    133    |          0.0          |  9   |
|    1553    |     2     |          0.0          |  10  |
|   20400    |     1     | 0.0037458515167236328 |  1   |
|   20400    |     2     | 0.0036238288879394532 |  2   |
|   20400    |     8     |  0.002835109233856201 |  3   |
|   20400    |     38    | 0.0023179936408996584 |  4   |
|   20400    |

#### Using purchase count

In [37]:
# these variables will change accordingly
name = 'pearson'
target = 'purchase_count'
pear = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+--------------------+------+
| customerId | productId |       score        | rank |
+------------+-----------+--------------------+------+
|    1553    |    248    | 3.269303671338341  |  1   |
|    1553    |    132    | 3.172413793103448  |  2   |
|    1553    |     37    | 3.0322580645161303 |  3   |
|    1553    |     34    | 2.9947963457127105 |  4   |
|    1553    |     0     | 2.9891616797683294 |  5   |
|    1553    |     3     | 2.7993184668951274 |  6   |
|    1553    |    110    | 2.7272471324963994 |  7   |
|    1553    |     27    |  2.69852878164528  |  8   |
|    1553    |    230    | 2.673649873986517  |  9   |
|    1553    |     10    | 2.6516516516516515 |  10  |
|   20400    |    248    | 3.2727272727272725 |  1   |
|   20400    |    132    | 3.172413793103448  |  2   |
|   20400    |     37    | 3.029654354818407  |  3   |
|   20400    |     34    | 2.995901639344264  |  4   |
|   20400    |     0     | 2.992079207920795  |  5   |
|   20400 

#### Using purchase dummy

In [38]:
# these variables will change accordingly
name = 'pearson'
target = 'purchase_dummy'
pear_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+-------+------+
| customerId | productId | score | rank |
+------------+-----------+-------+------+
|    1553    |    259    |  0.0  |  1   |
|    1553    |     44    |  0.0  |  2   |
|    1553    |    298    |  0.0  |  3   |
|    1553    |     45    |  0.0  |  4   |
|    1553    |    249    |  0.0  |  5   |
|    1553    |    106    |  0.0  |  6   |
|    1553    |     15    |  0.0  |  7   |
|    1553    |     3     |  0.0  |  8   |
|    1553    |     49    |  0.0  |  9   |
|    1553    |     11    |  0.0  |  10  |
|   20400    |     44    |  0.0  |  1   |
|   20400    |    298    |  0.0  |  2   |
|   20400    |     45    |  0.0  |  3   |
|   20400    |    249    |  0.0  |  4   |
|   20400    |    106    |  0.0  |  5   |
|   20400    |     15    |  0.0  |  6   |
|   20400    |     3     |  0.0  |  7   |
|   20400    |     49    |  0.0  |  8   |
|   20400    |    193    |  0.0  |  9   |
|   20400    |     11    |  0.0  |  10  |
|   19750    |     44    |  0.0  |

#### Using normalized purchase count

In [39]:
name = 'pearson'
target = 'scaled_purchase_freq'
pear_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display)

+------------+-----------+---------------------+------+
| customerId | productId |        score        | rank |
+------------+-----------+---------------------+------+
|    1553    |    226    |  0.7567567567567564 |  1   |
|    1553    |    247    | 0.34335839598997503 |  2   |
|    1553    |    230    |  0.331357133219428  |  3   |
|    1553    |    125    | 0.25611510791366904 |  4   |
|    1553    |    294    |  0.248062015503876  |  5   |
|    1553    |     72    | 0.23844537815126055 |  6   |
|    1553    |    155    | 0.23333333333333334 |  7   |
|    1553    |    248    | 0.22965116279069767 |  8   |
|    1553    |    204    | 0.22826086956521746 |  9   |
|    1553    |    165    |  0.2265193370165745 |  10  |
|   20400    |    226    |  0.7567425398568846 |  1   |
|   20400    |    247    |  0.3433559915386048 |  2   |
|   20400    |    230    |  0.3319112923043839 |  3   |
|   20400    |    125    | 0.25610130824631067 |  4   |
|   20400    |    294    | 0.24805311653041104 |

#### Note
* In collaborative filtering above, we used two approaches: cosine and pearson distance. We also got to apply them to three training datasets with normal counts, dummy, or normalized counts of items purchase.
* We can see that the recommendations are different for each user. This suggests that personalization does exist. 
* But how good is this model compared to the baseline, and to each other? We need some means of evaluating a recommendation engine. Lets focus on that in the next section.

## 7. Model Evaluation
For evaluating recommendation engines, we can use the concept of precision-recall.

* RMSE (Root Mean Squared Errors)
    * Measures the error of predicted values
    * Lesser the RMSE value, better the recommendations
* Recall
    * What percentage of products that a user buys are actually recommended?
    * If a customer buys 5 products and the recommendation decided to show 3 of them, then the recall is 0.6
* Precision
    * Out of all the recommended items, how many the user actually liked?
    * If 5 products were recommended to the customer out of which he buys 4 of them, then precision is 0.8
    
Lets compare all the models we have built based on precision-recall characteristics:

In [None]:
# create initial callable variables

models_w_counts = [popularity_model, cos, pear]
models_w_dummy = [pop_dummy, cos_dummy, pear_dummy]
models_w_norm = [pop_norm, cos_norm, pear_norm]

names_w_counts = ['Popularity Model on Purchase Counts', 'Cosine Similarity on Purchase Counts', 'Pearson Similarity on Purchase Counts']
names_w_dummy = ['Popularity Model on Purchase Dummy', 'Cosine Similarity on Purchase Dummy', 'Pearson Similarity on Purchase Dummy']
names_w_norm = ['Popularity Model on Scaled Purchase Counts', 'Cosine Similarity on Scaled Purchase Counts', 'Pearson Similarity on Scaled Purchase Counts']

#### Models on purchase counts

In [41]:
eval_counts = tc.recommender.util.compare_models(test_data, models_w_counts, model_names=names_w_counts)

PROGRESS: Evaluate model Popularity Model on Purchase Counts



Precision and recall summary statistics by cutoff
+--------+-----------------------+-----------------------+
| cutoff |     mean_precision    |      mean_recall      |
+--------+-----------------------+-----------------------+
|   1    |  0.00108256351039261  | 0.0005335491586934968 |
|   2    | 0.0009021362586605099 | 0.0009665745628505458 |
|   3    |  0.002766551193225568 |  0.004019403662157683 |
|   4    | 0.0033559468822170974 |  0.006494450286850699 |
|   5    |  0.006885103926097012 |  0.018628337741091527 |
|   6    |  0.007313317936874524 |  0.023420607914784897 |
|   7    |  0.006619102606400533 |  0.024658940285839433 |
|   8    |  0.00609844110854502  |  0.026038192068346074 |
|   9    | 0.0058057480112907595 |  0.02788129940365914  |
|   10   | 0.0057881062355658215 |  0.031253355861923716 |
+--------+-----------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 1.0938488207270458

Per User RMSE (best)
+------------+------+-------+
| customerId |


Precision and recall summary statistics by cutoff
+--------+----------------------+---------------------+
| cutoff |    mean_precision    |     mean_recall     |
+--------+----------------------+---------------------+
|   1    |  0.1146073903002305  | 0.06729130020764029 |
|   2    | 0.09577078521939966  | 0.11082678690863827 |
|   3    | 0.08025404157043825  |  0.1373577328310975 |
|   4    | 0.06937427829099273  | 0.15652858915914242 |
|   5    | 0.06127309468822183  | 0.17131354915836378 |
|   6    | 0.05498219784449591  |  0.1837126191102135 |
|   7    | 0.05027218739689864  | 0.19531800065370916 |
|   8    | 0.04689304272517341  | 0.20695190298117455 |
|   9    | 0.043968116499871604 |  0.2176828879796243 |
|   10   | 0.041202367205542885 | 0.22565637178668113 |
+--------+----------------------+---------------------+
[10 rows x 3 columns]


Overall RMSE: 1.9244290168309925

Per User RMSE (best)
+------------+-------------------+-------+
| customerId |        rmse       | count |



Precision and recall summary statistics by cutoff
+--------+-----------------------+-----------------------+
| cutoff |     mean_precision    |      mean_recall      |
+--------+-----------------------+-----------------------+
|   1    | 0.0010825635103926085 | 0.0005335491586934974 |
|   2    | 0.0009021362586605125 | 0.0009665745628505469 |
|   3    |  0.002670323325635082 | 0.0038991188276696386 |
|   4    |  0.004294168591224017 |  0.008912776884233298 |
|   5    |  0.006899538106235589 |  0.01870050864178426  |
|   6    |  0.007349403387220945 |  0.023637120616863343 |
|   7    |  0.006639722863741344 |  0.02480328208722529  |
|   8    |  0.006125505196304863 |  0.02626071901214909  |
|   9    |  0.005845842956120112 |  0.028242153907123403 |
|   10   |  0.005795323325635103 |  0.03137364069641161  |
+--------+-----------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 1.0909592024577166

Per User RMSE (best)
+------------+-----------------------+------

#### Models on purchase dummy

In [42]:
eval_dummy = tc.recommender.util.compare_models(test_data_dummy, models_w_dummy, model_names=names_w_dummy)

PROGRESS: Evaluate model Popularity Model on Purchase Dummy



Precision and recall summary statistics by cutoff
+--------+-----------------------+----------------------+
| cutoff |     mean_precision    |     mean_recall      |
+--------+-----------------------+----------------------+
|   1    |  0.005306942053930007 | 0.002646640985711549 |
|   2    |  0.006024096385542173 | 0.006420523363294428 |
|   3    | 0.0060958118187033885 | 0.00962463362958202  |
|   4    |  0.005773092369477923 | 0.01216364713567811  |
|   5    |  0.005464716006884687 | 0.014559198729809767 |
|   6    |  0.005055938037865744 | 0.01575180930817765  |
|   7    |  0.004794688959921325 | 0.01734487265479288  |
|   8    |  0.004751147446930573 | 0.01943765679899703  |
|   9    |  0.004757123733027372 | 0.021808352475953526 |
|   10   |  0.004783419391853151 | 0.024014600680480523 |
+--------+-----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.0

Per User RMSE (best)
+------------+------+-------+
| customerId | rmse | count |
+------------


Precision and recall summary statistics by cutoff
+--------+----------------------+---------------------+
| cutoff |    mean_precision    |     mean_recall     |
+--------+----------------------+---------------------+
|   1    | 0.12220309810671286  |  0.0711984141285199 |
|   2    | 0.09642139988525472  |  0.1101575463697083 |
|   3    | 0.08211417096959273  | 0.13850551402305103 |
|   4    | 0.07198436603557043  | 0.16008376525502574 |
|   5    | 0.06394148020654067  | 0.17673853411856877 |
|   6    | 0.058065595716198286 | 0.19108493802265958 |
|   7    | 0.05293623473485768  | 0.20186450834396769 |
|   8    | 0.04916989386115885  | 0.21326791934935804 |
|   9    | 0.045794288264167655 |  0.2225241021460714 |
|   10   | 0.04333763625932283  | 0.23268339499652782 |
+--------+----------------------+---------------------+
[10 rows x 3 columns]


Overall RMSE: 0.9694624626504055

Per User RMSE (best)
+------------+---------------------+-------+
| customerId |         rmse        | coun


Precision and recall summary statistics by cutoff
+--------+-----------------------+-----------------------+
| cutoff |     mean_precision    |      mean_recall      |
+--------+-----------------------+-----------------------+
|   1    |  0.005306942053929974 |  0.002646640985711559 |
|   2    |  0.006024096385542138 | 0.0064205233632944414 |
|   3    |  0.006095811818703401 |  0.009624633629582022 |
|   4    | 0.0057730923694779305 |  0.012163647135678154 |
|   5    |  0.005464716006884674 |  0.014559198729809803 |
|   6    |  0.00505593803786575  |  0.01575180930817772  |
|   7    |  0.00479468895992131  |  0.017344872654792946 |
|   8    |  0.004751147446930579 |  0.01943765679899696  |
|   9    | 0.0047571237330273405 |  0.021808352475953537 |
|   10   | 0.0047834193918531295 |  0.024014600680480627 |
+--------+-----------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 1.0

Per User RMSE (best)
+------------+------+-------+
| customerId | rmse | count |

#### Models on normalized purchase frequency

In [43]:
eval_norm = tc.recommender.util.compare_models(test_data_norm, models_w_norm, model_names=names_w_norm)

PROGRESS: Evaluate model Popularity Model on Scaled Purchase Counts



Precision and recall summary statistics by cutoff
+--------+-----------------------+-----------------------+
| cutoff |     mean_precision    |      mean_recall      |
+--------+-----------------------+-----------------------+
|   1    |  0.002301827075240959 | 0.0013465288767403583 |
|   2    | 0.0020860307869371283 |  0.002178543454978506 |
|   3    |  0.002014098690835855 |  0.003324661519525573 |
|   4    |  0.002068047762911803 |  0.004630174569751619 |
|   5    | 0.0022298949791396887 |  0.006361556294471808 |
|   6    | 0.0033448424687095558 |  0.011385401362367467 |
|   7    |  0.002969767967610004 |  0.01180460563353546  |
|   8    |  0.00268846209178535  |  0.012076748730452001 |
|   9    | 0.0025815630045237293 |  0.012992390721213767 |
|   10   | 0.0027478060710689175 |  0.01511390489495065  |
+--------+-----------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.13241807080730894

Per User RMSE (best)
+------------+-----------------------+-----


Precision and recall summary statistics by cutoff
+--------+----------------------+----------------------+
| cutoff |    mean_precision    |     mean_recall      |
+--------+----------------------+----------------------+
|   1    | 0.06682491727808942  | 0.038151223472032744 |
|   2    | 0.05240253200978263  | 0.05902012482755494  |
|   3    | 0.045293243178439746 | 0.07485649709973381  |
|   4    | 0.03920299237519769  | 0.08512383526513123  |
|   5    | 0.03527549992806794  | 0.09524048514764222  |
|   6    | 0.032609216899247055 | 0.10502834806760032  |
|   7    | 0.030427276650841415 | 0.11375902059844567  |
|   8    | 0.028125449575600665 | 0.11965517151509744  |
|   9    | 0.02663086046771855  |  0.1272173340608018  |
|   10   | 0.025348870666091178 |  0.134925705140134   |
+--------+----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.16054879245867876

Per User RMSE (best)
+------------+------+-------+
| customerId | rmse | count |
+----------


Precision and recall summary statistics by cutoff
+--------+-----------------------+-----------------------+
| cutoff |     mean_precision    |      mean_recall      |
+--------+-----------------------+-----------------------+
|   1    |  0.002301827075240963 | 0.0013465288767403607 |
|   2    | 0.0020860307869371287 | 0.0021785434549784956 |
|   3    | 0.0020140986908358583 |  0.003324661519525555 |
|   4    | 0.0020680477629118057 |  0.004630174569751607 |
|   5    |  0.002229894979139697 |  0.006361556294471809 |
|   6    | 0.0033208651033424493 |  0.011241537170164885 |
|   7    |  0.003000596008796267 |  0.011924492460370958 |
|   8    |  0.002751402675873962 |  0.012346494090831779 |
|   9    | 0.0026534951006250145 |  0.013334068177694901 |
|   10   |  0.002762192490289189 |  0.015173848308368414 |
+--------+-----------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 0.13211031619884706

Per User RMSE (best)
+------------+------------------------+----

## 8. Model Selection
### 8.1. Evaluation summary
* Based on RMSE


    1. Popularity on purchase counts: 1.1111750034210488
    2. Cosine similarity on purchase counts: 1.9230643981653215
    3. Pearson similarity on purchase counts: 1.9231102838192284
    
    4. Popularity on purchase dummy: 0.9697374361161925
    5. Cosine similarity on purchase dummy: 0.9697509978436404
    6. Pearson similarity on purchase dummy: 0.9697745320187097
    
    7. Popularity on scaled purchase counts: 0.16230660626840343
    8. Cosine similarity on scaled purchase counts: 0.16229800354111104
    9. Pearson similarity on scaled purchase counts: 0.1622982668334026
    
* Based on Precision and Recall
![](../images/model_comparisons.png)


#### Notes

* Popularity v. Collaborative Filtering: We can see that the collaborative filtering algorithms work better than popularity model for purchase counts. Indeed, popularity model doesn’t give any personalizations as it only gives the same list of recommended items to every user.
* Precision and recall: Looking at the summary above, we see that the precision and recall for Purchase Counts > Purchase Dummy > Normalized Purchase Counts. However, because the recommendation scores for the normalized purchase data is zero and constant, we choose the dummy. In fact, the RMSE isn’t much different between models on the dummy and those on the normalized data.
* RMSE: Since RMSE is higher using pearson distance thancosine, we would choose model the smaller mean squared errors, which in this case would be cosine.
Therefore, we select the Cosine similarity on Purchase Dummy approach as our final model.

## 8. Final Output
* In this step, we would like to manipulate format for recommendation output to one we can export to csv, and also a function that will return recommendation list given a customer ID.
* We need to first rerun the model using the whole dataset, as we came to a final model using train data and evaluated with test set.

In [44]:
users_to_recommend = list(customers[user_id])

final_model = tc.item_similarity_recommender.create(tc.SFrame(data_dummy), 
                                            user_id=user_id, 
                                            item_id=item_id, 
                                            target='purchase_dummy', 
                                            similarity_type='cosine')

recom = final_model.recommend(users=users_to_recommend, k=n_rec)
recom.print_rows(n_display)

+------------+-----------+----------------------+------+
| customerId | productId |        score         | rank |
+------------+-----------+----------------------+------+
|    1553    |     1     | 0.10348175764083863  |  1   |
|    1553    |     2     |  0.0934672474861145  |  2   |
|    1553    |     35    |  0.0845762014389038  |  3   |
|    1553    |     33    |  0.0668614387512207  |  4   |
|    1553    |     61    | 0.06512556076049805  |  5   |
|    1553    |     15    | 0.06476415395736694  |  6   |
|    1553    |     11    | 0.05467898845672607  |  7   |
|    1553    |     5     | 0.05406981706619263  |  8   |
|    1553    |     36    | 0.05048650503158569  |  9   |
|    1553    |     13    | 0.04985467195510864  |  10  |
|   20400    |     26    | 0.05812269449234009  |  1   |
|   20400    |     6     | 0.05361741781234741  |  2   |
|   20400    |    113    | 0.05312788486480713  |  3   |
|   20400    |     1     | 0.05210459232330322  |  4   |
|   20400    |     15    | 0.04

### 8.2. Customer recommendation function

In [None]:
def customer_recomendation(customer_id):
    if customer_id not in df_output.index:
        print('Customer not found.')
        return customer_id
    return df_output.loc[customer_id]

In [53]:
customer_recomendation(4)

recommendedProducts    2|1|36|13|216|61|20|33|25|157
Name: 4, dtype: object

In [54]:
customer_recomendation(21)

recommendedProducts    38|36|48|79|1|2|15|13|25|20
Name: 21, dtype: object

## Summary
In this exercise, we were able to traverse a step-by-step process for making recommendations to customers. We used Collaborative Filtering approaches with `cosine` and `pearson` measure and compare the models with our baseline popularity model. We also prepared three sets of data that include regular buying count, buying dummy, as well as normalized purchase frequency as our target variable. Using RMSE, precision and recall, we evaluated our models and observed the impact of personalization. Finally, we selected the Cosine approach in dummy purchase data. 