# Data Science for Business

## Spring 2020, module 4 @ HSE

---

## Home assignment 5


Author: **Miron Rogovets**

---

You goal for this task is two fold:

1. Cluster all the products into distinct groups (clusters)
2. Build a recommender system for customers, but instead of products we will recommend categories.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
pd.set_option('display.float_format', lambda x: '{:.2f}'.format(x))
sns.set_style('darkgrid')

In [3]:
df = pd.read_csv('data/sample_transations.csv', index_col=0)
df.head(3)

Unnamed: 0,dd_card_number,store_number,dd_transaction_number,sku,quantity,post_discount_price,date,hour,dbi_item_catgry,dbi_item_sub_catgry,dbi_item_famly_name,dbi_item_size
0,0,775,7969,74,1,2.29,9/2/2015,Lunch,Beverages,Hot Coffee,Hot Coffee,Large
1,0,775,7969,73,1,0.0,9/2/2015,Lunch,Beverages,Hot Coffee,Hot Coffee,Medium
2,0,761,16021,75,1,2.49,9/4/2015,Morning,Beverages,Hot Coffee,Hot Coffee,X-Large


In [4]:
df.shape

(51939, 12)

In [5]:
df.dtypes

dd_card_number             int64
store_number               int64
dd_transaction_number      int64
sku                        int64
quantity                   int64
post_discount_price      float64
date                      object
hour                      object
dbi_item_catgry           object
dbi_item_sub_catgry       object
dbi_item_famly_name       object
dbi_item_size             object
dtype: object

In [6]:
df.isna().sum()

dd_card_number           0
store_number             0
dd_transaction_number    0
sku                      0
quantity                 0
post_discount_price      0
date                     0
hour                     0
dbi_item_catgry          0
dbi_item_sub_catgry      0
dbi_item_famly_name      0
dbi_item_size            0
dtype: int64

- `dd_card_number` - customer identifier (categorical)
- `store_number` - store identifier (categorical)
- `dd_transaction_number` - transaction identifier (categorical)
- `sku` - product identifier, may vary across different stores (categorical)
- `quantity` - quantity (numerical)
- `post_discount_price` - transaction price (numerical)
- `date` - transaction date 
- `hour` (categorical)
- `dbi_item_catgry` (categorical)
- `dbi_item_sub_catgry` (categorical)
- `dbi_item_famly_name` (categorical)
- `dbi_item_size` (categorical)


### Data Exploration

In [37]:
df.duplicated().sum()

1273

In [38]:
data = df.drop_duplicates()
data.shape

(50666, 12)

In [118]:
{
    'customers': len(data['dd_card_number'].unique()),
    'stores': len(data['store_number'].unique()),
    'transactions': len(data['dd_transaction_number'].unique()),
    'product_sku': len(data['sku'].unique()),
    'product_categories': len(data['dbi_item_catgry'].unique()),
    'product_subcategories': len(data['dbi_item_sub_catgry'].unique()),
    'products': len(data['dbi_item_famly_name'].unique())
}

{'customers': 100,
 'stores': 1501,
 'transactions': 24348,
 'product_sku': 620,
 'product_categories': 5,
 'product_subcategories': 55,
 'products': 120}

In [49]:
data['dbi_item_catgry'].value_counts()

Beverages        30004
Food - Bakery    13786
Food AM           6181
Food PM            427
Other              268
Name: dbi_item_catgry, dtype: int64

In [50]:
data['hour'].value_counts()

Morning      29029
Lunch        14294
Afternoon     5467
Night         1876
Name: hour, dtype: int64

In [51]:
data['dbi_item_sub_catgry'].value_counts()

Hot Coffee                                 15648
Iced Coffee                                 8882
Donut Varieties                             6254
Bagels                                      3880
Muffin                                      3117
Wake Up Wraps                               2226
Hash Brown                                  1139
Iced Espresso                               1121
Iced Tea                                    1036
Frozen Beverages                             805
Hot Espresso                                 744
Cooler Beverages                             683
Bacon, Egg & Cheese                          658
Hot Tea                                      625
Egg & Cheese                                 596
Sausage, Egg & Cheese                        532
Other Hot Beverages                          451
Other Food- Bakery                           413
Egg White Flatbreads                         335
Turkey Sausage Sandwich                      275
K-Cups              

In [52]:
len(data['dbi_item_size'].value_counts())

103

In [53]:
len(data['date'].unique())

365

In [117]:
(data['post_discount_price'] == 0.0).sum()

7305

---

### I. Clustering (20)

1. Feature generation. Use examples from Seminar 6 Plan to generate features for products clustering. You may generate any number of features but you must generate at least 3 features which differ from those, proposed in the plan.

In [68]:
# average item price
data.groupby('dbi_item_famly_name')['post_discount_price'].transform('mean')

0       2.13
1       2.13
2       2.13
3       2.13
4       2.13
        ... 
51934   2.13
51935   1.31
51936   2.10
51937   2.13
51938   1.31
Name: post_discount_price, Length: 50666, dtype: float64

In [69]:
# median item price
data.groupby('dbi_item_famly_name')['post_discount_price'].transform('median')

0       2.09
1       2.09
2       2.09
3       2.09
4       2.09
        ... 
51934   2.09
51935   0.99
51936   1.99
51937   2.09
51938   0.99
Name: post_discount_price, Length: 50666, dtype: float64

In [70]:
# median quantity in a single purchase
data.groupby(['dd_transaction_number', 'dbi_item_famly_name'])['quantity'].transform('median')

0       1.00
1       1.00
2       1.00
3       1.00
4       1.00
        ... 
51934   1.00
51935   2.00
51936   1.00
51937   1.00
51938   2.00
Name: quantity, Length: 50666, dtype: float64

In [95]:
# How many different stores sell this item
data.groupby('dbi_item_famly_name')['store_number'].transform('nunique')

0        1064
1        1064
2        1064
3        1064
4        1064
         ... 
51934    1064
51935     597
51936      27
51937    1064
51938     597
Name: store_number, Length: 50666, dtype: int64

In [97]:
# How many different customers buy this item
data.groupby('dbi_item_famly_name')['dd_card_number'].transform('nunique')

0        97
1        97
2        97
3        97
4        97
         ..
51934    97
51935    99
51936    17
51937    97
51938    99
Name: dd_card_number, Length: 50666, dtype: int64

In [145]:
# Number of purchases at different hour
data.groupby(['dbi_item_famly_name', 'hour'])['quantity'].transform('sum')

0        4368
1        4368
2        9898
3        9898
4        9898
         ... 
51934    9898
51935    4064
51936      30
51937     377
51938     200
Name: quantity, Length: 50666, dtype: int64

In [115]:
# Ratio of purchases within a day for a store
data.groupby(['dbi_item_famly_name', 'date', 'store_number'])['dd_transaction_number'].transform('size') / \
data.groupby(['dbi_item_famly_name', 'date'])['dd_transaction_number'].transform('size')

0       0.05
1       0.05
2       0.07
3       0.07
4       0.07
        ... 
51934   0.03
51935   0.07
51936   1.00
51937   0.03
51938   0.08
Name: dd_transaction_number, Length: 50666, dtype: float64

In [131]:
# Number of other items in the same category
data.groupby('dbi_item_catgry')['dbi_item_famly_name'].transform('nunique') - 1 
# substract current item from total amount

0        33
1        33
2        33
3        33
4        33
         ..
51934    33
51935    25
51936    33
51937    33
51938    25
Name: dbi_item_famly_name, Length: 50666, dtype: int64

In [137]:
# Number of other items in the same subcategory
data.groupby('dbi_item_sub_catgry')['dbi_item_famly_name'].transform('nunique') - 1 

0        2
1        2
2        2
3        2
4        2
        ..
51934    2
51935    4
51936    5
51937    2
51938    4
Name: dbi_item_famly_name, Length: 50666, dtype: int64

In [138]:
# Average price of items in the same category
data.groupby('dbi_item_catgry')['post_discount_price'].transform('mean')

0       2.16
1       2.16
2       2.16
3       2.16
4       2.16
        ... 
51934   2.16
51935   1.23
51936   2.16
51937   2.16
51938   1.23
Name: post_discount_price, Length: 50666, dtype: float64

In [139]:
# Average price of items in the same subcategory
data.groupby('dbi_item_sub_catgry')['post_discount_price'].transform('mean')

0       2.11
1       2.11
2       2.11
3       2.11
4       2.11
        ... 
51934   2.11
51935   1.52
51936   1.92
51937   2.11
51938   1.52
Name: post_discount_price, Length: 50666, dtype: float64

In [102]:
# Missed:
# - Ratio of purchases within a week for a customers (averaged over all customers)
# - Number of purchases at different week day

2. Cluster all products into distinct groups (clusters). You may use any clustering algorithm you want. If you use distance-based clustering (e.g. k-means), do not forget to preprocess your features (normalization, z-scoring or standard scaling). Try a different number of groups (e.g. from 5 to 30)

3. Write a report. In your report you should present the following information:
   - Put an example screenshot of your features.
   - Explain (in a similar way I explain them in the plan) every single feature (you may skip features from the seminar plan) you use.
   - Cluster’s information: how many clusters do you have, how many objects are in these clusters.
   - Cluster’s interpretation. Try to provide an interpretation of every single cluster (or groups of clusters) you end up. For example: “Cluster 1 includes hot drinks and beverages often bought in a combination in the morning.”
   - You may include any visualization you find necessary, e.g.: colored PCA components, histogram or pie charts of cluster’s sizes, “elbows” used for selection number of clusters (if you have used it).


### II. Recommender system (25)

1. Prepare user-item data as it was done during the seminar: User, Item, Score. You may construct Score (e.g. see seminar) any way you want, but you must explain it in your report.  

2. Split your data into train and test sets (as Leonid explained during the lecture): some of the user-item pairs go to the train set and some to the test set.


3. Build a recommender system using cluster groups (if you have about 20-40 clusters) or items subcategories (75 most frequent values of the  `dbi_item_famly_name`  attribute) as items and `dd_card_number` as users. You may want to play with a number of neighbours in your KNN recommender model. 

4. Compute 3 different recommender performance scores, which were explained during the lecture or seminar to assess the quality of your recommendations (use appropriate metrics).

5. Write a report.  In your report you should present the following information:
   - Report computed performance scores.
   - Elaborate on the quality of your recommendations.
   - Provide 3-5 examples of `good` recommendations suggested by your recommender system.
   - Provide 3-5 examples of `bad` recommendations suggested by your recommender system.
   - You may report any additional information you find potentially useful to assess the quality of your recommendations: e.g for a couple of customers compute the price of their average purchase (or an item in purchase) and compare it with the average price of recommended items.
   - You may use any visualisations you find useful
