<img src='../img/logo.png' alt='DS Market logo' height='150px'>

# Clustering

## Table of Contents

* [A. Introduction](#introduction)
* [B. Importing Libraries](#libraries)
* [C. Importing data](#data)
* [D. Products Clustering](#products)
* [E. Store Clustering](#store_clustering)

## A. Introduction <a class="anchor" id="introduction"></a>

The goal of this notebook is to identify groups of products that behave in a similar way, so DS Market can evaluate the performance of their different campaigns. We will provide those groups and the ideal number of these.

Additionally, we will try to identify how similar are stores from one another in case it makes any sense to do it.

## B. Importing Libraries <a class="anchor" id="libraries"></a>

In [67]:
# system and path management
import sys
sys.path.append('../scripts') # including helper functions inside the scripts folder

# removing system warnings
import warnings
warnings.filterwarnings('ignore')

# data manipulation
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# modelling
from sklearn.cluster import KMeans

# plotting
import matplotlib.pyplot as plt
import plotly.express as px

# plotting options
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams["figure.figsize"] = (10, 7)

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.options.display.float_format = '{:,.2f}'.format

# helper functions
import file_management
import clustering

## C. Importing Data <a class="anchor" id="data"></a>

In [68]:
# downloading the processed data files from gdrive
directory = '../data/processed/'
urls = [
    {'filename': 'sales_processed.csv', 'url': 'https://drive.google.com/file/d/1JdeAgraKcaFQJrjG2HPVb5D0VD0iTlNB/view?usp=sharing'},
    {'filename': 'prices_processed.csv', 'url': 'https://drive.google.com/file/d/1pSEJAQfAU-owDjKmxcPrxf3CpGFivwa6/view?usp=sharing'},
    {'filename': 'calendar_processed.csv', 'url': 'https://drive.google.com/file/d/1Lnji96iBkTpFiWo-QXeW3TvESiNYWCML/view?usp=sharing'}
]
        
file_management.download_files_from_url(urls, directory)

sales = pd.read_csv(directory + 'sales_processed.csv', index_col = 0)
prices = pd.read_csv(directory + 'prices_processed.csv', index_col = 0)
calendar = pd.read_csv(directory + 'calendar_processed.csv', index_col = 0)

sales_processed.csv file already exists in ../data/processed/
prices_processed.csv file already exists in ../data/processed/
calendar_processed.csv file already exists in ../data/processed/


In [69]:
# downloading the feature files from gdrive
directory = '../data/features/'
urls = [
    {'filename': 'sales_by_date.csv', 'url': 'https://drive.google.com/file/d/1JMy2pJUp7DscjnY3_vhCNM7NZk9Th4i9/view?usp=sharing'},
    {'filename': 'sales_by_date_store.csv', 'url': 'https://drive.google.com/file/d/17Na9Eyj_NUGt9Uial1Oepwn8neUTXMmp/view?usp=sharing'},
    {'filename': 'sales_by_date_city.csv', 'url': 'https://drive.google.com/file/d/1Psykw5DZ7JfQkHYcajVW2ZlcaYoj2rmd/view?usp=sharing'},
    {'filename': 'sales_by_product.csv', 'url': 'https://drive.google.com/file/d/1bayt13OJ8NFjfYkscPac6J2HiuyrSqEK/view?usp=sharing'},
    {'filename': 'sales_by_store.csv', 'url': 'https://drive.google.com/file/d/1XC1znbqRNCDSWLkl22HHXgxs3dzFVUVU/view?usp=sharing'}
]

file_management.download_files_from_url(urls, directory)

sales_by_date = pd.read_csv(directory + 'sales_by_date.csv', index_col = 0)
sales_by_date_store = pd.read_csv(directory + 'sales_by_date_store.csv', index_col = 0)
sales_by_date_city = pd.read_csv(directory + 'sales_by_date_city.csv', index_col = 0)
sales_by_product = pd.read_csv(directory + 'sales_by_product.csv', index_col = 0)
sales_by_store = pd.read_csv(directory + 'sales_by_store.csv', index_col = 0)

sales_by_date.csv file already exists in ../data/features/
sales_by_date_store.csv file already exists in ../data/features/
sales_by_date_city.csv file already exists in ../data/features/
sales_by_product.csv file already exists in ../data/features/
sales_by_store.csv file already exists in ../data/features/


## D. Product Clustering<a class="anchor" id="product_clustering"></a>

Let's explore how if our products behave in similar ways. A first approach would be to see how they behave when we confront their number of sales against their average price and its price variation (std).

In [50]:
sales_by_product

Unnamed: 0,num_sales,num_sales.1,num_sales.2,sell_price,sell_price.1,sell_price.2
,sum,mean,std,sum,mean,std
item,,,,,,
ACCESORIES_1_001,4093,0.21395713538944067,0.5760331412094313,219344.5842,11.466000219550445,0.7342924677908943
ACCESORIES_1_002,5059,0.2644537375849451,0.5939880615111044,100941.36080000002,5.276600146366964,0.09235967885724389
ACCESORIES_1_003,1435,0.07501306847882906,0.3231332379741486,75518.0251,3.947622848928385,0.12810647698491953
...,...,...,...,...,...,...
SUPERMARKET_3_823,15388,0.8043910088865656,1.714319173069961,63984.096,3.34469921589127,0.22665860179437608
SUPERMARKET_3_824,8325,0.4351803450078411,0.9471949620798986,57897.0,3.026502875065342,0.25630606694953434
SUPERMARKET_3_825,13526,0.7070569785676947,1.2012319487830874,94370.36399999999,4.9331084161003655,0.24141176783901494
SUPERMARKET_3_826,12188,0.6371144798745426,1.2473999037623729,29381.328,1.5358770517511762,0.0064263250869368805


In [51]:
# preparing the dataframe to use for this analysis
df = pd.DataFrame({
    'avg_sell_price' : sales_by_product.iloc[2:, 4],
    'avg_num_sales': sales_by_product.iloc[2:, 1],
    'std_sell_price': sales_by_product.iloc[2:, -1]
})

df['avg_sell_price'] = df['avg_sell_price'].astype('float')
df['avg_num_sales'] = df['avg_num_sales'].astype('float')
df['std_sell_price'] = df['std_sell_price'].astype('float')

df

Unnamed: 0,avg_sell_price,avg_num_sales,std_sell_price
ACCESORIES_1_001,11.47,0.21,0.73
ACCESORIES_1_002,5.28,0.26,0.09
ACCESORIES_1_003,3.95,0.08,0.13
ACCESORIES_1_004,5.98,2.05,0.28
ACCESORIES_1_005,3.84,0.76,0.22
...,...,...,...
SUPERMARKET_3_823,3.34,0.80,0.23
SUPERMARKET_3_824,3.03,0.44,0.26
SUPERMARKET_3_825,4.93,0.71,0.24
SUPERMARKET_3_826,1.54,0.64,0.01


In [52]:
px.scatter_3d(
    df, 
    x ='avg_sell_price', 
    y = 'avg_num_sales', 
    z = 'std_sell_price', 
    color = df.index, 
    title = 'Scatter plot  - Product prices, sales and price variations',
    height = 800,
    hover_name = df.index
)

In [53]:
# scaling data
scaler = StandardScaler()
scaled_selected_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_selected_data, columns = df.columns, index = df.index)

In [54]:
# elbow method to identify the right amount of clusters for K-Means
clustering.kmeans_elbow_plot(scaled_selected_data, (1, 12))

Using the Elbow method, we decided to take 5 clusters.

In [55]:
clusters = 5
kmeans = KMeans(n_clusters = clusters, init = 'k-means++')

clusters = clustering.get_clusters(scaled_df, kmeans)
df['cluster'] = clusters

5 different clusters have been generated


In [56]:
px.scatter_3d(
        df, 
        x ='avg_sell_price', 
        y = 'avg_num_sales', 
        z = 'std_sell_price', 
        color = 'cluster', 
        title = 'Clustered Products',
        height = 800,
        hover_name = df.index
    )

In [57]:
# check how many items are in each cluster
df.cluster.value_counts(normalize = True) * 100

2   69.01
0   22.50
4    4.69
1    3.67
3    0.13
Name: cluster, dtype: float64

In [58]:
df['category'] = [item.split('_')[0] for item in df.index]
cluster_category_df = (df.groupby(['cluster', 'category']).agg('count')).iloc[:, [0]]
cluster_category_df.columns = ['num_items']
cluster_category_df

Unnamed: 0_level_0,Unnamed: 1_level_0,num_items
cluster,category,Unnamed: 2_level_1
0,ACCESORIES,170
0,HOME,342
0,SUPERMARKET,174
1,ACCESORIES,43
1,HOME,44
1,SUPERMARKET,25
2,ACCESORIES,342
2,HOME,638
2,SUPERMARKET,1124
3,SUPERMARKET,4


From the K-Means output, what we've seen is that we came up with 5 different types of products with:
- Cluster 2: Low price, low amount of avg sales per day and low price variation per day (69% of our products, which is mainly populated by Supermarket products)
- Cluster 4: Similar low prices as the first ones, a similar pricing variation, but that are selling better (4.69%, mainly populated by Supermarket products)
- Cluster 0: A higher sale price, but with a low amount of sales per day. However, these have a higher variability in prices (22.50%, where the vast majority are Home and Garden products)
- Cluster 1: A higher price variation, with a quite large range of prices, from kind of cheap until an expensive ones, but that are not selling much (3.67%, where it's almost evenly distributed between Accesories, Supermarket and Home & Garden)
- Cluster 3: 4 products that have a higher amount of sales, but are not very expensive and don't have much of a variation of their prices (0.13%, where all of them are Supermarket items)

Having a ton of supermarket items in percentage is quite normal taking into account that DS Market has more Supermarket items than the other 2 categories.

Side Note: In this approach we decided not to remove the outliers as we believe this wouldn't make much of a difference to the clusters that were found.

## E. Store Clustering<a class="anchor" id="store_clustering"></a>

Let's check if we can also find some similarities with the stores themselves. We need to take into account that we have very few stores (10), so maybe the results won't be very interesting, but it's worth to try it out.

In [70]:
sales_by_store

Unnamed: 0,num_sales,num_sales.1,num_sales.2,num_sales.3,sell_price,sell_price.1,sell_price.2,sell_price.3,total_income,total_income.1,total_income.2,total_income.3
,count,sum,mean,std,count,sum,mean,std,count,sum,mean,std
store,,,,,,,,,,,,
BOS_1,5832737,5595292,0.9592909812323099,3.3272319411412368,5832737,32235235.3739,5.526605326778835,4.519343803206978,5832737,19340893.8249,3.3159207803986366,9.607334985720026
BOS_2,5832737,7214384,1.2368779871267983,4.421267271631793,5832737,32179182.3277,5.5169952507202025,4.495633960159359,5832737,25266780.371,4.331890906618968,12.708695667383259
BOS_3,5832737,6089330,1.0439918686544585,3.796400072688688,5832737,32329719.5246,5.542804265750367,4.550445562125641,5832737,21946513.5513,3.762644115669882,13.990806354938323
NYC_1,5832737,7698216,1.3198290956715517,4.058652095858902,5832737,32525656.9327,5.5763969698445175,4.554146834163076,5832737,27735269.8694,4.755103799365546,13.05824513473278
NYC_2,5832737,5685475,0.974752504698909,2.7596788381729387,5832737,32554083.3411,5.581270566648214,4.55813576772769,5832737,21507127.3084,3.6873130587578355,9.260088590510295
NYC_3,5832737,11188180,1.9181698060447436,6.208485547943539,5832737,32334050.1794,5.543546739618124,4.55123660549577,5832737,39492258.6086,6.770793644321697,18.112460688347337
NYC_4,5832737,4103676,0.7035592381415449,2.0042745918603546,5832737,32523875.0424,5.576091471705308,4.554585209022432,5832737,15046818.8427,2.5797183796732135,6.819577584776189
PHI_1,5832737,5149062,0.8827865888689992,2.424396150984676,5832737,32599974.017300002,5.589138344022712,4.556099175980534,5832737,18235243.7722,3.1263613929789735,7.708491142009216


In [72]:
# preparing the dataframe to use for this analysis
# taking the features with more differences
df = pd.DataFrame({
    'income_sum': sales_by_store.iloc[2:, -3],
    'income_std': sales_by_store.iloc[2:, -1],
    'num_sales_sum': sales_by_store.iloc[2:, 1]
})

df['income_sum'] = df['income_sum'].astype('float')
df['income_std'] = df['income_std'].astype('float')
df['num_sales_sum'] = df['num_sales_sum'].astype('float')

df

Unnamed: 0,income_sum,income_std,num_sales_sum
BOS_1,19340893.82,9.61,5595292.0
BOS_2,25266780.37,12.71,7214384.0
BOS_3,21946513.55,13.99,6089330.0
NYC_1,27735269.87,13.06,7698216.0
NYC_2,21507127.31,9.26,5685475.0
NYC_3,39492258.61,18.11,11188180.0
NYC_4,15046818.84,6.82,4103676.0
PHI_1,18235243.77,7.71,5149062.0
PHI_2,21658283.67,11.57,6544012.0
PHI_3,20752293.45,10.67,6427782.0


In [75]:
px.scatter_3d(
    df, 
    x ='income_sum', 
    y = 'income_std', 
    z = 'num_sales_sum', 
    color = df.index, 
    title = 'Scatter plot  - Store incomes and num of sales',
    height = 600,
    hover_name = df.index
)

We see that even though we were looking at similarities in income (standard deviation and sum) and the number of total number of sold products, this is an almost 2D dimensional problem as the samples are located in an almost straight line.

Let's see which clusters we can get.

In [77]:
# scaling data
scaler = StandardScaler()
scaled_selected_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_selected_data, columns = df.columns, index = df.index)

In [78]:
# elbow method to identify the right amount of clusters for K-Means
clustering.kmeans_elbow_plot(scaled_selected_data, (1, 10))

In [82]:
# following elbow method, we will use 3 clusters
clusters = 3
kmeans = KMeans(n_clusters = clusters, init = 'k-means++')

clusters = clustering.get_clusters(scaled_df, kmeans)
df['cluster'] = clusters

px.scatter_3d(
    df, 
    x ='income_sum', 
    y = 'income_std', 
    z = 'num_sales_sum', 
    color = 'cluster', 
    title = 'Clustered Stores',
    height = 600,
    hover_name = df.index
)

3 different clusters have been generated


In [84]:
df.sort_values('cluster')

Unnamed: 0,income_sum,income_std,num_sales_sum,cluster
BOS_2,25266780.37,12.71,7214384.0,0
BOS_3,21946513.55,13.99,6089330.0,0
NYC_1,27735269.87,13.06,7698216.0,0
PHI_2,21658283.67,11.57,6544012.0,0
PHI_3,20752293.45,10.67,6427782.0,0
BOS_1,19340893.82,9.61,5595292.0,1
NYC_2,21507127.31,9.26,5685475.0,1
NYC_4,15046818.84,6.82,4103676.0,1
PHI_1,18235243.77,7.71,5149062.0,1
NYC_3,39492258.61,18.11,11188180.0,2


As we can see, stores can be grouped in 3 clusters with the following characteristics:
- Cluster 1 (`BOS_1`, `NYC_2`, `NYC_4`, `PHI_1`): with a lower amount of sales, compared to the other groups and a small income deviation per day.
- Cluster 0 (`BOS_2`, `BOS_3`, `NYC_1`, `PHI_2`, `PHI_3`): with a higher amount of sales, more variation on prices and a slightly higher number of products sold
- Cluster 2 (`NYC_3`): this contains the beast store, the one that sells more than the others. From the data we have, it is the one with the higher sales number, but with a very high difference in income so far. This seems to be part of the key of its success, the possibility to vary the prices of the items more than in the other stores.