<img src='../img/logo.png' alt='DS Market logo' height='150px'>

# Clustering

## Table of Contents

* [A. Introduction](#introduction)
* [B. Importing Libraries](#libraries)
* [C. Importing data](#data)
* [D. Products Clustering](#products)
* [E. Store Clustering](#store_clustering)

## A. Introduction <a class="anchor" id="introduction"></a>

The goal of this notebook is to identify groups of products that behave in a similar way, so DS Market can evaluate the performance of their different campaigns. We will provide those groups and the ideal number of these.

Additionally, we will try to identify how similar are stores from one another in case it makes any sense to do it.

## B. Importing Libraries <a class="anchor" id="libraries"></a>

In [14]:
# system and path management
import sys
sys.path.append('../scripts') # including helper functions inside the scripts folder

# removing system warnings
import warnings
warnings.filterwarnings('ignore')

# data manipulation
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# modelling
from sklearn.cluster import KMeans

# plotting
import matplotlib.pyplot as plt
import plotly.express as px

# plotting options
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams["figure.figsize"] = (10, 7)

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.options.display.float_format = '{:,.2f}'.format

# helper functions
import file_management
import clustering

## C. Importing Data <a class="anchor" id="data"></a>

In [3]:
# downloading the processed data files from gdrive
directory = '../data/processed/'
urls = [
    {'filename': 'sales_processed.csv', 'url': 'https://drive.google.com/file/d/1JdeAgraKcaFQJrjG2HPVb5D0VD0iTlNB/view?usp=sharing'},
    {'filename': 'prices_processed.csv', 'url': 'https://drive.google.com/file/d/1pSEJAQfAU-owDjKmxcPrxf3CpGFivwa6/view?usp=sharing'},
    {'filename': 'calendar_processed.csv', 'url': 'https://drive.google.com/file/d/1Lnji96iBkTpFiWo-QXeW3TvESiNYWCML/view?usp=sharing'}
]
        
file_management.download_files_from_url(urls, directory)

sales = pd.read_csv(directory + 'sales_processed.csv', index_col = 0)
prices = pd.read_csv(directory + 'prices_processed.csv', index_col = 0)
calendar = pd.read_csv(directory + 'calendar_processed.csv', index_col = 0)

sales_processed.csv file already exists in ../data/processed/
prices_processed.csv file already exists in ../data/processed/
calendar_processed.csv file already exists in ../data/processed/


In [4]:
# downloading the feature files from gdrive
directory = '../data/features/'
urls = [
    {'filename': 'sales_by_date.csv', 'url': 'https://drive.google.com/file/d/1JMy2pJUp7DscjnY3_vhCNM7NZk9Th4i9/view?usp=sharing'},
    {'filename': 'sales_by_date_store.csv', 'url': 'https://drive.google.com/file/d/17Na9Eyj_NUGt9Uial1Oepwn8neUTXMmp/view?usp=sharing'},
    {'filename': 'sales_by_date_city.csv', 'url': 'https://drive.google.com/file/d/1Psykw5DZ7JfQkHYcajVW2ZlcaYoj2rmd/view?usp=sharing'},
    {'filename': 'sales_by_product.csv', 'url': 'https://drive.google.com/file/d/1bayt13OJ8NFjfYkscPac6J2HiuyrSqEK/view?usp=sharing'}
]

file_management.download_files_from_url(urls, directory)

sales_by_date = pd.read_csv(directory + 'sales_by_date.csv', index_col = 0)
sales_by_date_store = pd.read_csv(directory + 'sales_by_date_store.csv', index_col = 0)
sales_by_date_city = pd.read_csv(directory + 'sales_by_date_city.csv', index_col = 0)
sales_by_product = pd.read_csv(directory + 'sales_by_product.csv', index_col = 0)

sales_by_date.csv file already exists in ../data/features/
sales_by_date_store.csv file already exists in ../data/features/
sales_by_date_city.csv file already exists in ../data/features/
sales_by_product.csv file already exists in ../data/features/


## D. Product Clustering<a class="anchor" id="product_clustering"></a>

Let's explore how if our products behave in similar ways. A first approach would be to see how they behave when we confront their number of sales against their average price and its price variation (std).

In [5]:
sales_by_product

Unnamed: 0,num_sales,num_sales.1,num_sales.2,sell_price,sell_price.1,sell_price.2
,sum,mean,std,sum,mean,std
item,,,,,,
ACCESORIES_1_001,4093,0.21395713538944067,0.5760331412094313,219344.5842,11.466000219550445,0.7342924677908943
ACCESORIES_1_002,5059,0.2644537375849451,0.5939880615111044,100941.36080000002,5.276600146366964,0.09235967885724389
ACCESORIES_1_003,1435,0.07501306847882906,0.3231332379741486,75518.0251,3.947622848928385,0.12810647698491953
...,...,...,...,...,...,...
SUPERMARKET_3_823,15388,0.8043910088865656,1.714319173069961,63984.096,3.34469921589127,0.22665860179437608
SUPERMARKET_3_824,8325,0.4351803450078411,0.9471949620798986,57897.0,3.026502875065342,0.25630606694953434
SUPERMARKET_3_825,13526,0.7070569785676947,1.2012319487830874,94370.36399999999,4.9331084161003655,0.24141176783901494
SUPERMARKET_3_826,12188,0.6371144798745426,1.2473999037623729,29381.328,1.5358770517511762,0.0064263250869368805


In [6]:
# preparing the dataframe to use for this analysis
df = pd.DataFrame({
    'avg_sell_price' : sales_by_product.iloc[2:, 4],
    'avg_num_sales': sales_by_product.iloc[2:, 1],
    'std_sell_price': sales_by_product.iloc[2:, -1]
})

df['avg_sell_price'] = df['avg_sell_price'].astype('float')
df['avg_num_sales'] = df['avg_num_sales'].astype('float')
df['std_sell_price'] = df['std_sell_price'].astype('float')

df

Unnamed: 0,avg_sell_price,avg_num_sales,std_sell_price
ACCESORIES_1_001,11.47,0.21,0.73
ACCESORIES_1_002,5.28,0.26,0.09
ACCESORIES_1_003,3.95,0.08,0.13
ACCESORIES_1_004,5.98,2.05,0.28
ACCESORIES_1_005,3.84,0.76,0.22
...,...,...,...
SUPERMARKET_3_823,3.34,0.80,0.23
SUPERMARKET_3_824,3.03,0.44,0.26
SUPERMARKET_3_825,4.93,0.71,0.24
SUPERMARKET_3_826,1.54,0.64,0.01


### D. Product Clustering<a class="anchor" id="product_clustering"></a>

In [7]:
px.scatter_3d(
    df, 
    x ='avg_sell_price', 
    y = 'avg_num_sales', 
    z = 'std_sell_price', 
    color = df.index, 
    title = 'Scatter plot  - Product prices, sales and price variations',
    height = 800
)

In [8]:
# scaling data
scaler = StandardScaler()
scaled_selected_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_selected_data, columns = df.columns, index = df.index)

In [9]:
# elbow method to identify the right amount of clusters for K-Means
clustering.kmeans_elbow_plot(scaled_selected_data, (1, 12))

Using the Elbow method, we decided to take 5 clusters.

In [11]:
clusters = 5
kmeans = KMeans(n_clusters = clusters, init = 'k-means++')

styles = {
    'title': 'Clusters',
    'height': 800
}

clusters = clustering.get_clusters(scaled_df, kmeans)
df['cluster'] = clusters

5 different clusters have been generated


In [15]:
px.scatter_3d(
        df, 
        x ='avg_sell_price', 
        y = 'avg_num_sales', 
        z = 'std_sell_price', 
        color = 'cluster', 
        title = 'Clustered Products',
        height = 800
    )

In [44]:
# check how many items are in each cluster
df.cluster.value_counts(normalize = True) * 100

0   69.99
4   21.91
3    4.69
1    3.28
2    0.13
Name: cluster, dtype: float64

In [118]:
df['category'] = [item.split('_')[0] for item in df.index]
cluster_category_df = (df.groupby(['cluster', 'category']).agg('count')).iloc[:, [0]]
cluster_category_df.columns = ['num_items']
cluster_category_df

Unnamed: 0_level_0,Unnamed: 1_level_0,num_items
cluster,category,Unnamed: 2_level_1
0,ACCESORIES,343
0,HOME,656
0,SUPERMARKET,1135
1,ACCESORIES,39
1,HOME,39
1,SUPERMARKET,22
2,SUPERMARKET,4
3,ACCESORIES,10
3,HOME,23
3,SUPERMARKET,110


From the K-Means output, what we've seen is that we came up with 5 different types of products with:
- Cluster 0: Low price, low amount of avg sales per day and low price variation per day (70% of our products, which is mainly populated by Supermarket products)
- Cluster 3: Similar low prices as the first ones, a similar pricing variation, but that are selling better (4.69%, mainly populated by Supermarket products)
- Cluster 4: A higher sale price, but with a low amount of sales per day. However, these have a higher variability in prices (21.91%, where the vast majority are Home and Garden products)
- Cluster 1: A higher price variation, with a quite large range of prices, from kind of cheap until an expensive ones, but that are not selling much (3.28%, where it's almost evenly distributed between Accesories, Supermarket and Home & Garden)
- Cluster 2: 4 products that have a higher amount of sales, but are not very expensive and don't have much of a variation of their prices (0.13%, where all of them are Supermarket items)

Having a ton of supermarket items in percentage is quite normal taking into account that DS Market has more Supermarket items than the other 2 categories.

Side Note: In this approach we decided not to remove the outliers as we believe this wouldn't make much of a difference to the clusters that were found.

## E. Store Clustering<a class="anchor" id="store_clustering"></a>

- num productos a la venta
- num sales
- precio que marcan (varia muy poco, pero varia)

____