# Clustering BY_CHANNEL

### Notebook automatically generated from your model

Model KMeans (k=2) (s1), trained on 2023-07-10 17:54:19.

#### Generated on 2023-07-12 17:33:30.897242

Clustering
This notebook will reproduce the steps for clustering the dataset BY_CHANNEL.

#### Warning

The goal of this notebook is to provide an easily readable and explainable code that reproduces the main steps
of training the model. It is not complete: some of the preprocessing done by the DSS visual machine learning is not
replicated in this notebook. This notebook will not give the same results and model performance as the DSS visual machine
learning model.

Let's start with importing the required libs :

In [1]:
import sys
import dataiku
import numpy as np
import pandas as pd
import sklearn as sk
import dataiku.core.pandasutils as pdu
from dataiku.doctor.preprocessing import PCA
from collections import defaultdict, Counter

And tune pandas display options:

In [2]:
pd.set_option('display.width', 3000)
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

#### Importing base data

The first step is to get our machine learning dataset:

In [3]:
# We apply the preparation that you defined. You should not modify this.
preparation_steps = []
preparation_output_schema = {'columns': [{'name': 'cod_tienda', 'type': 'bigint'}, {'name': 'sales_surface_sqmeters', 'type': 'bigint'}, {'name': 'ventas_valor_per_store', 'type': 'double'}, {'name': 'Channel', 'type': 'string'}, {'name': 'revenue_per_sqmeter', 'type': 'double'}], 'userModified': False}

ml_dataset_handle = dataiku.Dataset('BY_STORE_V2')
ml_dataset_handle.set_preparation_steps(preparation_steps, preparation_output_schema)
%time ml_dataset = ml_dataset_handle.get_dataframe(limit = 100000)

print ('Base data has %i rows and %i columns' % (ml_dataset.shape[0], ml_dataset.shape[1]))
# Five first records",
ml_dataset.head(5)

CPU times: user 14 ms, sys: 4.88 ms, total: 18.8 ms
Wall time: 95.6 ms
Base data has 543 rows and 5 columns


Unnamed: 0,cod_tienda,sales_surface_sqmeters,ventas_valor_per_store,Channel,revenue_per_sqmeter
0,243,1500,376962.96,Supermarkets,251.30864
1,38,600,283402.85,Supermarkets,472.338083
2,30,400,217143.08,Supermarkets,542.8577
3,101,1100,354598.14,Supermarkets,322.361945
4,113,600,191223.41,Supermarkets,318.705683


#### Initial data management

The preprocessing aims at making the dataset compatible with modeling.
At the end of this step, we will have a matrix of float numbers, with no missing values.
We'll use the features and the preprocessing steps defined in Models.

Let's only keep selected features

In [4]:
ml_dataset = ml_dataset[['sales_surface_sqmeters']]

Let's first coerce categorical columns into unicode, numerical features into floats.

In [5]:
# astype('unicode') does not work as expected

def coerce_to_unicode(x):
    if sys.version_info < (3, 0):
        if isinstance(x, str):
            return unicode(x,'utf-8')
        else:
            return unicode(x)
    else:
        return str(x)


categorical_features = []
numerical_features = ['sales_surface_sqmeters']
text_features = []
from dataiku.doctor.utils import datetime_to_epoch
for feature in categorical_features:
    ml_dataset[feature] = ml_dataset[feature].apply(coerce_to_unicode)
for feature in text_features:
    ml_dataset[feature] = ml_dataset[feature].apply(coerce_to_unicode)
for feature in numerical_features:
    if ml_dataset[feature].dtype == np.dtype('M8[ns]') or (hasattr(ml_dataset[feature].dtype, 'base') and ml_dataset[feature].dtype.base == np.dtype('M8[ns]')):
        ml_dataset[feature] = datetime_to_epoch(ml_dataset[feature])
    else:
        ml_dataset[feature] = ml_dataset[feature].astype('double')

Let's copy our dataset to keep it for eventual profiling at the end.

In [6]:
# train dataset will be the one on which we will apply ml technics
train = ml_dataset.copy()

#### Features preprocessing

The first thing to do at the features level is to handle the missing values.
Let's reuse the settings defined in the model

In [7]:
drop_rows_when_missing = []
impute_when_missing = [{'feature': 'sales_surface_sqmeters', 'impute_with': 'MEAN'}]

# Features for which we drop rows with missing values"
for feature in drop_rows_when_missing:
    train = train[train[feature].notnull()]
    
    print ('Dropped missing records in %s' % feature)

# Features for which we impute missing values"
for feature in impute_when_missing:
    if feature['impute_with'] == 'MEAN':
        v = train[feature['feature']].mean()
    elif feature['impute_with'] == 'MEDIAN':
        v = train[feature['feature']].median()
    elif feature['impute_with'] == 'CREATE_CATEGORY':
        v = 'NULL_CATEGORY'
    elif feature['impute_with'] == 'MODE':
        v = train[feature['feature']].value_counts().index[0]
    elif feature['impute_with'] == 'CONSTANT':
        v = feature['value']
    train[feature['feature']] = train[feature['feature']].fillna(v)
    
    print ('Imputed missing values in feature %s with value %s' % (feature['feature'], coerce_to_unicode(v)))

Imputed missing values in feature sales_surface_sqmeters with value 2298.342541436464


We can now handle the categorical features (still using the settings defined in Models):

Let's rescale numerical features

In [8]:
rescale_features = {'sales_surface_sqmeters': 'AVGSTD'}
for (feature_name, rescale_method) in rescale_features.items():
    if rescale_method == 'MINMAX':
        _min = train[feature_name].min()
        _max = train[feature_name].max()
        scale = _max - _min
        shift = _min
    else:
        shift = train[feature_name].mean()
        scale = train[feature_name].std()
    if scale == 0.:
        del train[feature_name]
        
        print ('Feature %s was dropped because it has no variance' % feature_name)
    else:
        print ('Rescaled %s' % feature_name)
        train[feature_name] = (train[feature_name] - shift).astype(np.float64) / scale

Rescaled sales_surface_sqmeters


Removing outliers

In [9]:
# Remove outliers from train set
from dataiku.doctor.preprocessing.dataframe_preprocessing import detect_outliers

outliers = detect_outliers(train, 0.9, 5, 0.01)
train = train[~outliers]

print ("%s outliers found" % (outliers.sum()))

DEBUG:dku.ml.preprocessing:Outliers detection: fitting PCA
DEBUG:dku.ml.preprocessing:Outliers detection: performing PCA
DEBUG:dku.ml.preprocessing:Outliers detection: performing cubic-root kmeans on df (543, 1)
DEBUG:dku.ml.preprocessing:Outliers detection: selecting mini-clusters
DEBUG:dku.ml.preprocessing:Outliers detection: done


4 outliers found


#### Modeling

In [10]:
from sklearn.cluster import KMeans
clustering_model = KMeans(n_clusters=2)

We can finally cluster our dataset!

In [11]:
%time clusters = clustering_model.fit_predict(train)

CPU times: user 126 ms, sys: 20.9 ms, total: 147 ms
Wall time: 32 ms


Build up our result dataset

#### Results

Inertia

In [12]:
print (clustering_model.inertia_)

54.55071163606398


Silhouette

In [13]:
from sklearn.metrics import silhouette_score
silhouette = silhouette_score(train.values, clusters, metric='euclidean', sample_size=2000)
print ("Silhouette score :", silhouette)

Silhouette score : 0.8706192305265472


Join our original dataset with the cluster labels we found.

In [14]:
final = train.join(pd.Series(clusters, index=train.index, name='cluster'))
final['cluster'] = final['cluster'].map(lambda cluster_id: 'cluster' + str(cluster_id))

Compute the cluster sizes

In [15]:
size = pd.DataFrame({'size': final['cluster'].value_counts()})
size.head()

Unnamed: 0,size
cluster0,473
cluster1,66


Draw a nice scatter plot

In [16]:
axis_x = train.columns[0]   # change me
axis_y = train.columns[1]  # change me

from ggplot import ggplot, aes, geom_point
print(ggplot(aes(axis_x, axis_y, colour='cluster'), final) + geom_point())

IndexError: index 1 is out of bounds for axis 0 with size 1

That's it. It's now up to you to tune your preprocessing, your algo, and your analysis !
