Time series clustering using KMeans and DAE (Denoising Auto Encoders)

Problem definition

Say we have 10,000 customer's usage data for past 500 Days. And we need to build usage forecasting model for this 10,000 users. But in some cases all usaers don't nessasaryly want machine learning methods. Instead we can use simple rule for some customer segment. But the situation is quite complex in some customer segments, their usage is hard to predict with simple rules, in such a case we have to use machine learning. But how do we distinguish those customer segments? That's where segmentation comes into a play.

In this notebook. we will build time series clustering model to identify above mentioned customer segments. Dataset used here represent the real customer usage dataset extracted between 2020/2021. Let's begin.

from TabDAE.tab_dae import DenoisingAutoencoder

import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim

from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import ListedColormap
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

import warnings

warnings.filterwarnings('ignore')

usage_data = pd.read_parquet("usage_data/sample_10_000_preprocessed.parquet")

usage_data.columns = ['uid','usage']

usage_data_pivot = usage_data.pivot_table(
    values='usage',
    index='uid',
    columns=usage_data.index
)

usage_data_pivot.shape

(10000, 579)

usage_data_pivot.head()

	2020-06-01	2020-06-02	2020-06-03	2020-06-04	2020-06-05	2020-06-06	2020-06-07	2020-06-08	2020-06-09	2020-06-10	...	2021-12-22	2021-12-23	2021-12-24	2021-12-25	2021-12-26	2021-12-27	2021-12-28	2021-12-29	2021-12-30	2021-12-31
uid
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	255	0	0	0	0	0	176	0	0
2	0	0	0	0	0	0	0	0	0	0	...	1100	1726	1243	1426	1366	1215	859	1772	0	1099
3	0	0	0	0	0	0	0	0	0	0	...	341	647	634	46	32	284	374	451	0	1006
4	0	0	0	0	0	0	0	0	0	0	...	304	0	323	88	22	295	413	392	0	114

5 rows × 579 columns

usage_data_pivot.columns = list(
    range(
        len(
            usage_data_pivot.columns
            )
        )
    )

del usage_data

usage_data_pivot.replace(0,1e-4,inplace=True)
usage_data_pivot.head()

	0	1	2	3	4	5	6	7	8	9	...	569	570	571	572	573	574	575	576	577	578
uid
0	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	...	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001
1	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	...	0.0001	255.0000	0.0001	0.0001	0.0001	0.0001	0.0001	176.0000	0.0001	0.0001
2	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	...	1100.0000	1726.0000	1243.0000	1426.0000	1366.0000	1215.0000	859.0000	1772.0000	0.0001	1099.0000
3	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	...	341.0000	647.0000	634.0000	46.0000	32.0000	284.0000	374.0000	451.0000	0.0001	1006.0000
4	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	...	304.0000	0.0001	323.0000	88.0000	22.0000	295.0000	413.0000	392.0000	0.0001	114.0000

5 rows × 579 columns

Here I'm deviding each users's usage by his/her maximum usage in given 579 days. Now I have 0-1 normalized value for each user. Since my target is identifying users with similar usage pattern (not the usage quantity) I guess it'll be more meaningful.

normalize = lambda row: (row / row.max())

usage_normalized = usage_data_pivot.apply(normalize, axis=1)

usage_normalized.head()

	0	1	2	3	4	5	6	7	8	9	...	569	570	571	572	573	574	575	576	577	578
uid
0	3.885004e-08	3.885004e-08	3.885004e-08	3.885004e-08	3.885004e-08	3.885004e-08	3.885004e-08	3.885004e-08	3.885004e-08	3.885004e-08	...	3.885004e-08	3.885004e-08	3.885004e-08	3.885004e-08	3.885004e-08	3.885004e-08	3.885004e-08	3.885004e-08	3.885004e-08	3.885004e-08
1	2.524615e-08	2.524615e-08	2.524615e-08	2.524615e-08	2.524615e-08	2.524615e-08	2.524615e-08	2.524615e-08	2.524615e-08	2.524615e-08	...	2.524615e-08	6.437768e-02	2.524615e-08	2.524615e-08	2.524615e-08	2.524615e-08	2.524615e-08	4.443322e-02	2.524615e-08	2.524615e-08
2	1.759944e-08	1.759944e-08	1.759944e-08	1.759944e-08	1.759944e-08	1.759944e-08	1.759944e-08	1.759944e-08	1.759944e-08	1.759944e-08	...	1.935938e-01	3.037663e-01	2.187610e-01	2.509680e-01	2.404083e-01	2.138332e-01	1.511792e-01	3.118620e-01	1.759944e-08	1.934178e-01
3	1.786671e-08	1.786671e-08	1.786671e-08	1.786671e-08	1.786671e-08	1.786671e-08	1.786671e-08	1.786671e-08	1.786671e-08	1.786671e-08	...	6.092550e-02	1.155976e-01	1.132750e-01	8.218689e-03	5.717349e-03	5.074147e-02	6.682151e-02	8.057888e-02	1.786671e-08	1.797391e-01
4	1.980198e-08	1.980198e-08	1.980198e-08	1.980198e-08	1.980198e-08	1.980198e-08	1.980198e-08	1.980198e-08	1.980198e-08	1.980198e-08	...	6.019802e-02	1.980198e-08	6.396040e-02	1.742574e-02	4.356436e-03	5.841584e-02	8.178218e-02	7.762376e-02	1.980198e-08	2.257426e-02

5 rows × 579 columns

Plot sample usage pattern

usage_normalized.iloc[999].plot(alpha=0.5,color='r')
plt.grid(True)
plt.show()

Let's train the model. Please note the output layer activation function used here. It's RelU (Rectified Linear Unit) . I used it because customer usage data always be either 0 or positive. If we use another activation functions such as Logistic it may return negetive values and it's meaning-less for this specific problem. If you want to know more info about model architecture, please refre to this code repo. It containes all the infomation related to above model archtecture. Finally I have used MAE as loss function here.

# Split the data into train and test sets
train_data, test_data = train_test_split(usage_normalized, test_size=0.2)

# Convert the data to a PyTorch tensor
train_data = torch.tensor(train_data.values).float()
test_data = torch.tensor(test_data.values).float()

# Define the model, optimizer, and loss function
input_dim = usage_normalized.shape[1]
encoding_dim = 64
hidden_dims = [512 ,256, 128]
model = DenoisingAutoencoder(input_dim, hidden_dims, encoding_dim,dropout_rate=0.1,activation=nn.ReLU,output_activation=nn.ReLU)
optimizer = optim.Adam(model.parameters(),lr=1e-2)
loss_fn = nn.L1Loss()

# Train the model
batch_size = 5000
num_epochs = 20
for epoch in range(num_epochs):
    # Iterate over the training data in batches
    for i in range(0, len(train_data), batch_size):
        inputs = train_data[i:i+batch_size]
        # Forward pass
        output = model(inputs, add_noise=True)
        loss = loss_fn(output, inputs)
        
        # Backward pass and optimization step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Evaluate the model on the test set
    model.eval()
    
    test_loss = 0
    for i in range(0, len(test_data), batch_size):
        inputs = test_data[i:i+batch_size]
        output = model(inputs, add_noise=False)
        test_loss += loss_fn(output, inputs).item()
    test_loss /= len(test_data)
    print(f'Test loss: {test_loss:.5f}')
    
# switch back to train mode    
    model.train()

Test loss: 0.00008
Test loss: 0.00017
Test loss: 0.00019
Test loss: 0.00028
Test loss: 0.00029
Test loss: 0.00027
Test loss: 0.00021
Test loss: 0.00017
Test loss: 0.00013
Test loss: 0.00010
Test loss: 0.00009
Test loss: 0.00007
Test loss: 0.00006
Test loss: 0.00006
Test loss: 0.00006
Test loss: 0.00006
Test loss: 0.00005
Test loss: 0.00005
Test loss: 0.00005
Test loss: 0.00005

Let's plot sample recostructed and actual usage data. To get reconstructed data we have to send our data throuhout both encoder and decoder. We can simply use model(input_tensor_dataset) for that. After that I have plot the reconstructed datasets with it's original shape.

# Set the number of rows and columns
n_rows = 5
n_cols = 5

# Set the figure size and create the figure
plt.figure(figsize=(18, 8))

for i in range(n_rows):
    for j in range(n_cols):
        # Select a random index from the test data
        rand_id = np.random.choice(list(range(len(test_data))))

        # Get the original and sample data at the selected index
        original_data = test_data[rand_id].detach().numpy()
        constructed_data = model(test_data)[rand_id].detach().numpy()

        # Plot the original and sample data in the current subplot
        plt.subplot(n_rows, n_cols, i*n_cols + j + 1)
        plt.plot(constructed_data, color='red',alpha=0.4)
        plt.plot(original_data, color='blue',alpha=0.4)
        plt.grid(True)

# Show the plot
plt.tight_layout()
plt.show()

Alright, now we almost completed the denoicing part. Now we have clustering part left. Let's start with clustering.

These are the steps I follow,

Fit the KMeans algorithms for original dataset and get the labels.
Perform PCA on dataset and project dataset into 2d pane by using first 2 component of the PCA

This will repeat for the reconstructed data as well.

from plotting.plotting_utils import plot_kmeans_pca,plot_time_series

N_CLUSTERS = 5

Let's start with original dataset

test_data_npy = test_data.detach().numpy()
labels = plot_kmeans_pca(test_data_npy,n_clusters=N_CLUSTERS)

Okay, now we have assigned each customers into segmentation based on their usage trend. Now it's time to plot sample dataset for each customer segment.

clustered_df = pd.DataFrame(
    test_data_npy
)

clustered_df['cluster'] = labels

plot_time_series(clustered_df)

After looking at above generated plot you may notice some reasonabl segmentation happend there.

As an example,

Red segment users and green segment users have most inactive days.
Blue segement customers are relatively inactive in begining of the series and then they have gradually increased their usage over time.
Orange and Purple users have relatively active pattern among other segmants. Among these two segments, the Orange group has the most active behavour compared to the Purple segment.

Let's repeat the same for reconstructed dataset as well.

test_reconstructed = model(test_data).detach().numpy()
labels = plot_kmeans_pca(test_reconstructed,n_clusters=N_CLUSTERS)

reconstructed_clustered_df = pd.DataFrame(
    test_data_npy
)

reconstructed_clustered_df['cluster'] = labels

plot_time_series(reconstructed_clustered_df)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
TabDAE		TabDAE
images		images
plotting		plotting
.gitignore		.gitignore
README.md		README.md
Time-series-clustering.ipynb		Time-series-clustering.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Time series clustering using KMeans and DAE (Denoising Auto Encoders)

Problem definition

About

Releases

Packages

Languages

Ransaka/TS-clustering

Folders and files

Latest commit

History

Repository files navigation

Time series clustering using KMeans and DAE (Denoising Auto Encoders)

Problem definition

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages