# Deep Clustering for Financial Market Segmentation

### A unsupervised deep learning approach for credit card customer clustering 

With the advancement of unsupervised deep learning, the Autoencoder neural network is now frequently used for high dimensionality (e.g., a dataset with thousands or more features) reduction. Autoencoder can also be combined with supervised learning (e.g., Random Forest) to form Semi-supervised learning method. Recently a Deep Embedded Clustering (DEC) method [1] was published. It combines autoencoder with K-means and other machine learning techniques for clustering rather than dimensionality reduction. The original implementation of DEC is based on Caffe.

The rest of this notebook is arranged as follows:

* Data Preparation
* Implementation of the DEC Method in Keras
* Summary

### Import packages

In [109]:
from time import time
import keras.backend as K
from tensorflow.keras.layers import Layer, InputSpec
from keras.layers import Dense, Input
from keras.models import Model
from keras.optimizers import SGD
from keras import callbacks
from keras.initializers import VarianceScaling
from sklearn.cluster import KMeans
import keras.metrics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_score
from IPython.display import Image
from tensorflow.keras.callbacks import TensorBoard
import tensorflow as tf

%matplotlib inline

# 1. Data Preparation

This section describes the common data preprocessing steps required for clustering.

### 1.1 Loading Data
After the Kaggle credit card dataset [2] has been downloaded onto a local machine, it can be loaded into Pandas DataFrame as follows:

In [52]:
np.random.seed(10) ### for the random selection, we're gonna use the seed 10
data = pd.read_csv('./CC GENERAL.csv')
data.head()

Unnamed: 0,CUST_ID,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,CREDIT_LIMIT,PAYMENTS,MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
0,C10001,40.900749,0.818182,95.4,0.0,95.4,0.0,0.166667,0.0,0.083333,0.0,0,2,1000.0,201.802084,139.509787,0.0,12
1,C10002,3202.467416,0.909091,0.0,0.0,0.0,6442.945483,0.0,0.0,0.0,0.25,4,0,7000.0,4103.032597,1072.340217,0.222222,12
2,C10003,2495.148862,1.0,773.17,773.17,0.0,0.0,1.0,1.0,0.0,0.0,0,12,7500.0,622.066742,627.284787,0.0,12
3,C10004,1666.670542,0.636364,1499.0,1499.0,0.0,205.788017,0.083333,0.083333,0.0,0.083333,1,1,7500.0,0.0,,0.0,12
4,C10005,817.714335,1.0,16.0,16.0,0.0,0.0,0.083333,0.083333,0.0,0.0,0,1,1200.0,678.334763,244.791237,0.0,12


In [53]:
# How many columns(variables) we have? 
print(data.columns)
print("# of columns: " + str(len(data.columns)))

Index(['CUST_ID', 'BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES',
       'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE',
       'PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY',
       'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY',
       'CASH_ADVANCE_TRX', 'PURCHASES_TRX', 'CREDIT_LIMIT', 'PAYMENTS',
       'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT', 'TENURE'],
      dtype='object')
# of columns: 18


### 1.2 Selecting Features
It can be seen from the DataFrame above that CUST_ID field is unique for each customer data record. This field with unique values is not useful for clustering and thus can be dropped:

In [54]:
data_x = data.drop(['CUST_ID'], axis=1)

### 1.3 Rescaling Features
It can also be seen from the DataFrame that the ranges of values are very different for different fields/features. It is well known that **K-means is sensitive to the scale of feature values because it uses Euclidean distance as similarity metrics**. To avoid this issue, the values of all features are rescaled into the range of [0, 1]:

In [55]:
numeric_columns = data_x.columns.values.tolist()
scaler = MinMaxScaler() 
data_x[numeric_columns] = scaler.fit_transform(data_x[numeric_columns])
data_x.head()

Unnamed: 0,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,CREDIT_LIMIT,PAYMENTS,MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
0,0.002148,0.818182,0.001945,0.0,0.00424,0.0,0.166667,0.0,0.083333,0.0,0.0,0.005587,0.03172,0.003979,0.001826,0.0,1.0
1,0.168169,0.909091,0.0,0.0,0.0,0.136685,0.0,0.0,0.0,0.166667,0.03252,0.0,0.232053,0.080893,0.014034,0.222222,1.0
2,0.131026,1.0,0.015766,0.018968,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.03352,0.248748,0.012264,0.00821,0.0,1.0
3,0.087521,0.636364,0.030567,0.036775,0.0,0.004366,0.083333,0.083333,0.0,0.055555,0.00813,0.002793,0.248748,0.0,,0.0,1.0
4,0.04294,1.0,0.000326,0.000393,0.0,0.0,0.083333,0.083333,0.0,0.0,0.0,0.002793,0.038397,0.013374,0.003204,0.0,1.0


### 1.4 Handling Missing Data
The following code is to check if any missing data exists in the dataset:

In [56]:
data_x.isnull().sum()

BALANCE                               0
BALANCE_FREQUENCY                     0
PURCHASES                             0
ONEOFF_PURCHASES                      0
INSTALLMENTS_PURCHASES                0
CASH_ADVANCE                          0
PURCHASES_FREQUENCY                   0
ONEOFF_PURCHASES_FREQUENCY            0
PURCHASES_INSTALLMENTS_FREQUENCY      0
CASH_ADVANCE_FREQUENCY                0
CASH_ADVANCE_TRX                      0
PURCHASES_TRX                         0
CREDIT_LIMIT                          1
PAYMENTS                              0
MINIMUM_PAYMENTS                    313
PRC_FULL_PAYMENT                      0
TENURE                                0
dtype: int64

The above table shows that there are one missing CREDIT_LIMIT record and 313 missing MINIMUM_PAYMENTS. In this case, it makes sense to fill up missing data with zero:

In [57]:
data_x.fillna(0, inplace=True)

# 2. Implementation of the DEC Method in Keras

* Step 1: Estimating the number of clusters
* Step 2: Creating and training a K-means model
* Step 3: Creating and training an autoencoder
* Step 4: Implementing DEC Soft Labeling
* Step 5: Creating a new DEC model
* Step 6: Training the New DEC Model
* Step 7: Using the Trained DEC Model for Predicting Clustering Classes
* Step 8: Jointly Refining DEC Model
* Step 9: Using Refined DEC Model for Predicting Clustering Classes
* Step 10: Comparing with K-means

### 2.1 Estimating the Number of Clusters
As described before, the DEC method combines Autoencoder with K-means and other machine learning techniques. In order to train a K-means model, **an estimated number of clusters is required**. The number of clusters is estimated in this case by exploring the silhouette values of different K-means model executions:

In [58]:
for num_clusters in range(2,10):
    clusterer = KMeans(n_clusters=num_clusters)
    preds = clusterer.fit_predict(data_x)
    # centers = clusterer.cluster_centers_
    score = silhouette_score(data_x, preds, metric='euclidean')
    print ("For n_clusters = {}, Kmeans silhouette score is {})".format(num_clusters, score))

For n_clusters = 2, Kmeans silhouette score is 0.3867399802746641)
For n_clusters = 3, Kmeans silhouette score is 0.3287956842628974)
For n_clusters = 4, Kmeans silhouette score is 0.3266576789600756)
For n_clusters = 5, Kmeans silhouette score is 0.3195084713077689)
For n_clusters = 6, Kmeans silhouette score is 0.2941513825965772)
For n_clusters = 7, Kmeans silhouette score is 0.32251271391463787)
For n_clusters = 8, Kmeans silhouette score is 0.32459857982009993)
For n_clusters = 9, Kmeans silhouette score is 0.32345829231712003)


A silhouette value measures how similar a data record is to its own cluster (cohesion) compared to other clusters. The silhouette value ranges from −1 to +1, where a high value indicates that the data record matches to its own cluster well and matches poorly to its neighboring clusters.

The silhouette values above indicate that the top two choices of the number of clusters are 2 and 3. For the purpose of the class, the number of clusters of 3 is chosen.

### 2.2 Creating and Training K-means Model
Once the number of clusters is determined, a K-means model can be created:

In [59]:
n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters)
y_pred_kmeans = kmeans.fit_predict(data_x)

In [83]:
x = data_x.values
x.shape

(8950, 17)

### 2.3 Creating and Training Autoencoder 
In addition to K-means, an autoencoder is required as well in the DEC algorithm [1]. The following function is to create an autoencoder:

In [98]:
def autoencoder(dims,act='relu', init='glorot_uniform'):
    """
    Fully connected symmetric auto-encoder model.
  
    dims: list of the sizes of layers of encoder like [500, 500, 2000, 10]. 
          dims[0] is input dim, dims[-1] is size of the latent hidden layer.

    act: activation function
    
    return:
        (autoencoder_model, encoder_model): Model of autoencoder and model of encoder
    """

    n_stacks = len(dims) - 1
    
    input_data = Input(shape=(dims[0],), name='input')
    x = input_data
    
    # internal layers of encoder
    for i in range(n_stacks-1):
        x = Dense(dims[i + 1], activation=act,  kernel_initializer=init, name='encoder_%d' % i)(x)
    # latent hidden layer
    encoded = Dense(dims[-1], kernel_initializer=init, name='encoder_%d' % (n_stacks - 1))(x)
    x = encoded
    # internal layers of decoder
    for i in range(n_stacks-1, 0, -1):
        x = Dense(dims[i], activation=act, kernel_initializer=init, name='decoder_%d' % i)(x)
    # decoder output
    x = Dense(dims[0], kernel_initializer=init, name='decoder_0')(x)
    
    decoded = x
    autoencoder_model = Model(inputs=input_data, outputs=decoded, name='autoencoder')
    encoder_model     = Model(inputs=input_data, outputs=encoded, name='encoder')
    
    return autoencoder_model, encoder_model

An autoencoder model is created as follows:

In [99]:
n_epochs   = 100
batch_size = 128
dims = [x.shape[-1], 500, 500, 2000, 10] 
## VarianceScaling
### Is a weight initialization strategy used to help with the training of neural networks. It Scales the initial weights of the layers based on the variance of the inputs, which
### can help to maintain a healthy gradient flow through the network and potentially improve training performance.

init_ = VarianceScaling(scale=1. / 3., mode='fan_in',
                           distribution='uniform') 
pretrain_optimizer = SGD(learning_rate=0.1, momentum=0.9)

In [100]:
pretrain_epochs = n_epochs
batch_size = batch_size
save_dir = './results'
autoencoder, encoder = autoencoder(dims, init=init_)

As described in [1], the sizes of layers [500, 500, 2000, 10] are chosen as a generic configuration of the autoencoder neural network for any dataset.

In [102]:
autoencoder.compile(optimizer=pretrain_optimizer, loss='mse')

autoencoder.fit(x,x, batch_size=batch_size, epochs=pretrain_epochs,callbacks=[tensorboard_])
#autoencoder.save_weights(save_dir + '/ae_weights.h5')

Epoch 1/100


[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 0.0810
Epoch 2/100
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 0.0261
Epoch 3/100
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 0.0197
Epoch 4/100
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - loss: 0.0193
Epoch 5/100
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 0.0158
Epoch 6/100
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - loss: 0.0100
Epoch 7/100
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 0.0086
Epoch 8/100
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 0.0079
Epoch 9/100
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 0.0066
Epoch 10/100
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - loss: 0.0055
Epoch 11/10

<keras.src.callbacks.history.History at 0x1c7713e5940>

In [107]:
%tensorboard --logdir logs

UsageError: Line magic function `%tensorboard` not found.


### References

1. Xie, R. Girshick, A. Farhadi, Unsupervised Deep Embedding for Clustering Analysis, May 24, 2016
2. Source of data: https://www.kaggle.com/datasets/arjunbhasin2013/ccdata