## Missing Value Clustering

This notebook aims to identify the patterns of missing value and cluster them together.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.cm as cm
import matplotlib.colors as colors
import pickle

In [None]:
df = pd.read_csv('dataset/worldbank/API.csv')
meta_country = pd.read_csv('dataset/worldbank/Metadata_Country_API_19_DS2_en_csv_v2_3159902.csv')
meta_indicator = pd.read_csv('dataset/worldbank/Metadata_Indicator_API_19_DS2_en_csv_v2_3159902.csv')

## Preprocessing Data

Missing value: fill null value with 0, and fill not null value with 1 (only apply this method on year dimension)

In [None]:
# Get  columns whose data type is float

floatColumns = df.dtypes[df.dtypes == np.float]

# list of columns whose data type is float

listOfFloatColumnNames = list(floatColumns.index)

print(listOfFloatColumnNames)

In [None]:
# Get  columns whose data type is object

objectColumns = df.dtypes[df.dtypes == np.object]

# list of columns whose data type is object

listOfObjectColumnNames = list(objectColumns.index)

print(listOfObjectColumnNames)

In [None]:
# number of unique country code (will be used as min max normalization)

len(df['Country Code'].unique())

In [None]:
# How the original (not yet preprocessed data) looks like

df_years = df[listOfFloatColumnNames]
df_years

In [None]:
# fill nan with 0, and fill not nan with 1

df_years = df_years.fillna(0)
df_years[df_years[listOfFloatColumnNames] > 0] = 1
df_years = df_years.astype('int')
df_years

In [None]:
# Only indicator name is needed

df_indicator = df[listOfObjectColumnNames].iloc[: , 2:3]
df_indicator

In [None]:
# concat df_countryAndIndicator and df_years back together

df_indicatorAndYears = pd.concat([df_indicator, df_years], axis=1)

# final preprocessed data

df_indicatorAndYears

## Group by Indicator

There are a few ways of checking the pattern of missing value. This section start off with missing value group by indicator, which means all country's data will be grouped (sum) together in each of the 76 indicators.

In [None]:
# Drop first column of dataframe
df_groupBy_indicatorCode = df_indicatorAndYears

# Then group by indicator
df_groupBy_indicatorCode = df_groupBy_indicatorCode.groupby(['Indicator Name']).sum()
df_groupBy_indicatorCode

In [None]:
# The first and second indicator has similar missing value pattern

df_temp = (df_groupBy_indicatorCode.iloc[0:5 , :]).transpose()

sns.lineplot(data=df_temp, palette="tab10", linewidth=2.5)

In [None]:
# This is another missing value pattern found

df_temp = (df_groupBy_indicatorCode.iloc[5:10 , :]).transpose()

sns.lineplot(data=df_temp, palette="tab10", linewidth=2.5)

In [None]:
# more of them

df_temp = (df_groupBy_indicatorCode.iloc[10:15 , :]).transpose()

sns.lineplot(data=df_temp, palette="tab10", linewidth=2.5)

## Time series clustering based on indicator

Since there are 76 different indicators and all of them may have different pattern of missing value. This section aim to use time series clustering to group all missing data value pattern into their respective category.

The resource of tslearn can be obtained from [tslearn documentation](https://tslearn.readthedocs.io/en/stable/auto_examples/clustering/plot_kmeans.html#sphx-glr-auto-examples-clustering-plot-kmeans-py).

In [None]:
from tslearn.clustering import TimeSeriesKMeans
from tslearn.utils import to_time_series_dataset

# Matplotlib customization
%matplotlib inline
mpl.rcParams.update(mpl.rcParamsDefault)
mpl.rcParams['font.size'] = 14
mpl.rcParams['figure.dpi'] = 150.
mpl.rcParams["figure.figsize"] = (20,50)

In [None]:
# copy the data
df_max_scaled = df_groupBy_indicatorCode.copy()
  
# apply normalization techniques
for column in df_max_scaled.columns:
    df_max_scaled[column] = df_max_scaled[column]  / 266 
      
# view normalized data
df_max_scaled

In [None]:
seed = 1
np.random.seed(seed)

X_train = to_time_series_dataset(df_max_scaled.copy())
# X_train

In [None]:
# Set number of cluster

cluster = 13

In [None]:
print("Euclidean k-means")
km = TimeSeriesKMeans(n_clusters=cluster, 
                      verbose=True, 
                      random_state=seed, 
                      max_iter=10)
y_pred = km.fit_predict(X_train)
clusters = pd.Series(data=y_pred, index=df_max_scaled.index)
clusters

plt.figure()
for yi in range(cluster):
    plt.subplot(cluster, 1, yi+1)
    for xx in X_train[y_pred == yi]:
        plt.plot(xx.ravel(), "k-", alpha=.2)
    plt.plot(km.cluster_centers_[yi].ravel(), "r-")
    plt.ylim(0, 1)
    plt.text(0.01, 0.50,'Cluster %d' % (yi + 1),
             transform=plt.gca().transAxes)
        
print("Euclidean k-means Chart")
plt.show()

In [None]:
# DBA-k-means
print("DBA k-means")
dba_km = TimeSeriesKMeans(n_clusters=cluster,
                          n_init=2,
                          metric="dtw",
                          verbose=True,
                          max_iter_barycenter=10,
                          random_state=seed)
y_pred = dba_km.fit_predict(X_train)

for yi in range(cluster):
    plt.subplot(cluster, 1, yi+1)
    for xx in X_train[y_pred == yi]:
        plt.plot(xx.ravel(), "k-", alpha=.2)
    plt.plot(dba_km.cluster_centers_[yi].ravel(), "r-")
    plt.ylim(0, 1)
    plt.text(0.01, 0.50,'Cluster %d' % (yi + 1),
             transform=plt.gca().transAxes)


print("DBA k-means Chart")
plt.show()

In [None]:
# Soft-DTW-k-means
print("Soft-DTW k-means")
sdtw_km = TimeSeriesKMeans(n_clusters=cluster,
                           metric="softdtw",
                           metric_params={"gamma": .01},
                           verbose=True,
                           random_state=seed)
y_pred = sdtw_km.fit_predict(X_train)

for yi in range(cluster):
    plt.subplot(cluster, 1, yi+1)
    for xx in X_train[y_pred == yi]:
        plt.plot(xx.ravel(), "k-", alpha=.2)
    plt.plot(sdtw_km.cluster_centers_[yi].ravel(), "r-")
    plt.ylim(0, 1)
    plt.text(0.01, 0.50,'Cluster %d' % (yi + 1),
             transform=plt.gca().transAxes)

print("Soft-DTW k-means Chart")
plt.show()

## Clustering Observation
1. Pending (work in progress)