## Missing Value Clustering

This notebook aims to identify the patterns of missing value and cluster them together.

### Motivation

When analyzing data, some trends discoverd may be caused by missing value. This notebook aims to address those issue.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.cm as cm
import matplotlib.colors as colors
import pickle

## Worldbank Data

In [None]:
df = pd.read_csv('dataset/worldbank/API.csv')
meta_country = pd.read_csv('dataset/worldbank/Metadata_Country_API_19_DS2_en_csv_v2_3159902.csv')
meta_indicator = pd.read_csv('dataset/worldbank/Metadata_Indicator_API_19_DS2_en_csv_v2_3159902.csv')

### Preprocessing Data

Missing value: fill null value with 0, and fill not null value with 1 (only apply this method on year dimension)

In [None]:
# Get  columns whose data type is float

floatColumns = df.dtypes[df.dtypes == np.float]

# list of columns whose data type is float

listOfFloatColumnNames = list(floatColumns.index)

print(listOfFloatColumnNames)

In [None]:
# Get  columns whose data type is object

objectColumns = df.dtypes[df.dtypes == np.object]

# list of columns whose data type is object

listOfObjectColumnNames = list(objectColumns.index)

print(listOfObjectColumnNames)

In [None]:
# number of unique values for country and indicator

df[listOfObjectColumnNames].nunique()

In [None]:
# How the original (not yet preprocessed data) looks like

df_years = df[listOfFloatColumnNames]
df_years.tail()

In [None]:
# fill nan with 0, and fill not nan with 1

df_years = df_years.fillna(0)
df_years[df_years[listOfFloatColumnNames] != 0] = 1
df_years = df_years.astype('int')
df_years.tail()

In [None]:
df_years.apply(pd.Series.value_counts)

In [None]:
# Only indicator name is needed

df_indicator = df[listOfObjectColumnNames].iloc[: , 2:3]
df_indicator

In [None]:
# concat df_indicator and df_years back together

df_indicatorAndYears = pd.concat([df_indicator, df_years], axis=1)

# final preprocessed data

df_indicatorAndYears.tail()

## Group by Indicator

There are a few ways of checking the pattern of missing value. This section start off with missing value group by indicator, which means all country's data will be grouped (sum) together in each of the 76 indicators.

In [None]:
# Make a new copy
df_groupBy_indicatorCode = df_indicatorAndYears

# Then group by indicator
df_groupBy_indicatorCode = df_groupBy_indicatorCode.groupby(['Indicator Name']).sum()
df_groupBy_indicatorCode.tail()

In [None]:
# drop index column first
df_check = df_groupBy_indicatorCode.reset_index(drop=True)

# check value_counts() for all columns (should get only 0 - 266 )
df_check.apply(pd.Series.value_counts)

#### Simple EDA for missing value

In [None]:
# The first and second indicator has similar missing value pattern

df_temp = (df_groupBy_indicatorCode.iloc[0:5 , :]).transpose()

sns.lineplot(data=df_temp, palette="tab10", linewidth=2.5)

In [None]:
# This is another missing value pattern found

df_temp = (df_groupBy_indicatorCode.iloc[5:10 , :]).transpose()

sns.lineplot(data=df_temp, palette="tab10", linewidth=2.5)

In [None]:
# more of them

df_temp = (df_groupBy_indicatorCode.iloc[10:15 , :]).transpose()

sns.lineplot(data=df_temp, palette="tab10", linewidth=2.5)

## Time series clustering based on indicator

Since there are 76 different indicators and all of them may have different pattern of missing value. This section aim to use time series clustering to group all missing data value pattern into their respective category.

The resource of tslearn can be obtained from [tslearn documentation](https://tslearn.readthedocs.io/en/stable/auto_examples/clustering/plot_kmeans.html#sphx-glr-auto-examples-clustering-plot-kmeans-py).

In [None]:
from tslearn.clustering import TimeSeriesKMeans
from tslearn.utils import to_time_series_dataset

# Matplotlib customization
%matplotlib inline
mpl.rcParams.update(mpl.rcParamsDefault)
mpl.rcParams['font.size'] = 14
mpl.rcParams['figure.dpi'] = 150.
mpl.rcParams["figure.figsize"] = (20,50)

In [None]:
# copy the data
df_groupBy_indicatorCode_normalized = df_groupBy_indicatorCode.copy()
  
# apply normalization techniques
for column in df_groupBy_indicatorCode_normalized.columns:
    # max scaled normalization
    df_groupBy_indicatorCode_normalized[column] = df_groupBy_indicatorCode_normalized[column]  / 266 
      
# view normalized data
df_groupBy_indicatorCode_normalized.tail()

### Parameters

In [None]:
seed = 1
np.random.seed(seed)

In [None]:
# Set number of cluster

cluster_number = 20

In [None]:
# training set (there's no testing set)

X_train_indicatorCode = to_time_series_dataset(df_groupBy_indicatorCode_normalized.copy())

### Functions

In [None]:
def euclideanKMeans(cluster, seed, X_train):
    print("Euclidean k-means")
    km = TimeSeriesKMeans(n_clusters=cluster, 
                          verbose=True, 
                          random_state=seed, 
                          max_iter=10)
    y_pred = km.fit_predict(X_train)
#     clusters = pd.Series(data=y_pred, index=X_train.index)
#     clusters

    plt.figure()
    for yi in range(cluster):
        plt.subplot(cluster, 1, yi+1)
        for xx in X_train[y_pred == yi]:
            plt.plot(xx.ravel(), "k-", alpha=.2)
        plt.plot(km.cluster_centers_[yi].ravel(), "r-")
        plt.ylim(0, 1)
        plt.text(0.01, 0.50,'Cluster %d' % (yi + 1),
                 transform=plt.gca().transAxes)

    print("Euclidean k-means Chart")
    plt.show()
    return y_pred

In [None]:
# DBA-k-means
def dbaKMeans(cluster, seed, X_train):
    print("DBA k-means")
    dba_km = TimeSeriesKMeans(n_clusters=cluster,
                              n_init=2,
                              metric="dtw",
                              verbose=True,
                              max_iter_barycenter=10,
                              random_state=seed)
    y_pred = dba_km.fit_predict(X_train)

    for yi in range(cluster):
        plt.subplot(cluster, 1, yi+1)
        for xx in X_train[y_pred == yi]:
            plt.plot(xx.ravel(), "k-", alpha=.2)
        plt.plot(dba_km.cluster_centers_[yi].ravel(), "r-")
        plt.ylim(0, 1)
        plt.text(0.01, 0.50,'Cluster %d' % (yi + 1),
                 transform=plt.gca().transAxes)


    print("DBA k-means Chart")
    plt.show()
    return y_pred

In [None]:
# Soft-DTW-k-means
def softDTWKmean(cluster, seed, X_train):
    print("Soft-DTW k-means")
    sdtw_km = TimeSeriesKMeans(n_clusters=cluster,
                               metric="softdtw",
                               metric_params={"gamma": .01},
                               verbose=True,
                               random_state=seed)
    y_pred = sdtw_km.fit_predict(X_train)

    for yi in range(cluster):
        plt.subplot(cluster, 1, yi+1)
        for xx in X_train[y_pred == yi]:
            plt.plot(xx.ravel(), "k-", alpha=.2)
        plt.plot(sdtw_km.cluster_centers_[yi].ravel(), "r-")
        plt.ylim(0, 1)
        plt.text(0.01, 0.50,'Cluster %d' % (yi),
                 transform=plt.gca().transAxes)

    print("Soft-DTW k-means Chart")
    plt.show()
    return y_pred

In [None]:
def mergeClusterNames(y_pred, df_index):
    clusters = pd.Series(data=y_pred, index=df_index.index)
    df_cluster = clusters.to_frame()
    df_cluster.columns = ['cluster']
    return df_cluster

def getSingleCluster(df_cluster, n):
    # cluster 1 in the chart represent cluster 0 in the data.
    display(df_cluster[df_cluster['cluster'] == n-1])

### Analysis

<span style="color:red">REMINDER:</span> **cluster 1 in the chart represents cluster 0 in data variable**

In [None]:
y_pred_Indicator_euclideanKM = euclideanKMeans(cluster_number, seed, X_train_indicatorCode)

In [None]:
print(y_pred_Indicator_euclideanKM)

In [None]:
cluster_Indicator_euclideanKM = mergeClusterNames(y_pred_Indicator_euclideanKM, df_groupBy_indicatorCode)
getSingleCluster(cluster_Indicator_euclideanKM, 2)

In [None]:
y_pred_Indicator_dbaKM = dbaKMeans(cluster_number, seed, X_train_indicatorCode)

In [None]:
print(y_pred_Indicator_dbaKM)

In [None]:
cluster_Indicator_dbaKM = mergeClusterNames(y_pred_Indicator_dbaKM, df_groupBy_indicatorCode)
getSingleCluster(cluster_Indicator_dbaKM, 10)

In [None]:
getSingleCluster(cluster_Indicator_dbaKM, 19)

In [None]:
# y_pred_Indicator_softDTWKM = softDTWKmean(cluster_number, seed, X_train_indicatorCode)

## Group by Country

In [None]:
# Country column

df_country = df[listOfObjectColumnNames].iloc[: , 0:1]
df_country

In [None]:
# concat df_country and df_years back together

df_countryAndYears = pd.concat([df_country, df_years], axis=1)

# final preprocessed data

df_countryAndYears.tail()

In [None]:
# Make a new copy
df_groupBy_countryName = df_countryAndYears

# Then group by country
df_groupBy_countryName = df_groupBy_countryName.groupby(['Country Name']).sum()
df_groupBy_countryName.tail()

In [None]:
# drop index column first
df_check = df_groupBy_countryName.reset_index(drop=True)

# check value_counts() for all columns (should get only 0 - 76 )
df_check.apply(pd.Series.value_counts)

In [None]:
# copy the data
df_groupBy_countryName_normalized = df_groupBy_countryName.copy()
  
# apply normalization techniques
for column in df_groupBy_countryName_normalized.columns:
    df_groupBy_countryName_normalized[column] = df_groupBy_countryName_normalized[column]  / 76
      
# view normalized data
df_groupBy_countryName_normalized.tail()

## Time series clustering based on country

In [None]:
# country train set

X_train_country = to_time_series_dataset(df_groupBy_countryName_normalized.copy())

In [None]:
y_pred_country_euclideanKM = euclideanKMeans(5, seed, X_train_country)

In [None]:
cluster_country_euclideanKM = mergeClusterNames(y_pred_country_euclideanKM, df_groupBy_countryName)

In [None]:
getSingleCluster(cluster_country_euclideanKM, 5)

## Conclusion

