# **KShape Clustering Notebook Description**

In this notebook, we show how we used the KShape algorithm to cluster the readings. We used this algorithm with various data as our input and concluded that the best outcome resulted from us using:
* **The timeseries readings windowed from -30 to 40 with respect to sample detect time**
* **Normalizing these readings so that they are all between 0 and 1**
* **Concatenating the 10 most important predictors obtained from the variable importance plot from the RandomForest algorithm**
* **Standardzing each row(reading) to have a mean of 0 and a standard deviation of 1**
* **Clustering with K = 30**

Here is a list of other pipelines that were tried using KShape. They did not yield successful results :
1. *Using only the timeseries readings*
    * Window timeseries readings from -30 to 40 w.r.t sample detect time
    * Normalize these readings so that they are all between 0 and 1
    * Standardize each reading to have a mean of 0 and a s.d of 1
    * Cluster with K = [20, 30]

2. *Using the timeseres readings + predictors with K = 20*
    * The timeseries readings windowed from -30 to 40 with respect to sample detect time
    * Normalizing these readings so that they are all between 0 and 1
    * Concatenating the 10 most important predictors obtained from the variable importance plot from the RandomForest algorithm
    * Standardzing each row(reading) to have a mean of 0 and a standard deviation of 1
    * Clustering with K = 20
    
3. *Not normalizing the timeseries data*
    * The timeseries readings windowed from -30 to 40 with respect to sample detect time
    * Concatenating the 10 most important predictors obtained from the variable importance plot from the RandomForest algorithm
    * Standardzing each row(reading) to have a mean of 0 and a standard deviation of 1
    * Clustering with K = 30

The clustering for this pipeline took much longer to run as it was not converging. The clusters obtained do not represent distinguishable shapes. After reading the research paper on [KShape](http://www1.cs.columbia.edu/~jopa/Papers/PaparrizosSIGMOD2015.pdf), we understood that the distance measure that is used (cross-correlation) requires for the readings to be contained "within a specified range [...] in order to meaningfully compare such sequences." 

4. *Normalizing the timeseries data, but not standardizing it*
    * The timeseries readings windowed from -30 to 40 with respect to sample detect time
    * Normalizing these readings so that they are all between 0 and 1
    * Concatenating the 10 most important predictors obtained from the variable importance plot from the RandomForest algorithm
    * Clustering with K = 30

We found that without standardizing the rows, the clusters obtained, once again, did not represent distinguishable shapes. After reading the research paper on [KShape](http://www1.cs.columbia.edu/~jopa/Papers/PaparrizosSIGMOD2015.pdf), we understood that to achieve scale invariance, we had to transform "each sequence [...] so that its mean µ is zero and its standard deviation σ is one". In the context of this problem (clustering the unsuccessful readings), scaling the readings to have mean 0 and s.d 1 is justifiable as it removes the possible impact that using different fluid types (for example), can have on the readings. 
    
5. *Using PCA to reduce the dimension of the timeseries data*
    * The timeseries readings windowed from -30 to 40 with respect to sample detect time
    * Normalizing these readings so that they are all between 0 and 1
    * Standardize each column (representing time points) to have mean 0 and standard deviation 1
    * Applying PCA for dimension reduction
    * The 3 first components accounted for 95 % of the cumulative variance
    * Concatenating the 10 most important predictors obtained from the variable importance plot from the RandomForest algorithm
    * Standardzing each column (3 first PC and 10 predictors) to have a mean of 0 and a standard deviation of 1
    * Clustering with K = 30

This yielded bad results. From our understanding, it might not make sense to standardize each column from the timeseries data. By doing so, we are assuming that there is equal variance at each time point across the different readings. This is probably not the case (less variance during the calibration window than during the sample window for example) and so by standardizing we are loosing this information that might be useful for clustering. 





## **Imports**

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from pandas.api.types import is_numeric_dtype

from tslearn.clustering import KShape
from tslearn.preprocessing import TimeSeriesScalerMeanVariance

# Import the in house diagnostic function
from diagnostics import *

# Removes the warning when adding a column to df
pd.options.mode.chained_assignment = None

## **Read in the preprocessed data that is to be clustered**

In [None]:
# Read in all the preprocessed time series. 
# Importing the timeseries data
# Read in all the preprocessed time series. 
ecd_ts = pd.read_csv('../Data/PreprocessedData/TimeSeries/ecd_smooth.csv')
syn_ts= pd.read_csv('../Data/PreprocessedData/TimeSeries/syn_smooth.csv')
cont_ts= pd.read_csv('../Data/PreprocessedData/TimeSeries/cont_smooth.csv')
un_ts= pd.read_csv('../Data/PreprocessedData/TimeSeries/un_smooth.csv')


# Make a new data frame with all the timeseries readings (unsuccessful + ECDs)
ts = pd.concat([un_ts, ecd_ts, syn_ts, cont_ts, un_ts])

# --------------------------------------------

# Read in the aggregate predictor files
un_pred =  pd.read_csv('../Data/PreprocessedData/Predictors/Unsuccessful.csv')
ecd_pred =  pd.read_csv('../Data/PreprocessedData/Predictors/ECD.csv')
syn_pred =  pd.read_csv('../Data/RawData/Predictors/SyntheticPC.csv')
con_pred =  pd.read_csv('../Data/RawData/Predictors/PCAggContaminated.csv')

# Add labels
ecd_pred['Label'] = 'wild' # ECDs in the 'wild'
syn_pred['Label'] = 'syn'  # Synthetic ECDs
con_pred['Label'] = 'con'  # Contaminated ECDs
un_pred['Label'] = 'un'    # Unsuccessful readings

# Make a new data frame with all the predictor readings (unsuccessful + ECDs)
pred = pd.concat([ecd_pred, syn_pred, con_pred, un_pred])
pred = pred.rename({'TestID' : 'TestId'}, axis = 1)

After obtaining a promising cluster (using KMeans) that contained lots of ECDs, we asked Olivia to verify if the unsuccessful readings that were in that same cluster were ECDs. It turned out that 35 out of the 50 actually were. The other were linked to errors for another analyte (sample bubble or ECD). Thus, we changed the label of these unsuccessful readings to `modified_ecds` instead of `un`. 

In [None]:
# Change the label from unsuccessful to 'modified_ecd' for the TestIds we were confirmed by Olivia were actually ECD errors
un_ids_checked = pd.read_csv('../Data/RawData/un_ids_checked.csv')
pred = pred.merge(un_ids_checked, on = 'TestId', how = 'left')

In [None]:
for row in range(len(pred)):
    if (pred['Label'][row] == 'un') & (pred['ECD'][row] == 'Yes'):
        pred['Label'][row] = 'modified_ecd'

pred = pred.drop(columns = 'ECD')

## **Select a subset of the predictors and concatenate them to our timeseries data**

The predictors that had the highest score when calculating the variable importance after fitting a RandomForest Classifier are used here.

**Using different sets of predictors gives us pretty different clusters. See the 2 sets tried below and the resulting distribution of their "ECD clusters"**

**Uncomment the *pred_subset* you wish to use**

Using the predictors below, we obtain **one** cluster with most of the ECDs. Here is the distribution of this cluster:
* 343 ECDs
* 376 unsuccessful readings.

In [None]:
#Same predictors as for the autoencoder
pred_subset = pred[['list of 22 predictors']]
pred_subset = pred_subset.reset_index(drop = True)

The cell below contains a subset of the most important predictors found using the RandomForest. Using these predictors results in **two** clusters that are highly concentrated in ECD readings, each having a  distinguishable shape. They seem to be more 'pure' then the 'ecd' cluster obtained when we used the predictors in the cell above. In fact here is their distribution:
* 177 ECDs & 280 unsuccessful
* 160 ECDs & 22 unsuccessful

In [None]:
# pred_subset = pred[['list of 12 predictors']]

In [None]:
ts_pred = ts.merge(pred_subset, on = 'TestId', how = 'left')

## **Run KShape Clustering**

In [None]:
# For this method to operate properly, prior scaling is required (mean = 0, standard deviation = 1)
X = TimeSeriesScalerMeanVariance().fit_transform(ts_pred.drop(columns = ['TestId', 'Label']))


`init = 'random'` : ["Choose k observations (rows) at random from data for the initial centroids"](https://tslearn.readthedocs.io/en/stable/gen_modules/clustering/tslearn.clustering.KShape.html)

`n_init = 10`: ["Number of time the k-Shape algorithm will be run with different centroid seeds"](https://tslearn.readthedocs.io/en/stable/gen_modules/clustering/tslearn.clustering.KShape.html)

Having an `n_init = 10` results in the algorithm running for a lot more time than if `n_init = 1`, but hopefully yields better results because the output will correspond to the iteration that led to the smallest inertia. Inertia is the sum of squared distances between sequences in a specific cluster and their corresponding centroid [(see reference here)](https://www.codecademy.com/learn/machine-learning/modules/dspath-clustering/cheatsheet).

In [None]:
# Running the algorithm and getting the corresponding cluster number for each reading
# For the sake of this example, set n_init = 1
clusters = KShape(n_clusters = 30, n_init = 1, init='random').fit_predict(X)

## **Diagnostics**

First, we have to get the data into the proper format so that it can be fed to the other diagnostic functions that are in the `diagnostic.py` file. The `prepare_data` function returns a dataframe that contains all the timeseries and predictors data as well as the corresponding cluster for each testid. Also, since we had previously appended a column containing the labels ('un', 'wild', 'syn', 'con') to the predictor file (`pred`), this column will also be in `diagnostic_df`.

In [None]:
diagnostic_df = prepare_data(ts_df = ts, pred_df = pred, clusters = clusters)

### **Get label counts for each cluster**

The function `get_label_counts` retrieves the number of readings from different categories (contained in the `label_col`) in each cluster (eg. number of unsuccessful, number of wild ECD errors, etc) and store them in a dataframe. 

In [None]:
get_label_counts(ts_pred = diagnostic_df, clust_col = 'Cluster', label_col = 'Label').head()

### **Describe clusters**

The function `describe_clusters` prints the number of readings from different categories in each cluster (eg. number of unsuccessful, number of wild ECD errors, etc). It also displays histograms for the desired aggregate predictors in each cluster. If `show_traces = True`, it plots the waveforms in each cluster, with one plot for unsuccessful, and one for the various kinds of ECD errors. The clusters that contain the highest number of ECD errors are displayed first.

In [None]:
describe_clusters(ts_pred = diagnostic_df, feature_list = ['Label', 'ReturnCode'], show_traces = True)

If we are interested in comparing the distributions of certain aggregate predictors we can also do that. 

In [None]:
compare_cluster_densities(ts_pred = diagnostic_df, clust1 = 6, clust2 = 17, 
                          feature_list = ['CMean', 'PMean', 'SMean'], clust_col = 'Cluster')

# **Centroids**

Since the dataframe used in the KShape algorithm includes both timeseries data and the most important predictors from the predictor file, the centroids that we obtain can't be expressed as a function of time (on the x-axis). Also, the centroid of each cluster is the ["eigenvector that corresponds to the largest eigenvalue"](http://www1.cs.columbia.edu/~jopa/Papers/PaparrizosSIGMOD2015.pdf) of a matrix found by maximizing the ["sum of squared similarity [between the centroid] and all the other timeseries sequences"](http://www1.cs.columbia.edu/~jopa/Papers/PaparrizosSIGMOD2015.pdf). This means that it is not clear what the centroid represents when combining aggregate predictors and timeseries data.

This being said, had we only used timeseries data for the clustering, the centroid would have effectively represented the "summary" shape found in the corresponding cluster. 

The code below can be used in the case where clustering is performed only with timeseries data and allows the user to create:
* A dataframe containing the centroid for each cluster using `format_centroids`
* Diagnostic plots comparing the waveforms in a cluster and its corresponding centroid shape using `centroid_diagnostic`

#### **Performing KShape only with timeseries data**

In [None]:
# Example of centroids had we used only timeseries data to cluster

# Dataframe containing only the timeseries data (columns with numeric names as columns) and the Label column
ts_only = ts_pred.loc[:,ts_pred.columns[~ts_pred.columns.str.isalpha()]]
ts_only = pd.concat([ts_pred['Label'], ts_only], axis = 1)

# For this method to operate properly, prior scaling is required (mean = 0, standard deviation = 1)
X = TimeSeriesScalerMeanVariance().fit_transform(ts_only.drop(columns = ['Label']))

# Here we are running `fit` and `predict` seperately instead of the most efficient `fit_predict` because we need the information contained in 
# `.cluster_centers_` to plot the centroids.

# Running the algorithm and getting the corresponding cluster number for each reading
# n_init = 1 just for the sake of the example (shorter to run), but suggest increasing in case of actual use
k_shape = KShape(n_clusters = 30, n_init=1, init='random').fit(X)
clusters_centroid = k_shape.predict(X)

#### **Obtain the centroids**

In [None]:
""" Outputs a dataframe containing the centroid for each cluster. Each column represents a centroid for a specific cluster. Each row is a timestamp.
    
    Args:
        data: The dataframe that was initially passed when fitting the KShape algorithm (as the fit_transform argument). 
        fit: The object containing the output from the .fit() method. Contains the attribute `cluster_centers_`\
        clusters: Numpy array containing the labels for the clusters formed from the algorithm. The output from the `.predict()` statement
        
    Returns:
        Pandas dataframe containing the centroids. The dimensions should be [number of clusters, number of timestamps].
"""
def format_centroids(data, fit, clusters):
    # Going from a 3D numpy array (where 3rd dimension is 1), to a 2D array
    centroids = pd.DataFrame(fit.cluster_centers_[:,:,-1])
    name_columns = data.columns
    
    # Renaming the columns
    centroids.columns = name_columns
    
    # Keeping only the columns that correspond to a timestamp (excludes the predictors)
    name_columns_timestamp = name_columns[~name_columns.str.isalpha()]
    
    centroids_df = centroids.loc[:,name_columns.isin(name_columns_timestamp)] 
    centroids_df['Cluster'] = np.sort(np.unique(clusters))
    
    return centroids_df


In [None]:
centroids = format_centroids(data = ts_only.drop(columns = ['Label']), fit = k_shape, clusters = clusters)

#### **Diagnostics for the centroid**

In [None]:
diagnostic_df_centroid = prepare_data(ts_df = ts, pred_df = pred, clusters = clusters_centroid)

In [None]:
""" Prints the total number of ECD and unsuccessful readings in the cluster as well as to plot.
    The first one shows the traces of each reading in that cluster and the second one shows its centroid.
    Args:
        diagnostic_df: A data frame containing the time series data and predictors as well as a column corresponding to the clusters (use the df output by `prepare_data` function).
        centroid_df: Pandas dataframe containing the centroids (output by the `format_centroids` function). 
        clust_col: The name of the column containing cluster labels. 
        label_col: The name of the column cotaining the data labels ('un'/'syn'/'cont'/etc) 
        
    Returns:
        Diagnostic information relating to the centroids.
"""
def centroid_diagnostic(diagnostic_df, centroid_df, clust_col = 'Cluster', label_col = 'Label'):
    
    # Getting the names of the columns that are numeric/correspond to timeseries 
    name_columns = diagnostic_df.columns[~diagnostic_df.columns.str.isalpha()]
    
    # Here we remove the last 4 seconds because there seems to be some distortion
    # at the extremities. More research needs to be done to understand the root cause  of this distortion
    name_columns = name_columns[0:int(len(name_columns)-(4/0.2))] 
    
    # Dataframe containing only the timeseries data and a column with corresponding to the assigned cluster
    timeseries = diagnostic_df.loc[:,name_columns]
    timeseries = pd.concat([timeseries, diagnostic_df[clust_col]], axis = 1)
    
    # Removing the last 4 seconds because of the distortions 
    clusters = centroid_df[clust_col] 
    centroid_df = centroid_df.loc[:, name_columns]
    centroid_df = pd.concat([centroid_df, clusters], axis = 1)
    
    start = timeseries.columns[0]
    end = timeseries.drop(columns = clust_col).columns[-1]
    
    counts = get_label_counts(diagnostic_df, clust_col = clust_col, label_col = label_col)
    
    # Printing the number of readings for each label
    for cluster in counts.index:
        # Print the cluster number: 
        print('\nCluster', cluster, '\n------------------------------------')
        # Subset the data frame with just the cluster.
        clust_subset = diagnostic_df[diagnostic_df[clust_col] == cluster]
        # Print some summary information
        for label in ['tot_ecds', 'un']:
            print("Number of", label, ":", counts[label][cluster])
    
        fig, axes = plt.subplots(nrows=1, ncols=2, figsize = (75, 15))
        axes[0].plot(timeseries[timeseries[clust_col] == cluster].drop(columns = clust_col).transpose())
        axes[1].plot(centroid_df.drop(columns = clust_col).columns, centroid_df.drop(columns = clust_col).iloc[cluster])

        # Formatting the plots
        for i in range(2):
            axes[i].set_xlabel('time (secs) w.r.t sample detect', fontsize = 30)
            axes[i].xaxis.set_ticks([start,'-0.0', end])
            axes[i].tick_params(axis='x', labelsize=25)
            axes[i].tick_params(axis='y', labelsize=25)

        axes[0].set_ylabel('signal', fontsize = 30)
        axes[1].set_ylabel('centroid', fontsize = 30)

        axes[0].set_title(f'Readings in cluster {cluster}', fontsize = 35)
        axes[1].set_title(f'Centroid for cluster {cluster}', fontsize = 35)


        plt.show()

In [None]:
centroid_diagnostic(diagnostic_df = diagnostic_df_centroid, centroid_df = centroids)