# **Kmeans Clustering Notebook Description**

Description: This notebook gives an example of running KMeans with euclidean distance to cluster waveforms. It also demonstrates using functions to describe the characteristics of the waveforms. 

## **Imports**

In [None]:
from tslearn.clustering import KShape, TimeSeriesKMeans, silhouette_score
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline 

# Import the in house diagnostic functios
from diagnostics import *

## **Read in the preprocessed data that is to be clustered**

In [None]:
# Read in all the preprocessed time series. 
ecd_ts = pd.read_csv('../Data/PreprocessedData/TimeSeries/ecd_smooth.csv')
syn_ts= pd.read_csv('../Data/PreprocessedData/TimeSeries/syn_smooth.csv')
cont_ts= pd.read_csv('../Data/PreprocessedData/TimeSeries/cont_smooth.csv')
un_ts= pd.read_csv('../Data/PreprocessedData/TimeSeries/un_smooth.csv')

# Make a new data frame with all of the different kinds of ECDs. 
allecd_ts = pd.concat([ecd_ts, cont_ts, syn_ts])
all_ts = pd.concat([un_ts, allecd_ts])

# Read in the aggregate predictor files
un_pred =  pd.read_csv('../Data/PreprocessedData/Predictors/Unsuccessful.csv')
ecd_pred =  pd.read_csv('../Data/PreprocessedData/Predictors/ECD.csv')
syn_pred =  pd.read_csv('../Data/RawData/Predictors/SyntheticECD.csv')
con_pred =  pd.read_csv('../Data/RawData/Predictors/ECDAggContaminated.csv')

# Make a new data frame with all of the predictors. 
all_preds = pd.concat([un_pred, ecd_pred, syn_pred, con_pred])
# Rename TestID to TestId to match the time series data. 
all_preds = all_preds.rename({'TestID':'TestId'}, axis = 1)

Save the test ids for later use. 

In [None]:
ECDids = allecd_ts['TestId']
unids = un_ts['TestId']

Create the labels for the different kinds of readings and store them for future use. 

In [None]:
# Create the labels. 
un_lab = pd.Series(['un']).repeat(len(un_ts))
ECD_lab = pd.Series(['ECD']).repeat(len(allecd_ts))
wild_lab = pd.Series(['wild']).repeat(len(ecd_ts))
cont_lab = pd.Series(['cont']).repeat(len(cont_ts))
syn_lab = pd.Series(['synth']).repeat(len(syn_ts))

# Store labels for all the categories (ECD, cont, syn, and un)
labs = pd.concat([un_lab, wild_lab, cont_lab, syn_lab]).reset_index(drop = True)
                                      
all_ts.reset_index(drop = True, inplace = True)
ids = all_ts['TestId']
#all_ts = all_ts.drop('TestId', axis = 1)

Add in the CMean, PMean, and SMean as additional predictors

In [None]:
preds_subset = all_preds[['TestId','AggPred1', 'AggPred2', 'AggPred3']]

In [None]:
all_ts_and_preds = all_ts.merge(preds_subset, on = 'TestId', how = 'left')
all_ts_and_preds = all_ts_and_preds.drop('TestId', axis = 1)

## **Run KMeans Clustering**

In [None]:
# For this method to operate properly, prior scaling is required
X_train = TimeSeriesScalerMeanVariance().fit_transform(all_ts_and_preds)
sz = X_train.shape[1]


# kShape clustering
ks = TimeSeriesKMeans(n_clusters=20, verbose=True, random_state=0, metric = 'euclidean', n_jobs = -1)
y_pred = ks.fit_predict(X_train)

##  **Do some diagnostics**

In [None]:
ts_pred = prepare_data(all_ts, all_preds, clusters = y_pred, labels = labs)
describe_clusters(ts_pred, ['ReturnCode', 'AggPred1', 'AggPred2', 'AggPred3'], start = -30, end = 39.8)

Let's say we're interested in how clusters 5 and 6 differ in terms of CMean, SMean, and PMean. We can use the `compare_cluster_densities()` fuction from the diagnostics script to compare the densities of these predictors for these two clusters.

In [None]:
compare_cluster_densities(ts_pred, clust1 = 5, clust2 = 6, 
                          feature_list = ['AggPred1', 'AggPred2', 'AggPred3'], clust_col = 'Cluster')