# **Autoencoder and Clustering**

["Autoencoders are a type of artificial neural network used to learn efficient data patterns in an unsupervised manner"](https://www.sciencedirect.com/topics/engineering/autoencoder#:~:text=An%20autoencoder%20is%20a%20type,by%20a%20hidden%20layer%20h.). An autoencoder ideally consists of an encoder and decoder.
- Encoder: Used to encode the raw waveform and reduce it to a smaller dimension.
- Decoder: Used to reconstruct the original waveform with the encoded data.

Important features of an autoencoder:
- Noise reduction
- Dimensionality Reduction
- Feature Extraction

To obtain better clustering results, we will use both the timeseries data (waveforms) and the aggregate predictors file. Here is the pipeline used in this notebook:

1. **Use an encoder on the timeseries data (from -30 to 40 w.r.t sample detect, normalized and smoothed) for feature extraction**
2. **Perform dimension reduction on these features using PCA to explain 95% of the cumulative variance**
3. **Concatenante the most important predictors (found using the RandomForest Classifier variable importance) to the principal components obtained in step 2**
4. **Perform clustering using Gaussian Mixture Modelling**


## Imports

In [None]:
import numpy as np  
import pandas as pd                                  
import altair as alt
import matplotlib.pyplot as plt
import tensorflow as tf 
import random as python_random

from numpy.random import seed                       
from sklearn.model_selection import train_test_split 
from keras.layers import Input, Dense                # keras function for defining the embedded layers within the autoencoder
from keras.models import Model                       # to model the autoencoder      
from sklearn.mixture import GaussianMixture          
from sklearn.decomposition import PCA 

# Our script to plot various diagnostic plots
from diagnostics import *

np.random.seed(123)                                 
python_random.seed(123)                             
tf.random.set_seed(1234)  

alt.data_transformers.disable_max_rows()

# Removes the warning when adding a column to df
pd.options.mode.chained_assignment = None

## Read in the preprocessed data to be clustered

#### Timeseries

In [None]:
ecd_ts = pd.read_csv("../Data/PreprocessedData/TimeSeries/ecd_smooth.csv")
syn_ts = pd.read_csv("../Data/PreprocessedData/TimeSeries/syn_smooth.csv")
con_ts = pd.read_csv("../Data/PreprocessedData/TimeSeries/cont_smooth.csv")
un_ts = pd.read_csv("../Data/PreprocessedData/TimeSeries/un_smooth.csv")

#### Predictors

In [None]:
un_pred = pd.read_csv('../Data/PreprocessedData/Predictors/Unsuccessful.csv')
ecd_pred = pd.read_csv('../Data/PreprocessedData/Predictors/ecdContact.csv')
syn_pred = pd.read_csv('../Data/RawData/Predictors/SyntheticECD.csv')
con_pred =  pd.read_csv('../Data/RawData/Predictors/ECDAggContaminated.csv')

## Data Wrangling

Here we are adding labels to the timeseries and predictor files (i.e, 'ecd', 'syn'). Then we are merging them together to obtain one dataframe with all the timeseries data and another with all the predictor data. Finally, we are creating a dataframe containing both our timeseries and predictor data for each TestId.

In [None]:
## Adding labels for predictor data
un_pred['Label'] = "un"
ecd_pred['Label'] = "ecd"
syn_pred['Label'] = "syn"
con_pred['Label'] = "con"

In [None]:
## Adding lables for timeseries data
un_ts['Label'] = "un"
ecd_ts['Label'] = "ecd"
syn_ts['Label'] = "syn"
con_ts['Label'] = "con"

In [None]:
## Merging the timeseries data into a single dataframe.
ts = pd.concat([ecd_ts, syn_ts, con_ts, un_ts])            
ts = ts.reset_index(drop = True)

In [None]:
## Merging the predictor files and renaming the TestID column to match with timeseries file.
preds = pd.concat([un_pred, ecd_pred, syn_pred, con_pred])
preds = preds.rename({'TestID':'TestId'}, axis = 1)

# Because we preprocessed the timeseries data, some of the ids are in the predictor file but no longer in the timeseries data
# Thus, we just want to keep the testid that are common to both
preds = pd.DataFrame(ts['TestId']).merge(preds, on = 'TestId', how = 'left')
preds = preds.reset_index(drop = True)

After obtaining a promising cluster (using KMeans) that contained lots of ecd contacts, we asked Olivia to verify if the unsuccessful readings that were in that same cluster were ecd contacts. It turned out that 35 out of the 50 actually were. The other were linked to errors for another analyte (sample bubble or ecd contact). Thus, we changed the label of these unsuccessful readings to `modified_ecds` instead of `un`. 

In [None]:
# Change the label from unsuccessful to 'modified_ecd' for the TestIds we were confirmed by Olivia were actually ecd contact errors
un_ids_checked = pd.read_csv('../Data/RawData/un_ids_checked.csv')

preds = preds.merge(un_ids_checked, on = 'TestId', how = 'left')
ts = ts.merge(un_ids_checked, on = 'TestId', how = 'left')

In [None]:
for row in range(len(preds)):
    if (preds['Label'][row] == 'un') & (preds['ecd'][row] == 'Yes'):
        preds['Label'][row] = 'modified_ecd'
    
for row in range(len(ts)):
    if (ts['Label'][row] == 'un') & (ts['ecd'][row] == 'Yes'):
        ts['Label'][row] = 'modified_ecd'

preds = preds.drop(columns = 'ecd')
ts = ts.drop(columns = 'ecd')

After wrangling, we have three dataframes:
* ts : Dataframe which contains only the smoothed (convolution filter) and normalized (btw 0 and 1) waveforms from -30 to 40 (w.r.t sample detect time) 

* preds : Dataframe which contains all the aggregate predictors (only for the TestIds that are also in ts).


## **1. Extracting features using an autoencoder**

In [None]:
# Removing the labels from the data to feed in to the autoencoder.
X = ts.drop(columns = ['TestId', 'Label']) 
Y = ts['Label']   

# Storing the TestIds in a separate variable
ids = ts['TestId'] 

We need to split the data into a training and testing dataset. This is necessary as we need to train the autoencoder on the training set to learn to extract features from the raw data, and then validate the results using the testing set. The autoencoder has to be trained using both the encoder and the decoder, but then we can only use the encoder part to extract the features. 

In [None]:
# shuffle = (True) to shuffle the data, so a diverse set is chosen instead of the same labeled ones.
    
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, shuffle = True)


In [None]:
# Defining the number of features
n_col = X.shape[1]                                       

# Defining encoding dimensions
# Since we are only extracting features we have chosen 349, as we do not want to reduce the dimensionality of the data.
encoding_dim = 349                                        

# Defining the input_dim which needs to be equal to output_dim
# Important: The input and output dimensions should always be the same 
input_dim = Input(shape = (n_col, ))                      

# Defining the encoding layers : creating many layers improves the model's ability to extract better features.
# relu function : a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero.
# This is the encoder because the depth of the layers are decreasing (for dimension reduction)
en1 = Dense(3000, activation = 'relu')(input_dim)
en2 = Dense(2750, activation = 'relu')(en1)
en3 = Dense(2500, activation = 'relu')(en2)
en4 = Dense(2250, activation = 'relu')(en3)
en5 = Dense(2000, activation = 'relu')(en4)               
en6 = Dense(1750, activation = 'relu')(en5)               
en7 = Dense(1500, activation = 'relu')(en6) 
en8 = Dense(1250, activation = 'relu')(en7)
en9 = Dense(1000, activation = 'relu')(en8)
en10 = Dense(750, activation = 'relu')(en9)
en11 = Dense(500, activation = 'relu')(en10)
en12 = Dense(250, activation = 'relu')(en11)
en13 = Dense(encoding_dim, activation = 'relu')(en12)

# Defining the Decoder Layers (same number of layers as the encoder)
# Parameters are chosen after mulitple trail and error runs
# This is the decoder because the depth of the layers are increasing (for reconstruction)
de1 = Dense(250, activation = 'relu')(en13)               
de2 = Dense(500, activation = 'relu')(de1)                 
de3 = Dense(750, activation = 'relu')(de2)                
de4 = Dense(1000, activation = 'relu')(de3)
de5 = Dense(1250, activation = 'relu')(de4)
de6 = Dense(1500, activation = 'relu')(de5)
de7 = Dense(1750, activation = 'relu')(de6)
de8 = Dense(2000, activation = 'relu')(de7)
de9 = Dense(2250, activation = 'relu')(de8)
de10 = Dense(2500, activation = 'relu')(de9)
de11 = Dense(2750, activation = 'relu')(de10)
de12 = Dense(3000, activation = 'relu')(de11)
de13 = Dense(n_col, activation = 'sigmoid')(de12)

# Combining both encoding and decoding layers
autoencoder = Model(inputs = input_dim, outputs = de13) 

# Compiling the Model
autoencoder.compile(optimizer = 'adadelta', loss = 'binary_crossentropy')  

In [None]:
# Summary provides us with the potential model and the number of possible parameters from the raw time series
# autoencoder.summary() 

In [None]:
# Training the autoencoder on X_train and validating on the X_test dataset,
# parameters:
    # epochs = 10 (can be increased to futher reduce the loss from the model)
    # batch_size = 32  (for every iteration the data chosen is of the size 12, can be further tuned)

autoencoder.fit(X_train, X_train, epochs = 10, batch_size = 32, shuffle = False, validation_data = (X_test, X_test))

Now that the autoencoder is fully trained, we can use only the encoder part to extract features from the timeseries.

In [None]:
# output is the last layer of the encoding part (here en13)
encoder = Model(inputs = input_dim, outputs = en13)

# Inputting the encoding_dim which has the number of dimensions required
encoded_input = Input(shape = (encoding_dim, )) 

# Now we pass in the entire dataset X with the predict function on the model which is trained by the autoencoder to extract features.
features = pd.DataFrame(encoder.predict(X)) 

# Adding the column names for the extracted features as feature_*
features = features.add_prefix('feature_') 

# Adding the labels and TestIds back to the newly created dataframe features
features['Label'] = Y                                       
features['TestId'] = ids                                     

features.head()

We can see that a few of the columns contain only zeros. Hence, they don't add any value and will be removed.

In [None]:
# Removing those features from the dataframe that have all the values set to zero has not much insight is gathered from this set.
data_without_zero_features = features.loc[:, (features != 0).any(axis=0)]

## **2. Performing Principal Component Analysis(PCA) for dimension reduction**

We perform PCA to reduce the number of features (~350 here)

In [None]:
data_without_zero_features.head()

In [None]:
# Splitting the "predictors" (X) and the "response" variable (Y)
X_pca = data_without_zero_features.drop(columns = ['Label', 'TestId'])          
Y_pca = data_without_zero_features[["TestId",'Label']] 

In [None]:
## Iteratively running PCA to see the cumulative variance , which helps to decide the number of components to choose for further steps.

def pca_plot(data):
    '''
    Function to determine how many components to choose based on the cumulative variance explained.
    Args:
        data : The dataframe on which PCA needs to be performed.
    '''
    pca = PCA().fit(data) 
    # Plotting the cumulative variance explained with number of components as x axis and variation explained on y axis.
    plt.plot(np.cumsum(pca.explained_variance_ratio_))  
    plt.xlim(0,20,1)                                  
    plt.xlabel('Number of components')                  
    plt.ylabel('Cumulative variance explained')        

In [None]:
pca_plot(X_pca)

In [None]:
## Setting the threshold to 0.95, i.e choose only those components that account for a variation of 95% 

def pca_components(data, percent = 0.95):
    '''Function to get Principal Components Analysis based on the number of components or the threshold passed.
    Args :
        data : Predictors on which the PCA needs to be applied on.
        percent : The percent of variation we want to account for with the PCA or the number of components we wish to obtain.
    Returns:
        Dataframe with the PCA.
    '''
    pca = PCA(n_components = percent)                             
    pc = pca.fit_transform(data)                                
    component_names = [f"PC{i+1}" for i in range(pc.shape[1])]  
    newdata = pd.DataFrame(pc, columns=component_names)         
    return newdata                                              

In [None]:
pca_df = pca_components(X_pca,0.95)

In [None]:
# Adding the TestId and Label back to the new dataframe pca_df created
pca_df = pd.concat([pca_df, Y_pca], axis = 1)
pca_df.head()

In [None]:
## Plotting the first two PC components to visualize the points
alt.Chart(pca_df).mark_circle(size=60).encode(
    x='PC1',
    y='PC2',
    color='Label'
).interactive()

## **3. Concatenate the most import predictors to the principal components obtained above**

Here, we use the predictors (from the aggregate predictors file) that scored the highest in term of variable importance when running a RandomForest Classifier.

These were the predictors that were kept:

- 12 aggregate predictors (deleted here for confidentiality)

In [None]:
pred_subset = preds[['list of 12 predictors']]
pred_subset = pred_subset.reset_index(drop = True)

Now that we have two dataframes ready one with the principal components (pca_df) and one with aggregate predictors (pred_subset),we combine them together to form our final predictors for the clustering.

In [None]:
# Dropecdg the `Label` column because it is already in the pca_df
pca_pred = pca_df.merge(pred_subset.drop(columns = 'Label'), on = 'TestId', how = 'left')

In [None]:
# Storing the TestId and Label in a separate dataframe.
Y_pca_pred = pca_pred[['Label','TestId']]                         

# Dropecdg them from the final predictors
pca_pred = pca_pred.drop(columns = ['Label','TestId']) 

## **4. Clustering using Gaussian Mixture Modeling**

Approach which assumes every data point belongs to a different cluster with a certain probability.

In [None]:
# Helper function to plot the BIC scores for different numbers of clusters. This helps us choose the number of clusters based on the lowest BIC score. 

def cluster_plot(data,n):
    '''
    Function to plot the number of components with BIC scores to choose the ideal number of clusters.
    Args :
        data : Dataframe on which clustering needs to be performed
        n : The number of potential clusters 
    '''
    ## to select an inital number of components
    n_components = np.arange(1, n)
    
    ## applying gaussian mixture modeling 
    models = [GaussianMixture(n, covariance_type='full', random_state=0).fit(data) for n in n_components]
    plt.plot(n_components, [m.bic(data) for m in models], label='BIC')
    plt.legend(loc='best')
    plt.xlabel('Number of clusters');


In [None]:
## To select the number of potential cluster, running an iterative setup to see the number of cluster with minimum BIC and AIC value i.e BIC measures the maximum likehood among the points.
cluster_plot(pca_pred, 30)

The smallest value of BIC occurs when the number of clusters is approximately 20. There does seem to be an elbow at approximately 5, but after trying this amount of clusters, we noticed that a lot of different shapes were being clustered together, hence we decided to increase the number of clusters.

In [None]:
### Fitting the Gaussian mixture model

def gaussian_fit(data, n):
    '''
    Function to fit a gaussian mixture model 
    Args :
        data : Dataframe on which clustering needs to be performed
        n : The number of potential clusters 
    
    Returns : The input dataframe (data) with an appended column for the corresponding cluster labels obtained
    
    '''
    gmm = GaussianMixture(n_components=n)                              
    gmm.fit(data)                                                      
    labels = gmm.predict(data) 
    # Assigning the Cluster to a column in the dataframe
    data['Cluster'] = labels                                          
    
    return data

In [None]:
gaussian_df = gaussian_fit(pca_pred, 20)

In [None]:
# Merging back the TestId and Label
gaussian_df = pd.concat([gaussian_df, Y_pca_pred], axis = 1)                         

## **Diagnostics**

In [None]:
# Including the cluster label to the timeseries data
ts_df = ts.merge(gaussian_df[['TestId', 'Cluster']], on = 'TestId', how = 'left')

In [None]:
# Dropecdg Label colu,m from ts_df because it is already in pred_df
diagnostic_df = prepare_data(ts_df = ts_df.drop(columns = 'Label'), pred_df = preds)

In [None]:
get_label_counts(ts_pred = diagnostic_df)

In [None]:
describe_clusters(ts_pred = diagnostic_df, feature_list = ['AggPred1', 'AggPred2'])