# **SOM Notebook Description**

This notebook uses a self organizing map (SOM) to try and cluster the waveforms after subsetting from 30 seconds before sample detect to 40 seconds after sample detect, then normalizing the waveforms between 0 and 1, and then using a convolution smoother. The most important features according to the random forest are used as additional predictors. 

## **Imports**

In [None]:
import pandas as pd            
import numpy as np           
import matplotlib.pyplot as plt                                          
# import package used for SOM algorithm
from minisom import MiniSom
# import in-house diagnostic functions
from diagnostics import *

## **Read in the Preprocessed Data that is to be Clustered**

## Load the smoothed time series

In [None]:
ecd_ts = pd.read_csv('../Data/PreprocessedData/TimeSeries/ecd_smooth.csv')            
syn_ts = pd.read_csv('../Data/PreprocessedData/TimeSeries/syn_smooth.csv')               
cont_ts = pd.read_csv('../Data/PreprocessedData/TimeSeries/cont_smooth.csv')       
un_ts = pd.read_csv('../Data/PreprocessedData/TimeSeries/un_smooth.csv')                

# Make a new data frame with all of the different kinds of ECDs. 
allecd_ts = pd.concat([ecd_ts, cont_ts, syn_ts])
# Make a data frame with all of the waveforms together. 
all_ts = pd.concat([un_ts, allecd_ts]) 

## Load the predictor files

In [None]:
un_pred =  pd.read_csv('../Data/PreprocessedData/Predictors/Unsuccessful.csv')
ecd_pred =  pd.read_csv('../Data/PreprocessedData/Predictors/ECD.csv')
syn_pred =  pd.read_csv('../Data/RawData/Predictors/SyntheticPC.csv')
con_pred =  pd.read_csv('../Data/RawData/Predictors/PCAggContaminated.csv')

# Make a new data frame with all of the predictors. 
all_preds = pd.concat([un_pred, ecd_pred, syn_pred, con_pred])
# Rename TestID to TestId to match the time series data. 
all_preds = all_preds.rename({'TestID':'TestId'}, axis = 1)

## Make a list of the most important features according to random forest. 

In [None]:
# subset of predictors data containing most important features according to random forest. 
preds_subset=all_preds[['TestId','CExtrapolation', \
             'CDrift', 'CNoise', 'CSecond',\
             'CWindowMovedBack', 'SDrift', \
             'SNoise', 'PSecond', 'TransDrift','AFirst']]

## Make a new dataframe combining the time series and important features. 

In [None]:
all_ts_and_preds = all_ts.merge(preds_subset, on = 'TestId', how = 'left')
all_ts_and_preds = all_ts_and_preds.drop('TestId', axis = 1)

Create the labels for the different kinds of readings and store them for future use. 

In [None]:
# Create the labels. 
un_lab = pd.Series(['un']).repeat(len(un_ts))
ecd_lab = pd.Series(['pc']).repeat(len(allecd_ts))
wild_lab = pd.Series(['wild']).repeat(len(ecd_ts))
cont_lab = pd.Series(['cont']).repeat(len(cont_ts))
syn_lab = pd.Series(['synth']).repeat(len(syn_ts))

# Store labels for all the categories (pc, cont, syn, and un)
labs = pd.concat([un_lab, wild_lab, cont_lab, syn_lab]).reset_index(drop = True)
                                      
# Save a copy of ids for future use. 
ids = all_ts['TestId']

## **Functions to Create Self organizing maps**

There are two functions defined: <br>
- minisom_func: The actual algorithm of minisom and clustering
- clutser_map : The function to convert maps to clusters

In [None]:
def minisom_func(ts_list,som_x,som_y,sigma,activation_distance,learning_rate,epochs,neighborhood_function):
    """ The minisom function for finding clusters. This function and the docstring are heavily copied from the original package
        (https://github.com/JustGlowing/minisom/blob/master/minisom.py)
       
        Parameters
        ----------
        ts_list : list
            The dataframe (converted to list for minisom package) containing the data to cluster
        som_x : int
            x dimension of the SOM.
        som_y : int
            y dimension of the SOM.
        sigma : float, optional (default=1.0)
            Spread of the neighborhood function, needs to be adequate
            to the dimensions of the map.
            (at the iteration t we have sigma(t) = sigma / (1 + t/T)
            where T is #num_iteration/2)
        activation_distance : string, callable optional (default='euclidean')
            Distance used to activate the map.
            Possible values: 'euclidean', 'cosine', 'manhattan', 'chebyshev'
            Example of callable that can be passed:
            def euclidean(x, w):
                return linalg.norm(subtract(x, w), axis=-1)
        learning_rate : initial learning rate
            (at the iteration t we have
            learning_rate(t) = learning_rate / (1 + t/T)
            where T is #num_iteration/2)
        neighborhood_function : string, optional (default='gaussian')
            Function that weights the neighborhood of a position in the map.
            Possible values: 'gaussian', 'mexican_hat', 'bubble', 'triangle'
    
        Returns : som object
        """
    # initializing minisom
    som = MiniSom(som_x,som_y,len(ts_list[0]),sigma=sigma,\
                  learning_rate=learning_rate,activation_distance=activation_distance,\
                  neighborhood_function=neighborhood_function,random_seed=30)             
    
    # initializing random weights 
    som.random_weights_init(ts_list)                                                                                 
    
     # Training minisom
    print("Training minisom function")
    som.train_random(ts_list,epochs)                                                     
    print("\n...maps are ready!")

    return som

In [None]:
def create_cluster_map(ts_list,parameters):
    """ Function to define the clusters in SOM and distribution of clusters in different waveforms
       
        Parameters
        ----------
        ts_list : list
            The dataframe (converted to list for minisom package) containing the data to cluster
        parameters : dict
            dictionary of parameters that go into the minisom_func
        Returns : list
            a list with the cluster labels produced by the SOM. 
        """
    # calling minisom function
    som=minisom_func(ts_list,parameters['som_x'],parameters['som_y'],\
                     parameters['sigma'],parameters['activation_distance'],\
                     parameters['learning_rate'],parameters['epochs'],parameters['neighborhood_function'])
    # Obtaining maps for each data point
    win_map=som.win_map(ts_list)                            
    
    # list to obtain clusters
    cluster_c = []  
    # list to obtain count in each cluster
    cluster_n = []                                          
    
    # loop to populate clusters and their counts.
    for x in range(parameters['som_x']):                    
        for y in range(parameters['som_y']):
            cluster = (x,y)
            if cluster in win_map.keys():
                cluster_c.append(len(win_map[cluster]))
            else:
                cluster_c.append(0)
            cluster_number = x*parameters['som_y']+y+1
            cluster_n.append(f"Cluster {cluster_number}")
    


    # Create a list associating each data point to its cluster.
    cluster_map = []                                       
    for idx in range(len(ts_list)):
        winner_node = som.winner(ts_list[idx])
        cluster_number=winner_node[0]*parameters['som_y']+winner_node[1]+1
        cluster_map.append(cluster_number)

    # Return the list that labels each data point with it's assigned cluster. 
    return cluster_map

## **Run SOM Clustering**

In [None]:
# Specify the parameters for minisom function.
# Product of som_x and som_y are the number of clusters that will be produced. 
parameters={'som_x':15,'som_y':2,'sigma':0.7,'activation_distance':'manhattan',\
            'learning_rate':0.01,'epochs':5000,'neighborhood_function':'gaussian'}

# Convert the dataframe to  a list for SOM. 
ts_list = all_ts_and_preds.values.tolist()

# Get the clusters according to the minisom algorithm
y_pred = create_cluster_map(ts_list = ts_list, parameters = parameters)                                      

##  **Do some diagnostics**

In [None]:
# Define start and end times for plotting. 
start = -30
end = 39.8
# Get the data ito correct format for diagnostics. 
ts_pred = prepare_data(all_ts, all_preds, clusters = y_pred, labels = labs)
# Plot some info about the clusters. 
describe_clusters(ts_pred, ['ReturnCode', 'AggPred1', 'AggPred2'], start = start, end = end)

If we just want a quick data frame to see which clusters have most of the ECDs, we can call the following function. 

In [None]:
get_label_counts(ts_pred)

If we are interested in comparing the distributions of certain aggregate predictors we can also do that. 

In [None]:
compare_cluster_densities(ts_pred, clust1 = 12, clust2 = 16, 
                          feature_list = ['AggPred1', 'AggPred2', 'AggPred3'], clust_col = 'Cluster')

Here it looks like these two clusters have a lot of overlap on these predictors. 