188.413 Self-Organising Systems
# Exercise 3: SOM Evaluation Report - Experiment 06

**Authors:**
* Gunnar Sjúrðarson Knudsen, 12028205
* Michael Ferdinand Moser, 01123077
* Magnus Wagner, 12034922

## Goal of this notebook
**6) Analyze different scalings:**
- Train a „regular“ SOM with obviously wrongly scaled data
- Analyse cluster structure, quantization errors, topology violations. In how far does this map differ from the maps analyzed above?
- **Describe and compare the structures found** (providing detailed info on visualizations and parameters

## Comments
As this is the only task that isn't about the parameters of the SOM, but on the preprocessing step, we will NOT be using the preprocessed data here.
Instead we fetch the data similarly to how it was done in the preprocessing notebook, and then create a function that will preprocess dependant on the relevant scaler

## Sources:
* https://github.com/smnishko/SOMToolbox/
* http://www.ifs.tuwien.ac.at/dm/somtoolbox/
* https://somoclu.readthedocs.io/en/stable/example.html
* READ THIS: https://github.com/JustGlowing/minisom

## Setup
Libraries, constants and other stuff

### Constants
Mainly filename, so that SOM results get stored separately

In [None]:
EXPERIMENT_DESCRIPTION = 'Experiment_06'
EXPERIMENT_DATA_FOLDER = 'experiment_results/experiment_06'

### SOM Parameters
Only the ones that are held consistant for this experiment

In [None]:
# Initialization and training
RANDOM_SEED = 0
N_NEURONS = 30                     # int: x dimension of the SOM
M_NEURONS = 30                     # int: y dimension of the SOM
SIGMA = N_NEURONS/4                # Spread of the neighborhood function, needs to be adequate to the dimensions of the map.
LEARNING_RATE = 0.7                # initial learning rate
TOPOLOGY = 'rectangular'           #  'rectangular', 'hexagonal'
ACTIVATION_DISTANCE = 'euclidean'  # 'euclidean', 'cosine', 'manhattan', 'chebyshev'
NEIGHBORHOOD_FUNCTION = 'gaussian' # 'gaussian', 'mexican_hat', 'bubble', 'triangle'

NUM_ITERATIONS = 50000
RANDOM_ORDER = False

### Load required libraries

In [None]:
# Data Science Libraries
import numpy as np
import pandas as pd
import datetime
import gzip
import os

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Dataset selection
import openml
from openml.datasets import edit_dataset, fork_dataset, get_dataset

# SOM Helpers
import panel as pn
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
## ML stuff
from sklearn import datasets, preprocessing

# SOM stuff
from somtoolbox import SOMToolbox
from SOMToolBox_Parse import SOMToolBox_Parse
from minisom import MiniSom    

# More som stuff
import somoclu

import pickle

import matplotlib.pyplot as plt
%matplotlib inline

from bokeh.io import export_png

# Methods used for this experiment
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import RobustScaler
from imblearn.over_sampling import SMOTE

### Read in preprocessed data

### Download dataset

In [None]:
did = 1497
# Download dataset
dataset = openml.datasets.get_dataset(did)

# Create datafame of the data, and show how it looks
X, y, categorical_indicator, attribute_names = dataset.get_data(dataset_format="array"
                                                                , target=dataset.default_target_attribute
                                                               )
df = pd.DataFrame(X, columns= attribute_names)
df["class"] = y
display(df)

category = df.select_dtypes(include='object')
categorial_columns = category.columns
numerical = df.select_dtypes(exclude='object')
numerical_columns = numerical.columns

### Define functions for preprocessing, training, and visualizing each experiment

#### Preprocess
Ensures that each step is done in the same method, with except of the scaler /sampler that is being tested

In [None]:
def preprocess_dataset(df, description, scaler = None, sampler = None, verbose = True):
    # Scale
    if scaler is not None:
        df.loc[:, df.columns != 'class'] = scaler.fit_transform(df.loc[:, df.columns != 'class'])
    
    # Visualize results
    if verbose:
        f = plt.figure(figsize=(12, 8), dpi=600)
        plt.title('Distribution of Columns', color='black')

        df.plot(kind="kde",  ax=f.gca())
        plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
        plt.savefig(f"{EXPERIMENT_DATA_FOLDER}/KDE_{description}.png")
        plt.show()

        f = plt.figure(figsize=(12, 8), dpi=600)
        plt.title('Distribution of Columns', color='black')

        df.plot(kind="box",  ax=f.gca())
        plt.savefig(f"{EXPERIMENT_DATA_FOLDER}/boxplot_{description}.png")
        plt.show()
    
    ### Restructure to MiniSom format
    # Features
    data = df[df.columns[:-1]]
    data = data.values

    # Target
    target = df['class']

    # Check if this re-encoding is correct
    label_names = {0: 'Move-Forward'
                 , 1: 'Slight-Right-Turn'
                 , 2: 'Sharp-Right-Turn'
                 , 3: 'Slight-Left-Turn'
                  }
    # Do sampling
    if sampler is not None:
        data, target = sampler.fit_resample(data,target)
        
    return target, label_names, data

In [None]:
def javafy_data(target, label_names, data, description):
    #### Input Vector
    PRE_INPUT = f"$TYPE vec\n$XDIM {len(target)}\n$YDIM 1\n$VEC_DIM 24\n"
    folder_name = f"{description}"
    PATH = f'./java_folder/experiment_06/{folder_name}/input.vec'
    
    if not os.path.exists(f'./java_folder/experiment_06/{folder_name}'):
        os.makedirs(f'./java_folder/experiment_06/{folder_name}')

    data_concatted = np.concatenate((data.astype(np.float16),np.arange(1,len(data)+1).reshape(-1,1)),axis=1)

    np.savetxt(PATH, data_concatted,delimiter=" ", fmt = '%1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %1.6f %i')

    tmp = open(PATH,mode="r").read()

    with open(PATH, 'w') as f:
       f.write(PRE_INPUT+tmp)
    
    #### Template
    PATH = f'./java_folder/experiment_06/{folder_name}/template.tv'
    template =  f'$TYPE template\n$XDIM 2\n$YDIM {len(target)}\n$VEC_DIM 24\n0 V1\n1 V2\n2 V3\n3 V4\n4 V5\n5 V6\n6 V7\n7 V8\n8 V9\n9 V10\n10 V11\n11 V12\n12 V13\n13 V14\n14 V15\n15 V16\n16 V17\n17 V18\n18 V19\n19 V20\n20 V21\n21 V22\n22 V23\n23 V24'
    with open(PATH, 'w') as f:
       f.write(template)
    
    #### Class info
    PATH = f'./java_folder/experiment_06/{folder_name}/class_info.cls'
    header =  f'$TYPE class_information\n$NUM_CLASSES 4\n$CLASS_NAMES Move_Forward Slight_Right_Turn Sharp_Right_Turn Slight_Left_turn\n$XDIM 2\n$YDIM {len(target)}\n'

    target_new = target.copy(deep=True)
    target_new = pd.DataFrame(target_new)
    target_new["index"]=np.arange(1,len(target)+1)
    target_new=target_new.set_index("index")

    with open(PATH, 'w') as f:
       f.write(header)
    target_new.to_csv(PATH, header=None, index=True, sep=' ', mode='a')
    
    print("Done!")

#### Train SOM

In [None]:
def train_som(_df, _description, _scaler = None, _sampler = None, _verbose = True):
    
    ### Preprocess data
    _target, _label_names, _data = preprocess_dataset(_df
                                                  , _description
                                                  , scaler = _scaler
                                                  , sampler = _sampler
                                                  , verbose = _verbose
                                                 )
    
    ### Train a (single) SOM - from MiniSom documentation
    som = MiniSom(x = N_NEURONS 
                  , y = M_NEURONS 
                  , input_len = _data.shape[1] # int: Number of the elements of the vectors in input.
                  , sigma = SIGMA 
                  , learning_rate = LEARNING_RATE 
                 #, decay_function = asymptotic_decay  # Need to understand this still learning_rate / (1+t/(max_iterarations/2))
                  , neighborhood_function = NEIGHBORHOOD_FUNCTION 
                  , topology = TOPOLOGY
                  , activation_distance = ACTIVATION_DISTANCE
                  , random_seed = RANDOM_SEED
                 )

    som.pca_weights_init(_data)
    som.train(_data
              , num_iteration = NUM_ITERATIONS
              , random_order = RANDOM_ORDER  
              , verbose=_verbose
             )  # random training
    
    # Reformat data for SMToolbox structure
    weights = som._weights.reshape(-1, 24)       # weights['arr']
    n_neurons = N_NEURONS                        # weights['xdim']
    m_neurons = M_NEURONS                        # weights['ydim']
    dimension = _data.shape[1]                    # weights['vec_dim']
    classes = _target                             # classes['arr']
    component_names = list(_label_names.values()) # classes['classes_names']
    data = _data                                  # idata['arr']    
    
    # Pack files to a pickle
    with open(EXPERIMENT_DATA_FOLDER + '/TRAINED_SOM_DATA_' + _description + '.pkl', 'wb') as f:
        pickle.dump([weights, n_neurons, m_neurons, dimension, classes, component_names, data], f)
    
    ### Use SOMToolbox on our newly generated SOM
    sm = SOMToolbox(weights = weights
                    , m = m_neurons
                    , n = n_neurons
                    , dimension = dimension
                    , input_data = data
                    , classes = classes
                    #, component_names = component_names
                   )
    
    #return _target, _label_names, _data
    return weights, n_neurons, m_neurons, dimension, classes, component_names, data, sm

#### Define static visualizations

In [None]:
#HitHistogram
def HitHist(_m, _n, _weights, _idata):
    hist = np.zeros(_m * _n)
    for vector in _idata: 
        position =np.argmin(np.sqrt(np.sum(np.power(_weights - vector, 2), axis=1)))
        hist[position] += 1

    return hist.reshape(_m, _n)

#U-Matrix - implementation
def UMatrix(_m, _n, _weights, _dim):
    U = _weights.reshape(_m, _n, _dim)
    U = np.insert(U, np.arange(1, _n), values=0, axis=1)
    U = np.insert(U, np.arange(1, _m), values=0, axis=0)
    #calculate interpolation
    for i in range(U.shape[0]): 
        if i%2==0:
            for j in range(1,U.shape[1],2):
                U[i,j][0] = np.linalg.norm(U[i,j-1] - U[i,j+1], axis=-1)
        else:
            for j in range(U.shape[1]):
                if j%2==0: 
                    U[i,j][0] = np.linalg.norm(U[i-1,j] - U[i+1,j], axis=-1)
                else:      
                    U[i,j][0] = (np.linalg.norm(U[i-1,j-1] - U[i+1,j+1], axis=-1) + np.linalg.norm(U[i+1,j-1] - U[i-1,j+1], axis=-1))/(2*np.sqrt(2))

    U = np.sum(U, axis=2) #move from Vector to Scalar

    for i in range(0, U.shape[0], 2): #count new values
        for j in range(0, U.shape[1], 2):
            region = []
            if j>0: region.append(U[i][j-1]) #check left border
            if i>0: region.append(U[i-1][j]) #check bottom
            if j<U.shape[1]-1: region.append(U[i][j+1]) #check right border
            if i<U.shape[0]-1: region.append(U[i+1][j]) #check upper border

            U[i,j] = np.median(region)

    return U

#SDH - implementation
def SDH(_m, _n, _weights, _idata, factor, approach):
    import heapq

    sdh_m = np.zeros( _m * _n)

    cs=0
    for i in range(factor): cs += factor-i

    for vector in _idata:
        dist = np.sqrt(np.sum(np.power(_weights - vector, 2), axis=1))
        c = heapq.nsmallest(factor, range(len(dist)), key=dist.__getitem__)
        if (approach==0): # normalized
            for j in range(factor):  sdh_m[c[j]] += (factor-j)/cs 
        if (approach==1):# based on distance
            for j in range(factor): sdh_m[c[j]] += 1.0/dist[c[j]] 
        if (approach==2): 
            dmin, dmax = min(dist[c]), max(dist[c])
            for j in range(factor): sdh_m[c[j]] += 1.0 - (dist[c[j]]-dmin)/(dmax-dmin)

    return sdh_m.reshape(_m, _n)

In [None]:
def generate_visualizations(weights, n_neurons, m_neurons, dimension, classes, component_names, data, description):
    hithist = hv.Image(HitHist(m_neurons
                               , m_neurons
                               , weights
                               , data
                              )
                      ).opts(xaxis=None, yaxis=None) 

    um = hv.Image(UMatrix(m_neurons
                          , m_neurons
                          , weights
                          , 24 # ??? was 4 - Dimensions?
                         )
                 ).opts(xaxis=None, yaxis=None) 

    sdh = hv.Image(SDH(m_neurons
                       , m_neurons
                       , weights
                       , data 
                       , 25 #??? Don't know 
                       , 0 # ?? Dont know
                      )
                  ).opts(xaxis=None, yaxis=None)   

    allthree =  hv.Layout([hithist.relabel('HitHist').opts(cmap='kr')
                           , um.relabel('U-Matrix').opts(cmap='jet')
                           , sdh.relabel('SDH').opts(cmap='viridis')
                          ]
                         )
    #display(allthree)
    
    #hv.save(hithist, filename="plot.png", fmt="png")
    display(hithist.relabel('HitHist').opts(cmap='kr'))
    display(um.relabel('U-Matrix').opts(cmap='jet'))
    display(sdh.relabel('SDH').opts(cmap='viridis'))
    
    print("FIGURE OUT HOW TO SAVE THESE PLOTS!?!?!?")

# Run experiments

In [None]:
# Enable for SomToolbox
## Awesome for understanding, but very memory intensive
INTERACTIVE_EXPLORATION = True

## Without oversampling

### E01) Without Scaling

In [None]:
# Define Experiment "parameters"
experiment_description = 'e01_nonsampled_unscaled'
scaler = None
sampler = None

# Preprocess and train
weights, n_neurons, m_neurons, dimension, classes, component_names, data, sm = train_som(df
                                                                                      , _description = experiment_description
                                                                                      , _scaler = scaler
                                                                                      , _sampler = sampler
                                                                                      , _verbose = True
                                                                                     )

In [None]:
# Do entire preprocessing again, and save in correct java folder
target, label_names, data = preprocess_dataset(df, description = experiment_description, scaler = scaler, sampler = sampler, verbose = False)
javafy_data(target, label_names, data, experiment_description)

In [None]:
# Start interactive exploration
if INTERACTIVE_EXPLORATION:
    display(sm._mainview)

### E02) Using Min-Max-Scaler

In [None]:
# Define Experiment "parameters"
experiment_description = 'e02_nonsampled_MinMax'
scaler = MinMaxScaler()
sampler = None

# Preprocess and train
weights, n_neurons, m_neurons, dimension, classes, component_names, data, sm = train_som(df
                                                                                      , _description = experiment_description
                                                                                      , _scaler = scaler
                                                                                      , _sampler = sampler
                                                                                      , _verbose = True
                                                                                     )

In [None]:
# Do entire preprocessing again, and save in correct java folder
target, label_names, data = preprocess_dataset(df, description = experiment_description, scaler = scaler, sampler = sampler, verbose = False)
javafy_data(target, label_names, data, experiment_description)

In [None]:
# Start interactive exploration
if INTERACTIVE_EXPLORATION:
    display(sm._mainview)

### E03) Using Zero mean Unit Variance Scaling

In [None]:
# Define Experiment "parameters"
experiment_description = 'e03_nonsampled_ZeroMeanUnitVariance'
scaler = StandardScaler(with_mean = False)
sampler = None

# Preprocess and train
weights, n_neurons, m_neurons, dimension, classes, component_names, data, sm = train_som(df
                                                                                      , _description = experiment_description
                                                                                      , _scaler = scaler
                                                                                      , _sampler = sampler
                                                                                      , _verbose = True
                                                                                     )

In [None]:
# Do entire preprocessing again, and save in correct java folder
target, label_names, data = preprocess_dataset(df, description = experiment_description, scaler = scaler, sampler = sampler, verbose = False)
javafy_data(target, label_names, data, experiment_description)

In [None]:
# Start interactive exploration
if INTERACTIVE_EXPLORATION:
    display(sm._mainview)

### E04) Using Z-Scaling

In [None]:
# Define Experiment "parameters"
experiment_description = 'e04_nonsampled_ZScaling'
scaler = StandardScaler(with_mean = True)
sampler = None

# Preprocess and train
weights, n_neurons, m_neurons, dimension, classes, component_names, data, sm = train_som(df
                                                                                      , _description = experiment_description
                                                                                      , _scaler = scaler
                                                                                      , _sampler = sampler
                                                                                      , _verbose = True
                                                                                     )

In [None]:
# Do entire preprocessing again, and save in correct java folder
target, label_names, data = preprocess_dataset(df, description = experiment_description, scaler = scaler, sampler = sampler, verbose = False)
javafy_data(target, label_names, data, experiment_description)

In [None]:
# Start interactive exploration
if INTERACTIVE_EXPLORATION:
    display(sm._mainview)

### E05) Using Max-Abs

In [None]:
# Define Experiment "parameters"
experiment_description = 'e05_nonsampled_MaxAbs'
scaler = MaxAbsScaler()
sampler = None

# Preprocess and train
weights, n_neurons, m_neurons, dimension, classes, component_names, data, sm = train_som(df
                                                                                      , _description = experiment_description
                                                                                      , _scaler = scaler
                                                                                      , _sampler = sampler
                                                                                      , _verbose = True
                                                                                     )

In [None]:
# Do entire preprocessing again, and save in correct java folder
target, label_names, data = preprocess_dataset(df, description = experiment_description, scaler = scaler, sampler = sampler, verbose = False)
javafy_data(target, label_names, data, experiment_description)

In [None]:
# Start interactive exploration
if INTERACTIVE_EXPLORATION:
    display(sm._mainview)

### E06) Using Robust-Scaler

In [None]:
# Define Experiment "parameters"
experiment_description = 'e06_nonsampled_Robust'
scaler = RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0))
sampler = None

# Preprocess and train
weights, n_neurons, m_neurons, dimension, classes, component_names, data, sm = train_som(df
                                                                                      , _description = experiment_description
                                                                                      , _scaler = scaler
                                                                                      , _sampler = sampler
                                                                                      , _verbose = True
                                                                                     )

In [None]:
# Do entire preprocessing again, and save in correct java folder
target, label_names, data = preprocess_dataset(df, description = experiment_description, scaler = scaler, sampler = sampler, verbose = False)
javafy_data(target, label_names, data, experiment_description)

In [None]:
# Start interactive exploration
if INTERACTIVE_EXPLORATION:
    display(sm._mainview)

## With oversampling

### E07) Without Scaling

In [None]:
# Define Experiment "parameters"
experiment_description = 'e07_smote_unscaled'
scaler = None
sampler = SMOTE(random_state = 42)

# Preprocess and train
weights, n_neurons, m_neurons, dimension, classes, component_names, data, sm = train_som(df
                                                                                      , _description = experiment_description
                                                                                      , _scaler = scaler
                                                                                      , _sampler = sampler
                                                                                      , _verbose = True
                                                                                     )

In [None]:
# Do entire preprocessing again, and save in correct java folder
target, label_names, data = preprocess_dataset(df, description = experiment_description, scaler = scaler, sampler = sampler, verbose = False)
javafy_data(target, label_names, data, experiment_description)

In [None]:
# Start interactive exploration
if INTERACTIVE_EXPLORATION:
    display(sm._mainview)

### E08) Using Min-Max-Scaler

In [None]:
# Define Experiment "parameters"
experiment_description = 'e08_smote_MinMax'
scaler = MinMaxScaler()
sampler = SMOTE(random_state = 42)

# Preprocess and train
weights, n_neurons, m_neurons, dimension, classes, component_names, data, sm = train_som(df
                                                                                      , _description = experiment_description
                                                                                      , _scaler = scaler
                                                                                      , _sampler = sampler
                                                                                      , _verbose = True
                                                                                     )

In [None]:
# Do entire preprocessing again, and save in correct java folder
target, label_names, data = preprocess_dataset(df, description = experiment_description, scaler = scaler, sampler = sampler, verbose = False)
javafy_data(target, label_names, data, experiment_description)

In [None]:
# Start interactive exploration
if INTERACTIVE_EXPLORATION:
    display(sm._mainview)

### E09) Using Zero mean Unit Variance Scaling

In [None]:
# Define Experiment "parameters"
experiment_description = 'e09_smote_ZeroMeanUnitVariance'
scaler = StandardScaler(with_mean = False)
sampler = SMOTE(random_state = 42)

# Preprocess and train
weights, n_neurons, m_neurons, dimension, classes, component_names, data, sm = train_som(df
                                                                                      , _description = experiment_description
                                                                                      , _scaler = scaler
                                                                                      , _sampler = sampler
                                                                                      , _verbose = True
                                                                                     )

In [None]:
# Do entire preprocessing again, and save in correct java folder
target, label_names, data = preprocess_dataset(df, description = experiment_description, scaler = scaler, sampler = sampler, verbose = False)
javafy_data(target, label_names, data, experiment_description)

In [None]:
# Start interactive exploration
if INTERACTIVE_EXPLORATION:
    display(sm._mainview)

### E10) Using Z-Scaling

In [None]:
# Define Experiment "parameters"
experiment_description = 'e10_smote_ZScaling'
scaler = StandardScaler(with_mean = True)
sampler = SMOTE(random_state = 42)

# Preprocess and train
weights, n_neurons, m_neurons, dimension, classes, component_names, data, sm = train_som(df
                                                                                      , _description = experiment_description
                                                                                      , _scaler = scaler
                                                                                      , _sampler = sampler
                                                                                      , _verbose = True
                                                                                     )

In [None]:
# Do entire preprocessing again, and save in correct java folder
target, label_names, data = preprocess_dataset(df, description = experiment_description, scaler = scaler, sampler = sampler, verbose = False)
javafy_data(target, label_names, data, experiment_description)

In [None]:
# Start interactive exploration
if INTERACTIVE_EXPLORATION:
    display(sm._mainview)

### E11) Using Max-Abs

In [None]:
# Define Experiment "parameters"
experiment_description = 'e11_smote_MaxAbs'
scaler = MaxAbsScaler()
sampler = SMOTE(random_state = 42)

# Preprocess and train
weights, n_neurons, m_neurons, dimension, classes, component_names, data, sm = train_som(df
                                                                                      , _description = experiment_description
                                                                                      , _scaler = scaler
                                                                                      , _sampler = sampler
                                                                                      , _verbose = True
                                                                                     )

In [None]:
# Do entire preprocessing again, and save in correct java folder
target, label_names, data = preprocess_dataset(df, description = experiment_description, scaler = scaler, sampler = sampler, verbose = False)
javafy_data(target, label_names, data, experiment_description)

In [None]:
# Start interactive exploration
if INTERACTIVE_EXPLORATION:
    display(sm._mainview)

### E12) Using Robust-Scaler

In [None]:
# Define Experiment "parameters"
experiment_description = 'e12_nonsampled_Robust'
scaler = RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0))
sampler = SMOTE(random_state = 42)

# Preprocess and train
weights, n_neurons, m_neurons, dimension, classes, component_names, data, sm = train_som(df
                                                                                      , _description = experiment_description
                                                                                      , _scaler = scaler
                                                                                      , _sampler = sampler
                                                                                      , _verbose = True
                                                                                     )

In [None]:
# Do entire preprocessing again, and save in correct java folder
target, label_names, data = preprocess_dataset(df, description = experiment_description, scaler = scaler, sampler = sampler, verbose = False)
javafy_data(target, label_names, data, experiment_description)

In [None]:
# Start interactive exploration
if INTERACTIVE_EXPLORATION:
    display(sm._mainview)

# Conclusion
**Frick If I know**

**Remaining Todo:**
  * Generate more statics
  * Figure out how to save statics