In [1]:
# Read data from Java SOMToolbox
from SOMToolBox_Parse import SOMToolBox_Parse
idata   = SOMToolBox_Parse("datasets/iris/iris.vec").read_weight_file()
weights = SOMToolBox_Parse("datasets/iris/iris.wgt.gz").read_weight_file()
classes = SOMToolBox_Parse("datasets/iris/iris.cls").read_weight_file()

In [2]:
# Visualization by PySOMVis
from pysomvis import PySOMVis

vis = PySOMVis(weights=weights['arr'], m=weights['ydim'],n=weights['xdim'],
                dimension=weights['vec_dim'], input_data=idata['arr'],
                classes=classes['arr'][:,1], component_names=classes['classes_names'])
vis._mainview

In [3]:
# Use any library for training SOM map (e.x. MiniSOM, SOMOClu, SOMpy, PopSOM etc.)
from pysomvis import PySOMVis
from minisom import MiniSom    

som = MiniSom(10, 10, 4)
som.train(idata['arr'], 10000)

vis = PySOMVis(weights=som._weights, input_data=idata['arr'])
vis._mainview

# Dataset
##  Select a data set from the OpenML Machine Learning Repository 

(http://www.openml.org) with the
following requirements:<br>
a. minimum 1000 instances,<br>
b. minimum 20 attributes,<br>
c. minimum 4 class labels (for visualizing class distributions on the map).<br>
Alternatively, you can also<br>
 opt to create an artificial dataset, preferably via parameterized scripts (in Matlab, Java, R,<br>
Python…) similar to the 10-Gaussians dataset, creating data of different densities combining<br>
i. Data on a finite area of a 1-d (line), 2-d, 3-d, 5-d hyperplanes<br>
ii. Data on (hyper-)spheres with different radius as well as Gaussians<br>
iii. Linear data sets in different intertwined settings<br>
iv. Other cluster characteristics that you find interesting<br>

## Register the dataset you picked with your group number in the TUWEL Wiki.<br>
 You must make sure<br>
that your dataset is unique, i.e. no two groups may take the same data set! (first come, first serve -<br>
do it early to get a data set that you also find interesting to work.)<br>

## Create a machine-actionable description of the dataset following Croissant / Schema.org<br>
descriptions for datasets (c.f. Croissant: https://neurips.cc/virtual/2024/poster/97627,<br>
https://docs.mlcommons.org/croissant/docs/croissant-spec.html; schema.org:<br>
https://schema.org/Dataset, c.f. the JSON example provided at https://schema.org/Dataset#eg-0478)<br>

## Analyze and describe the characteristics of the dataset (size, attribute types as discussed in class,<br>
value ranges, sparsity, min/max values, outliers, missing values, correlations, ...), and describe this<br>
in the report. Also, describe any hypotheses you might have concerning the distribution of the data,<br>
number of clusters and their relationship, majority/minority classes.<br>

## Analyzing the dataset

The dataset we chose is related to the paper SisPorto 2.0: a program for automated analysis of cardiotocograms  (https://pubmed.ncbi.nlm.nih.gov/11132590/), which proposes a system to analyze cardiotocograms. 

A cardiotocogram (CTG) is the graphical representation produced during cardiotocography monitoring. 

Ayres de Campos et al. (2000) SisPorto 2.0 A Program for Automated Analysis of Cardiotocograms. J Matern Fetal Med 5:311-318

### Dataset description from the source
2126 fetal cardiotocograms (CTGs) were automatically processed and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both with respect to a morphologic pattern (A, B, C. ...) and to a fetal state (N, S, P). Therefore the dataset can be used either for 10-class or 3-class experiments.


We will be using the morphologic patterns as the target classes to end up with a 10-class experiment.

### Cardiotography
Cardiotocography, also known as electronic fetal monitoring (EFM), is a medical technique used to monitor both the fetal heart rate and uterine contractions during pregnancy and labor.


### Feature descriptions:
Attribute Information:

- LB - FHR baseline (beats per minute)
- AC - # of accelerations per second
- FM - # of fetal movements per second
- UC - # of uterine contractions per second
- DL - # of light decelerations per second
- DS - # of severe decelerations per second
- DP - # of prolongued decelerations per second
- ASTV - percentage of time with abnormal short term variability
- MSTV - mean value of short term variability
- ALTV - percentage of time with abnormal long term variability
- MLTV - mean value of long term variability
- Width - width of FHR histogram
- Min - minimum of FHR histogram
- Max - Maximum of FHR histogram
- Nmax - # of histogram peaks
- Nzeros - # of histogram zeros
- Mode - histogram mode
- Mean - histogram mean
- Median - histogram median
- Variance - histogram variance
- Tendency - histogram tendency
- CLASS - FHR pattern class code (1 to 10)
- NSP - fetal state class code (N=normal; S=suspect; P=pathologic)

### Arff inaccuracies
The feature descriptions in the arff file don't exactly match, since the features are unnamed and there are more of them than described. Guesstimately, the class also got one-hot encoded



# Preprocessing: 

Get the data into the form needed for training SOMs. Describe your preprocessing<br>
steps (e.g. transcoding, scaling), why you did it and how you did it. Specifically, if your dataset turns<br>
out to be extremely large (very high-dimensional and huge number of vectors so that it does not fit<br>
into memory for training SOMs) you may choose to apply subsampling for the training data.<br>

In [70]:
## TODO LOADING AND PREPROCESSING THE DATA
from scipy.io import arff
import pandas as pd

data, meta = arff.loadarff('datasets/cardiotography/cardiotography.arff')

df = pd.DataFrame(data)

In [71]:
df_dropped = df.drop(columns=["V26", "V27", "V28", "V29", "V30", "V31", "V32", "V33", "V34", "V35"])

In [73]:
df_dropped.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V17,V18,V19,V20,V21,V22,V23,V24,V25,Class
0,23.0,240.0,357.0,120.0,120.0,0.0,0.0,0.0,73.0,0.5,...,62.0,126.0,2.0,0.0,120.0,137.0,121.0,73.0,1.0,b'9'
1,45.0,5.0,632.0,132.0,132.0,4.0,0.0,4.0,17.0,2.1,...,68.0,198.0,6.0,1.0,141.0,136.0,140.0,12.0,0.0,b'6'
2,45.0,177.0,779.0,133.0,133.0,2.0,0.0,5.0,16.0,2.1,...,68.0,198.0,5.0,1.0,141.0,135.0,138.0,13.0,0.0,b'6'
3,45.0,411.0,1192.0,134.0,134.0,2.0,0.0,6.0,16.0,2.4,...,53.0,170.0,11.0,0.0,137.0,134.0,137.0,13.0,1.0,b'6'
4,45.0,533.0,1147.0,132.0,132.0,4.0,0.0,5.0,16.0,2.4,...,53.0,170.0,9.0,0.0,137.0,136.0,138.0,11.0,1.0,b'2'


In [49]:

#print(meta)

#df.info()

#meta.names

In [47]:
feature_names = [attr[0] for attr in meta.names()]  # Correctly extract attribute names
print("\nFeature Names From Metadata:\n", feature_names)


Feature Names From Metadata:
 ['V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'V', 'C']


In [42]:
df.describe()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V26,V27,V28,V29,V30,V31,V32,V33,V34,V35
count,2115.0,2115.0,2115.0,2115.0,2115.0,2115.0,2115.0,2115.0,2115.0,2115.0,...,2115.0,2115.0,2115.0,2115.0,2115.0,2115.0,2115.0,2115.0,2115.0,2115.0
mean,25.118203,880.31253,1705.182506,133.313002,133.313002,2.730969,7.268558,3.674232,47.002837,1.334421,...,0.181087,0.271868,0.024113,0.038298,0.034043,0.156974,0.118676,0.050591,0.032151,0.092199
std,15.217105,894.93955,931.816753,9.836045,9.836045,3.566235,37.218533,2.845813,17.186054,0.884448,...,0.385182,0.445027,0.153438,0.19196,0.181381,0.363862,0.323484,0.219213,0.176444,0.289374
min,1.0,0.0,287.0,106.0,106.0,0.0,0.0,0.0,12.0,0.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,10.0,55.0,1009.5,126.0,126.0,0.0,0.0,1.0,32.0,0.7,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,29.0,538.0,1241.0,133.0,133.0,1.0,0.0,3.0,49.0,1.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,39.0,1522.0,2437.5,140.0,140.0,4.0,2.0,5.0,61.0,1.7,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,48.0,3296.0,3599.0,160.0,160.0,26.0,564.0,23.0,87.0,7.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [63]:
feature_names = [
    "LB", "AC", "FM", "UC", "DL", "DS", "DP", "ASTV", "MSTV", "ALTV",
    "MLTV", "Width", "Min", "Max", "Nmax", "Nzeros", "Mode", "Mean",
    "Median", "Variance", "Tendency", "V22", "V23", "V24",
    "V25", "V26", "V27", "V28", "V29", "V30", "V31", "V32", "V33", "V34", "V35", "CLASS",
]
df.columns = feature_names

In [64]:
df.head()["CLASS"]

0    b'9'
1    b'6'
2    b'6'
3    b'6'
4    b'2'
Name: CLASS, dtype: object

In [65]:
df.drop(columns=["V26", "V27", "V28", "V29", "V30", "V31", "V32", "V33", "V34", "V35"], inplace=True)

In [None]:
question whether to drop v

In [66]:
df.head()

Unnamed: 0,LB,AC,FM,UC,DL,DS,DP,ASTV,MSTV,ALTV,...,Mode,Mean,Median,Variance,Tendency,V22,V23,V24,V25,CLASS
0,23.0,240.0,357.0,120.0,120.0,0.0,0.0,0.0,73.0,0.5,...,62.0,126.0,2.0,0.0,120.0,137.0,121.0,73.0,1.0,b'9'
1,45.0,5.0,632.0,132.0,132.0,4.0,0.0,4.0,17.0,2.1,...,68.0,198.0,6.0,1.0,141.0,136.0,140.0,12.0,0.0,b'6'
2,45.0,177.0,779.0,133.0,133.0,2.0,0.0,5.0,16.0,2.1,...,68.0,198.0,5.0,1.0,141.0,135.0,138.0,13.0,0.0,b'6'
3,45.0,411.0,1192.0,134.0,134.0,2.0,0.0,6.0,16.0,2.4,...,53.0,170.0,11.0,0.0,137.0,134.0,137.0,13.0,1.0,b'6'
4,45.0,533.0,1147.0,132.0,132.0,4.0,0.0,5.0,16.0,2.4,...,53.0,170.0,9.0,0.0,137.0,136.0,138.0,11.0,1.0,b'2'


In [6]:
#standard preprocessing for the DataFrame 
def preprocess_data(df):
    # Drop rows with missing values
    df = df.dropna()
    # Drop duplicate rows
    df = df.drop_duplicates()
    # Drop columns with constant values
    df = df.loc[:, (df != df.iloc[0]).any()]
    return df   

In [7]:
df = preprocess_data(df)
#df.info()

#print(df["Class"].value_counts())

# C) SOM Training and Analysis

## 1) Train a reasonably sized „regular“ SOM

In [129]:
from minisom import MiniSom
from sklearn.datasets import load_iris
import numpy as np
from pysomvis import PySOMVis
import pandas as pd
from scipy.io import arff
def calculate_som_size(data_size, fraction=0.1):
    """Calculate SOM dimensions based on the dataset size."""
    # Calculate the total number of units based on the data size and fraction
    total_units = int(data_size * fraction)
    # Use the square root to determine SOM dimensions (assumes a square grid)
    som_dim = int(np.sqrt(total_units))
    return som_dim, som_dim

def train_som(dataloader, **params):
    """
    Train a Self-Organizing Map (SOM) with specified parameters.

    Parameters:
    - dataloader (function): A function that returns the dataset. The dataset should have 
      'data' (features) and 'target' (labels) attributes.
    - size_fraction (float): Fraction of the dataset used to calculate the SOM size (default is 0.1).
    - sigma (float): The neighborhood radius used for the SOM (default is 1.0). Controls how far 
      neighboring neurons will be affected by the learning process.
    - learning_rate (float): The learning rate for the SOM (default is 0.5). It determines how 
      much the weights of the SOM are adjusted during training.
    - num_iterations (int): The number of iterations to train the SOM (default is 1000).
    - random_seed (int, optional): The random seed for initializing the SOM. If None, a random seed is used.
      Providing a fixed value ensures reproducibility.
    - neighborhood_function (str): The type of neighborhood function used for training ('gaussian' is default).
      This determines the shape of the neighborhood that is affected during training.

    Returns:
    - som (MiniSom): The trained SOM object.
    - X (array): The input features used for training.
    - y (array): The target labels corresponding to the input features.
    - class_names (array): The class names of the target labels.
    """
    
    # Load dataset using the provided dataloader function
    data = dataloader()
    X = data.data  # Extract the data (already normalized by the dataloader)
    y = data.target
    class_names = data.target_names

    # Calculate SOM size based on the number of data points and specified fraction
    som_x, som_y = calculate_som_size(len(X), fraction=params.get('size_fraction', 0.1))

    # Get the dimensionality of the input data (number of features per data point)
    input_len = X.shape[1]

    # Initialize SOM with the calculated dimensions and other specified parameters
    som = MiniSom(som_x, som_y, input_len, 
                  sigma=params.get('sigma', 1.0),  # Neighborhood radius
                  learning_rate=params.get('learning_rate', 0.5),  # Learning rate
                  neighborhood_function=params.get('neighborhood_function', 'gaussian'),  # Neighborhood function
                  random_seed=params.get('random_seed', None))  # Random seed for initialization

    # Train the SOM using the normalized data and specified number of iterations
    som.train(X, params.get('num_iterations', 1000))

    # Return trained SOM and the input data
    return som, X, y, class_names

def iris_dataloader():
    """Load and normalize Iris dataset."""
    data = load_iris()
    # Normalize data to range [0, 1]
    data.data = (data.data - data.data.min(axis=0)) / (data.data.max(axis=0) - data.data.min(axis=0))
    return data

def cardiotography_dataloader():
    """Load and normalize Cardiotography dataset."""
    data, meta = arff.loadarff('datasets/cardiotography/cardiotography.arff')
    df = pd.DataFrame(data)

    #drop the columns that appear to just be one hot encoding for the class anyways
    df.drop(columns=["V26", "V27", "V28", "V29", "V30", "V31", "V32", "V33", "V34", "V35"], inplace=True)
    df = preprocess_data(df)
    #split df into data and target
    data = df.iloc[:, :-1].values
    target = df.iloc[:, -1].values
    target = target.astype(int)
    
    
    # Normalize data to range [0, 1]
    #df = (df - df.min(axis=0)) / (df.max(axis=0) - df.min(axis=0))
    #normalize only the data columns
    data = (data - data.min(axis=0)) / (data.max(axis=0) - data.min(axis=0))
  

    
    class Data():
        def __init__(self, data, target):
            self.data = data
            self.target = target
            self.target_names = np.array(df.columns)
    
    # data are columns from 0 except the last one
    # target is the last column
    return Data(data, target)




def calculate_som_metadata(som, X, y, class_names, compact_output = True):
  
    metadata = {}

    # 1. Quantization Error
    metadata['quantization_error'] = som.quantization_error(X)

    # 2. Topographic Error
    metadata['topographic_error'] = som.topographic_error(X)

    # 3. Best Matching Units (BMUs)
    metadata['bmus'] = [som.winner(x) for x in X]
    metadata['bmu_indices'] = [np.ravel_multi_index(bmu, som.get_weights().shape[:2]) for bmu in metadata['bmus']]

    # 4. Neuron Weights
    metadata['neuron_weights'] = som.get_weights()

    # 5. Node Counts (Hit Map)
    metadata['node_counts'] = np.zeros(som.get_weights().shape[:2], dtype=int)
    for bmu in metadata['bmus']:
        metadata['node_counts'][bmu] += 1
 

    if compact_output:
        print("--- SOM Metadata ---")
        print(f"Quantization Error: {metadata['quantization_error']:.4f}")
        print(f"Topographic Error:  {metadata['topographic_error']:.4f}")

        return None
    else:
        return metadata

    
    

Train a SOM with „regular“ size (i.e. number of units as a certain fraction of the number of data<br>
items) and reasonable training parameters (sufficiently large initial neighborhood, learning<br>
rate; provide a justification for the selection of the parameters. NOTE: Learning rates for SOMs<br>
differ from those usually encountered in Deep Neural Networks, c.f. lecture)<br>

In [140]:
# TODO TRAIN SOM WITH REG SIZE

params = {
    'size_fraction': 0.4,
    'sigma': 1.5,
    'learning_rate': 0.3,
    'num_iterations': 5000,
    'random_seed': 42  # Specify a random seed for initialization
}

som, X, y, class_names = train_som(cardiotography_dataloader, **params)

Analyse in detail the class distribution, cluster structure, quantization errors, topology
violations.<br> a) Can you identify the border effect and magnification factors.<br> b) How well do class
distribution and cluster structure match?<br> c) Which classes fall into sub-clusters, which classes
are split across clusters, which classes mix in clusters.<br> d) How is the quantization error
distributed on the map, how does this correspond with perceived cluster separation and
quality?

In [139]:
calculate_som_metadata(som, X, y, class_names)

--- SOM Metadata ---
Quantization Error: 0.4055
Topographic Error:  0.1957


In [138]:
print("Quantization Error:", metadata['quantization_error'])
print("Topographic Error:", metadata['topographic_error'])
#print("Node Counts:\n", metadata['class_distribution_per_node'])

Quantization Error: 0.40545093079315864
Topographic Error: 0.19574468085106383


In [141]:
#TODO above
weights = {
    'arr': som.get_weights(),
    'xdim': som._weights.shape[0],
    'ydim': som._weights.shape[1],
    'vec_dim': som._weights.shape[2]
}

# Initialize PySOMVis with correct m and n
# vis = PySOMVis(weights=weights['arr'], 
#                m=weights['xdim'],  # m should be xdim
#                n=weights['ydim'],  # n should be ydim
#                dimension=weights['vec_dim'], 
#                input_data=X, 
#                classes=y, 
#                component_names=class_names)

# # Display the visualization
# vis._mainview()

vis = PySOMVis(weights=som._weights, input_data=X)
vis._mainview


In [136]:
#Iris SOM to compare
params_iris = {
    'size_fraction': 0.4,
    'sigma': 1.5,
    'learning_rate': 0.3,
    'num_iterations': 5000,
    'random_seed': 42  # Specify a random seed for initialization
}

som_iris, X, y, class_names = train_som(iris_dataloader, **params_iris)
weights = {
    'arr': som.get_weights(),
    'xdim': som._weights.shape[0],
    'ydim': som._weights.shape[1],
    'vec_dim': som._weights.shape[2]
}

vis = PySOMVis(weights=som_iris._weights, input_data=X)
#vis._mainview

**Describe and compare the structures found**(providing detailed info on visualizations and
parameters)


When looking at the SDH, we can see a concentration in the bottom right corner, with some smaller clusters emerging. When increasing the smoothing factor to around 100, we can see two meta-clusters. 

Comparatively, when using the kmeans clustering visualization and set the class amount to the 10 intended classes, we can see some well-defined clusters forming (in the area that is also well-looking in the SOM), but the rest is just noise.






In [84]:
#TODO show structres

#TODO comparision text

## 2) Analyze different initializations of the SOM

Train one further „regular-sized“ SOM using the same training parameters as above, but using
a different random seed for initializing the SOM.

In [142]:
params["random_seed"] = 43

# Train the SOM using the updated parameters (same dataset, different seed)
som2, X, y, class_names = train_som(cardiotography_dataloader, **params)


**Show and describe** <br> a) how the cluster structures and class distributions shift on the two
SOMs,<br> b) the effect on topology violations, cluster relationships, etc.<br> c) Which clusters show
a stable relationship, which ones change their relative position?<br> d) Which data instances are
stably mapped with similar data instances, which change a lot? Are they part of the same
clusters?

In [143]:
# Visualize the first SOM (som)
vis1 = PySOMVis(weights=som._weights, input_data=X)
vis1._mainview

# Visualize the second SOM (som2) with a different random seed
vis2 = PySOMVis(weights=som2._weights, input_data=X)
vis2._mainview

In [146]:
som2.quantization_error(X)

np.float64(0.3697331290873076)

When looking at the SDH with a factor of 150 here, we can see four distinct clusters emerging. These are still not as many classes as we would expect, which is odd. 

**Describe and compare the structures found** (providing detailed info on visualizations and
parameters)

In [87]:
#TODO show structures

Hit Histogram: Shows that there are a few nodes on the map where there are lots of data points mapped to it, but the majority of it 

#todo comparision text

## 3) Analyze different map sizes

Train 2 additional SOMs varying the size (very small / very large) (provide reasons for choice
of sizes)<br>
Train each map with rather large neighborhood radius and high learning rate (provide reasons
for the definition of „high“!)

In [105]:
# Train a very small SOM (small grid, less resolution)
params_small = {
    'size_fraction': 0.05,  # Very small grid size (5% of the data size)
    'sigma': 3.0,  # Large neighborhood radius (larger value for more smoothing)
    'learning_rate': 0.9,  # High learning rate (close to 1 for fast convergence)
    'num_iterations': 5000,
    'random_seed': 44  # Different seed for initialization
}
som_small, X, y, class_names = train_som(cardiotography_dataloader, **params_small)

# Train a very large SOM (large grid, higher resolution)
params_large = {
    'size_fraction': 1.0,  # Very large grid size (100% of the data size)
    'sigma': 3.0,  # Large neighborhood radius (larger value for more smoothing)
    'learning_rate': 0.9,  # High learning rate (close to 1 for fast convergence)
    'num_iterations': 5000,
    'random_seed': 45  # Different seed for initialization
}
som_large, X, y, class_names = train_som(cardiotography_dataloader, **params_large)


# Train a very large SOM (large grid, higher resolution)
params_huge = {
    'size_fraction': 10.0,  # insane grid size (1000% of the data size)
    'sigma': 3.0,  # Large neighborhood radius (larger value for more smoothing)
    'learning_rate': 0.9,  # High learning rate (close to 1 for fast convergence)
    'num_iterations': 5000,
    'random_seed': 45  # Different seed for initialization
}
som_huge, X, y, class_names = train_som(cardiotography_dataloader, **params_huge)


Analyse in detail the<br> a) class distribution,<br> b) cluster structure,<br> c) quantization errors,<br> d)
topology violations. Also,<br> e) analyze how clusters shift, change in relative size, and how their
relative position to each other changes or remains the same.<br> f) Check for aspects such as
magnification factors. What is the resulting granularity of clusters visible on the small and large
maps? Are the same clusters visible in the very large map as in the regular map?

In [89]:


# Visualize the very small SOM
vis_small = PySOMVis(weights=som_small._weights, input_data=X)
vis_small._mainview




In [106]:
# Visualize the very large SOM
vis_large = PySOMVis(weights=som_large._weights, input_data=X)
vis_large._mainview

In [116]:
#vis_huge = PySOMVis(weights=som_huge._weights, input_data=X)
#vis_huge._mainview

**Describe and compare the structures found** (providing detailed info on visualizations and
parameters)

In [91]:
#todo maybe show structures

# Comparison
Comparing the small to the large SOM, we can clearly see more distinct (though still connected) cluster regions emerging in the large, 1.0x SOM. 


## 4) Analyze different initial neighborhood radius settings

Train the very large SOM as specified above, but with a much too small neighborhood radius.

In [92]:
param_small_neighborhood = {
    'size_fraction': 0.5,
    'sigma': 0.1,  # Small neighborhood radius
    'learning_rate': 0.9,
    'num_iterations': 5000,
    'random_seed': 46
}

som_small_neighborhood, X, y, class_names = train_som(cardiotography_dataloader, **param_small_neighborhood)

Analyse the<br> a) cluster structure,<br> b) quantization errors,<br> c) topology violations.<br> d) In how far
does this map differ from the very large map trained with a correct/high initial neighborhood
radius?

In [20]:
# Visualize the SOM with a small neighborhood radius
vis_small_neighborhood = PySOMVis(weights=som_small_neighborhood._weights, input_data=X)
vis_small_neighborhood._mainview

**Describe and compare the structures found** (what is the effect of a „too small“
neighborhood radius? How to detect it?)

In [21]:
#TODO maybe show structures.

#TODO describe from above

## 5) Analyze different initial learning rates

Train the regular-sized SOM as specified above, but with a (I) much too large / (II) much too<br>
small learning rate (provide justification for the setting of the parameter)

In [22]:
param_som_low_learning_rate = {
    'size_fraction': 0.5,
    'sigma': 1.0,
    'learning_rate': 0.1,  # Low learning rate
    'num_iterations': 5000,
    'random_seed': 47
}

som_low_learning_rate, X, y, class_names = train_som(cardiotography_dataloader, **param_som_low_learning_rate)

param_som_large_learning_rate = {
    'size_fraction': 0.5,
    'sigma': 1.0,
    'learning_rate': 5,  # High learning rate
    'num_iterations': 5000,
    'random_seed': 48
}

som_large_learning_rate, X, y, class_names = train_som(cardiotography_dataloader, **param_som_large_learning_rate)

Analyse for both (I) and (II)<br> a) cluster structure,<br> b) quantization errors,<br> c) topology violations.<br>
d) In how far do these two maps differ from the well-trained map analyzed above?

In [23]:
# Visualize the SOM with a low learning rate
vis_low_learning_rate = PySOMVis(weights=som_low_learning_rate._weights, input_data=X)
vis_low_learning_rate._mainview

In [24]:
# Visualize the SOM with a high learning rate
vis_large_learning_rate = PySOMVis(weights=som_large_learning_rate._weights, input_data=X)
vis_large_learning_rate._mainview

Describe and compare the structures found (how can you detect „too small“ learning<br>
rates? When do they start to make sense?

In [25]:
#TODO maybe show structures

#todo comparision text

## 6) Analyze different max iterations

Train a regular SOM using 2, 5, 10, 50, 100, 1000, 5000, 10000 iterations

In [26]:
iterations = [2,5,10,50,100,1000,5000,10000]

params_iterations = {
    'size_fraction': 0.5,
    'sigma': 1.0,
    'learning_rate': 0.5,
    'random_seed': 49
}

trained_soms = []

for x in iterations:
    som, X, y, class_names = train_som(cardiotography_dataloader, num_iterations=x, **params_iterations)
    
    trained_soms.append((som, X, y, class_names,x))
    

Analyse cluster structure. <br>a) When do cluster structures start to emerge?<br> b) After how many
iterations do they stabilize?<br> c) How can you tell from the quality measures whether the map is
stable?<br> d) Which visualizations help you discover not-yet stable SOM mappings?

In [27]:
# visualize first som
print(f"Iterations = {trained_soms[0][4]}")
vis_iterations = PySOMVis(weights=trained_soms[0][0]._weights, input_data=trained_soms[0][1])
vis_iterations._mainview

Iterations = 2


In [28]:
# visualize second som
print(f"Iterations = {trained_soms[1][4]}")
vis_iterations = PySOMVis(weights=trained_soms[1][0]._weights, input_data=trained_soms[1][1])
vis_iterations._mainview

Iterations = 5


In [29]:
# visualize third som
print(f"Iterations = {trained_soms[2][4]}")
vis_iterations = PySOMVis(weights=trained_soms[2][0]._weights, input_data=trained_soms[2][1])
vis_iterations._mainview


Iterations = 10


In [30]:
# visualize fourth som
print(f"Iterations = {trained_soms[3][4]}")
vis_iterations = PySOMVis(weights=trained_soms[3][0]._weights, input_data=trained_soms[3][1])
vis_iterations._mainview

Iterations = 50


In [31]:
# visualize fifth som
print(f"Iterations = {trained_soms[4][4]}")
vis_iterations = PySOMVis(weights=trained_soms[4][0]._weights, input_data=trained_soms[4][1])
vis_iterations._mainview


Iterations = 100


In [32]:
#visualize sixth som
print(f"Iterations = {trained_soms[5][4]}")
vis_iterations = PySOMVis(weights=trained_soms[5][0]._weights, input_data=trained_soms[5][1])
vis_iterations._mainview

Iterations = 1000


In [33]:
#visualize seventh som
print(f"Iterations = {trained_soms[6][4]}")
vis_iterations = PySOMVis(weights=trained_soms[6][0]._weights, input_data=trained_soms[6][1])
vis_iterations._mainview

Iterations = 5000


In [34]:
#visualize eighth som
print(f"Iterations = {trained_soms[7][4]}")
vis_iterations = PySOMVis(weights=trained_soms[7][0]._weights, input_data=trained_soms[7][1])
vis_iterations._mainview


Iterations = 10000


Describe and compare the structures found (how can you detect „too small“ learning
rates? When do they start to make sense?

In [35]:
#TODO maybe show structures

#todo comparision text

## 7) Detailed analysis of an „Optimal SOM“

Train a SOM using what you consider to be „optimal parameters“ based on sub-tasks 1-6.

In [118]:
optimal_params = {
    'size_fraction': 0.5,
    'sigma': 1.0,
    'learning_rate': 0.5,
    'num_iterations': 1000,
    'random_seed': 50
}

# Train the SOM with the optimal parameters
som_optimal, X, y, class_names = train_som(cardiotography_dataloader, **optimal_params)

# Visualize the SOM with the optimal parameters
vis_optimal = PySOMVis(weights=som_optimal._weights, input_data=X)
vis_optimal._mainview

Describe the final model following the FAIR4ML schema (cf.
https://doi.org/10.5281/zenodo.14002310, https://rda-fair4ml.github.io/FAIR4MLschema/
release/0.1.0/index.html, https://github.com/RDA-FAIR4ML/FAIR4ML-schema)

#todo above

### SUBTASKS a-e

Provide a detailed interpretation of the cluster/class structures using a combination of<br>
visualizations and their parameter settings. Describe the findings in detail, specifically<br>
analyzing and providing rationale for

#### a 

Cluster densities / cardinalities, shapes: what can you tell about the cluster sizes<br>
shapes, their cardinalities and densities? Can you observe areas of higher/lower<br>
densities? Compare different visualizations that support (or contradict) your hypothesis<br>
and reason/explain why they do so.

In [37]:
#TODO above 

#### b

Hierarchical cluster relationships: can you detect any hierarchies in the data? How do<br>
they seem to be structured? Which clusters are similar, which are very distant, how<br>
could they be related? Compare different visualizations that support (or contradict)<br>
your hypothesis and reason/explain why they do so.

In [38]:
#TODO above

#### c

Topological relations / violations: in which areas can you observe topology violations?<br>
What types of violations do you observe in which areas of the map (i.e. actual violations<br>
due to bad training or the inherent structure of the data vs. cluster data that is mapped<br>
onto the plane). In how far do different visualizations agree on these violations?<br>
Compare different visualizations that support (or contradict) your hypothesis and<br>
reason/explain why they do so.

In [39]:
#TODO above

#### d 

Class distribution: Which classes are mapped onto which parts of the map? How do<br>
they relate to each other? In how far does the class distribution match the cluster<br>
structure? Which classes are well-separated, which ones less so? What might be the<br>
reason for these overlaps? Is the mapping less correct in these regions (e.g. higher<br>
error measures)? Are these areas well-separated. Which classes form homogeneous<br>
clusters, which form sub-clusters, how similar are these sub-clusters?

In [40]:
#TODO above

#### e

Quality of the map in terms of vector quantization and topology violation: is the quality<br>
homogeneous, are there certain areas or classes where the quality of the mapping is<br>
lower, others where it is higher?

In [41]:
#TODO above

# D) Summarize your findings



## 1

Summarize your overall findings and lessons learned:<br>
a. Which parameters have what kind of influence on the SOM?<br>
b. How sensitive is the setting of these parameters <br>
c. Which visualizations are most useful to reveal what kind of information? Which combination

WRITE SUMMARY

## 2

(optional) Provide feedback on the exercise in general: which parts were useful / less useful; which<br>
other kind of experiment would have been interesting, … (this section is, obviously, optional and will<br>
not be considered for grading. You may also decide to provide that kind of feedback anonymously via<br>
the feedback mechanism in TISS – in any case we would appreciate learning about it to adjust the<br>
exercises for next year.)

P