# SOM Coding Report

## Team Members
- Huber Maximilian [01526935]
- Süss Maximilian [12225947]

## Public Repository 
https://github.com/MxHub1/PySOMVis_t-product.git

## Abstract
The work addresses the visualization of a **topographic product (TP)** of a self organising map (SOM) by using and extending the public python repository PySOMVis on Github. To evaluate the results of this project, data and reference code from the Java SOMToolBox is used. The visualizations are tested and validated using the chainlink as well as the 10-cluster datasets which contain relevant files to train a new SOM as well as to verify a already trained map in order to verify the correctness of a reference implementation. After executing the visualizations of the provided trained maps, new SOMs of dimension 10x10 as well as 100x60 are trained and the topografic product calculated and visualized accordingly. 

## Preparation 
The following code handles the necessary imports as well as imports the necessary data for the chainlink as well as 10-cluster dataset (Input vectors and weights). To select the corresponding dataset use the OS-agnostic file path to navigate to the provided files.

In [1]:
import os
from SOMToolBox_Parse import SOMToolBox_Parse

dataset_str = "datalink"

## Specify the vector and weight data to use for the following visualizations
idata_path = os.path.join("datasets", dataset_str, dataset_str + ".vec")
weights_path = os.path.join("datasets", dataset_str, dataset_str + ".wgt.gz")


# Parse provided dataset files using the SOMToolbox Parser
idata = SOMToolBox_Parse(idata_path).read_weight_file()
weights = SOMToolBox_Parse(weights_path).read_weight_file()

The following imports are necessary for the calculation of the topographic product as well as its corresponding visualization.

In [2]:
import holoviews as hv
from holoviews import opts
from coding_assignment import TopoProd
import seaborn as sns
import matplotlib.pyplot as plt

hv.extension('bokeh')

## Visualization of Topographic Product for provided SOMs
This section contains the visualizations of the provided datasets containing the pre-trained SOMs which are originating in the SOMToolBox for the chainlink dataset as well as the 10-cluster dataset. The provided input vectors as well as weights are imported and parsed using the SOMToolbox parser. Multiple visualizations are created by adapting the parameter configuration for the visualization. With the topographic product there is only the parameter 'k' which represents the size of the neared neighbours of a unit. Therefore we have chosen the following values for k to compare the outputs of the corresponding visualizations:
- k = 1
- k = 2
- k = 4
- k = |weights| - 1 (number of weights - 1)

For the first visualizations of the topographic product using the pre-trained SOMs we have chosen to compare two different kinds of visualization and color palattes in order to be able extract more information as well as to prevent effects of colour blindness. Later in the report we sticked to grayscale-only so as not to overload the document with redundant visualisations.

### Topographic product for pre-trained SOM

#### TP for k=1

In [3]:
# Calculating the topographic product
p = TopoProd(weights['ydim'], weights['xdim'], weights['arr'], k=1)

p1 = p[..., 0] # Distortion in input space
p2 = p[..., 1] # Distortion in output space
p3 = p[..., 2] # Geometric mean between p1 and p2

# Plotting of the topographic product
h_p1 = hv.Image(p1).opts(xaxis=None, yaxis=None)
h_p2 = hv.Image(p2).opts(xaxis=None, yaxis=None)
h_p3 = hv.Image(p3).opts(xaxis=None, yaxis=None)

hv.Layout([
    h_p1.relabel('TopoProd P1').opts(cmap='gray'),
    h_p2.relabel('TopoProd P2').opts(cmap='gray'),
    h_p3.relabel('TopoProd P3').opts(cmap='gray'),
])

#### TP for k=2

In [4]:
# Calculating the topographic product
p = TopoProd(weights['ydim'], weights['xdim'], weights['arr'], k=2)

p1 = p[..., 0] # Distortion in input space
p2 = p[..., 1] # Distortion in output space
p3 = p[..., 2] # Geometric mean between p1 and p2

# Plotting of the topographic product
h_p1 = hv.Image(p1).opts(xaxis=None, yaxis=None)
h_p2 = hv.Image(p2).opts(xaxis=None, yaxis=None)
h_p3 = hv.Image(p3).opts(xaxis=None, yaxis=None)

hv.Layout([
    h_p1.relabel('TopoProd P1').opts(cmap='gray'),
    h_p2.relabel('TopoProd P2').opts(cmap='gray'),
    h_p3.relabel('TopoProd P3').opts(cmap='gray'),
])

#### TP for k=4

In [5]:
# Calculating the topographic product
p = TopoProd(weights['ydim'], weights['xdim'], weights['arr'], k=4)

p1 = p[..., 0] # Distortion in input space
p2 = p[..., 1] # Distortion in output space
p3 = p[..., 2] # Geometric mean between p1 and p2

# Plotting of the topographic product
h_p1 = hv.Image(p1).opts(xaxis=None, yaxis=None)
h_p2 = hv.Image(p2).opts(xaxis=None, yaxis=None)
h_p3 = hv.Image(p3).opts(xaxis=None, yaxis=None)

hv.Layout([
    h_p1.relabel('TopoProd P1').opts(cmap='gray'),
    h_p2.relabel('TopoProd P2').opts(cmap='gray'),
    h_p3.relabel('TopoProd P3').opts(cmap='gray'),
])

#### TP for k = |weights| - 1

In [6]:
# Calculating the topographic product
p = TopoProd(weights['ydim'], weights['xdim'], weights['arr'])

p1 = p[..., 0] # Distortion in input space
p2 = p[..., 1] # Distortion in output space
p3 = p[..., 2] # Geometric mean between p1 and p2

# Plotting of the topographic product
h_p1 = hv.Image(p1).opts(xaxis=None, yaxis=None)
h_p2 = hv.Image(p2).opts(xaxis=None, yaxis=None)
h_p3 = hv.Image(p3).opts(xaxis=None, yaxis=None)

hv.Layout([
    h_p1.relabel('TopoProd P1').opts(cmap='gray'),
    h_p2.relabel('TopoProd P2').opts(cmap='gray'),
    h_p3.relabel('TopoProd P3').opts(cmap='gray'),
])

## Visualization of Topographic Product for self-trained SOMs
This section contains the visualizations of the provided datasets by using self-trained SOMs of dimensions 10x10 (small) as well as 100x60(large) for the chainlink dataset as well as the 10-cluster dataset. We chose to use the MiniSom library for the SOM training with 10.000 iterations, since in our tests this did not impose large negative effects on the performance of our algorithm. The provided input vectors as well as weights are imported and parsed using the SOMToolbox parser. Multiple visualizations are created by adapting the parameter configuration for the visualization. With the topographic product there is only the parameter 'k' which represents the size of the neared neighbours of a unit. Therefore we have chosen different values for k to compare the outputs of the corresponding visualizations.

In [7]:
from minisom import MiniSom

### Small (10x10) SOM 

In [8]:
# Train 10x10 Map for the selected Dataset

if (dataset_str == "10clusters"):
    dimension = 10
elif (dataset_str == "datalink"):
    dimension = 3

som = MiniSom(10, 10, dimension)
som.train(idata['arr'], 10000)

##### TP for k=1

In [9]:
p = TopoProd(10, 10, som._weights, k=1)

p1 = p[..., 0] # Distortion in input space
p2 = p[..., 1] # Distortion in output space
p3 = p[..., 2] # Geometric mean between p1 and p2

h_p1 = hv.Image(p1).opts(xaxis=None, yaxis=None)
h_p2 = hv.Image(p2).opts(xaxis=None, yaxis=None)
h_p3 = hv.Image(p3).opts(xaxis=None, yaxis=None)

hv.Layout([
    h_p1.relabel('TopoProd P1').opts(cmap='gray'),
    h_p2.relabel('TopoProd P2').opts(cmap='gray'),
    h_p3.relabel('TopoProd P3').opts(cmap='gray'),
])

##### TP for k=2

In [10]:
p = TopoProd(10, 10, som._weights, k=2)

p1 = p[..., 0] # Distortion in input space
p2 = p[..., 1] # Distortion in output space
p3 = p[..., 2] # Geometric mean between p1 and p2

h_p1 = hv.Image(p1).opts(xaxis=None, yaxis=None)
h_p2 = hv.Image(p2).opts(xaxis=None, yaxis=None)
h_p3 = hv.Image(p3).opts(xaxis=None, yaxis=None)

hv.Layout([
    h_p1.relabel('TopoProd P1').opts(cmap='gray'),
    h_p2.relabel('TopoProd P2').opts(cmap='gray'),
    h_p3.relabel('TopoProd P3').opts(cmap='gray'),
])

##### TP for k=4

In [11]:
p = TopoProd(10, 10, som._weights, k=4)

p1 = p[..., 0] # Distortion in input space
p2 = p[..., 1] # Distortion in output space
p3 = p[..., 2] # Geometric mean between p1 and p2

h_p1 = hv.Image(p1).opts(xaxis=None, yaxis=None)
h_p2 = hv.Image(p2).opts(xaxis=None, yaxis=None)
h_p3 = hv.Image(p3).opts(xaxis=None, yaxis=None)

hv.Layout([
    h_p1.relabel('TopoProd P1').opts(cmap='gray'),
    h_p2.relabel('TopoProd P2').opts(cmap='gray'),
    h_p3.relabel('TopoProd P3').opts(cmap='gray'),
])

##### TP for k = |weights| - 1

In [12]:
p = TopoProd(10, 10, som._weights)

p1 = p[..., 0] # Distortion in input space
p2 = p[..., 1] # Distortion in output space
p3 = p[..., 2] # Geometric mean between p1 and p2

h_p1 = hv.Image(p1).opts(xaxis=None, yaxis=None)
h_p2 = hv.Image(p2).opts(xaxis=None, yaxis=None)
h_p3 = hv.Image(p3).opts(xaxis=None, yaxis=None)

hv.Layout([
    h_p1.relabel('TopoProd P1').opts(cmap='gray'),
    h_p2.relabel('TopoProd P2').opts(cmap='gray'),
    h_p3.relabel('TopoProd P3').opts(cmap='gray'),
])

### Large (100x60) SOM 

In [13]:
# Train 100x60 Map for the selected Dataset

if (dataset_str == "10clusters"):
    dimension = 10
elif (dataset_str == "datalink"):
    dimension = 3

som = MiniSom(100, 60, dimension)
som.train(idata['arr'], 10000)

##### TP for k=1

In [14]:
p = TopoProd(100, 60, som._weights, k=1)

p1 = p[..., 0] # Distortion in input space
p2 = p[..., 1] # Distortion in output space
p3 = p[..., 2] # Geometric mean between p1 and p2

h_p1 = hv.Image(p1).opts(xaxis=None, yaxis=None)
h_p2 = hv.Image(p2).opts(xaxis=None, yaxis=None)
h_p3 = hv.Image(p3).opts(xaxis=None, yaxis=None)

hv.Layout([
    h_p1.relabel('TopoProd P1').opts(cmap='gray'),
    h_p2.relabel('TopoProd P2').opts(cmap='gray'),
    h_p3.relabel('TopoProd P3').opts(cmap='gray'),
])

##### TP for k=2

In [15]:
p = TopoProd(100, 60, som._weights, k=2)

p1 = p[..., 0] # Distortion in input space
p2 = p[..., 1] # Distortion in output space
p3 = p[..., 2] # Geometric mean between p1 and p2

h_p1 = hv.Image(p1).opts(xaxis=None, yaxis=None)
h_p2 = hv.Image(p2).opts(xaxis=None, yaxis=None)
h_p3 = hv.Image(p3).opts(xaxis=None, yaxis=None)

hv.Layout([
    h_p1.relabel('TopoProd P1').opts(cmap='gray'),
    h_p2.relabel('TopoProd P2').opts(cmap='gray'),
    h_p3.relabel('TopoProd P3').opts(cmap='gray'),
])

##### TP for k=4

In [16]:
p = TopoProd(100, 60, som._weights, k=2)

p1 = p[..., 0] # Distortion in input space
p2 = p[..., 1] # Distortion in output space
p3 = p[..., 2] # Geometric mean between p1 and p2

h_p1 = hv.Image(p1).opts(xaxis=None, yaxis=None)
h_p2 = hv.Image(p2).opts(xaxis=None, yaxis=None)
h_p3 = hv.Image(p3).opts(xaxis=None, yaxis=None)

hv.Layout([
    h_p1.relabel('TopoProd P1').opts(cmap='gray'),
    h_p2.relabel('TopoProd P2').opts(cmap='gray'),
    h_p3.relabel('TopoProd P3').opts(cmap='gray'),
])

##### TP for k = |weights| - 1

In [17]:
p = TopoProd(100, 60, som._weights)

p1 = p[..., 0] # Distortion in input space
p2 = p[..., 1] # Distortion in output space
p3 = p[..., 2] # Geometric mean between p1 and p2

h_p1 = hv.Image(p1).opts(xaxis=None, yaxis=None)
h_p2 = hv.Image(p2).opts(xaxis=None, yaxis=None)
h_p3 = hv.Image(p3).opts(xaxis=None, yaxis=None)

hv.Layout([
    h_p1.relabel('TopoProd P1').opts(cmap='gray'),
    h_p2.relabel('TopoProd P2').opts(cmap='gray'),
    h_p3.relabel('TopoProd P3').opts(cmap='gray'),
])

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


## Validation of correctness

In order to validate, whether or not our implementation of the calculation as well as visualization of the topographic product ist correct, we compared our outcome with the results of the Java SOMToolBox (0.7.5-4 svn4367). To make the validation reproducible we compared our visualizations for the provided SOMs for the Chainlink dataset as well as the 10-cluster dataset.

### Comparison with Java SOMToolBox topgraphic product
While comparing our visualizations with the ones created by the Java SOMToolBox, we found that the results of the Java SOMViewer for the pre-trained SOM were not deterministic in that it returned different visualizations of the topographic product when executed multiple times. We used the following configuration for the experiment:

- SDK: Java 11
- Javav SOMToolBox Version: 0.7.5-4 svn4367

We executed the SOMViewer with the following command for the chainlink dataset (and the 10-cluster dataset accordingly), using the provided files which we downloaded from http://www.ifs.tuwien.ac.at/dm/somtoolbox/datasets.html

```
java at.tuwien.ifs.somtoolbox.apps.viewer.SOMViewer 
    -u datasets\chainlink\chainlink.unit
    -w datasets\chainlink\chainlink.wgt
    -v datasets\chainlink\chainlink.vec
    -t datasets\chainlink\chainlink.tv
    -c datasets\chainlink\chainlink.cls
    -m datasets\chainlink\chainlink.map
```


We then used the user interface of the SOMViewer to generate the visualizations for the topographic error with different values for 'k'. We executed the same visualizations multiple times for each configuration of 'k' (1, 2, 4). 
Furthermore, after each execution of the topographic product, we cleared the tool's visualization cache in order to avoid unwanted side effects. To cross check on the behaviour of the tool with regards to other visualizations we tested various visualizations in the same manner (with the same input files), which were all deterministic. This brings us to the assumption that there might be some algorithmic error in the Java implementation of the topographic product of the Java SOMToolBox.

The following visualization results of the topographic product when using the Java SOMToolBox were executed on static data and parameter configuration multiple times with different outcome:

<table style="text-align: center">
  <tr>
    <th style="text-align: center" colspan="2">Java SOMViewer Chainlink TP for k = 2</th>
  </tr>
  <tr>
    <td><img width="300px" src="./visualizations/Java_SOMToolBox_reference_visualizations/chainlink/JavaSOMTool_visualization_chainlink_TopographicProduct_k2_a.png"></td>
    <td><img width="300px" src="./visualizations/Java_SOMToolBox_reference_visualizations/chainlink/JavaSOMTool_visualization_chainlink_TopographicProduct_k2_b.png"></td>
  </tr>
  <tr>
    <th style="text-align: center" colspan="2">Java SOMViewer Chainlink TP for k = 4</th>
  </tr>
  <tr>
    <td><img width="300px" src="./visualizations/Java_SOMToolBox_reference_visualizations/chainlink/JavaSOMTool_visualization_chainlink_TopographicProduct_k4_a.png"></td>
    <td><img width="300px" src="./visualizations/Java_SOMToolBox_reference_visualizations/chainlink/JavaSOMTool_visualization_chainlink_TopographicProduct_k4_b.png"></td>
  </tr>
  <tr>
    <th style="text-align: center" colspan="2">Java SOMViewer 10-cluster TP for k = 2</th>
  </tr>
  <tr>
    <td><img width="300px" src="./visualizations/Java_SOMToolBox_reference_visualizations/10-cluster/JavaSOMTool_visualization_10-Cluster_TopographicProduct_k2_a.png"></td>
    <td><img width="300px" src="./visualizations/Java_SOMToolBox_reference_visualizations/10-cluster/JavaSOMTool_visualization_10-Cluster_TopographicProduct_k2_b.png"></td>
  </tr>
  <tr>
    <th style="text-align: center" colspan="2">Java SOMViewer 10-cluster TP for k = 4</th>
  </tr>
  <tr>
    <td><img width="300px" src="./visualizations/Java_SOMToolBox_reference_visualizations/10-cluster/JavaSOMTool_visualization_10-Cluster_TopographicProduct_k4_a.png"></td>
    <td><img width="300px" src="./visualizations/Java_SOMToolBox_reference_visualizations/10-cluster/JavaSOMTool_visualization_10-Cluster_TopographicProduct_k4_b.png"></td>
  </tr>
 </table>

 While there are clear visible similatities, the results of two different runs on the same data return a different visualization.

### Evaluation of the visualizations

Adapting the parameter 'k' leads to changes in the granularity of the structural evaluation by the SOM. Low values (e.g. k=1) mean that topographical structures can only be mapped in a locally limited area, while a higher value of k shows more details about the topographical differences in the neighbourhood. This is particularly visible in the uni-color areas in the visualised heat maps, which represent homogeneous value of the topographical product. Since a low resulting value (towards 0) of the topographical product means an optimal representation of the neighbourhood, you will find larger uni-color areas at low 'k' values than at higher k values.
However, as you can see, very large values of 'k' (e.g. number of all other weights) do not provide a satisfactory result in some cases either, as a large amount of detailed information is lost here too.