# Visualizing a representation of the `penguins` dataset in 2D with *Self-Organizing Map*

In [01_kmeans_clustering.ipynb](./01_kmeans_clustering.ipynb) we use/used the `K-Means` algorithm to find groups of data points. Here, we want to learn a representation of the dataset. K-Means may also be used for representation learning in a sense that it discretizes the data space into disjoint regions. However, once we have a dataset that has more than three dimensions, visualizing these groups becomes challenging.

On the other hand, a *Self-Organizing Map* is similar to `K-Means` in a sense that it also operates with centroids to which data samples get mapped, yet one very important difference here is that the centroids are chained together from the very beginning so after it discovers and maps the dataset, the neighboring information can be used to flatten the SOM grid into a 2D visual. As hinted, SOM itself could be used for clustering as well, yet in this notebook the objective is to learn and visualize the simpler representation of the original dataset.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

RAND = 42

## Dataset

In [None]:
X_scaled = pd.read_csv("../datasets/penguins/simple/X_scaled.csv", index_col=0, header=0)
y = pd.read_csv("../datasets/penguins/simple/y.csv", index_col=0, header=0)

In [None]:
# TODO: Do some exploratory data analysis (Optional)
# ...

## Experiment 1: Use only two features

In this experiment, we treat the first two features of the dataset as it was a dataset with arbitrary dimensions and use SOM to learn the representation of it and to visualize it in 2D. In this case we will be able to verify the converge easily.

1. Define hyperparameters

	> use `calc_recommended_grid_size` methods to get candidate values for `d1` and `d2`

	> e.g.: `print("Recommended total node count and dimension size for SOM:", calc_recommended_grid_size(...))`

2. Fit SOM

	> use `train_batch_som` method to obtain an `som` object fitted on the data

	> observe the quality of the SOM outputted, and experiment with different hyperparameter combinations to arrive at a better representation

	> *Quantization error (QE)* measures the Mean Residual from best-matching nodes aka. nearest centroid

	> *Topographic error (TE)* measures the proportion of data points for which the 1st and 2nd best-matching units are not neighbors in the SOM grid

3. Extract learned node weights (prototype / representative point positions) from SOM

	> use `som.get_weights()` and flatten them

4. Visualize these 2D coordinates on the 2D dataset

	We may expect to end up with a similar results but could be better:

	![](./_images/som_representation_expectation.png)

5. Visualize the representation via the SOM maps

	We may expect to end up with a similar results but could be better:

	![](./_images/som_representation_maps.png)

In [None]:
from _utilities.som import calc_recommended_grid_size, train_som
from _utilities.som_plot import visualize_distance_map, visualize_hitmap, place_node_edges

In [None]:
selected_columns = X_scaled.columns[:2]
print("Selected features:", selected_columns.tolist())

X_scaled_F12 = X_scaled[selected_columns].copy().values
print("Data shape:", X_scaled_F12.shape)

In [None]:
# TODO: Hyperparameters

hyparams = {
    "d1": ?,
    "d2": ?,
    "sigma": ?, # start with D/2 for more robust map, or D/3 or smaller for more flexible
    "learning_rate": .5, # start with .5, usually between .5-1
    "num_iteration": 10, # a single iteration touches every data point in the dataset
    "topology": "rectangular"
}

# TODO: Fit SOM

som = train_som(X=?, **hyparams, random_seed=RAND, verb=True)

# TODO: Extract learned node weights

node_weights = som.get_weights()
node_weights_flat = node_weights.reshape(-1, node_weights.shape[2])

In [None]:
# TODO: Visualize

plt.figure(figsize=(8, 6))

# data points
plt.scatter(?, ?, color="cornflowerblue", edgecolors="royalblue", alpha=.75)

# node positions
plt.scatter(node_weights_flat[:, 0], node_weights_flat[:, 1], color="red", edgecolors="red", alpha=.9)
place_node_edges(som, ax=plt.gca())

plt.title(?)
plt.gca().set_aspect("equal")
plt.xlabel(?)
plt.ylabel(?)

plt.show()

In [None]:
# TODO: Visualize with the SOM maps

fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(8, 4))

visualize_distance_map(som=?, X=?, ax=ax0)
visualize_hitmap(som=?, X=?, ax=ax1)

plt.tight_layout()         
plt.show()

## Experiment 2: Use all the 4 features

Follow the same steps yet use the original dataset (`X_scaled`) not the limited one. The node positions can't be visualized in two or three dimensional plots. Either we plot the features pairwise, 3D, or use PCA to reduce dimensions. But we expect that the final SOM maps can capture the original high-dimensional structure and place similar items in the original space onto nearby locations in the SOM map. Compare the final results of the two scenarios.

In [None]:
# TODO: Follow the same steps but for the original dataset