# Project 1: Clustering with SOM
## Imports and Setup

Basic imports, sklearn utilities, MiniSom, and matplotlib configuration. Includes robust MiniSom import/installation.

In [5]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_digits
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

from minisom import MiniSom

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

## Data Loading and Preparation

Load the digits dataset, scale the features, and split into training, validation, and test sets (60/20/20 split).

In [6]:
# Load the digits dataset
digits = load_digits()
X = digits.data # Original data (8x8=64 features)
y = digits.target # Labels (0-9)
print(f"Digits dataset loaded: {X.shape[0]} samples, {X.shape[1]} features.")
print(f"Number of classes: {len(np.unique(y))}")

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
print(f"Data scaled using {type(scaler).__name__}.")

# 1. Split into Initial Train and Final Test sets
X_train_init, X_test, y_train_init, y_test, X_train_init_orig, X_test_orig = train_test_split(
    X_scaled, y, X, # Split scaled X, y, AND original X
    test_size=0.2,    # 80% initial train, 20% final test
    random_state=42,
    stratify=y )
print(f"Initial data split: Initial Train ({X_train_init.shape[0]}), Final Test ({X_test.shape[0]})")

# 2. Split Initial Train into Final Train and Validation sets
X_train_final, X_val, y_train_final, y_val, X_train_final_orig, X_val_orig = train_test_split(
    X_train_init, y_train_init, X_train_init_orig, # Split initial train data (scaled and original)
    test_size=0.25,
    random_state=42,
    stratify=y_train_init
)
print(f"Train/Validation split: Final Train ({X_train_final.shape[0]}), Validation ({X_val.shape[0]})")
print("Digits data prepared and split into Train/Validation/Test sets.")
print("-" * 30)

Digits dataset loaded: 1797 samples, 64 features.
Number of classes: 10
Data scaled using MinMaxScaler.
Initial data split: Initial Train (1437), Final Test (360)
Train/Validation split: Final Train (1077), Validation (360)
Digits data prepared and split into Train/Validation/Test sets.
------------------------------
