Carlos Bravo Garrán - 100474964

# __Seed Clustering__

In [None]:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot') or plt.style.use('ggplot')

from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_blobs

import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

### __1. Load the dataset__

Load the seed dataset from a CSV file. 

The features are stored in `X`, and the target class labels are stored in `y`.

Add seed variable for random states (`100474964`)


In [None]:
data = pd.read_csv('data/semillas.csv')
X = data.drop(columns=['clase'])
y = data['clase']

seed = 100474964

print(data.head())

### __2. Comparison of Scalers__

In this section, we aim to identify the most appropriate scaler for the seed dataset before applying clustering algorithms. Scaling is crucial to ensure that all features contribute equally to the distance calculations.

We compare three scalers:
- MinMaxScaler
- RobustScaler
- StandardScaler

The scaled data is projected into 2D using PCA for visual evaluation.

In [None]:
scalers = {
    'MinMaxScaler': MinMaxScaler(),
    'RobustScaler': RobustScaler(),
    'StandardScaler': StandardScaler()
}

plt.figure(figsize=(18, 5))

for i, (name, scaler) in enumerate(scalers.items(), 1):
    pipeline = make_pipeline(scaler, PCA(n_components=2, random_state=seed))
    
    X_pca = pipeline.fit_transform(X)
    
    plt.subplot(1, 3, i)
    scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7, s=50)
    plt.title(f'{name} + PCA')
    plt.xlabel('PC1')
    plt.ylabel('PC2')
    plt.grid(True)

plt.suptitle('Comparison of Scalers with PCA (By Seed Class)', fontsize=16)
plt.tight_layout()
plt.show()


#### 2.1 Best Scaler Selection

After observing the PCA plots:

- **MinMaxScaler**: The data points are well distributed with moderate separation between different seed classes. Although some overlap exists, the distribution appears balanced and suitable for clustering.
- **RobustScaler**: There is a good separation of classes, but the spread of the data is larger, which may not be ideal for density-based methods.
- **StandardScaler**: While resistant to outliers, the classes are not well separated, making clustering more difficult.

We select **MinMaxScaler** because it provides a balanced and homogeneous distribution of the data, facilitating the identification of clusters without introducing large variations in scale.


#### 2.2 Variance Explained by PCA

To ensure that the 2D projection using PCA retains enough information from the original dataset, we calculate, after applying each scaler, the variance explained by the two principal components and the total variance.


In [None]:
variance_ratios = {}

for name, scaler in scalers.items():
    X_scaled = scaler.fit_transform(X)
    pca = PCA(n_components=2, random_state=seed)
    X_pca = pca.fit_transform(X_scaled)
    
    variance_explained = np.sum(pca.explained_variance_ratio_)
    variance_ratios[name] = {
        'PC1': pca.explained_variance_ratio_[0],
        'PC2': pca.explained_variance_ratio_[1],
        'Total': variance_explained
    }

variance_table = pd.DataFrame.from_dict(variance_ratios, orient='index')
variance_table.index.name = 'Scaler'
variance_table.reset_index(inplace=True)

display(variance_table)


The results show that all three scalers achieve a high variance explanation (>85%), indicating that the 2D PCA projection is representative of the original dataset in every case.

Among them, **MinMaxScaler** achieves the highest variance explained, with 91.81%.

We confirm that using **MinMaxScaler** is appropriate, as it retains the largest proportion of the original data variance after dimensionality reduction. Therefore, we proceed with MinMaxScaler for the clustering tasks.

#### 2.3 Scaling and PCA
The dataset was scaled using the selecter scaler (**MinMaxScaler**) and reduced to two dimensions using PCA

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    MinMaxScaler(),
    PCA(n_components=2, random_state=seed)
)

X_final_pca = pipeline.fit_transform(X)

plt.figure(figsize=(4, 3))
plt.scatter(X_final_pca[:, 0], X_final_pca[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.title('MinMaxScaler + PCA (Final Preprocessing)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('MinMaxScaler + PCA', fontsize=10)
plt.grid(True)
plt.show()
