Carlos Bravo Garrán - 100474964

# __Seed Clustering__

In [None]:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot') or plt.style.use('ggplot')

from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import silhouette_samples
import matplotlib.cm as cm



import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

### __1. Load the dataset__

Load the seed dataset from a CSV file. 

The features are stored in `X`, and the target class labels are stored in `y`.

Add seed variable for random states (`100474964`)


In [None]:
data = pd.read_csv('data/semillas.csv')
X = data.drop(columns=['clase'])
y = data['clase']

seed = 100474964

print(data.head())

### __2. Comparison of Scalers__

This section identifies the most appropriate scaler for the seed dataset before applying clustering algorithms. Scaling ensures that all features contribute equally to the distance calculations.

Three scalers are compared:
- MinMaxScaler
- RobustScaler
- StandardScaler

The scaled data is projected into 2D using PCA for visual evaluation.

In [None]:
scalers = {
    'MinMaxScaler': MinMaxScaler(),
    'RobustScaler': RobustScaler(),
    'StandardScaler': StandardScaler()
}

plt.figure(figsize=(18, 5))

for i, (name, scaler) in enumerate(scalers.items(), 1):
    pipeline = make_pipeline(scaler, PCA(n_components=2, random_state=seed))
    
    X_pca = pipeline.fit_transform(X)
    
    plt.subplot(1, 3, i)
    scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7, s=50)
    plt.title(f'{name} + PCA')
    plt.xlabel('PC1')
    plt.ylabel('PC2')
    plt.grid(True)

plt.suptitle('Comparison of Scalers with PCA (By Seed Class)', fontsize=16)
plt.tight_layout()
plt.show()


#### 2.1 Best Scaler Selection

After observing the PCA plots:

- **MinMaxScaler**: The data points are well distributed with moderate separation between different seed classes. Although some overlap exists, the distribution appears balanced and suitable for clustering.
- **RobustScaler**: There is a good separation of classes, but the spread of the data is larger, which may not be ideal for density-based methods.
- **StandardScaler**: While resistant to outliers, the classes are not well separated, making clustering more difficult.

The scaler selected is **MinMaxScaler** because it provides a balanced and homogeneous distribution of the data, facilitating the identification of clusters without introducing large variations in scale.


#### 2.2 Variance Explained by PCA

To ensure that the 2D projection using PCA retains enough information from the original dataset, calculate, after applying each scaler, the variance explained by the two principal components and the total variance.


In [None]:
variance_ratios = {}

for name, scaler in scalers.items():
    X_scaled = scaler.fit_transform(X)
    pca = PCA(n_components=2, random_state=seed)
    X_pca = pca.fit_transform(X_scaled)
    
    variance_explained = np.sum(pca.explained_variance_ratio_)
    variance_ratios[name] = {
        'PC1': pca.explained_variance_ratio_[0],
        'PC2': pca.explained_variance_ratio_[1],
        'Total': variance_explained
    }

variance_table = pd.DataFrame.from_dict(variance_ratios, orient='index')
variance_table.index.name = 'Scaler'
variance_table.reset_index(inplace=True)

display(variance_table)


The results show that all three scalers achieve a high variance explanation (>85%), indicating that the 2D PCA projection is representative of the original dataset in every case.

Among them, **MinMaxScaler** achieves the highest variance explained, with 91.81%.

It's confirmed that using **MinMaxScaler** is appropriate, as it retains the largest proportion of the original data variance after dimensionality reduction. Therefore, MinMaxScaler is selected for the clustering tasks.

#### 2.3 Scaling and PCA
The dataset was scaled using the selecter scaler (**MinMaxScaler**) and reduced to two dimensions using PCA

In [None]:
pipeline = make_pipeline(
    MinMaxScaler(),
    PCA(n_components=2, random_state=seed)
)

X_final_pca = pipeline.fit_transform(X)

plt.figure(figsize=(4, 3))
plt.scatter(X_final_pca[:, 0], X_final_pca[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.title('MinMaxScaler + PCA (Final Preprocessing)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('MinMaxScaler + PCA', fontsize=10)
plt.grid(True)
plt.show()


### __3. Clustering__

Apply unsupervised clustering techniques on the PCA-transformed seed dataset to identify natural groupings:

- K-Means
- Hierarchical Clustering
- DBSCAN

For each method,the key hyperparameters are tuned, the results visualized and their performance compared to determine which best captures the structure of the data.


#### __3.1 K-means clustering__

Based on the visual inspection of the PCA plot, k = 3 is initially selected as a reasonable number of clusters.

##### __Model creation and training__

In [None]:
modelo_kmeans = KMeans(
    n_clusters=3,
    n_init=25,
    random_state=seed
)

modelo_kmeans.fit(X_final_pca)


##### __Cluster prediction__

In [None]:
y_pred = modelo_kmeans.predict(X_final_pca)


##### __Cluster Visualization and Evaluation__


In [None]:
colors = ['yellow', 'purple', 'green', 'orange', 'purple']

fig, ax = plt.subplots(1, 1, figsize=(6, 5))

for cluster in np.unique(y_pred):
    ax.scatter(
        x = X_final_pca[y_pred == cluster, 0],
        y = X_final_pca[y_pred == cluster, 1],
        color = colors[cluster],
        label = f"Cluster {cluster}",
        edgecolor = 'black',
        marker = 'o'
    )

ax.scatter(
    x = modelo_kmeans.cluster_centers_[:, 0],
    y = modelo_kmeans.cluster_centers_[:, 1],
    c = 'red',
    s = 200,
    marker = '*',
    label = 'Centroids'
)

ax.set_title('Clusters generated by KMeans (K=3)')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.legend()
ax.grid(True)
plt.show()


The K-Means clustering with k = 3 produced three clearly defined groups in the PCA-transformed space.

- Each cluster is compact and well-separated from the others, with minimal overlap between groups.
- The centroids are located close to the center of each cluster, indicating that the algorithm has correctly captured the dense regions of the data.
- The overall structure suggests that the data naturally forms three main groups, supporting the choice of k = 3.
- Although the clusters are not perfectly spherical, the distribution around each centroid is balanced and coherent.

In conclusion, the visual inspection confirms that K-Means with k = 3 is a good fit for the structure observed in the seed dataset.

Even so, other values of k will be tested to ensure robustness.

#### __Evaluation of K-Means with Different Values of k__

In this section, the performance of the K-Means algorithm is evaluated using different values of k (number of clusters). This helps identify the optimal number of clusters that best represents the structure of the data.

Plots are generated for k = 2 and k = 4, comparing the distribution of data in the PCA space.


In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

# Results for K = 2
y_predict_2 = KMeans(n_clusters=2, n_init=25, random_state=seed).fit_predict(X=X_final_pca)
ax[0].scatter(
    x = X_final_pca[:, 0],
    y = X_final_pca[:, 1],
    c = y_predict_2,
    marker = 'o',
    edgecolor = 'black'
)
ax[0].set_title('KMeans with K=2')
ax[0].set_xlabel('PC1')
ax[0].set_ylabel('PC2')
ax[0].grid(True)

# Results for K = 4
y_predict_4 = KMeans(n_clusters=4, n_init=25, random_state=seed).fit_predict(X=X_final_pca)
ax[1].scatter(
    x = X_final_pca[:, 0],
    y = X_final_pca[:, 1],
    c = y_predict_4,
    marker = 'o',
    edgecolor = 'black'
)
ax[1].set_title('KMeans with K=4')
ax[1].set_xlabel('PC1')
ax[1].set_ylabel('PC2')
ax[1].grid(True)

plt.tight_layout()
plt.show()



- With **k = 2**, the data is divided into two large groups, losing important internal structures.
- With **k = 4**, the clustering tends to oversegment the data, creating small clusters that do not correspond to natural groupings.
- With **k = 3**, the data is divided into three compact and well-separated groups, without significant overlap or oversegmentation.

The visual comparison confirms that **k = 3** is the most appropriate number of clusters for this dataset.

##### __Elbow Method__

To support the selection of the number of clusters, the elbow method is applied, which analyzes how the inertia decreases as the number of clusters increases.

The optimal number of clusters is identified at the point where adding an additional cluster no longer significantly reduces the inertia.

In [None]:
inertia = []
K = range(1, 11)

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=0)
    kmeans.fit(X_final_pca)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(6, 4))
plt.plot(K, inertia, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(True)
plt.show()


The plot of inertia facing the number of clusters shows a clear "elbow" at k = 3, reconfirming that k = 3 is the most appropriate number of clusters for this dataset.

##### __Silhouette Method__

The silhouette coefficient measures how similar points within the same cluster are compared to those in other clusters. The closer to 1, the better defined the cluster.

We will calculate and visualize the average silhouette coefficient for different values of k (from 2 to 10):

In [None]:
silhouette_scores = []
K = range(2, 11)

for k in K:
    kmeans = KMeans(n_clusters=k, n_init=25, random_state=seed)
    labels = kmeans.fit_predict(X_final_pca)
    score = silhouette_score(X_final_pca, labels)
    silhouette_scores.append(score)

plt.figure(figsize=(6, 4))
plt.plot(K, silhouette_scores, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Average Silhouette Score')
plt.title('Silhouette Method for Optimal k')
plt.grid(True)
plt.show()


It can be observed that k = 2 has the highest silhouette coefficient (~0.56), suggesting good separation between two groups. However, this may be because the classes are divided into two large generic groups, ignoring more natural subdivisions.

This differs from the choice we made earlier of k = 3, so to clarify, we will generate individual silhouette plots for k = 2 to k = 5 to see if the clusters are well-defined and how consistent the assignments are.

In [None]:
Ks = [2, 3, 4, 5]
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for idx, k in enumerate(Ks):
    ax = axes[idx]
    kmeans = KMeans(n_clusters=k, n_init=25, random_state=seed)
    cluster_labels = kmeans.fit_predict(X_final_pca)

    silhouette_vals = silhouette_samples(X_final_pca, cluster_labels)

    y_lower = 10
    for i in range(k):
        ith_cluster_silhouette_vals = silhouette_vals[cluster_labels == i]
        ith_cluster_silhouette_vals.sort()

        size_cluster_i = ith_cluster_silhouette_vals.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / k)
        ax.fill_betweenx(
            np.arange(y_lower, y_upper),
            0,
            ith_cluster_silhouette_vals,
            facecolor=color,
            edgecolor=color,
            alpha=0.7
        )

        ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
        y_lower = y_upper + 10 

    avg_score = silhouette_score(X_final_pca, cluster_labels)
    ax.text(0.7, 0.9, f'Avg Silhouette: {avg_score:.3f}', transform=ax.transAxes, color="red", fontsize=10)


    ax.axvline(x=avg_score, color="red", linestyle="--")
    ax.set_title(f'Silhouette plot for K = {k}')
    ax.set_xlabel('Silhouette coefficient values')
    ax.set_ylabel('Cluster label')
    ax.set_yticks([])
    ax.set_xlim([-0.1, 1])
    ax.grid(True)

plt.tight_layout()
plt.show()


| Number of Clusters (k) | Average Silhouette | Analysis                                                                 |
|------------------------|--------------------|-------------------------------------------------------------------------------|
| 2                      | ~0.56           | Clear and compact structure, but may be an excessive simplification          |
| 3                      | ~0.51           | Good separation, consistent with the actual number of classes (3)            |
| 4                      | ~0.43           | Overclustering, appearance of small clusters and overlap between groups      |
| 5                      | ~0.40           | Worse cohesion, some artificial clusters and poor point assignment           |


The silhouette plots show that the values of n_clusters = 4 and 5 are a poor choice due to below-average silhouette scores and the appearance of clusters with very unequal sizes.

On the other hand, the silhouette analysis is more ambiguous when comparing k = 2 and k = 3, as we have seen before. The value of k = 2 presents the highest average silhouette coefficient (~0.56), and its clusters are clearly separated, although the grouping may be too general, hiding more detailed structures. This can also be seen in the plot, where one of the clusters (cluster 0) is larger and encompasses multiple subsets.

With k = 3, the average silhouette coefficient is slightly lower (~0.51), but the three clusters exhibit good internal cohesion and relative separation, with more balanced sizes and consistent bars. Additionally, k = 3 is consistent with the results obtained previously.

In summary, although k = 2 achieves the highest average score, the visual and conceptual analysis suggests that k = 3 offers a richer, more interpretable segmentation aligned with the actual structure of the dataset.

