# K-Means Clustering on Iris Dataset
**Author:** Magudeshwaran and Senthilkumaran

**Goal:** Group similar iris flowers into clusters using the K-Means algorithm, after reducing data dimensions with PCA.

### Step 1: Import Libraries
We need `pandas` for data, `numpy` for numbers, `matplotlib` for plotting, and `sklearn` for scaling, PCA, and K-Means.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import os

# Fix for potential threadpoolctl error
os.environ["OMP_NUM_THREADS"] = "1"

### Step 2: Load the Data
We load the Iris dataset, a classic for clustering. It contains measurements of different iris flower species.

In [None]:
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'
df = pd.read_csv(url)
df.head()

### Step 3: Prepare the Data
We prepare our data by:
- Selecting only the numerical features for clustering.
- Scaling the features so they all have a similar range.
- Using PCA to reduce the data to 2 dimensions for easy plotting.

In [None]:
# Select numerical features
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].values
y_true = df['species'].values # True labels for comparison, not used in clustering

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA to reduce to 2 components for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

### Step 4: Determine Optimal K (Elbow Method)
The Elbow Method helps us find the best number of clusters (K). We plot the "inertia" (sum of squared distances of samples to their closest cluster center) for different K values. The "elbow point" in the plot suggests an optimal K.

In [None]:
inertia = []
K = range(1, 11)
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) # Set n_init explicitly
    kmeans.fit(X_pca)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(K, inertia, 'bx-')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal K')
plt.show()

### Step 5: Apply K-Means with Optimal K
From the elbow plot, K=3 seems to be a good choice. Now we apply K-Means with 3 clusters.

In [None]:
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) # Set n_init explicitly
clusters = kmeans.fit_predict(X_pca)

### Step 6: Visualize the Clusters
Finally, we plot the clustered data. We can see how K-Means has grouped the data points into three distinct clusters.

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis', marker='o', s=50, edgecolor='k')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X') # Plot cluster centers
plt.title('K-Means Clustering of Iris Data (PCA-reduced)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Cluster Label')
plt.grid(True)
plt.show()