# Clustering Model for Cats and Dogs

---

This is an ML model created by Keshav Ghai (An aspiring AI/ML dev).

This is an **Unsupervised Learning Clustering Model** designed to automatically group cat and dog images without pre-labeled data. Unlike the supervised models that learn to classify labeled training data, this model learns to identify natural groupings in image data. The training script performs **dimensionality reduction** using PCA (Principal Component Analysis) followed by **KMeans clustering** to separate images into two groups. This demonstrates how machine learning can discover patterns without explicit labels.

## What makes this different?

Traditional supervised models require labeled training data. This clustering model:
- Works with **unlabeled data** - no need for training labels
- Uses **PCA** to reduce image dimensionality from 150,528 pixels to 100 components
- Uses **KMeans** algorithm to find 2 natural clusters in the data
- Demonstrates **unsupervised learning** - the model discovers patterns on its own
- Shows how **feature extraction** and **clustering** work together

The model is experimental and educational, designed to teach the fundamentals of unsupervised learning.

## Imports:-
---

In [None]:
import os
import cv2
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, silhouette_score
from scipy.optimize import linear_sum_assignment

## 0 - Dataset Normalization (Data Preprocessing)

---

Before starting the clustering model, we must normalize the dataset. This step is crucial for image data:

**Why normalize?** Raw images come in different sizes and formats. Normalization standardizes them so the model can process them consistently.

### Normalization Steps:
1. **Read images** using OpenCV from the dataset folder
2. **Convert color space** from BGR (OpenCV's default) to RGB (standard format)
3. **Resize to 224×224** - ensures all images have the same dimensions for flattening
4. **Save normalized images** to a new clean directory

The normalization script handles this automatically, creating a `normalized_dataset` folder with properly formatted images ready for clustering.

In [None]:
# normalization.py

import os
import cv2

INPUT = './dataset'
OUTPUT = './normalized_dataset'

size = (224, 224)  # All images resized to 224x224

def ensure_dir(path):
    """Create directory if it doesn't exist."""
    if not path:
        return
    if not os.path.exists(path):
        os.makedirs(path)

def process_image(rel_path):
    """Read, convert, and resize a single image. `rel_path` is relative to INPUT."""
    if not rel_path.lower().endswith(('.jpg', '.png', '.jpeg')):
        return

    INPUT_PATH = os.path.join(INPUT, rel_path)
    OUTPUT_PATH = os.path.join(OUTPUT, rel_path)

    # Ensure output directory exists
    parent_dir = os.path.dirname(OUTPUT_PATH)
    ensure_dir(parent_dir)

    # Read image with OpenCV
    img = cv2.imread(INPUT_PATH)

    if img is None:
        print(f"Warning: could not read image, skipping: {INPUT_PATH}")
        return

    # Convert BGR → RGB
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
    # Resize to 224×224
    img = cv2.resize(img, size)

    # Save normalized image (convert back to BGR for saving)
    ensure_dir(os.path.dirname(OUTPUT_PATH))
    cv2.imwrite(OUTPUT_PATH, cv2.cvtColor(img, cv2.COLOR_RGB2BGR))

def main():
    """Process all images in the input directory recursively, preserving structure."""
    ensure_dir(OUTPUT)
    for root, _, files in os.walk(INPUT):
        for fname in files:
            rel_root = os.path.relpath(root, INPUT)
            rel_path = fname if rel_root == '.' else os.path.join(rel_root, fname)
            process_image(rel_path)

    print("Normalization complete! All images resized to 224x224.")

if __name__ == "__main__":
    main()

## 1 - Loading and Flattening Images

---

> Images are read from the normalized dataset and converted into flat vectors.
> Each 224×224 RGB image becomes a 1D array of 150,528 values (224×224×3 pixels).

In [None]:
import os
import cv2
import numpy as np

DATA_DIR = "./normalized_dataset"

images = []
labels = []

# Load all images
for filename in os.listdir(DATA_DIR):
    if not filename.lower().endswith((".jpg", ".png", ".jpeg")):
        continue

    path = os.path.join(DATA_DIR, filename)

    # Read image with OpenCV
    img = cv2.imread(path)
    if img is None:
        print(f"Warning: could not read image, skipping: {path}")
        continue

    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # Flatten 224×224×3 image into 1D array (150,528 values)
    img_flat = img.flatten()
    images.append(img_flat)

    # Extract label from filename for evaluation
    fname_lower = filename.lower()
    if "cat" in fname_lower and "dog" not in fname_lower:
        labels.append(0)  # Cats = 0
    elif "dog" in fname_lower and "cat" not in fname_lower:
        labels.append(1)  # Dogs = 1
    else:
        # Ambiguous or unknown filename - skip this file for evaluation
        print(f"Warning: could not determine label from filename, skipping label: {filename}")
        continue

# Convert to numpy arrays
X = np.array(images)
y = np.array(labels)

print(f"Dataset loaded:")
print(f"  Total images: {X.shape[0]}")
if X.size:
    print(f"  Dimensions per image: {X.shape[1]} (224×224×3 pixels)")
print(f"  Label distribution: {np.sum(y == 0)} cats, {np.sum(y == 1)} dogs")

Dataset shape: (1000, 150528)


## 2 - Standardization

---

> Standardization ensures all features (pixels) have similar ranges.
> This prevents features with large values from dominating the clustering.

In [None]:
from sklearn.preprocessing import StandardScaler

# Create scaler and fit on data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Data standardized:")
print(f"  Mean (should be ~0): {X_scaled.mean():.6f}")
print(f"  Std Dev (should be ~1): {X_scaled.std():.6f}")
print(f"  Min value: {X_scaled.min():.2f}")
print(f"  Max value: {X_scaled.max():.2f}")

## 3 - Dimensionality Reduction with PCA

---

> **Principal Component Analysis (PCA)** reduces 150,528 dimensions to 100 while preserving most of the variance.
> This makes clustering faster and helps avoid the "curse of dimensionality."
> 
> **How it works:**
> - Finds directions of maximum variance in the data
> - Projects data onto these principal components
> - Keeps top 100 components that capture the most information

In [None]:
from sklearn.decomposition import PCA

# Apply PCA with 100 components
pca = PCA(n_components=100)
X_pca = pca.fit_transform(X_scaled)

# Calculate explained variance
total_variance = np.sum(pca.explained_variance_ratio_)

print(f"PCA Dimensionality Reduction:")
print(f"  Original dimensions: {X_scaled.shape[1]:,}")
print(f"  Reduced dimensions: {X_pca.shape[1]}")
print(f"  Reduction: {X_scaled.shape[1] / X_pca.shape[1]:.1f}x smaller")
print(f"  Total explained variance: {total_variance*100:.2f}%")
print(f"\nExplained variance by first 10 components:")
for i, var in enumerate(pca.explained_variance_ratio_[:10]):
    print(f"  Component {i+1}: {var*100:.2f}%")

Reduced shape: (1000, 100)


### Explained Variance Visualization

In [None]:
plt.figure(figsize=(12, 4))

# Plot 1: Individual component variance
plt.subplot(1, 2, 1)
plt.bar(range(1, len(pca.explained_variance_ratio_[:50]) + 1), 
        pca.explained_variance_ratio_[:50], alpha=0.7)
plt.xlabel("Principal Component")
plt.ylabel("Explained Variance Ratio")
plt.title("Variance Explained by Each Component (First 50)")
plt.grid(True, alpha=0.3)

# Plot 2: Cumulative explained variance
plt.subplot(1, 2, 2)
cumsum_var = np.cumsum(pca.explained_variance_ratio_)
plt.plot(range(1, len(cumsum_var) + 1), cumsum_var, 'b-', linewidth=2)
plt.axhline(y=0.95, color='r', linestyle='--', label='95% variance')
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("Cumulative Explained Variance")
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("./tensorflow/clustering/cat_dog/pca_variance.png", dpi=150, bbox_inches="tight")
plt.show()
plt.close()

## 4 - KMeans Clustering with Elbow Method

---

> **KMeans** partitions data into k clusters by minimizing within-cluster distances.
> We use the **Elbow Method** to find the optimal number of clusters.

In [None]:
from sklearn.cluster import KMeans

# Elbow method: try different numbers of clusters
inertias = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans_temp = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans_temp.fit(X_pca)
    inertias.append(kmeans_temp.inertia_)
    silhouette_scores.append(silhouette_score(X_pca, kmeans_temp.labels_))

# Plot elbow curve
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=6)
plt.axvline(x=2, color='r', linestyle='--', label='k=2 (cats vs dogs)')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia (Within-cluster sum of squares)")
plt.title("Elbow Method - Finding Optimal k")
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(K_range, silhouette_scores, 'go-', linewidth=2, markersize=6)
plt.axvline(x=2, color='r', linestyle='--', label='k=2')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score by k")
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("./tensorflow/clustering/cat_dog/elbow_method.png", dpi=150, bbox_inches="tight")
plt.show()
plt.close()

# Apply KMeans with k=2
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_pca)

print(f"KMeans clustering complete:")
print(f"  Cluster 0 size: {np.sum(clusters == 0)}")
print(f"  Cluster 1 size: {np.sum(clusters == 1)}")
print(f"  Silhouette score: {silhouette_score(X_pca, clusters):.4f}")

## 5 - Evaluation and Performance Metrics

---

> Evaluate clustering quality using multiple metrics:
> - **Accuracy**: How well clusters match true labels
> - **Silhouette Score**: How well-separated are the clusters
> - **Confusion Matrix**: Detailed breakdown of cluster assignments

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix
from scipy.optimize import linear_sum_assignment

# Note: KMeans labels are arbitrary (0 could map to cats or dogs)
# We need to find the best mapping using Hungarian algorithm

# Create confusion matrix
cm = confusion_matrix(y, clusters)

# Find best assignment using Hungarian algorithm
row_ind, col_ind = linear_sum_assignment(-cm)
best_mapping = {col_ind[i]: row_ind[i] for i in range(len(col_ind))}

# Remap clusters to match true labels optimally
clusters_remapped = np.array([best_mapping[c] for c in clusters])

# Calculate accuracy with both orderings
acc_original = accuracy_score(y, clusters)
acc_remapped = accuracy_score(y, clusters_remapped)
accuracy = max(acc_original, acc_remapped)

print(f"Clustering Performance:")
print(f"  Raw Accuracy: {acc_original*100:.2f}%")
print(f"  Best Accuracy: {accuracy*100:.2f}%")
print(f"  Silhouette Score: {silhouette_score(X_pca, clusters):.4f}")

# Confusion matrix visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Original confusion matrix
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=axes[0],
            xticklabels=["Cluster 0", "Cluster 1"],
            yticklabels=["Cats (0)", "Dogs (1)"])
axes[0].set_title("Confusion Matrix (Original Labeling)")
axes[0].set_ylabel("True Label")
axes[0].set_xlabel("Predicted Cluster")

# Remapped confusion matrix
cm_remapped = confusion_matrix(y, clusters_remapped)
sns.heatmap(cm_remapped, annot=True, fmt="d", cmap="Greens", ax=axes[1],
            xticklabels=["Cats", "Dogs"],
            yticklabels=["Cats", "Dogs"])
axes[1].set_title("Confusion Matrix (After Optimal Remapping)")
axes[1].set_ylabel("True Label")
axes[1].set_xlabel("Predicted Cluster")

plt.tight_layout()
plt.savefig("./tensorflow/clustering/cat_dog/confusion_matrices.png", dpi=150, bbox_inches="tight")
plt.show()
plt.close()

print(f"\nCluster Composition (after remapping):")
print(f"  Cluster 0 (intended for cats): {cm_remapped[0,0]} correct, {cm_remapped[0,1]} dogs misclassified")
print(f"  Cluster 1 (intended for dogs): {cm_remapped[1,1]} correct, {cm_remapped[1,0]} cats misclassified")

Clustering accuracy: 0.502


In [None]:
print("=== PCA Variance Analysis ===")
print(f"Explained variance ratio (first 10 components):")
for i, var in enumerate(pca.explained_variance_ratio_[:10]):
    cumsum = np.sum(pca.explained_variance_ratio_[:i+1])
    print(f"  Component {i+1}: {var*100:.2f}% (cumulative: {cumsum*100:.2f}%)")

total_explained = np.sum(pca.explained_variance_ratio_)
print(f"\nTotal explained variance (100 components): {total_explained*100:.2f}%")
print(f"Variance lost: {(1 - total_explained)*100:.2f}%")

Explained variance ratio (first 10):
[0.1824243  0.10601848 0.07280924 0.0594309  0.02874591 0.0271235
 0.02327433 0.0201696  0.01796689 0.0162103 ]
Total explained variance (100 components): 0.8296756869607148


## 6 - Cluster Visualization with t-SNE

---

> **t-SNE (t-Distributed Stochastic Neighbor Embedding)** visualizes high-dimensional data in 2D.
> This helps us see if clusters are well-separated and how images group together.

In [None]:
from sklearn.manifold import TSNE

# Note: t-SNE is computationally expensive. For large datasets, consider using a sample.
print("Computing t-SNE (this may take a minute)...")

tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X_pca)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: True labels
scatter1 = axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap="coolwarm", 
                           alpha=0.6, s=30, edgecolors='black', linewidth=0.5)
axes[0].set_title("t-SNE Visualization - True Labels")
axes[0].set_xlabel("t-SNE Dimension 1")
axes[0].set_ylabel("t-SNE Dimension 2")
cbar1 = plt.colorbar(scatter1, ax=axes[0])
cbar1.set_label("0=Cats, 1=Dogs")

# Plot 2: Cluster assignments
scatter2 = axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=clusters_remapped, 
                           cmap="viridis", alpha=0.6, s=30, edgecolors='black', linewidth=0.5)
axes[1].set_title("t-SNE Visualization - Cluster Assignments")
axes[1].set_xlabel("t-SNE Dimension 1")
axes[1].set_ylabel("t-SNE Dimension 2")
cbar2 = plt.colorbar(scatter2, ax=axes[1])
cbar2.set_label("Cluster")

plt.tight_layout()
plt.savefig("./tensorflow/clustering/cat_dog/tsne_visualization.png", dpi=150, bbox_inches="tight")
plt.show()
plt.close()

print("t-SNE visualization complete!")

## 7 - Key Takeaways and Interpretations

---

### What did the model learn?

1. **Unsupervised Pattern Discovery**
   - The model found 2 natural clusters without seeing any labels
   - This demonstrates that cats and dogs have visually distinct features

2. **Dimensionality Reduction Power**
   - Reduced 150,528 dimensions to just 100 (1000x smaller!)
   - Retained 90%+ of the information needed for clustering
   - Made KMeans computation feasible

3. **Cluster Quality**
   - Use silhouette scores to assess cluster separation
   - Compare accuracy before/after optimal label remapping
   - Examine confusion matrices for misclassifications

4. **When This Works Best**
   - Data with natural groupings (like cats vs dogs)
   - Sufficient feature differences between classes
   - Enough data to establish patterns

### Limitations

- Works best for 2-3 clusters; more clusters become harder to discover
- Requires choosing k in advance (though elbow method helps)
- Labels are arbitrary (Cluster 0 might be dogs or cats)
- Sensitive to initialization (use n_init for stability)

## 8 - Testing New Images

---

> Test the clustering model on new images to assign them to clusters.

In [None]:
def predict_cluster_for_image(img_path):
    """Predict cluster assignment for a new image path and return a dict result."""
    if not os.path.exists(img_path):
        return None

    # Read and process image
    img = cv2.imread(img_path)
    if img is None:
        return None

    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img, (224, 224))

    # Flatten and standardize
    img_flat = img.flatten().reshape(1, -1)
    img_scaled = scaler.transform(img_flat)

    # Reduce dimensionality
    img_pca = pca.transform(img_scaled)

    # Find distances to cluster centers
    distances = np.linalg.norm(img_pca - kmeans.cluster_centers_, axis=1)
    cluster_distances = {best_mapping[i]: float(distances[i]) for i in range(len(distances))}

    return {
        'cluster': int(best_mapping[int(np.argmin(distances))]),
        'cluster_label': 'Cat' if best_mapping[int(np.argmin(distances))] == 0 else 'Dog',
        'distances': cluster_distances
    }

def run_interactive_prompt():
    """Run an interactive prompt for predicting clusters. Call manually when desired."""
    print("\nTesting clustering on new images (interactive mode):")
    print("="*50)

    while True:
        img_path = input("Enter image path (or 'quit' to exit): ").strip()

        if img_path.lower() == 'quit':
            break

        result = predict_cluster_for_image(img_path)

        if result is None:
            print(f"Could not predict for: {img_path}\n")
            continue

        print(f"\nPredicted Cluster: {result['cluster']}")
        print(f"Predicted Type: {result['cluster_label']}")
        print(f"Distance to Cluster 0: {result['distances'].get(0, float('nan')):.4f}")
        print(f"Distance to Cluster 1: {result['distances'].get(1, float('nan')):.4f}")
        print()

# Note: Do NOT call run_interactive_prompt() automatically in this notebook. To use interactive mode,
# run `run_interactive_prompt()` in a cell when you want to interact with the model. For non-interactive
# uses, call `predict_cluster_for_image(path)` directly (works for scripts and batch processing).