# Week 4, Day 2: Hierarchical Clustering

## Learning Objectives
- Understand hierarchical clustering concepts
- Learn agglomerative and divisive clustering
- Master dendrogram interpretation
- Practice implementing hierarchical clustering

## Topics Covered
1. Types of Hierarchical Clustering
2. Linkage Methods
3. Dendrogram Analysis
4. Applications

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage

## 1. Basic Hierarchical Clustering Example

In [None]:
def basic_hierarchical_clustering():
    # Generate synthetic data
    np.random.seed(42)
    n_samples = 150
    
    # Create three distinct clusters
    cluster1 = np.random.normal(0, 1, (n_samples, 2))
    cluster2 = np.random.normal(5, 1, (n_samples, 2))
    cluster3 = np.random.normal(2.5, 1, (n_samples, 2)) + np.array([0, 5])
    
    # Combine clusters
    X = np.vstack([cluster1, cluster2, cluster3])
    
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Create linkage matrix
    linkage_matrix = linkage(X_scaled, method='ward')
    
    # Plot dendrogram
    plt.figure(figsize=(10, 7))
    dendrogram(linkage_matrix)
    plt.title('Hierarchical Clustering Dendrogram')
    plt.xlabel('Sample Index')
    plt.ylabel('Distance')
    plt.show()
    
    # Apply clustering
    clustering = AgglomerativeClustering(n_clusters=3)
    clusters = clustering.fit_predict(X_scaled)
    
    # Visualize clusters
    plt.figure(figsize=(10, 6))
    scatter = plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
    plt.title('Hierarchical Clustering Results')
    plt.colorbar(scatter)
    plt.show()
    
    # Print cluster information
    print("Cluster sizes:")
    print(pd.Series(clusters).value_counts())

basic_hierarchical_clustering()

## 2. Comparing Linkage Methods

In [None]:
def compare_linkage_methods():
    # Generate data
    np.random.seed(42)
    n_samples = 100
    
    # Create two clusters
    X = np.vstack([
        np.random.normal(0, 1, (n_samples, 2)),
        np.random.normal(4, 1, (n_samples, 2))
    ])
    
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Different linkage methods
    methods = ['single', 'complete', 'average', 'ward']
    
    # Plot dendrograms
    plt.figure(figsize=(20, 5))
    
    for i, method in enumerate(methods, 1):
        plt.subplot(1, 4, i)
        linkage_matrix = linkage(X_scaled, method=method)
        dendrogram(linkage_matrix)
        plt.title(f'{method.capitalize()} Linkage')
        plt.xlabel('Sample Index')
        if i == 1:
            plt.ylabel('Distance')
    
    plt.tight_layout()
    plt.show()
    
    # Compare clustering results
    plt.figure(figsize=(20, 5))
    
    for i, method in enumerate(methods, 1):
        plt.subplot(1, 4, i)
        clustering = AgglomerativeClustering(n_clusters=2, linkage=method)
        clusters = clustering.fit_predict(X_scaled)
        plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
        plt.title(f'{method.capitalize()} Linkage Clusters')
    
    plt.tight_layout()
    plt.show()

compare_linkage_methods()

## 3. Document Clustering Example

In [None]:
def document_clustering_example():
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    # Sample documents
    documents = [
        "Machine learning is fascinating",
        "Deep learning revolutionizes AI",
        "Python programming is fun",
        "Data science uses machine learning",
        "Programming in Python is easy",
        "AI transforms technology",
        "Learning Python programming",
        "Data analysis with machine learning"
    ]
    
    # Convert text to TF-IDF features
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(documents)
    
    # Create linkage matrix
    linkage_matrix = linkage(X.toarray(), method='ward')
    
    # Plot dendrogram
    plt.figure(figsize=(12, 8))
    dendrogram(linkage_matrix)
    plt.title('Document Clustering Dendrogram')
    plt.xlabel('Document Index')
    plt.ylabel('Distance')
    plt.show()
    
    # Apply clustering
    clustering = AgglomerativeClustering(n_clusters=3)
    clusters = clustering.fit_predict(X.toarray())
    
    # Print results
    print("\nDocument Clusters:")
    for i, (doc, cluster) in enumerate(zip(documents, clusters)):
        print(f"Cluster {cluster}: {doc}")

document_clustering_example()

## Practical Exercises

In [None]:
# Exercise 1: Customer Segmentation

def customer_segmentation_exercise():
    # Generate customer data
    np.random.seed(42)
    n_customers = 200
    
    # Create features
    age = np.random.normal(35, 10, n_customers)
    income = np.random.normal(50000, 20000, n_customers)
    spending_score = np.random.normal(50, 25, n_customers)
    
    # Create DataFrame
    customer_data = pd.DataFrame({
        'Age': age,
        'Income': income,
        'SpendingScore': spending_score
    })
    
    print("Sample of customer data:")
    print(customer_data.head())
    
    # Task: Segment customers using hierarchical clustering
    # 1. Preprocess the data
    # 2. Create and analyze dendrogram
    # 3. Apply clustering
    # 4. Visualize and interpret results
    
    # Your code here

customer_segmentation_exercise()

In [None]:
# Exercise 2: Image Segmentation

def image_segmentation_exercise():
    from sklearn.datasets import load_sample_images
    
    # Load sample image
    dataset = load_sample_images()
    image = dataset.images[0]
    
    # Display original image
    plt.figure(figsize=(10, 5))
    plt.subplot(121)
    plt.imshow(image)
    plt.title('Original Image')
    plt.axis('off')
    plt.show()
    
    # Task: Segment image using hierarchical clustering
    # 1. Prepare image data
    # 2. Apply clustering
    # 3. Reconstruct segmented image
    # 4. Compare results
    
    # Your code here

image_segmentation_exercise()

## MCQ Quiz

1. What is the main advantage of hierarchical clustering over K-means?
   - a) Faster computation
   - b) No need to specify number of clusters
   - c) Better with categorical data
   - d) Less memory usage

2. What does a dendrogram show?
   - a) Feature importance
   - b) Cluster hierarchy
   - c) Data distribution
   - d) Error rate

3. Which linkage method minimizes within-cluster variance?
   - a) Single
   - b) Complete
   - c) Average
   - d) Ward

4. What is the time complexity of hierarchical clustering?
   - a) O(n)
   - b) O(n log n)
   - c) O(n²)
   - d) O(n³)

5. What determines the height in a dendrogram?
   - a) Number of clusters
   - b) Distance between clusters
   - c) Sample size
   - d) Feature count

6. Which is NOT a type of hierarchical clustering?
   - a) Agglomerative
   - b) Divisive
   - c) Iterative
   - d) Bottom-up

7. What is cophenetic correlation used for?
   - a) Feature selection
   - b) Cluster validation
   - c) Distance calculation
   - d) Data preprocessing

8. Which linkage method is most sensitive to outliers?
   - a) Single
   - b) Complete
   - c) Average
   - d) Ward

9. What is the main disadvantage of hierarchical clustering?
   - a) Difficult to interpret
   - b) High computational cost
   - c) Poor performance
   - d) Limited applications

10. When should you use single linkage?
    - a) Compact clusters
    - b) Elongated clusters
    - c) Spherical clusters
    - d) Dense clusters

Answers: 1-b, 2-b, 3-d, 4-c, 5-b, 6-c, 7-b, 8-a, 9-b, 10-b