# Module 15: Machine Learning Clustering and Dimensionality Reduction - Student Lab

## Titanic Dataset: Unsupervised Learning Challenge

Welcome to the hands-on lab for unsupervised machine learning! In this lab, you'll discover hidden patterns in the Titanic dataset using clustering and dimensionality reduction techniques.

### Learning Objectives:
- Practice data preprocessing for unsupervised learning
- Implement and compare different clustering algorithms
- Apply dimensionality reduction techniques
- Evaluate clustering results using appropriate metrics
- Interpret clusters and understand their real-world meaning
- Use RAPIDS for GPU acceleration (if available)

### Instructions:
1. Follow the TODO comments to complete each section
2. Run all cells in order
3. Answer the questions in the markdown cells
4. Experiment with different parameters and techniques

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
import warnings
warnings.filterwarnings('ignore')

# Try to import RAPIDS libraries
try:
    import cuml
    import cudf
    RAPIDS_AVAILABLE = True
    print("RAPIDS libraries available for GPU acceleration")
except ImportError:
    RAPIDS_AVAILABLE = False
    print("RAPIDS libraries not available, using CPU implementations")

# Try to import UMAP
try:
    import umap
    UMAP_AVAILABLE = True
    print("UMAP library available")
except ImportError:
    UMAP_AVAILABLE = False
    print("UMAP library not available")

print("\n=== Module 15: Unsupervised Learning Lab ===")
print("Titanic Dataset Clustering and Dimensionality Reduction Challenge")

## Part 1: Data Loading and Preprocessing

**TODO 1.1:** Load the Titanic dataset and examine its structure

In [None]:
# TODO 1.1: Load the Titanic dataset
# Hint: Use pd.read_csv() with the URL provided
titanic = # YOUR CODE HERE

print(f"Dataset shape: {titanic.shape}")
print("\nFirst 5 rows:")
display(titanic.head())

# Basic statistics
print("\nBasic statistics:")
display(titanic.describe())

# Data types and missing values
print("\nData types and missing values:")
print(titanic.info())
print("\nMissing values count:")
print(titanic.isnull().sum())

**Question 1.1:** What are the main challenges with this dataset for unsupervised learning? How does it differ from supervised learning preprocessing?

**TODO 1.2:** Implement data preprocessing for unsupervised learning

In [None]:
# TODO 1.2: Implement data preprocessing
def preprocess_unsupervised(df):
    # Create a copy
    df_clean = df.copy()
    
    # Handle missing Age values
    # Hint: Use median age by Pclass and Sex
    # YOUR CODE HERE
    
    # Handle missing Embarked values
    # Hint: Fill with mode
    # YOUR CODE HERE
    
    # Handle missing Fare values (if any)
    # Hint: Use median fare by Pclass
    # YOUR CODE HERE
    
    # Create Has_Cabin feature from Cabin column
    # YOUR CODE HERE
    
    # Extract Title from Name
    # YOUR CODE HERE
    
    # Create Family_Size feature
    # YOUR CODE HERE
    
    # Encode categorical variables
    # Hint: Use LabelEncoder for Sex, Embarked, Title
    # YOUR CODE HERE
    
    # Keep PassengerId and Survived for later analysis
    passenger_info = df_clean[['PassengerId', 'Survived', 'Name']].copy()
    
    # Drop unnecessary columns for clustering
    # YOUR CODE HERE
    
    return df_clean, passenger_info

# Apply preprocessing
titanic_clean, passenger_info = preprocess_unsupervised(titanic)
print("Preprocessed data shape:", titanic_clean.shape)
print("\nRemaining missing values:")
print(titanic_clean.isnull().sum())

# Scale the features
# Hint: Use StandardScaler
# YOUR CODE HERE

print("\nScaled data shape:", titanic_scaled.shape)
print("\nScaled data statistics:")
display(titanic_scaled.describe().round(3))

**Question 1.2:** Why do we need to scale features for clustering? What happens if we don't scale them?

## Part 2: K-means Clustering

**TODO 2.1:** Determine the optimal number of clusters

In [None]:
# TODO 2.1: Determine optimal number of clusters
# Use Elbow method and Silhouette analysis
# Hint: Try k from 2 to 10
inertias = []
silhouette_scores = []
K_range = # YOUR CODE HERE

for k in K_range:
    if RAPIDS_AVAILABLE:
        # Use RAPIDS cuML
        kmeans = # YOUR CODE HERE
        kmeans.fit(titanic_scaled)
        inertias.append(kmeans.inertia_)
        labels = kmeans.labels_
    else:
        # Use scikit-learn
        kmeans = # YOUR CODE HERE
        kmeans.fit(titanic_scaled)
        inertias.append(kmeans.inertia_)
        labels = kmeans.labels_
    
    # Calculate silhouette score
    sil_score = silhouette_score(titanic_scaled, labels)
    silhouette_scores.append(sil_score)

# Plot Elbow method and Silhouette scores
# YOUR CODE HERE

plt.tight_layout()
plt.show()

# Print optimal k suggestions
optimal_k_elbow = # YOUR CHOICE BASED ON PLOT
optimal_k_silhouette = K_range[np.argmax(silhouette_scores)]
print(f"\nSuggested k from Elbow method: {optimal_k_elbow}")
print(f"Suggested k from Silhouette score: {optimal_k_silhouette}")
print(f"Best silhouette score: {max(silhouette_scores):.3f}")

**Question 2.1:** Based on the plots, what is the optimal number of clusters? How do the Elbow method and Silhouette score agree or disagree?

**TODO 2.2:** Perform K-means clustering and analyze results

In [None]:
# TODO 2.2: Perform K-means clustering
# Choose your optimal k
optimal_k = # YOUR CHOICE

if RAPIDS_AVAILABLE:
    kmeans_final = # YOUR CODE HERE
    kmeans_labels = kmeans_final.fit_predict(titanic_scaled)
else:
    kmeans_final = # YOUR CODE HERE
    kmeans_labels = kmeans_final.fit_predict(titanic_scaled)

# Add cluster labels to the data
# YOUR CODE HERE

print(f"K-means clustering completed with {optimal_k} clusters")
print("\nCluster distribution:")
print(titanic_with_clusters['Kmeans_Cluster'].value_counts().sort_index())

# Analyze clusters
# YOUR CODE HERE

print("\nCluster characteristics:")
display(cluster_analysis)

**Question 2.2:** What do the clusters represent? Can you describe the characteristics of each cluster in plain English?

## Part 3: Other Clustering Algorithms

**TODO 3.1:** Try hierarchical clustering

In [None]:
# TODO 3.1: Hierarchical clustering
# Use a sample for efficiency
sample_size = min(200, len(titanic_scaled))
titanic_sample = # YOUR CODE HERE

# Perform hierarchical clustering
# YOUR CODE HERE

# Add hierarchical cluster labels
# YOUR CODE HERE

print(f"Hierarchical clustering completed on {sample_size} samples with {optimal_k} clusters")
print("\nHierarchical cluster distribution:")
print(titanic_sample_with_clusters['Hierarchical_Cluster'].value_counts().sort_index())

# Compare with K-means on the sample
# YOUR CODE HERE

**Question 3.1:** How do hierarchical clustering results compare to K-means? Which method do you prefer and why?

**TODO 3.2:** Experiment with DBSCAN

In [None]:
# TODO 3.2: DBSCAN clustering
# Try different eps and min_samples values
eps_values = [0.5, 0.8, 1.0]
min_samples_values = [3, 5, 10]

best_silhouette = -1
best_params = None
best_labels = None

for eps in eps_values:
    for min_samples in min_samples_values:
        if RAPIDS_AVAILABLE:
            dbscan = # YOUR CODE HERE
            labels = dbscan.fit_predict(titanic_scaled)
        else:
            dbscan = # YOUR CODE HERE
            labels = dbscan.fit_predict(titanic_scaled)
        
        # Only evaluate if we have more than 1 cluster and some core points
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        n_core_points = (labels != -1).sum()
        
        if n_clusters > 1 and n_core_points > 10:
            try:
                sil_score = silhouette_score(titanic_scaled[labels != -1], labels[labels != -1])
                if sil_score > best_silhouette:
                    best_silhouette = sil_score
                    best_params = (eps, min_samples)
                    best_labels = labels
            except:
                continue

print(f"Best DBSCAN parameters: eps={best_params[0]}, min_samples={best_params[1]}")
print(f"Best silhouette score: {best_silhouette:.3f}")

# Analyze best DBSCAN results
if best_labels is not None:
    n_clusters = len(set(best_labels)) - (1 if -1 in best_labels else 0)
    n_noise = list(best_labels).count(-1)
    
    print(f"\nNumber of clusters found: {n_clusters}")
    print(f"Number of noise points: {n_noise} ({n_noise/len(titanic_scaled)*100:.1f}%)")
    
    # Add to main dataframe
    titanic_with_clusters['DBSCAN_Cluster'] = best_labels

**Question 3.2:** What does DBSCAN find that the other methods don't? When would you choose DBSCAN over K-means?

## Part 4: Dimensionality Reduction

**TODO 4.1:** Apply PCA for dimensionality reduction

In [None]:
# TODO 4.1: PCA dimensionality reduction
# Apply PCA to reduce to 2 dimensions
# YOUR CODE HERE

print("PCA completed")
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total explained variance: {pca.explained_variance_ratio_.sum():.3f}")

# Create DataFrame with PCA results
# YOUR CODE HERE

# Plot PCA results colored by clusters
# YOUR CODE HERE

plt.tight_layout()
plt.show()

**Question 4.1:** How much variance is explained by the first two principal components? What does this tell us about the data structure?

**TODO 4.2:** Try t-SNE for non-linear dimensionality reduction

In [None]:
# TODO 4.2: t-SNE dimensionality reduction
# Apply t-SNE to reduce to 2 dimensions
# YOUR CODE HERE

# Create DataFrame with t-SNE results
# YOUR CODE HERE

# Plot t-SNE results
# YOUR CODE HERE

plt.tight_layout()
plt.show()

**Question 4.2:** How does t-SNE visualization compare to PCA? Which one shows clearer cluster separation?

**TODO 4.3:** Try UMAP if available

In [None]:
# TODO 4.3: UMAP dimensionality reduction (if available)
if UMAP_AVAILABLE:
    print("Performing UMAP dimensionality reduction...")
    
    if RAPIDS_AVAILABLE:
        # Use RAPIDS cuML UMAP
        umap_reducer = # YOUR CODE HERE
        titanic_umap = umap_reducer.fit_transform(titanic_scaled)
    else:
        # Use CPU UMAP
        umap_reducer = # YOUR CODE HERE
        titanic_umap = umap_reducer.fit_transform(titanic_scaled)
    
    # Create DataFrame with UMAP results
    # YOUR CODE HERE
    
    # Plot UMAP results
    # YOUR CODE HERE
    
    plt.tight_layout()
    plt.show()
else:
    print("UMAP not available, skipping UMAP visualization")

**Question 4.3:** Compare PCA, t-SNE, and UMAP. Which method works best for this dataset and why?

## Part 5: Clustering Evaluation and Interpretation

**TODO 5.1:** Evaluate clustering quality

In [None]:
# TODO 5.1: Evaluate clustering quality
# Use silhouette score, Calinski-Harabasz score, Davies-Bouldin score
def evaluate_clustering(X, labels, method_name):
    if len(set(labels)) > 1:
        try:
            silhouette = # YOUR CODE HERE
            ch_score = # YOUR CODE HERE
            db_score = # YOUR CODE HERE
            
            print(f"\n{method_name} Clustering Evaluation:")
            print(f"Silhouette Score: {silhouette:.3f}")
            print(f"Calinski-Harabasz Score: {ch_score:.3f}")
            print(f"Davies-Bouldin Score: {db_score:.3f}")
            
            return {
                'silhouette': silhouette,
                'ch_score': ch_score,
                'db_score': db_score
            }
        except:
            print(f"\n{method_name}: Could not calculate all metrics")
            return None
    else:
        print(f"\n{method_name}: Only one cluster found")
        return None

# Evaluate K-means
# YOUR CODE HERE

# Evaluate DBSCAN (only core points)
# YOUR CODE HERE

**Question 5.1:** Which clustering method performs best according to the metrics? What do these metrics tell us?

**TODO 5.2:** Interpret the clusters

In [None]:
# TODO 5.2: Interpret clusters
# Analyze each cluster's characteristics
print("\n=== CLUSTER INTERPRETATION ===")

# Analyze K-means clusters in detail
# YOUR CODE HERE

# Compare clustering with survival patterns
# YOUR CODE HERE

# Show example passengers from each cluster
# YOUR CODE HERE

**Question 5.2:** What real-world insights can you derive from the clusters? How do they relate to the survival patterns we know from supervised learning?

## Part 6: Advanced Challenge (Optional)

**TODO 6.1:** Try different preprocessing approaches

In [None]:
# TODO 6.1: Advanced preprocessing (Optional)
# Try different feature engineering approaches

# Option 1: Include more categorical features
# Option 2: Try different scaling methods
# Option 3: Create interaction features

# Example: Try without scaling some features
# titanic_no_scale = titanic_clean.copy()
# scaler_partial = StandardScaler()
# features_to_scale = ['Age', 'Fare', 'Family_Size']
# titanic_no_scale[features_to_scale] = scaler_partial.fit_transform(titanic_no_scale[features_to_scale])

# Then compare clustering results
# YOUR CODE HERE

print("Advanced preprocessing completed - compare results with basic approach")

## Summary and Reflection

**TODO:** Complete the summary questions below

### Key Learnings:
1. What was the most surprising discovery from your clustering analysis?
2. Which clustering algorithm worked best for this dataset and why?
3. How do dimensionality reduction techniques help understand the data?
4. What are the practical applications of unsupervised learning on this dataset?
5. How does unsupervised learning complement supervised learning?

### Answers:
1. 
2. 
3. 
4. 
5. 

### Next Steps:
- Try other clustering algorithms (Gaussian Mixture Models, Spectral Clustering)
- Experiment with different distance metrics and linkage methods
- Apply clustering to other datasets
- Explore semi-supervised learning approaches
- Try advanced dimensionality reduction techniques

### Performance Notes:
- RAPIDS acceleration: {"Used" if RAPIDS_AVAILABLE else "Not used"}
- UMAP available: {"Yes" if UMAP_AVAILABLE else "No"}
- Dataset size: {len(titanic_scaled)} samples, {titanic_scaled.shape[1]} features
- Best clustering method: [Your choice]
- Best dimensionality reduction: [Your choice]