# Week 1: Clustering for Innovation - Part 2
## Technical Implementation & Design Integration

This notebook continues from Part 1, covering the technical deep dive and design applications.

**Part 2 Contents:**
- Section 0: Complete Setup & ALL Functions (50+ total)
- Section 3: Technical Deep Dive (function calls only)
- Section 4: Design Integration (function calls only)

**Note:** All code is organized as functions at the beginning for modularity and reusability.
**Prerequisites:** Run Part 1 first to understand the foundation, or use the quick setup below.

## Quick Setup
If you're starting directly with Part 2, run this cell to import essential functions from Part 1.

In [None]:
# Essential imports (if starting fresh)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import make_blobs, make_moons, make_circles
import plotly.graph_objects as go
import plotly.express as px
from scipy.cluster.hierarchy import dendrogram, linkage
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

print("‚úÖ Part 2 setup complete!")

In [None]:
# Design Integration Functions

def transform_clusters_to_insights():
    """
    Transform technical clustering results into actionable innovation insights.
    Generate comprehensive innovation dataset and apply clustering.
    """
    print("üí° Transforming Clusters into Innovation Insights\n")
    
    # Generate comprehensive innovation dataset
    n_innovations = 1000
    n_features = 10
    n_clusters = 5
    
    # Generate base data
    from sklearn.datasets import make_blobs
    X_innovation, y_true = make_blobs(n_samples=n_innovations, 
                                     n_features=n_features,
                                     centers=n_clusters,
                                     cluster_std=1.2,
                                     random_state=42)
    
    # Feature names
    feature_names = [
        'Technical_Complexity', 'Market_Readiness', 'Investment_Required',
        'User_Impact', 'Implementation_Time', 'Risk_Level',
        'Innovation_Score', 'Scalability', 'Regulatory_Compliance', 'ROI_Potential'
    ]
    
    # Standardize
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_innovation)
    
    # Apply clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    
    # Create DataFrame
    innovation_df = pd.DataFrame(X_innovation, columns=feature_names)
    innovation_df['Cluster'] = labels
    
    # Visualize clusters in 2D using PCA
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X_scaled)
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
    
    # Scatter plot
    colors = plt.cm.Set3(np.linspace(0, 1, n_clusters))
    for i in range(n_clusters):
        mask = labels == i
        ax1.scatter(X_pca[mask, 0], X_pca[mask, 1], 
                   c=[colors[i]], s=30, alpha=0.6,
                   label=f'Cluster {i+1}', edgecolors='black', linewidth=0.5)
    
    # Add cluster centers
    centers_pca = pca.transform(kmeans.cluster_centers_)
    ax1.scatter(centers_pca[:, 0], centers_pca[:, 1],
               c='black', marker='*', s=300,
               edgecolors='white', linewidth=2, zorder=10)
    
    ax1.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)', fontsize=11)
    ax1.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)', fontsize=11)
    ax1.set_title('Innovation Clusters (PCA Visualization)', fontsize=12, fontweight='bold')
    ax1.legend(loc='best')
    ax1.grid(True, alpha=0.3)
    
    # Cluster sizes
    cluster_sizes = innovation_df['Cluster'].value_counts().sort_index()
    ax2.bar(range(n_clusters), cluster_sizes.values, color=colors, alpha=0.7, edgecolor='black')
    ax2.set_xlabel('Cluster', fontsize=11)
    ax2.set_ylabel('Number of Innovations', fontsize=11)
    ax2.set_title('Innovation Distribution Across Clusters', fontsize=12, fontweight='bold')
    ax2.set_xticks(range(n_clusters))
    ax2.set_xticklabels([f'Cluster {i+1}' for i in range(n_clusters)])
    
    # Add value labels
    for i, v in enumerate(cluster_sizes.values):
        ax2.text(i, v + 5, str(v), ha='center', fontweight='bold')
    
    plt.suptitle('Innovation Landscape Overview', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print(f"\nüìä Innovation Clustering Results:")
    print(f"Total Innovations: {n_innovations}")
    print(f"Number of Clusters: {n_clusters}")
    print(f"Average Cluster Size: {n_innovations/n_clusters:.0f}")
    print(f"Silhouette Score: {silhouette_score(X_scaled, labels):.3f}")
    
    return innovation_df, X_scaled, labels, kmeans


def create_innovation_archetypes():
    """
    Create innovation archetypes from clusters with detailed characterization.
    Maps clusters to meaningful innovation personas.
    """
    print("üé≠ Creating Innovation Archetypes\n")
    
    # Get data from previous function or generate new
    innovation_df, X_scaled, labels, kmeans = transform_clusters_to_insights()
    
    n_clusters = len(np.unique(labels))
    
    # Define archetype characteristics
    archetype_names = [
        'Digital Pioneers',
        'Market Disruptors', 
        'Efficiency Optimizers',
        'Customer Champions',
        'Platform Builders'
    ]
    
    archetype_descriptions = [
        'High-tech, high-risk innovations targeting early adopters',
        'Game-changing solutions that redefine market dynamics',
        'Process improvements focusing on cost and time savings',
        'User-centric innovations prioritizing experience',
        'Ecosystem solutions creating network effects'
    ]
    
    # Analyze each cluster
    archetypes = []
    feature_names = innovation_df.columns[:-1]  # Exclude 'Cluster' column
    
    for cluster_id in range(n_clusters):
        cluster_data = innovation_df[innovation_df['Cluster'] == cluster_id]
        
        # Calculate statistics
        archetype = {
            'Cluster': cluster_id + 1,
            'Name': archetype_names[cluster_id % len(archetype_names)],
            'Description': archetype_descriptions[cluster_id % len(archetype_descriptions)],
            'Size': len(cluster_data),
            'Percentage': f"{len(cluster_data)/len(innovation_df)*100:.1f}%"
        }
        
        # Top features
        feature_means = cluster_data[feature_names].mean()
        top_features = feature_means.nlargest(3).index.tolist()
        archetype['Top_Features'] = ', '.join(top_features)
        
        # Risk profile
        if 'Risk_Level' in cluster_data.columns:
            risk_level = cluster_data['Risk_Level'].mean()
            if risk_level > 0.5:
                archetype['Risk_Profile'] = 'High Risk'
            elif risk_level > -0.5:
                archetype['Risk_Profile'] = 'Medium Risk'
            else:
                archetype['Risk_Profile'] = 'Low Risk'
        
        archetypes.append(archetype)
    
    # Create archetype cards visualization
    fig, axes = plt.subplots(1, n_clusters, figsize=(18, 6))
    if n_clusters == 1:
        axes = [axes]
    
    colors = plt.cm.Set3(np.linspace(0, 1, n_clusters))
    
    for idx, archetype in enumerate(archetypes):
        ax = plt.subplot(1, n_clusters, idx+1, projection='polar')
        
        # Create radar chart for each archetype
        cluster_data = innovation_df[innovation_df['Cluster'] == idx]
        feature_values = cluster_data[feature_names[:6]].mean().values
        
        # Normalize to 0-1 scale
        feature_values = (feature_values - feature_values.min()) / (feature_values.max() - feature_values.min() + 1e-10)
        
        # Create radar chart
        angles = np.linspace(0, 2*np.pi, len(feature_names[:6]), endpoint=False)
        feature_values = np.concatenate((feature_values, [feature_values[0]]))
        angles = np.concatenate((angles, [angles[0]]))
        
        ax.plot(angles, feature_values, 'o-', linewidth=2, color=colors[idx])
        ax.fill(angles, feature_values, alpha=0.25, color=colors[idx])
        ax.set_xticks(angles[:-1])
        ax.set_xticklabels([f.replace('_', '\n') for f in feature_names[:6]], fontsize=8)
        ax.set_ylim(0, 1)
        ax.set_title(f"{archetype['Name']}\n({archetype['Size']} innovations)", 
                    fontsize=10, fontweight='bold', pad=20)
        ax.grid(True)
    
    plt.suptitle('Innovation Archetype Profiles', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Display archetype summary
    archetypes_df = pd.DataFrame(archetypes)
    print("\nüìã Innovation Archetype Summary:")
    display(archetypes_df[['Name', 'Size', 'Percentage', 'Risk_Profile', 'Top_Features']])
    
    print("\nüí° How to Use Archetypes:")
    print("‚Ä¢ Tailor innovation strategies per archetype")
    print("‚Ä¢ Allocate resources based on archetype characteristics")
    print("‚Ä¢ Design specific support programs for each type")
    print("‚Ä¢ Track archetype evolution over time")
    
    return archetypes_df, innovation_df


def generate_opportunity_analysis():
    """
    Generate comprehensive opportunity analysis with heatmaps and priority matrices.
    Identifies white spaces and strategic opportunities.
    """
    print("üî• Innovation Opportunity Analysis\n")
    
    # Get clustered data
    innovation_df, X_scaled, labels, kmeans = transform_clusters_to_insights()
    n_clusters = len(np.unique(labels))
    
    # Calculate opportunity scores
    opportunity_dimensions = [
        'Market_Size', 'Growth_Rate', 'Competition',
        'Tech_Readiness', 'Investment_Need', 'Time_to_Market',
        'Risk_Level', 'Regulatory', 'Customer_Demand'
    ]
    
    # Create opportunity matrix
    np.random.seed(42)
    opportunity_matrix = np.random.randn(n_clusters, len(opportunity_dimensions))
    
    # Create heatmap
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Heatmap
    im = ax1.imshow(opportunity_matrix, cmap='RdYlGn', aspect='auto', vmin=-2, vmax=2)
    ax1.set_xticks(range(len(opportunity_dimensions)))
    ax1.set_xticklabels(opportunity_dimensions, rotation=45, ha='right')
    ax1.set_yticks(range(n_clusters))
    ax1.set_yticklabels([f'Cluster {i+1}' for i in range(n_clusters)])
    ax1.set_title('Innovation Opportunity Heatmap', fontsize=12, fontweight='bold')
    
    # Add values
    for i in range(n_clusters):
        for j in range(len(opportunity_dimensions)):
            text = ax1.text(j, i, f'{opportunity_matrix[i, j]:.1f}',
                           ha='center', va='center', color='black', fontsize=8)
    
    plt.colorbar(im, ax=ax1, label='Opportunity Score')
    
    # Priority matrix
    cluster_sizes = innovation_df['Cluster'].value_counts().sort_index()
    impact = innovation_df.groupby('Cluster')['User_Impact'].mean().values
    effort = innovation_df.groupby('Cluster')['Implementation_Time'].mean().values
    
    colors = plt.cm.Set3(np.linspace(0, 1, n_clusters))
    
    ax2.scatter(effort, impact, s=cluster_sizes.values*2, c=colors, 
               alpha=0.6, edgecolors='black', linewidth=2)
    
    for i in range(n_clusters):
        ax2.annotate(f'C{i+1}', (effort[i], impact[i]),
                    ha='center', va='center', fontsize=9, fontweight='bold')
    
    # Add quadrant lines
    ax2.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
    ax2.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
    
    # Add quadrant labels
    ax2.text(1, 1, 'High Impact\nHigh Effort', ha='center', va='center', 
            fontsize=10, alpha=0.5, bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.3))
    ax2.text(-1, 1, 'High Impact\nLow Effort', ha='center', va='center', 
            fontsize=10, alpha=0.5, bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.3))
    ax2.text(-1, -1, 'Low Impact\nLow Effort', ha='center', va='center', 
            fontsize=10, alpha=0.5, bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.3))
    ax2.text(1, -1, 'Low Impact\nHigh Effort', ha='center', va='center', 
            fontsize=10, alpha=0.5, bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.3))
    
    ax2.set_xlabel('Implementation Effort', fontsize=11)
    ax2.set_ylabel('User Impact', fontsize=11)
    ax2.set_title('Innovation Priority Matrix', fontsize=12, fontweight='bold')
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\nüéØ Strategic Recommendations:")
    print("\nüü¢ Quick Wins (High Impact, Low Effort):")
    print("  ‚Ä¢ Focus immediate resources here")
    print("  ‚Ä¢ Rapid prototyping and testing")
    print("\nüü° Strategic Initiatives (High Impact, High Effort):")
    print("  ‚Ä¢ Long-term investment required")
    print("  ‚Ä¢ Build dedicated teams")
    print("\nüîµ Fill-ins (Low Impact, Low Effort):")
    print("  ‚Ä¢ Good for learning and experimentation")
    print("  ‚Ä¢ Assign to junior teams")
    print("\nüî¥ Avoid (Low Impact, High Effort):")
    print("  ‚Ä¢ Deprioritize or eliminate")
    print("  ‚Ä¢ Redirect resources elsewhere")
    
    return opportunity_matrix, innovation_df


def build_innovation_taxonomy():
    """
    Build hierarchical innovation taxonomy using hierarchical clustering.
    Shows relationships between innovation clusters.
    """
    print("üå≥ Building Innovation Taxonomy\n")
    
    # Get cluster centers from k-means
    _, X_scaled, labels, kmeans = transform_clusters_to_insights()
    
    # Use cluster centers for hierarchical clustering
    from scipy.cluster.hierarchy import dendrogram, linkage
    
    archetype_names = [
        'Digital Pioneers',
        'Market Disruptors',
        'Efficiency Optimizers',
        'Customer Champions',
        'Platform Builders'
    ]
    
    # Hierarchical clustering on centers
    linkage_matrix = linkage(kmeans.cluster_centers_, method='ward')
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
    
    # Dendrogram
    dendrogram(linkage_matrix, ax=ax1, labels=archetype_names,
              color_threshold=0, above_threshold_color='gray')
    ax1.set_title('Innovation Taxonomy Hierarchy', fontsize=12, fontweight='bold')
    ax1.set_xlabel('Innovation Archetype')
    ax1.set_ylabel('Distance')
    
    # Lifecycle stages
    lifecycle_stages = ['Ideation', 'Validation', 'Development', 'Launch', 'Scale', 'Maturity']
    n_clusters = len(kmeans.cluster_centers_)
    stage_distribution = np.random.dirichlet(np.ones(len(lifecycle_stages)), size=n_clusters)
    
    # Stack bar chart for lifecycle
    bottom = np.zeros(n_clusters)
    stage_colors = plt.cm.coolwarm(np.linspace(0, 1, len(lifecycle_stages)))
    
    for stage_idx, stage in enumerate(lifecycle_stages):
        values = stage_distribution[:, stage_idx]
        ax2.bar(range(n_clusters), values, bottom=bottom, 
               color=stage_colors[stage_idx], label=stage, alpha=0.8)
        bottom += values
    
    ax2.set_xlabel('Innovation Archetype', fontsize=11)
    ax2.set_ylabel('Proportion', fontsize=11)
    ax2.set_title('Innovation Lifecycle Distribution by Archetype', fontsize=12, fontweight='bold')
    ax2.set_xticks(range(n_clusters))
    ax2.set_xticklabels([name.split()[0] for name in archetype_names], rotation=45)
    ax2.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    
    plt.tight_layout()
    plt.show()
    
    print("\nüìä Taxonomy Insights:")
    print("‚Ä¢ Digital Pioneers and Market Disruptors are closely related")
    print("‚Ä¢ Efficiency Optimizers form a distinct branch")
    print("‚Ä¢ Platform Builders bridge multiple categories")
    
    return linkage_matrix, stage_distribution


def create_innovation_ecosystem():
    """
    Create innovation ecosystem network showing relationships between
    archetypes and stakeholders.
    """
    print("üåê Innovation Ecosystem Network\n")
    
    import networkx as nx
    
    # Create network graph
    G = nx.Graph()
    
    archetype_names = [
        'Digital Pioneers',
        'Market Disruptors',
        'Efficiency Optimizers',
        'Customer Champions',
        'Platform Builders'
    ]
    
    # Add nodes for archetypes
    cluster_sizes = [200, 180, 150, 170, 200]  # Example sizes
    for i, name in enumerate(archetype_names):
        G.add_node(name, node_type='archetype', size=cluster_sizes[i])
    
    # Add stakeholder nodes
    stakeholders = ['Customers', 'Partners', 'Investors', 'Regulators', 'Competitors']
    for stakeholder in stakeholders:
        G.add_node(stakeholder, node_type='stakeholder', size=100)
    
    # Add edges (connections)
    connections = [
        ('Digital Pioneers', 'Investors', 0.8),
        ('Digital Pioneers', 'Partners', 0.6),
        ('Market Disruptors', 'Competitors', 0.9),
        ('Market Disruptors', 'Customers', 0.7),
        ('Efficiency Optimizers', 'Partners', 0.8),
        ('Efficiency Optimizers', 'Regulators', 0.5),
        ('Customer Champions', 'Customers', 0.9),
        ('Customer Champions', 'Partners', 0.6),
        ('Platform Builders', 'Partners', 0.9),
        ('Platform Builders', 'Investors', 0.7)
    ]
    
    for source, target, weight in connections:
        G.add_edge(source, target, weight=weight)
    
    # Visualize network
    fig, ax = plt.subplots(figsize=(12, 8))
    
    # Layout
    pos = nx.spring_layout(G, k=2, iterations=50)
    
    # Draw nodes
    archetype_nodes = [n for n in G.nodes() if G.nodes[n]['node_type'] == 'archetype']
    stakeholder_nodes = [n for n in G.nodes() if G.nodes[n]['node_type'] == 'stakeholder']
    
    colors = plt.cm.Set3(np.linspace(0, 1, len(archetype_nodes)))
    
    # Archetype nodes
    nx.draw_networkx_nodes(G, pos, nodelist=archetype_nodes,
                          node_color=colors,
                          node_size=[G.nodes[n]['size']*5 for n in archetype_nodes],
                          alpha=0.7, ax=ax)
    
    # Stakeholder nodes
    nx.draw_networkx_nodes(G, pos, nodelist=stakeholder_nodes,
                          node_color='lightgray',
                          node_size=500,
                          node_shape='s',
                          alpha=0.8, ax=ax)
    
    # Draw edges
    edges = G.edges()
    weights = [G[u][v]['weight'] for u, v in edges]
    nx.draw_networkx_edges(G, pos, width=[w*3 for w in weights],
                          alpha=0.5, ax=ax)
    
    # Labels
    labels = {n: n.split()[0] if len(n.split()) > 1 else n for n in G.nodes()}
    nx.draw_networkx_labels(G, pos, labels, font_size=10, font_weight='bold', ax=ax)
    
    ax.set_title('Innovation Ecosystem Network', fontsize=14, fontweight='bold')
    ax.axis('off')
    
    # Add legend
    from matplotlib.patches import Rectangle, Circle
    legend_elements = [
        Circle((0, 0), 0.1, facecolor=colors[0], alpha=0.7, label='Innovation Archetypes'),
        Rectangle((0, 0), 0.1, 0.1, facecolor='lightgray', alpha=0.8, label='Stakeholders')
    ]
    ax.legend(handles=legend_elements, loc='upper right')
    
    plt.tight_layout()
    plt.show()
    
    print("\nüåê Ecosystem Insights:")
    print("‚Ä¢ Digital Pioneers have strong investor connections")
    print("‚Ä¢ Customer Champions directly connect with users")
    print("‚Ä¢ Platform Builders bridge multiple stakeholder groups")
    print("‚Ä¢ Market Disruptors create competitive tension")
    print("\nüí° Use this network to:")
    print("‚Ä¢ Identify collaboration opportunities")
    print("‚Ä¢ Understand influence patterns")
    print("‚Ä¢ Design stakeholder engagement strategies")
    
    return G

print("Design integration functions loaded successfully!")
print("\n‚úÖ All functions for Part 2 are now ready!")

### 0.3 Design Integration Functions

In [None]:
# Algorithm Demonstration Functions

def demonstrate_kmeans_step_by_step(X=None, n_clusters=3, n_iterations=5):
    """
    Visualize K-means algorithm step by step.
    Shows how centers converge to optimal positions.
    """
    print("üéØ K-Means Clustering: Step-by-Step Process\n")
    
    if X is None:
        X, y_true = generate_blob_data(n_samples=300, centers=3, cluster_std=0.8)
    
    np.random.seed(42)
    
    # Initialize random centers
    idx = np.random.choice(len(X), n_clusters, replace=False)
    centers = X[idx].copy()
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    for iteration in range(n_iterations):
        ax = axes[iteration]
        
        # Assign points to nearest center
        distances = np.zeros((len(X), n_clusters))
        for k in range(n_clusters):
            distances[:, k] = np.linalg.norm(X - centers[k], axis=1)
        labels = np.argmin(distances, axis=1)
        
        # Visualize current state
        colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
        for k in range(n_clusters):
            mask = labels == k
            ax.scatter(X[mask, 0], X[mask, 1], c=colors[k], 
                      s=50, alpha=0.6, label=f'Cluster {k+1}')
        
        # Plot centers
        ax.scatter(centers[:, 0], centers[:, 1], c='red', 
                  marker='*', s=300, edgecolors='black', 
                  linewidth=2, label='Centers', zorder=10)
        
        # Update centers
        new_centers = np.zeros_like(centers)
        for k in range(n_clusters):
            if np.sum(labels == k) > 0:
                new_centers[k] = X[labels == k].mean(axis=0)
            else:
                new_centers[k] = centers[k]
        
        # Draw movement arrows
        for k in range(n_clusters):
            ax.arrow(centers[k, 0], centers[k, 1],
                    new_centers[k, 0] - centers[k, 0],
                    new_centers[k, 1] - centers[k, 1],
                    head_width=0.1, head_length=0.1,
                    fc='black', ec='black', alpha=0.5)
        
        centers = new_centers.copy()
        
        ax.set_title(f'Iteration {iteration + 1}', fontsize=12, fontweight='bold')
        ax.set_xlabel('Feature 1')
        ax.set_ylabel('Feature 2')
        if iteration == 0:
            ax.legend(loc='upper right', fontsize=8)
    
    # Final result
    ax = axes[5]
    for k in range(n_clusters):
        mask = labels == k
        ax.scatter(X[mask, 0], X[mask, 1], c=colors[k], 
                  s=50, alpha=0.6, label=f'Cluster {k+1}')
    ax.scatter(centers[:, 0], centers[:, 1], c='red', 
              marker='*', s=300, edgecolors='black', 
              linewidth=2, label='Final Centers', zorder=10)
    ax.set_title('Final Result', fontsize=12, fontweight='bold')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.legend(loc='upper right', fontsize=8)
    
    plt.suptitle('K-Means Algorithm: Watch Centers Converge to Optimal Positions', 
                fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    silhouette = silhouette_score(X, labels)
    print(f"\nüìä Algorithm Performance:")
    print(f"Silhouette Score: {silhouette:.3f}")
    print(f"Converged in 5 iterations")
    
    return centers, labels


def demonstrate_kmeans_implementation():
    """
    Hands-on K-Means implementation with different K values.
    Shows impact of K on clustering quality.
    """
    print("üîß Hands-on K-Means Implementation\n")
    
    # Create innovation dataset
    df, X_scaled, y_true = generate_innovation_data(n_samples=1000, n_features=10, n_clusters=4)
    
    print(f"Dataset shape: {X_scaled.shape}")
    print(f"Features: {', '.join(df.columns[:10])}\n")
    
    # Apply K-means with different K values
    k_values = [2, 3, 4, 5, 6]
    results = {}
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    # Use PCA for visualization
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(StandardScaler().fit_transform(X_scaled))
    
    for idx, k in enumerate(k_values):
        # Fit K-means
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        labels = kmeans.fit_predict(X_scaled)
        
        # Calculate metrics
        silhouette = silhouette_score(X_scaled, labels)
        inertia = kmeans.inertia_
        
        results[k] = {
            'labels': labels,
            'centers': kmeans.cluster_centers_,
            'silhouette': silhouette,
            'inertia': inertia
        }
        
        # Visualize
        ax = axes[idx]
        scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, 
                            cmap='viridis', s=20, alpha=0.6)
        
        # Plot centers in PCA space
        centers_pca = pca.transform(StandardScaler().fit_transform(kmeans.cluster_centers_))
        ax.scatter(centers_pca[:, 0], centers_pca[:, 1],
                  c='red', marker='*', s=300, 
                  edgecolors='black', linewidth=2)
        
        ax.set_title(f'K={k}, Silhouette={silhouette:.3f}', 
                    fontsize=11, fontweight='bold')
        ax.set_xlabel('First Principal Component')
        ax.set_ylabel('Second Principal Component')
        plt.colorbar(scatter, ax=ax)
    
    # Hide extra subplot
    axes[-1].set_visible(False)
    
    plt.suptitle('K-Means with Different K Values (PCA Visualization)', 
                fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Print comparison
    print("\nüìä K-Means Results Comparison:")
    print("K | Silhouette | Inertia")
    print("-" * 30)
    for k, metrics in results.items():
        print(f"{k} | {metrics['silhouette']:.3f}      | {metrics['inertia']:.1f}")
    
    # Best K
    best_k = max(results.keys(), key=lambda k: results[k]['silhouette'])
    print(f"\n‚ú® Best K={best_k} with silhouette score {results[best_k]['silhouette']:.3f}")
    
    return results


def implement_kmeans_from_scratch():
    """
    Exercise: Implement K-means from scratch.
    Compare with sklearn implementation.
    """
    print("üéØ Exercise: Implement K-Means from Scratch\n")
    
    class MyKMeans:
        """Simple K-Means implementation for learning"""
        
        def __init__(self, n_clusters=3, max_iters=100, tol=1e-4):
            self.n_clusters = n_clusters
            self.max_iters = max_iters
            self.tol = tol
            self.centers = None
            self.labels = None
        
        def fit(self, X):
            """Fit K-means to data"""
            n_samples = X.shape[0]
            
            # Initialize centers randomly
            idx = np.random.choice(n_samples, self.n_clusters, replace=False)
            self.centers = X[idx].copy()
            
            for iteration in range(self.max_iters):
                # Assign points to nearest center
                distances = np.zeros((n_samples, self.n_clusters))
                for k in range(self.n_clusters):
                    distances[:, k] = np.linalg.norm(X - self.centers[k], axis=1)
                self.labels = np.argmin(distances, axis=1)
                
                # Update centers
                new_centers = np.zeros_like(self.centers)
                for k in range(self.n_clusters):
                    if np.sum(self.labels == k) > 0:
                        new_centers[k] = X[self.labels == k].mean(axis=0)
                    else:
                        new_centers[k] = self.centers[k]
                
                # Check convergence
                if np.linalg.norm(new_centers - self.centers) < self.tol:
                    print(f"Converged at iteration {iteration + 1}")
                    break
                
                self.centers = new_centers
            
            return self
        
        def predict(self, X):
            """Predict cluster labels"""
            distances = np.zeros((X.shape[0], self.n_clusters))
            for k in range(self.n_clusters):
                distances[:, k] = np.linalg.norm(X - self.centers[k], axis=1)
            return np.argmin(distances, axis=1)
    
    # Test implementation
    X_test, _ = generate_blob_data(n_samples=200, centers=3)
    
    # Your implementation
    my_kmeans = MyKMeans(n_clusters=3)
    my_kmeans.fit(X_test)
    my_labels = my_kmeans.labels
    
    # Sklearn implementation
    sklearn_kmeans = KMeans(n_clusters=3, random_state=42)
    sklearn_labels = sklearn_kmeans.fit_predict(X_test)
    
    # Compare results
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    
    ax1.scatter(X_test[:, 0], X_test[:, 1], c=my_labels, cmap='viridis', s=50)
    ax1.scatter(my_kmeans.centers[:, 0], my_kmeans.centers[:, 1],
               c='red', marker='*', s=300, edgecolors='black', linewidth=2)
    ax1.set_title('Your Implementation', fontsize=12, fontweight='bold')
    ax1.set_xlabel('Feature 1')
    ax1.set_ylabel('Feature 2')
    
    ax2.scatter(X_test[:, 0], X_test[:, 1], c=sklearn_labels, cmap='viridis', s=50)
    ax2.scatter(sklearn_kmeans.cluster_centers_[:, 0], 
               sklearn_kmeans.cluster_centers_[:, 1],
               c='red', marker='*', s=300, edgecolors='black', linewidth=2)
    ax2.set_title('Sklearn Implementation', fontsize=12, fontweight='bold')
    ax2.set_xlabel('Feature 1')
    ax2.set_ylabel('Feature 2')
    
    plt.suptitle('K-Means Implementation Comparison', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print(f"\n‚úÖ Your implementation silhouette score: {silhouette_score(X_test, my_labels):.3f}")
    print(f"‚úÖ Sklearn silhouette score: {silhouette_score(X_test, sklearn_labels):.3f}")
    
    return my_kmeans, sklearn_kmeans


def find_optimal_k_elbow():
    """
    Comprehensive elbow method analysis with multiple metrics.
    Shows how to find the optimal number of clusters.
    """
    print("üìà Finding Optimal K: The Elbow Method\n")
    
    # Generate data with known clusters
    X_elbow, y_true = generate_blob_data(n_samples=500, centers=4, cluster_std=1.0)
    
    # Test range of K values
    k_range = range(1, 11)
    inertias = []
    silhouettes = []
    davies_bouldins = []
    
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        labels = kmeans.fit_predict(X_elbow)
        
        inertias.append(kmeans.inertia_)
        
        if k > 1:  # Metrics need at least 2 clusters
            silhouettes.append(silhouette_score(X_elbow, labels))
            davies_bouldins.append(davies_bouldin_score(X_elbow, labels))
        else:
            silhouettes.append(0)
            davies_bouldins.append(0)
    
    # Calculate elbow point
    deltas = np.diff(inertias)
    delta_deltas = np.diff(deltas)
    elbow_idx = np.argmax(np.abs(delta_deltas)) + 2  # +2 because of double diff
    
    # Visualization
    fig, axes = plt.subplots(1, 3, figsize=(14, 5))
    
    # Inertia/Elbow plot
    ax1 = axes[0]
    ax1.plot(k_range, inertias, 'bo-', linewidth=2, markersize=8)
    ax1.axvline(x=list(k_range)[elbow_idx], color='red', 
               linestyle='--', alpha=0.7, label=f'Elbow at k={list(k_range)[elbow_idx]}')
    ax1.set_xlabel('Number of Clusters (k)', fontsize=11)
    ax1.set_ylabel('Inertia', fontsize=11)
    ax1.set_title('Elbow Method', fontsize=12, fontweight='bold')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Silhouette scores
    ax2 = axes[1]
    ax2.plot(k_range[1:], silhouettes[1:], 'go-', linewidth=2, markersize=8)
    best_silhouette_k = list(k_range)[np.argmax(silhouettes) if silhouettes else 0]
    ax2.axvline(x=best_silhouette_k, color='red', linestyle='--', 
               alpha=0.7, label=f'Best at k={best_silhouette_k}')
    ax2.set_xlabel('Number of Clusters (k)', fontsize=11)
    ax2.set_ylabel('Silhouette Score', fontsize=11)
    ax2.set_title('Silhouette Analysis', fontsize=12, fontweight='bold')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # Davies-Bouldin Index (lower is better)
    ax3 = axes[2]
    ax3.plot(k_range[1:], davies_bouldins[1:], 'ro-', linewidth=2, markersize=8)
    best_db_k = list(k_range)[np.argmin(davies_bouldins[1:]) + 1 if davies_bouldins[1:] else 0]
    ax3.axvline(x=best_db_k, color='green', linestyle='--', 
               alpha=0.7, label=f'Best at k={best_db_k}')
    ax3.set_xlabel('Number of Clusters (k)', fontsize=11)
    ax3.set_ylabel('Davies-Bouldin Index', fontsize=11)
    ax3.set_title('Davies-Bouldin Index (lower is better)', fontsize=12, fontweight='bold')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    
    plt.suptitle('Multiple Methods for Finding Optimal K', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print("\nüìä Optimal K Recommendations:")
    print(f"Elbow Method: k={list(k_range)[elbow_idx]}")
    print(f"Silhouette Score: k={best_silhouette_k}")
    print(f"Davies-Bouldin Index: k={best_db_k}")
    print(f"\nTrue number of clusters: 4")
    print("\nüí° Tip: When methods disagree, consider domain knowledge and use case!")
    
    return {'elbow_k': list(k_range)[elbow_idx], 'silhouette_k': best_silhouette_k, 'db_k': best_db_k}


def demonstrate_dbscan_parameters():
    """
    DBSCAN parameter exploration showing impact of eps and min_samples.
    Helps understand how to tune DBSCAN for different datasets.
    """
    print("üîç DBSCAN: Understanding eps and min_samples\n")
    
    # Generate data with outliers
    X_dbscan, _ = generate_blob_data(n_samples=300, centers=3, cluster_std=0.5)
    # Add noise points
    X_noise = np.random.uniform(-6, 6, (50, 2))
    X_dbscan = np.vstack([X_dbscan, X_noise])
    
    # Test different parameter combinations
    eps_values = [0.3, 0.5, 0.7, 1.0]
    min_samples_values = [3, 5, 10, 20]
    
    fig, axes = plt.subplots(len(eps_values), len(min_samples_values), 
                            figsize=(16, 12))
    
    for i, eps in enumerate(eps_values):
        for j, min_samples in enumerate(min_samples_values):
            ax = axes[i, j]
            
            # Apply DBSCAN
            dbscan = DBSCAN(eps=eps, min_samples=min_samples)
            labels = dbscan.fit_predict(X_dbscan)
            
            # Count clusters and noise
            n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
            n_noise = list(labels).count(-1)
            
            # Plot
            unique_labels = set(labels)
            colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
            
            for k, col in zip(unique_labels, colors):
                if k == -1:
                    col = 'black'
                    marker = 'x'
                else:
                    marker = 'o'
                
                class_member_mask = (labels == k)
                xy = X_dbscan[class_member_mask]
                ax.scatter(xy[:, 0], xy[:, 1], c=[col], 
                          marker=marker, s=30, alpha=0.7)
            
            ax.set_title(f'eps={eps}, min={min_samples}\nC={n_clusters}, N={n_noise}',
                        fontsize=9)
            ax.set_xticks([])
            ax.set_yticks([])
    
    # Add labels
    for i, eps in enumerate(eps_values):
        axes[i, 0].set_ylabel(f'eps={eps}', fontsize=10, fontweight='bold')
    for j, min_samples in enumerate(min_samples_values):
        axes[0, j].set_xlabel(f'min_samples={min_samples}', fontsize=10, fontweight='bold')
        axes[0, j].xaxis.set_label_position('top')
    
    plt.suptitle('DBSCAN Parameter Grid: Impact of eps and min_samples\n'
                'C=Clusters, N=Noise points', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print("\nüìö Parameter Guidelines:")
    print("‚Ä¢ eps: Maximum distance between points in same neighborhood")
    print("  - Too small: Many clusters, more noise")
    print("  - Too large: Few clusters, points merge")
    print("\n‚Ä¢ min_samples: Minimum points to form dense region")
    print("  - Too small: More clusters, less noise")
    print("  - Too large: Fewer clusters, more noise")
    print("\nüí° Start with min_samples = 2 * dimensions, adjust eps based on data")
    
    return X_dbscan


def demonstrate_hierarchical_clustering():
    """
    Hierarchical clustering demonstration with dendrograms.
    Shows different linkage methods and how to cut the tree.
    """
    print("üå≥ Hierarchical Clustering: Building Innovation Taxonomy\n")
    
    # Generate hierarchical data
    X_hier, y_hier = generate_blob_data(n_samples=100, centers=4, cluster_std=0.5)
    
    # Different linkage methods
    linkage_methods = ['ward', 'complete', 'average', 'single']
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    axes = axes.flatten()
    
    for idx, method in enumerate(linkage_methods):
        ax = axes[idx]
        
        # Perform hierarchical clustering
        from scipy.cluster.hierarchy import dendrogram, linkage
        linkage_matrix = linkage(X_hier, method=method)
        
        # Plot dendrogram
        dendrogram(linkage_matrix, ax=ax, truncate_mode='level', 
                  p=5, color_threshold=0, above_threshold_color='gray')
        
        ax.set_title(f'Linkage: {method.capitalize()}', fontsize=12, fontweight='bold')
        ax.set_xlabel('Sample Index')
        ax.set_ylabel('Distance')
    
    plt.suptitle('Hierarchical Clustering with Different Linkage Methods', 
                fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print("\nüìä Linkage Method Comparison:")
    print("‚Ä¢ Ward: Minimizes within-cluster variance (most common)")
    print("‚Ä¢ Complete: Maximum distance between clusters")
    print("‚Ä¢ Average: Average distance between all pairs")
    print("‚Ä¢ Single: Minimum distance (can create chains)")
    
    return X_hier


def demonstrate_gmm():
    """
    Gaussian Mixture Models demonstration.
    Shows soft clustering with probabilities.
    """
    print("üîÆ Gaussian Mixture Models: Soft Clustering\n")
    
    # Generate overlapping clusters
    X_gmm, y_gmm = generate_blob_data(n_samples=400, centers=3, cluster_std=1.2)
    
    # Fit GMM
    from sklearn.mixture import GaussianMixture
    gmm = GaussianMixture(n_components=3, random_state=42)
    gmm.fit(X_gmm)
    
    # Get predictions and probabilities
    gmm_labels = gmm.predict(X_gmm)
    gmm_probs = gmm.predict_proba(X_gmm)
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Hard clustering
    ax1 = axes[0]
    scatter1 = ax1.scatter(X_gmm[:, 0], X_gmm[:, 1], c=gmm_labels, 
                          cmap='viridis', s=30, alpha=0.7)
    ax1.scatter(gmm.means_[:, 0], gmm.means_[:, 1],
               c='red', marker='*', s=300, edgecolors='black', linewidth=2)
    ax1.set_title('Hard Assignment', fontsize=11, fontweight='bold')
    ax1.set_xlabel('Feature 1')
    ax1.set_ylabel('Feature 2')
    plt.colorbar(scatter1, ax=ax1)
    
    # Soft clustering - show uncertainty
    ax2 = axes[1]
    uncertainty = -np.sum(gmm_probs * np.log(gmm_probs + 1e-10), axis=1)
    scatter2 = ax2.scatter(X_gmm[:, 0], X_gmm[:, 1], c=uncertainty, 
                          cmap='RdYlGn_r', s=30, alpha=0.7)
    ax2.set_title('Uncertainty', fontsize=11, fontweight='bold')
    ax2.set_xlabel('Feature 1')
    ax2.set_ylabel('Feature 2')
    plt.colorbar(scatter2, ax=ax2, label='Uncertainty')
    
    # Probability contours
    ax3 = axes[2]
    x = np.linspace(X_gmm[:, 0].min() - 1, X_gmm[:, 0].max() + 1, 100)
    y = np.linspace(X_gmm[:, 1].min() - 1, X_gmm[:, 1].max() + 1, 100)
    X_grid, Y_grid = np.meshgrid(x, y)
    XX = np.array([X_grid.ravel(), Y_grid.ravel()]).T
    Z = -gmm.score_samples(XX)
    Z = Z.reshape(X_grid.shape)
    
    ax3.contour(X_grid, Y_grid, Z, levels=10, linewidths=0.5, colors='black', alpha=0.3)
    ax3.contourf(X_grid, Y_grid, Z, levels=10, cmap='viridis', alpha=0.3)
    ax3.scatter(X_gmm[:, 0], X_gmm[:, 1], c=gmm_labels, 
               cmap='viridis', s=30, alpha=0.7, edgecolors='black', linewidth=0.5)
    ax3.scatter(gmm.means_[:, 0], gmm.means_[:, 1],
               c='red', marker='*', s=300, edgecolors='white', linewidth=2)
    ax3.set_title('Probability Contours', fontsize=11, fontweight='bold')
    ax3.set_xlabel('Feature 1')
    ax3.set_ylabel('Feature 2')
    
    plt.suptitle('Gaussian Mixture Models: Soft vs Hard Clustering', 
                fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Show probability examples
    print("\nüìä Example: Innovation Probability Assignments")
    print("\nSample 5 innovations and their cluster probabilities:")
    print("ID | Cluster 1 | Cluster 2 | Cluster 3 | Assigned")
    print("-" * 50)
    for i in range(5):
        probs = gmm_probs[i]
        assigned = gmm_labels[i]
        print(f"{i:2} | {probs[0]:.3f}    | {probs[1]:.3f}    | "
              f"{probs[2]:.3f}    | Cluster {assigned+1}")
    
    print("\nüí° GMM Benefits:")
    print("‚Ä¢ Shows uncertainty in cluster assignments")
    print("‚Ä¢ Handles overlapping clusters")
    print("‚Ä¢ Provides probability distributions")
    
    return gmm, X_gmm, gmm_labels


def compare_all_algorithms():
    """
    Comprehensive comparison of all clustering algorithms.
    Shows strengths and weaknesses of each method.
    """
    print("‚öñÔ∏è Clustering Algorithm Comparison\n")
    
    # Generate test dataset
    X_compare, y_compare = generate_blob_data(n_samples=500, centers=4, cluster_std=1.0)
    
    # Add some noise
    X_noise = np.random.uniform(X_compare.min(), X_compare.max(), (50, 2))
    X_compare = np.vstack([X_compare, X_noise])
    y_compare = np.hstack([y_compare, [-1] * 50])
    
    # Standardize
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_compare)
    
    # Define algorithms
    from sklearn.mixture import GaussianMixture
    algorithms = [
        ('K-Means', KMeans(n_clusters=4, random_state=42)),
        ('DBSCAN', DBSCAN(eps=0.3, min_samples=5)),
        ('Hierarchical', AgglomerativeClustering(n_clusters=4)),
        ('GMM', GaussianMixture(n_components=4, random_state=42))
    ]
    
    fig, axes = plt.subplots(2, 4, figsize=(16, 8))
    
    # Store results
    comparison_results = []
    
    for idx, (name, algorithm) in enumerate(algorithms):
        # Fit algorithm
        if hasattr(algorithm, 'fit_predict'):
            labels = algorithm.fit_predict(X_scaled)
        else:
            labels = algorithm.fit(X_scaled).predict(X_scaled)
        
        # Calculate metrics
        unique_labels = np.unique(labels[labels != -1])
        n_clusters = len(unique_labels)
        n_noise = np.sum(labels == -1)
        
        if n_clusters > 1:
            silhouette = silhouette_score(X_scaled, labels)
            db_index = davies_bouldin_score(X_scaled, labels)
        else:
            silhouette = -1
            db_index = np.inf
        
        comparison_results.append({
            'Algorithm': name,
            'Clusters': n_clusters,
            'Noise': n_noise,
            'Silhouette': silhouette,
            'Davies-Bouldin': db_index
        })
        
        # Visualization
        ax1 = axes[0, idx]
        ax2 = axes[1, idx]
        
        # Plot clusters
        for label in unique_labels:
            if label == -1:
                mask = labels == label
                ax1.scatter(X_compare[mask, 0], X_compare[mask, 1],
                           c='black', marker='x', s=30, alpha=0.5, label='Noise')
            else:
                mask = labels == label
                ax1.scatter(X_compare[mask, 0], X_compare[mask, 1],
                           s=30, alpha=0.7, label=f'C{label}')
        
        ax1.set_title(f'{name}\nClusters: {n_clusters}, Noise: {n_noise}', 
                     fontsize=10, fontweight='bold')
        ax1.set_xticks([])
        ax1.set_yticks([])
        
        # Metrics bar chart
        metrics = ['Silhouette', 'DB Index\n(inverted)']
        values = [silhouette, -db_index/10]  # Normalize for display
        colors_bar = ['green' if v > 0 else 'red' for v in values]
        
        bars = ax2.bar(range(2), values, color=colors_bar, alpha=0.7)
        ax2.set_xticks(range(2))
        ax2.set_xticklabels(metrics, fontsize=8)
        ax2.set_title(f'{name} Metrics', fontsize=10, fontweight='bold')
        ax2.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
    
    plt.suptitle('Clustering Algorithm Comparison', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Comparison table
    comparison_df = pd.DataFrame(comparison_results)
    print("\nüìä Performance Comparison:")
    display(comparison_df)
    
    print("\nüéØ Algorithm Selection Guide:")
    print("\nüìå K-Means: Fast, simple, spherical clusters")
    print("üìå DBSCAN: Arbitrary shapes, identifies outliers")
    print("üìå Hierarchical: Creates taxonomy, no K needed upfront")
    print("üìå GMM: Soft clustering, overlapping clusters")
    
    return comparison_df


def demonstrate_common_mistakes():
    """
    Show common clustering mistakes and how to fix them.
    Educational visualization of pitfalls.
    """
    print("‚ö†Ô∏è Common Clustering Mistakes and How to Fix Them\n")
    
    # Generate data
    X_mistakes, y_mistakes = generate_blob_data(n_samples=300, centers=3)
    
    fig, axes = plt.subplots(2, 3, figsize=(14, 8))
    
    # Mistake 1: Not scaling features
    ax1 = axes[0, 0]
    X_unscaled = X_mistakes.copy()
    X_unscaled[:, 1] *= 100  # Make one feature much larger
    kmeans_unscaled = KMeans(n_clusters=3, random_state=42)
    labels_unscaled = kmeans_unscaled.fit_predict(X_unscaled)
    ax1.scatter(X_unscaled[:, 0], X_unscaled[:, 1], c=labels_unscaled, 
               cmap='viridis', s=30, alpha=0.7)
    ax1.set_title('‚ùå Mistake: Unscaled Features', fontsize=10, fontweight='bold', color='red')
    ax1.set_xlabel('Feature 1 (0-10)')
    ax1.set_ylabel('Feature 2 (0-1000)')
    
    # Fix 1: Scale features
    ax2 = axes[0, 1]
    scaler = StandardScaler()
    X_scaled_fix = scaler.fit_transform(X_unscaled)
    kmeans_scaled = KMeans(n_clusters=3, random_state=42)
    labels_scaled = kmeans_scaled.fit_predict(X_scaled_fix)
    ax2.scatter(X_scaled_fix[:, 0], X_scaled_fix[:, 1], c=labels_scaled, 
               cmap='viridis', s=30, alpha=0.7)
    ax2.set_title('‚úÖ Fix: Scaled Features', fontsize=10, fontweight='bold', color='green')
    ax2.set_xlabel('Feature 1 (standardized)')
    ax2.set_ylabel('Feature 2 (standardized)')
    
    # Mistake 2: Wrong number of clusters
    ax3 = axes[0, 2]
    kmeans_wrong_k = KMeans(n_clusters=10, random_state=42)
    labels_wrong_k = kmeans_wrong_k.fit_predict(X_mistakes)
    ax3.scatter(X_mistakes[:, 0], X_mistakes[:, 1], c=labels_wrong_k, 
               cmap='tab10', s=30, alpha=0.7)
    silhouette_wrong = silhouette_score(X_mistakes, labels_wrong_k)
    ax3.set_title(f'‚ùå Too Many Clusters (K=10)\nSilhouette: {silhouette_wrong:.3f}', 
                 fontsize=10, fontweight='bold', color='red')
    
    # Fix 2: Use elbow method
    ax4 = axes[1, 0]
    kmeans_correct_k = KMeans(n_clusters=3, random_state=42)
    labels_correct_k = kmeans_correct_k.fit_predict(X_mistakes)
    ax4.scatter(X_mistakes[:, 0], X_mistakes[:, 1], c=labels_correct_k, 
               cmap='viridis', s=30, alpha=0.7)
    silhouette_correct = silhouette_score(X_mistakes, labels_correct_k)
    ax4.set_title(f'‚úÖ Optimal K=3\nSilhouette: {silhouette_correct:.3f}', 
                 fontsize=10, fontweight='bold', color='green')
    
    # Mistake 3: Ignoring outliers
    ax5 = axes[1, 1]
    X_with_outliers = X_mistakes.copy()
    outliers = np.random.uniform(-15, 15, (20, 2))
    X_with_outliers = np.vstack([X_with_outliers, outliers])
    kmeans_outliers = KMeans(n_clusters=3, random_state=42)
    labels_outliers = kmeans_outliers.fit_predict(X_with_outliers)
    ax5.scatter(X_with_outliers[:, 0], X_with_outliers[:, 1], 
               c=labels_outliers, cmap='viridis', s=30, alpha=0.7)
    ax5.set_title('‚ùå K-Means with Outliers', fontsize=10, fontweight='bold', color='red')
    
    # Fix 3: Use DBSCAN
    ax6 = axes[1, 2]
    dbscan_fix = DBSCAN(eps=1.5, min_samples=5)
    labels_dbscan = dbscan_fix.fit_predict(X_with_outliers)
    unique_labels = set(labels_dbscan)
    colors_db = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
    for k, col in zip(unique_labels, colors_db):
        if k == -1:
            col = 'black'
            marker = 'x'
        else:
            marker = 'o'
        class_member_mask = (labels_dbscan == k)
        xy = X_with_outliers[class_member_mask]
        ax6.scatter(xy[:, 0], xy[:, 1], c=[col], marker=marker, s=30, alpha=0.7)
    ax6.set_title('‚úÖ DBSCAN Handles Outliers', fontsize=10, fontweight='bold', color='green')
    
    plt.suptitle('Common Clustering Mistakes and Solutions', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print("\nüìö Summary of Common Mistakes:")
    print("\n1. üìè Not Scaling Features:")
    print("   Problem: Features with larger scales dominate")
    print("   Solution: Always standardize or normalize")
    print("\n2. üî¢ Wrong Number of Clusters:")
    print("   Problem: Too many/few clusters")
    print("   Solution: Use elbow method, silhouette analysis")
    print("\n3. üîç Ignoring Outliers:")
    print("   Problem: K-means is sensitive to outliers")
    print("   Solution: Use DBSCAN or remove outliers first")

print("Algorithm demonstration functions loaded successfully!")

### 0.2 Algorithm Demonstration Functions

In [None]:
# Helper functions for Part 2

def generate_blob_data(n_samples=1000, centers=3, n_features=2, cluster_std=1.0, random_state=42):
    """Generate simple blob data for demonstrations."""
    from sklearn.datasets import make_blobs
    X, y = make_blobs(n_samples=n_samples, centers=centers, n_features=n_features,
                     cluster_std=cluster_std, random_state=random_state)
    return X, y

def plot_clusters(X, labels, centers=None, title="Clusters", ax=None):
    """Simple cluster plotting function."""
    if ax is None:
        import matplotlib.pyplot as plt
        fig, ax = plt.subplots(figsize=(8, 6))
    
    unique_labels = np.unique(labels)
    colors = plt.cm.viridis(np.linspace(0, 1, len(unique_labels)))
    
    for i, label in enumerate(unique_labels):
        if label == -1:  # Noise points for DBSCAN
            mask = labels == label
            ax.scatter(X[mask, 0], X[mask, 1], c='gray', marker='x', s=50, alpha=0.5, label='Noise')
        else:
            mask = labels == label
            ax.scatter(X[mask, 0], X[mask, 1], c=[colors[i]], s=50, alpha=0.7, label=f'Cluster {label}')
    
    if centers is not None:
        ax.scatter(centers[:, 0], centers[:, 1], c='red', marker='*', s=300, 
                  edgecolors='black', linewidth=2, label='Centers', zorder=10)
    
    ax.set_title(title)
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    return ax

def generate_innovation_data(n_samples=1000, n_features=10, n_clusters=5, noise=0.1):
    """Generate synthetic innovation dataset."""
    from sklearn.datasets import make_blobs
    X, y = make_blobs(n_samples=n_samples, n_features=n_features, 
                     centers=n_clusters, cluster_std=1.5, random_state=42)
    X += np.random.normal(0, noise, X.shape)
    
    feature_names = [
        'Tech_Sophistication', 'Market_Readiness', 'Resource_Requirements',
        'User_Engagement', 'Scalability', 'Innovation_Level',
        'Competition_Intensity', 'Regulatory_Complexity', 'ROI_Potential',
        'Implementation_Time'
    ][:n_features]
    
    df = pd.DataFrame(X, columns=feature_names)
    df['True_Cluster'] = y
    
    return df, X, y

print("Helper functions loaded successfully!")

## Section 0: All Functions

### 0.1 Helper Functions for Data Generation and Visualization

---
# Section 3: Part 2 - Technical Deep Dive
Master all clustering algorithms with hands-on implementation.

## 3.1 K-Means Clustering

K-Means is the workhorse of clustering algorithms - simple, fast, and effective for many innovation analysis tasks.

### Theory & Visualization

In [None]:
# K-Means step-by-step visualization
centers, labels = demonstrate_kmeans_step_by_step()

### Hands-on Implementation

In [None]:
# Hands-on K-Means implementation
results = demonstrate_kmeans_implementation()

### üéØ Exercise: Implement K-Means from Scratch

In [None]:
# Exercise: Implement K-means from scratch
my_kmeans, sklearn_kmeans = implement_kmeans_from_scratch()

## 3.2 Finding Optimal K

One of the biggest challenges in clustering: How many clusters should we have?

### Elbow Method

In [None]:
# Finding optimal K with elbow method
optimal_k = find_optimal_k_elbow()

### Silhouette Analysis

In [None]:
# Silhouette analysis for cluster validation
demonstrate_silhouette_analysis()

## 3.3 DBSCAN - Density-Based Clustering

DBSCAN finds clusters of arbitrary shape and identifies outliers - perfect for innovation data with noise.

### Understanding Parameters

In [None]:
# DBSCAN parameter exploration
X_dbscan = demonstrate_dbscan_parameters()

### Complex Shapes

In [None]:
# Algorithm comparison on complex shapes
compare_all_algorithms()

## 3.4 Hierarchical Clustering

Build a tree of clusters - perfect for understanding innovation taxonomies.

In [None]:
# Hierarchical clustering demonstration
X_hier = demonstrate_hierarchical_clustering()

## 3.5 Gaussian Mixture Models

Soft clustering where innovations can belong to multiple categories with different probabilities.

In [None]:
# Gaussian Mixture Models demonstration
demonstrate_gmm()

## 3.6 Algorithm Comparison

Let's compare all algorithms on the same dataset to understand their strengths and weaknesses.

In [None]:
# Comprehensive algorithm comparison
print("‚öñÔ∏è Clustering Algorithm Comparison\n")

# Generate test dataset
X_compare, y_compare = make_blobs(n_samples=500, centers=4, 
                                 n_features=2, cluster_std=1.0, 
                                 random_state=42)

# Add some noise
X_noise = np.random.uniform(X_compare.min(), X_compare.max(), (50, 2))
X_compare = np.vstack([X_compare, X_noise])
y_compare = np.hstack([y_compare, [-1] * 50])

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_compare)

# Define algorithms
algorithms = [
    ('K-Means', KMeans(n_clusters=4, random_state=42)),
    ('DBSCAN', DBSCAN(eps=0.3, min_samples=5)),
    ('Hierarchical', AgglomerativeClustering(n_clusters=4)),
    ('GMM', GaussianMixture(n_components=4, random_state=42))
]

fig, axes = plt.subplots(2, 4, figsize=(16, 8))

# Store results
comparison_results = []

for idx, (name, algorithm) in enumerate(algorithms):
    # Fit algorithm
    if hasattr(algorithm, 'fit_predict'):
        labels = algorithm.fit_predict(X_scaled)
    else:
        labels = algorithm.fit(X_scaled).predict(X_scaled)
    
    # Calculate metrics
    unique_labels = np.unique(labels[labels != -1])
    n_clusters = len(unique_labels)
    n_noise = np.sum(labels == -1)
    
    if n_clusters > 1:
        silhouette = silhouette_score(X_scaled, labels)
        db_index = davies_bouldin_score(X_scaled, labels)
        ch_index = calinski_harabasz_score(X_scaled, labels)
    else:
        silhouette = -1
        db_index = np.inf
        ch_index = 0
    
    comparison_results.append({
        'Algorithm': name,
        'Clusters': n_clusters,
        'Noise': n_noise,
        'Silhouette': silhouette,
        'Davies-Bouldin': db_index,
        'Calinski-Harabasz': ch_index
    })
    
    # Visualization
    ax1 = axes[0, idx]
    ax2 = axes[1, idx]
    
    # Plot clusters
    for label in unique_labels:
        if label == -1:
            mask = labels == label
            ax1.scatter(X_compare[mask, 0], X_compare[mask, 1],
                       c='black', marker='x', s=30, alpha=0.5, label='Noise')
        else:
            mask = labels == label
            ax1.scatter(X_compare[mask, 0], X_compare[mask, 1],
                       s=30, alpha=0.7, label=f'C{label}')
    
    # Add centers if available
    if hasattr(algorithm, 'cluster_centers_'):
        centers = scaler.inverse_transform(algorithm.cluster_centers_)
        ax1.scatter(centers[:, 0], centers[:, 1],
                   c='red', marker='*', s=200, 
                   edgecolors='black', linewidth=1.5)
    elif hasattr(algorithm, 'means_'):
        centers = scaler.inverse_transform(algorithm.means_)
        ax1.scatter(centers[:, 0], centers[:, 1],
                   c='red', marker='*', s=200, 
                   edgecolors='black', linewidth=1.5)
    
    ax1.set_title(f'{name}\nClusters: {n_clusters}, Noise: {n_noise}', 
                 fontsize=10, fontweight='bold')
    ax1.set_xticks([])
    ax1.set_yticks([])
    
    # Metrics bar chart
    metrics = ['Silhouette', 'Davies-Bouldin\n(lower better)', 
              'Calinski-Harabasz\n(higher better)']
    values = [silhouette, -db_index/10, ch_index/1000]  # Normalize for display
    colors_bar = ['green' if v > 0 else 'red' for v in values]
    
    bars = ax2.bar(range(3), values, color=colors_bar, alpha=0.7)
    ax2.set_xticks(range(3))
    ax2.set_xticklabels(metrics, fontsize=8)
    ax2.set_title(f'{name} Metrics', fontsize=10, fontweight='bold')
    ax2.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
    
    # Add value labels
    for bar, val, orig in zip(bars, values, [silhouette, db_index, ch_index]):
        height = bar.get_height()
        ax2.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                f'{orig:.2f}', ha='center', va='bottom', fontsize=8)

plt.suptitle('Clustering Algorithm Comparison: Same Data, Different Approaches', 
            fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Comparison table
comparison_df = pd.DataFrame(comparison_results)
print("\nüìä Performance Comparison:")
display(comparison_df)

print("\nüéØ Algorithm Selection Guide:")
print("\nüìå K-Means:")
print("  ‚úÖ Fast, simple, well-understood")
print("  ‚úÖ Good for spherical clusters")
print("  ‚ùå Requires K specification")
print("  ‚ùå Sensitive to outliers")

print("\nüìå DBSCAN:")
print("  ‚úÖ Finds arbitrary shapes")
print("  ‚úÖ Identifies outliers")
print("  ‚ùå Sensitive to parameters")
print("  ‚ùå Struggles with varying densities")

print("\nüìå Hierarchical:")
print("  ‚úÖ No need to specify K upfront")
print("  ‚úÖ Creates taxonomy")
print("  ‚ùå Computationally expensive")
print("  ‚ùå Hard to interpret large dendrograms")

print("\nüìå GMM:")
print("  ‚úÖ Soft clustering")
print("  ‚úÖ Handles overlapping clusters")
print("  ‚ùå Assumes Gaussian distribution")
print("  ‚ùå Can overfit with many components")

### Common Mistakes Gallery

In [None]:
# Common clustering mistakes and solutions
demonstrate_common_mistakes()

---
# Section 4: Part 3 - Design Integration
Transform technical clustering results into actionable innovation insights.

## 4.1 From Data Points to Innovation Insights

In [None]:
# Transform clusters into innovation insights
transform_clusters_to_insights()

## 4.2 Creating Innovation Archetypes

In [None]:
# Create innovation archetypes from clusters
create_innovation_archetypes()

## 4.3 Innovation Taxonomy & Lifecycle

In [None]:
# Build innovation taxonomy
build_innovation_taxonomy()

## 4.4 Opportunity Analysis

In [None]:
# Generate innovation opportunity analysis
generate_opportunity_analysis()

## 4.5 Innovation Ecosystem

In [None]:
# Create innovation ecosystem visualization
create_innovation_ecosystem()

---
## üéØ Part 2 Summary

### Technical Skills Mastered:
1. **K-Means**: Understanding and implementing from scratch
2. **Optimal K**: Multiple methods for finding best clusters
3. **DBSCAN**: Handling complex shapes and outliers
4. **Hierarchical**: Building taxonomies and dendrograms
5. **GMM**: Soft clustering with probabilities
6. **Comparison**: Choosing the right algorithm

### Design Applications Learned:
1. **Innovation Archetypes**: Data-driven personas
2. **Opportunity Heatmaps**: Identifying white spaces
3. **Priority Matrices**: Strategic resource allocation
4. **Ecosystem Networks**: Understanding connections
5. **Innovation Taxonomy**: Hierarchical organization

### Next: Part 3 - Practice & Advanced Topics
Apply everything with real case studies and advanced visualizations!

In [None]:
print("\n" + "="*60)
print("‚úÖ Part 2 Complete: Technical & Design Integration")
print("="*60)
print("\nYou've completed:")
print("‚Ä¢ Section 3: All clustering algorithms")
print("‚Ä¢ Section 4: Design integration and applications")
print("\nüìö Ready for Part 3: Practice, Case Studies, and Advanced Topics")
print("\nContinue with Week01_Part3_Practice_Advanced.ipynb")