# Week 1: Clustering for Innovation Discovery - Practice Exercise

## Objective
Apply clustering algorithms to discover patterns in innovation data from a company hackathon.

## Dataset
- 500 innovation proposals from employees
- Features: category scores, complexity, impact, feasibility
- Goal: Discover natural groupings to inform innovation strategy

---

## Step 1: Setup and Imports

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Set style and seed
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

print('Libraries loaded successfully!')

## Step 2: Generate Synthetic Innovation Data
In practice, you would load real data. Here we simulate it for learning purposes.

In [None]:
# Generate synthetic innovation data
n_innovations = 500

# Create 4 innovation archetypes with different characteristics
# Archetype 1: Quick Wins (low complexity, moderate impact)
quick_wins = np.random.multivariate_normal(
    [3, 6, 7, 5],  # [complexity, impact, feasibility, novelty]
    [[1, 0.2, 0.3, 0.1], 
     [0.2, 1.5, 0.2, 0.3],
     [0.3, 0.2, 1, 0.2],
     [0.1, 0.3, 0.2, 1]], 
    125
)

# Archetype 2: Moonshots (high complexity, high impact)
moonshots = np.random.multivariate_normal(
    [8, 9, 4, 9],
    [[1.5, 0.4, -0.3, 0.5], 
     [0.4, 1, 0.1, 0.4],
     [-0.3, 0.1, 1.5, -0.2],
     [0.5, 0.4, -0.2, 1]], 
    100
)

# Archetype 3: Incremental (low complexity, low impact)
incremental = np.random.multivariate_normal(
    [2, 3, 8, 2],
    [[1, 0.3, 0.4, 0.2], 
     [0.3, 1.2, 0.3, 0.1],
     [0.4, 0.3, 1, 0.3],
     [0.2, 0.1, 0.3, 1]], 
    175
)

# Archetype 4: Transformative (moderate complexity, high impact)
transformative = np.random.multivariate_normal(
    [6, 8, 6, 7],
    [[1.2, 0.3, 0.2, 0.4], 
     [0.3, 1.3, 0.3, 0.3],
     [0.2, 0.3, 1.1, 0.2],
     [0.4, 0.3, 0.2, 1.2]], 
    100
)

# Combine all data
X = np.vstack([quick_wins, moonshots, incremental, transformative])
true_labels = np.array([0]*125 + [1]*100 + [2]*175 + [3]*100)

# Create DataFrame
feature_names = ['Complexity', 'Impact', 'Feasibility', 'Novelty']
df = pd.DataFrame(X, columns=feature_names)
df['True_Archetype'] = true_labels

print(f'Dataset shape: {df.shape}')
print(f'\nFirst 5 rows:')
df.head()

## Step 3: Exploratory Data Analysis
Understand your data before clustering!

In [None]:
# TODO: Create visualizations to understand the data
# Hint: Use pairplot, correlation matrix, or distribution plots

# Your code here:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Distribution plots
for i, col in enumerate(feature_names):
    ax = axes[i//2, i%2]
    ax.hist(df[col], bins=30, alpha=0.7, color='steelblue', edgecolor='black')
    ax.set_title(f'Distribution of {col}')
    ax.set_xlabel(col)
    ax.set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# Correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(df[feature_names].corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

## Step 4: Data Preprocessing
Standardize features for clustering

In [None]:
# TODO: Standardize the features
# Why is this important for clustering?

# Your code here:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[feature_names])

print('Data standardized!')
print(f'Mean: {X_scaled.mean(axis=0)}')
print(f'Std: {X_scaled.std(axis=0)}')

## Step 5: Determine Optimal Number of Clusters
Use elbow method and silhouette analysis

In [None]:
# TODO: Implement elbow method
# Calculate inertia for k=2 to k=10

# Your code here:
inertias = []
silhouettes = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    inertias.append(kmeans.inertia_)
    silhouettes.append(silhouette_score(X_scaled, labels))

# Plot results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Elbow plot
ax1.plot(K_range, inertias, 'bo-')
ax1.set_xlabel('Number of Clusters (k)')
ax1.set_ylabel('Inertia')
ax1.set_title('Elbow Method')
ax1.grid(True)

# Silhouette plot
ax2.plot(K_range, silhouettes, 'ro-')
ax2.set_xlabel('Number of Clusters (k)')
ax2.set_ylabel('Silhouette Score')
ax2.set_title('Silhouette Analysis')
ax2.grid(True)

plt.tight_layout()
plt.show()

# What is your optimal k?

## Step 6: Apply Multiple Clustering Algorithms
Compare different approaches

In [None]:
# TODO: Apply K-Means, DBSCAN, and GMM
# Compare their performance

# Your code here:
algorithms = {
    'K-Means': KMeans(n_clusters=4, random_state=42, n_init=10),
    'DBSCAN': DBSCAN(eps=0.5, min_samples=5),
    'Hierarchical': AgglomerativeClustering(n_clusters=4),
    'GMM': GaussianMixture(n_components=4, random_state=42)
}

results = {}
for name, algorithm in algorithms.items():
    if name == 'GMM':
        labels = algorithm.fit_predict(X_scaled)
    else:
        labels = algorithm.fit_predict(X_scaled)
    
    # Calculate metrics (skip for DBSCAN if it finds -1 labels)
    if len(set(labels)) > 1 and -1 not in labels:
        sil_score = silhouette_score(X_scaled, labels)
        db_score = davies_bouldin_score(X_scaled, labels)
        ch_score = calinski_harabasz_score(X_scaled, labels)
    else:
        sil_score = db_score = ch_score = None
    
    results[name] = {
        'labels': labels,
        'silhouette': sil_score,
        'davies_bouldin': db_score,
        'calinski_harabasz': ch_score
    }
    
    print(f'{name}:')
    print(f'  Unique clusters: {len(set(labels))}')
    if sil_score:
        print(f'  Silhouette Score: {sil_score:.3f}')
        print(f'  Davies-Bouldin: {db_score:.3f}')
        print(f'  Calinski-Harabasz: {ch_score:.1f}')
    print()

## Step 7: Visualize Clustering Results
Use PCA for 2D visualization

In [None]:
# TODO: Create PCA visualization for each algorithm

# Your code here:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

# Plot true labels
scatter = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=true_labels, 
                         cmap='viridis', alpha=0.6, s=30)
axes[0].set_title('True Archetypes')
axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%})')
axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%})')
plt.colorbar(scatter, ax=axes[0])

# Plot each algorithm's results
for i, (name, result) in enumerate(results.items(), 1):
    scatter = axes[i].scatter(X_pca[:, 0], X_pca[:, 1], c=result['labels'], 
                             cmap='viridis', alpha=0.6, s=30)
    axes[i].set_title(f'{name}')
    axes[i].set_xlabel(f'PC1')
    axes[i].set_ylabel(f'PC2')
    if result['silhouette']:
        axes[i].text(0.02, 0.98, f'Sil: {result["silhouette"]:.3f}',
                    transform=axes[i].transAxes, fontsize=9,
                    verticalalignment='top')

# Remove extra subplot
axes[-1].remove()

plt.tight_layout()
plt.show()

## Step 8: Interpret Innovation Clusters
What patterns do you see?

In [None]:
# TODO: Analyze cluster characteristics
# What makes each cluster unique?

# Your code here:
best_algorithm = 'K-Means'  # Choose based on metrics
best_labels = results[best_algorithm]['labels']

# Add cluster labels to dataframe
df['Cluster'] = best_labels

# Calculate cluster centers
cluster_summary = df.groupby('Cluster')[feature_names].mean()
print('Cluster Centers (Original Scale):')
print(cluster_summary.round(2))
print()

# Interpret clusters
interpretations = {
    0: 'Quick Wins - Low complexity, good feasibility',
    1: 'Moonshots - High impact, high novelty',
    2: 'Incremental - Low risk, easy implementation',
    3: 'Transformative - Balanced high potential'
}

print('Cluster Interpretations:')
for cluster_id in range(len(cluster_summary)):
    size = (df['Cluster'] == cluster_id).sum()
    print(f'Cluster {cluster_id}: {size} innovations')
    if cluster_id in interpretations:
        print(f'  â†’ {interpretations[cluster_id]}')
    print()

## Step 9: Innovation Strategy Recommendations
Based on your clustering analysis

In [None]:
# TODO: Create actionable recommendations
# How should the company prioritize these innovation clusters?

# Your analysis here:
recommendations = """
INNOVATION PORTFOLIO STRATEGY:

1. QUICK WINS (Cluster 0): {:.0f} innovations
   - Implement immediately for rapid value
   - Low resource requirement
   - Build momentum and credibility

2. MOONSHOTS (Cluster 1): {:.0f} innovations  
   - Select top 2-3 for long-term investment
   - High risk, high reward
   - Requires dedicated innovation lab

3. INCREMENTAL (Cluster 2): {:.0f} innovations
   - Delegate to operational teams
   - Continuous improvement focus
   - Minimal oversight needed

4. TRANSFORMATIVE (Cluster 3): {:.0f} innovations
   - Strategic priorities for next quarter
   - Balance of risk and reward
   - Cross-functional teams required

RECOMMENDED PORTFOLIO MIX:
- 40% Quick Wins (immediate impact)
- 30% Transformative (strategic growth)
- 20% Incremental (continuous improvement)
- 10% Moonshots (future disruption)
""".format(
    (df['Cluster'] == 0).sum(),
    (df['Cluster'] == 1).sum(),
    (df['Cluster'] == 2).sum(),
    (df['Cluster'] == 3).sum()
)

print(recommendations)

## Step 10: Save Results and Next Steps

In [None]:
# Save clustered data
df.to_csv('innovation_clusters.csv', index=False)
print('Results saved to innovation_clusters.csv')

# Summary statistics
print('\nFINAL SUMMARY:')
print(f'Total innovations analyzed: {len(df)}')
print(f'Clusters discovered: {len(set(best_labels))}')
print(f'Best algorithm: {best_algorithm}')
print(f'Silhouette score: {results[best_algorithm]["silhouette"]:.3f}')

print('\nNEXT STEPS:')
print('1. Validate clusters with domain experts')
print('2. Deep dive into each cluster for detailed insights')
print('3. Create innovation roadmap based on clusters')
print('4. Set up monitoring for cluster evolution over time')

## Bonus Challenge: Advanced Analysis

Try these extensions:
1. **Optimal eps for DBSCAN**: Use k-distance plot
2. **Cluster stability**: Run clustering multiple times with different seeds
3. **Feature importance**: Which features drive the clustering?
4. **Outlier analysis**: Identify and analyze outlier innovations
5. **Time complexity**: Measure and compare algorithm performance

In [None]:
# Your bonus code here:


---

## Submission Instructions

1. Complete all TODO sections
2. Add your interpretations and insights
3. Export notebook as PDF/HTML
4. Submit via course portal by [deadline]

## Grading Rubric

- **Data Exploration** (20%): Thorough EDA with visualizations
- **Algorithm Implementation** (30%): Correct application of multiple algorithms
- **Evaluation** (20%): Proper use of metrics and comparison
- **Interpretation** (20%): Meaningful cluster analysis and insights
- **Recommendations** (10%): Actionable innovation strategy

Good luck! ðŸš€