# Innovation Success Prediction - Supervised Learning

**Business Question:** Which innovations will succeed in the market?

**Dataset:** 6,000 technology innovations (2020-2024)

**Goal:** Train classifiers to predict innovation success based on company, market, and innovation features.

**Approach:** Compare Random Forest, Logistic Regression, and XGBoost to understand when different algorithms excel.

## What is Supervised Learning?

**Supervised Learning** is a machine learning approach where we train models using **labeled data** - data where we already know the correct answer (the "label" or "target").

**Key Concepts:**
- **Features (X):** Input variables that describe each innovation (company age, funding, team size, etc.)
- **Target (y):** The outcome we want to predict (success = 1, failure = 0)
- **Training:** Model learns patterns from labeled examples
- **Prediction:** Apply learned patterns to predict outcomes for new innovations

**Why Supervised Learning?**
- We have historical data with known outcomes
- We want to predict future outcomes
- We need quantifiable accuracy metrics
- We want to understand which factors drive success

## Functions

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, learning_curve, validation_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    classification_report, accuracy_score, precision_score, recall_score, 
    f1_score, roc_auc_score, confusion_matrix, roc_curve, precision_recall_curve, auc
)
import warnings
warnings.filterwarnings('ignore')

try:
    from xgboost import XGBClassifier
    XGBOOST_AVAILABLE = True
except ImportError:
    XGBOOST_AVAILABLE = False
    print("XGBoost not available. Install with: pip install xgboost")


# ============================================================================
# DATA LOADING AND EXPLORATION FUNCTIONS
# ============================================================================

def load_and_explore_data(filepath='innovations.csv'):
    """Load dataset and display basic information."""
    df = pd.read_csv(filepath)
    
    print("=" * 80)
    print("DATASET OVERVIEW")
    print("=" * 80)
    print(f"\nShape: {df.shape[0]} innovations, {df.shape[1]} columns")
    print(f"\nColumns: {list(df.columns)}")
    print(f"\nMissing values: {df.isnull().sum().sum()}")
    print(f"\nSuccess rate: {df['success'].mean():.1%}")
    print(f"  Successful: {df['success'].sum():,}")
    print(f"  Failed: {(df['success'] == 0).sum():,}")
    print("\n" + "=" * 80 + "\n")
    
    return df


def plot_target_distribution(df):
    """Visualize target variable distribution."""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Count plot
    counts = df['success'].value_counts().sort_index()
    colors = ['red', 'green']
    bars = axes[0].bar(['Failed (0)', 'Success (1)'], counts.values, 
                       color=colors, alpha=0.7, edgecolor='black', linewidth=1.5)
    axes[0].set_ylabel('Count', fontsize=11)
    axes[0].set_title('Target Distribution: Success vs Failure', fontsize=12, fontweight='bold')
    axes[0].grid(axis='y', alpha=0.3)
    
    # Add count labels
    for bar, count in zip(bars, counts.values):
        height = bar.get_height()
        axes[0].text(bar.get_x() + bar.get_width()/2., height,
                    f'{count:,}\n({count/len(df):.1%})',
                    ha='center', va='bottom', fontsize=10, fontweight='bold')
    
    # Percentage pie chart
    axes[1].pie(counts.values, labels=['Failed', 'Success'], autopct='%1.1f%%',
               colors=colors, startangle=90, textprops={'fontsize': 11, 'fontweight': 'bold'})
    axes[1].set_title('Success Rate Distribution', fontsize=12, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    print("\nðŸ’¡ What This Chart Tells Us:")
    print(f"   - Dataset is relatively balanced ({df['success'].mean():.1%} success rate)")
    print(f"   - Not severely imbalanced, so accuracy is a reasonable metric")
    print(f"   - Baseline (always predict majority class) = {max(df['success'].mean(), 1-df['success'].mean()):.1%}")
    print(f"   - Our model must beat this baseline to be useful\n")


def analyze_feature_distributions(df):
    """Analyze numerical feature distributions by target class."""
    numerical_features = ['company_age_years', 'team_size', 'team_experience_avg_years',
                         'funding_raised_usd', 'market_size_millions', 'market_growth_rate']
    
    fig, axes = plt.subplots(2, 3, figsize=(16, 10))
    axes = axes.flatten()
    
    for idx, feature in enumerate(numerical_features):
        successful = df[df['success'] == 1][feature]
        failed = df[df['success'] == 0][feature]
        
        axes[idx].hist([failed, successful], bins=30, label=['Failed', 'Success'],
                      color=['red', 'green'], alpha=0.6, edgecolor='black')
        axes[idx].set_xlabel(feature.replace('_', ' ').title(), fontsize=10)
        axes[idx].set_ylabel('Frequency', fontsize=10)
        axes[idx].set_title(f'{feature.replace("_", " ").title()} Distribution', 
                           fontsize=11, fontweight='bold')
        axes[idx].legend()
        axes[idx].grid(axis='y', alpha=0.3)
        
        # Add mean lines
        axes[idx].axvline(successful.mean(), color='green', linestyle='--', linewidth=2, alpha=0.7)
        axes[idx].axvline(failed.mean(), color='red', linestyle='--', linewidth=2, alpha=0.7)
    
    plt.tight_layout()
    plt.show()
    
    print("\nðŸ’¡ What This Chart Tells Us:")
    print("   - Green = successful innovations, Red = failed innovations")
    print("   - Dashed lines show mean values for each class")
    print("   - Look for separation between red and green distributions")
    print("   - Better separation = feature is more predictive\n")


def plot_correlation_heatmap(df, feature_cols):
    """Plot correlation heatmap of features with target."""
    # Select numerical features + target
    numerical_cols = [col for col in feature_cols if df[col].dtype in ['int64', 'float64']]
    corr_data = df[numerical_cols + ['success']].corr()
    
    # Plot heatmap
    plt.figure(figsize=(12, 10))
    sns.heatmap(corr_data, annot=True, fmt='.2f', cmap='coolwarm', center=0,
               square=True, linewidths=1, cbar_kws={'label': 'Correlation'})
    plt.title('Feature Correlation Heatmap (with Target)', fontsize=12, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Print top correlations with target
    target_corr = corr_data['success'].drop('success').sort_values(ascending=False)
    print("\nðŸ’¡ Top Correlations with Success:")
    print(target_corr.head(5))
    print("\nðŸ’¡ Negative Correlations (inverse relationship):")
    print(target_corr.tail(5))
    print("\n   - Positive correlation: higher value â†’ higher success probability")
    print("   - Negative correlation: higher value â†’ lower success probability\n")


# ============================================================================
# FEATURE ENGINEERING FUNCTIONS
# ============================================================================

def prepare_features(df):
    """Prepare features for modeling by encoding categorical variables."""
    df_prep = df.copy()
    
    print("Feature Engineering Steps:")
    print("-" * 40)
    
    # Numerical features
    numerical_features = [
        'company_age_years', 'team_size', 'team_experience_avg_years',
        'has_prior_success', 'funding_raised_usd', 'market_size_millions',
        'market_growth_rate'
    ]
    print(f"âœ“ Using {len(numerical_features)} numerical features")
    
    # Encode categorical variables (one-hot encoding)
    df_prep['company_type_encoded'] = (df_prep['company_type'] == 'Corporate').astype(int)
    df_prep['competition_low'] = (df_prep['competition_level'] == 'Low').astype(int)
    df_prep['competition_high'] = (df_prep['competition_level'] == 'High').astype(int)
    df_prep['stage_scaling'] = (df_prep['development_stage'] == 'Scaling').astype(int)
    df_prep['stage_market_ready'] = (df_prep['development_stage'] == 'Market-Ready').astype(int)
    print(f"âœ“ Encoded 3 categorical variables â†’ 5 binary features")
    
    # All feature columns
    feature_cols = numerical_features + [
        'company_type_encoded', 'competition_low', 'competition_high',
        'stage_scaling', 'stage_market_ready'
    ]
    
    X = df_prep[feature_cols]
    y = df_prep['success']
    
    print(f"\nâœ“ Total features: {len(feature_cols)}")
    print(f"âœ“ Samples: {len(X):,}\n")
    
    return X, y, feature_cols, df_prep


def split_data(X, y, test_size=0.2, random_state=42):
    """Split data into training and test sets with stratification."""
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )
    
    print("Data Split:")
    print("-" * 40)
    print(f"Training set: {len(X_train):,} samples ({len(X_train)/len(X):.0%})")
    print(f"Test set: {len(X_test):,} samples ({len(X_test)/len(X):.0%})")
    print(f"\nTraining success rate: {y_train.mean():.1%}")
    print(f"Test success rate: {y_test.mean():.1%}")
    print(f"\nâœ“ Stratified split maintains class balance\n")
    
    return X_train, X_test, y_train, y_test


# ============================================================================
# MODEL TRAINING FUNCTIONS
# ============================================================================

def train_random_forest(X_train, y_train, n_estimators=100, max_depth=10, random_state=42):
    """Train Random Forest classifier."""
    print("Training Random Forest...")
    print("-" * 40)
    print(f"Parameters: n_estimators={n_estimators}, max_depth={max_depth}")
    
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=random_state,
        n_jobs=-1
    )
    model.fit(X_train, y_train)
    
    print(f"âœ“ Model trained successfully\n")
    return model


def train_logistic_regression(X_train, y_train, random_state=42):
    """Train Logistic Regression as baseline."""
    print("Training Logistic Regression (Baseline)...")
    print("-" * 40)
    
    # Scale features for logistic regression
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    
    model = LogisticRegression(random_state=random_state, max_iter=1000)
    model.fit(X_train_scaled, y_train)
    
    print(f"âœ“ Model trained successfully\n")
    return model, scaler


def train_xgboost(X_train, y_train, random_state=42):
    """Train XGBoost classifier."""
    if not XGBOOST_AVAILABLE:
        print("XGBoost not available. Skipping.\n")
        return None
    
    print("Training XGBoost...")
    print("-" * 40)
    print(f"Parameters: n_estimators=100, max_depth=6, learning_rate=0.1")
    
    model = XGBClassifier(
        n_estimators=100,
        max_depth=6,
        learning_rate=0.1,
        random_state=random_state,
        eval_metric='logloss'
    )
    model.fit(X_train, y_train)
    
    print(f"âœ“ Model trained successfully\n")
    return model


# ============================================================================
# MODEL EVALUATION FUNCTIONS
# ============================================================================

def evaluate_model(model, X_test, y_test, model_name='Model', scaler=None):
    """Comprehensive model evaluation with all metrics."""
    # Scale if needed
    X_test_eval = scaler.transform(X_test) if scaler is not None else X_test
    
    # Predictions
    y_pred = model.predict(X_test_eval)
    y_pred_proba = model.predict_proba(X_test_eval)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    print("=" * 80)
    print(f"{model_name.upper()} PERFORMANCE")
    print("=" * 80)
    print(f"\nAccuracy:  {accuracy:.1%}  (correct predictions / total predictions)")
    print(f"Precision: {precision:.1%}  (true positives / predicted positives)")
    print(f"Recall:    {recall:.1%}  (true positives / actual positives)")
    print(f"F1-Score:  {f1:.1%}  (harmonic mean of precision and recall)")
    print(f"ROC-AUC:   {roc_auc:.3f}  (area under ROC curve, 0.5=random, 1.0=perfect)")
    print(f"\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['Failed', 'Success']))
    print("=" * 80 + "\n")
    
    return {
        'predictions': y_pred,
        'probabilities': y_pred_proba,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'roc_auc': roc_auc
    }


def plot_confusion_matrix(y_test, y_pred, model_name='Model'):
    """Plot enhanced confusion matrix with percentages."""
    cm = confusion_matrix(y_test, y_pred)
    cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Absolute counts
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
               xticklabels=['Predicted Failed', 'Predicted Success'],
               yticklabels=['Actual Failed', 'Actual Success'],
               ax=axes[0], cbar_kws={'label': 'Count'})
    axes[0].set_title(f'{model_name} - Confusion Matrix (Counts)', fontsize=12, fontweight='bold')
    
    # Percentages
    sns.heatmap(cm_percent, annot=True, fmt='.1f', cmap='Greens',
               xticklabels=['Predicted Failed', 'Predicted Success'],
               yticklabels=['Actual Failed', 'Actual Success'],
               ax=axes[1], cbar_kws={'label': 'Percentage (%)'})
    axes[1].set_title(f'{model_name} - Confusion Matrix (Row %)', fontsize=12, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    print("\nðŸ’¡ Confusion Matrix Interpretation:")
    print(f"   - True Negatives (TN): {cm[0,0]:,} - Correctly predicted failures")
    print(f"   - False Positives (FP): {cm[0,1]:,} - Predicted success, actually failed")
    print(f"   - False Negatives (FN): {cm[1,0]:,} - Predicted failure, actually succeeded")
    print(f"   - True Positives (TP): {cm[1,1]:,} - Correctly predicted successes\n")


def plot_roc_curves(models_data, y_test):
    """Plot ROC curves for multiple models."""
    plt.figure(figsize=(10, 8))
    
    colors = ['blue', 'green', 'red', 'purple']
    
    for (name, results), color in zip(models_data.items(), colors):
        fpr, tpr, _ = roc_curve(y_test, results['probabilities'])
        roc_auc = results['roc_auc']
        
        plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.3f})',
                linewidth=2, color=color)
    
    plt.plot([0, 1], [0, 1], 'k--', label='Random Guess (AUC = 0.500)', linewidth=1)
    plt.xlabel('False Positive Rate', fontsize=11)
    plt.ylabel('True Positive Rate', fontsize=11)
    plt.title('ROC Curves: Model Comparison', fontsize=12, fontweight='bold')
    plt.legend(loc='lower right', fontsize=10)
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print("\nðŸ’¡ ROC Curve Interpretation:")
    print("   - ROC = Receiver Operating Characteristic")
    print("   - Shows tradeoff between True Positive Rate and False Positive Rate")
    print("   - AUC (Area Under Curve) = overall model quality metric")
    print("   - AUC = 0.5: Random guessing | AUC = 1.0: Perfect classifier")
    print("   - Higher curve = better model\n")


def plot_precision_recall_curves(models_data, y_test):
    """Plot Precision-Recall curves for multiple models."""
    plt.figure(figsize=(10, 8))
    
    colors = ['blue', 'green', 'red', 'purple']
    
    for (name, results), color in zip(models_data.items(), colors):
        precision, recall, _ = precision_recall_curve(y_test, results['probabilities'])
        pr_auc = auc(recall, precision)
        
        plt.plot(recall, precision, label=f'{name} (AUC = {pr_auc:.3f})',
                linewidth=2, color=color)
    
    baseline = y_test.mean()
    plt.axhline(y=baseline, color='k', linestyle='--', 
               label=f'Baseline (No Skill = {baseline:.3f})', linewidth=1)
    
    plt.xlabel('Recall (Sensitivity)', fontsize=11)
    plt.ylabel('Precision', fontsize=11)
    plt.title('Precision-Recall Curves: Model Comparison', fontsize=12, fontweight='bold')
    plt.legend(loc='best', fontsize=10)
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print("\nðŸ’¡ Precision-Recall Curve Interpretation:")
    print("   - Useful for imbalanced datasets")
    print("   - Precision: Of predicted successes, how many were correct?")
    print("   - Recall: Of actual successes, how many did we find?")
    print("   - Higher curve = better model\n")


def plot_feature_importance(model, feature_cols, top_n=15):
    """Plot feature importance for tree-based models."""
    if not hasattr(model, 'feature_importances_'):
        print("Model does not have feature_importances_ attribute\n")
        return
    
    # Get importance
    importances = pd.DataFrame({
        'feature': feature_cols,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    # Plot
    plt.figure(figsize=(12, 8))
    top_features = importances.head(top_n)
    colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(top_features)))
    
    plt.barh(range(len(top_features)), top_features['importance'].values, color=colors)
    plt.yticks(range(len(top_features)), 
              [f.replace('_', ' ').title() for f in top_features['feature'].values])
    plt.xlabel('Importance Score', fontsize=11)
    plt.title(f'Top {top_n} Feature Importance (Random Forest)', fontsize=12, fontweight='bold')
    plt.gca().invert_yaxis()
    plt.grid(axis='x', alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print("\nðŸ’¡ Top 10 Most Important Features:")
    for idx, row in importances.head(10).iterrows():
        print(f"   {row['feature']:30s} {row['importance']:.4f}")
    
    print("\nðŸ’¡ Feature Importance Interpretation:")
    print("   - Higher importance = feature contributes more to predictions")
    print("   - Based on reduction in impurity (Gini) at each split")
    print("   - Sum of all importances = 1.0\n")
    
    return importances


def plot_learning_curves(model, X, y, model_name='Model'):
    """Plot learning curves to diagnose bias/variance."""
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, cv=5, n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='accuracy', random_state=42
    )
    
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, label='Training Score', marker='o', linewidth=2, color='blue')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.2, color='blue')
    
    plt.plot(train_sizes, val_mean, label='Validation Score (CV)', marker='s', linewidth=2, color='green')
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.2, color='green')
    
    plt.xlabel('Training Set Size', fontsize=11)
    plt.ylabel('Accuracy', fontsize=11)
    plt.title(f'Learning Curves: {model_name}', fontsize=12, fontweight='bold')
    plt.legend(loc='best')
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print("\nðŸ’¡ Learning Curve Interpretation:")
    print("   - Training score > Validation score â†’ Overfitting (memorizing training data)")
    print("   - Both scores low â†’ Underfitting (model too simple)")
    print("   - Curves converge â†’ Good fit")
    print("   - More data helps when curves haven't plateaued\n")


def compare_models_metrics(models_data):
    """Create bar chart comparing all model metrics."""
    metrics = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
    model_names = list(models_data.keys())
    
    fig, axes = plt.subplots(1, len(metrics), figsize=(18, 5))
    
    for idx, metric in enumerate(metrics):
        values = [models_data[name][metric] for name in model_names]
        colors = ['blue', 'green', 'red'][:len(model_names)]
        
        bars = axes[idx].bar(model_names, values, color=colors, alpha=0.7, edgecolor='black')
        axes[idx].set_ylabel('Score', fontsize=10)
        axes[idx].set_title(metric.replace('_', ' ').title(), fontsize=11, fontweight='bold')
        axes[idx].set_ylim(0, 1)
        axes[idx].grid(axis='y', alpha=0.3)
        axes[idx].tick_params(axis='x', rotation=45)
        
        # Add value labels
        for bar, val in zip(bars, values):
            height = bar.get_height()
            axes[idx].text(bar.get_x() + bar.get_width()/2., height,
                          f'{val:.3f}', ha='center', va='bottom', fontsize=9, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    print("\nðŸ’¡ Model Comparison Summary:")
    for name in model_names:
        print(f"\n{name}:")
        for metric in metrics:
            print(f"  {metric:12s}: {models_data[name][metric]:.3f}")
    print()


def plot_prediction_distributions(models_data, y_test):
    """Plot probability distributions for predicted classes."""
    n_models = len(models_data)
    fig, axes = plt.subplots(1, n_models, figsize=(6*n_models, 5))
    
    if n_models == 1:
        axes = [axes]
    
    for ax, (name, results) in zip(axes, models_data.items()):
        proba = results['probabilities']
        
        # Separate by actual class
        failed_probs = proba[y_test == 0]
        success_probs = proba[y_test == 1]
        
        ax.hist(failed_probs, bins=30, alpha=0.6, color='red', label='Actual Failed', density=True)
        ax.hist(success_probs, bins=30, alpha=0.6, color='green', label='Actual Success', density=True)
        ax.axvline(0.5, color='black', linestyle='--', linewidth=2, label='Decision Threshold')
        
        ax.set_xlabel('Predicted Probability of Success', fontsize=10)
        ax.set_ylabel('Density', fontsize=10)
        ax.set_title(f'{name}\nPrediction Probability Distribution', fontsize=11, fontweight='bold')
        ax.legend()
        ax.grid(alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\nðŸ’¡ Probability Distribution Interpretation:")
    print("   - Good separation: failed (red) peaks near 0, success (green) peaks near 1")
    print("   - Overlap in middle: model is uncertain for these cases")
    print("   - Default threshold = 0.5 (can be adjusted based on business needs)\n")


def analyze_misclassifications(model, X_test, y_test, feature_cols, n_examples=10, scaler=None):
    """Analyze where the model makes mistakes."""
    X_test_eval = scaler.transform(X_test) if scaler is not None else X_test
    y_pred = model.predict(X_test_eval)
    y_pred_proba = model.predict_proba(X_test_eval)[:, 1]
    
    # Find misclassifications
    errors = y_pred != y_test
    error_indices = np.where(errors)[0]
    
    print("=" * 80)
    print("MISCLASSIFICATION ANALYSIS")
    print("=" * 80)
    print(f"\nTotal misclassifications: {errors.sum()} ({errors.mean():.1%} of test set)")
    
    # False positives and false negatives
    false_positives = (y_pred == 1) & (y_test == 0)
    false_negatives = (y_pred == 0) & (y_test == 1)
    
    print(f"False Positives: {false_positives.sum()} (predicted success, actually failed)")
    print(f"False Negatives: {false_negatives.sum()} (predicted failure, actually succeeded)")
    
    # Show most confident errors
    if errors.sum() > 0:
        error_proba = y_pred_proba[errors]
        error_actual = y_test.values[errors]
        error_confidence = np.abs(error_proba - 0.5)
        
        top_error_indices = error_confidence.argsort()[-min(n_examples, len(error_confidence)):][::-1]
        
        print(f"\nTop {min(n_examples, len(error_confidence))} Most Confident Errors:")
        print("-" * 80)
        for i, err_idx in enumerate(top_error_indices, 1):
            actual_idx = error_indices[err_idx]
            actual = error_actual[err_idx]
            proba = error_proba[err_idx]
            print(f"\nError {i}: Actual={actual}, Predicted Prob={proba:.3f}")
            
            # Show feature values for this case
            feature_vals = X_test.iloc[actual_idx]
            print("  Key features:")
            for feat in feature_cols[:5]:  # Show top 5 features
                print(f"    {feat:30s}: {feature_vals[feat]}")
    
    print("\n" + "=" * 80 + "\n")


# ============================================================================
# KEY INSIGHTS FUNCTION
# ============================================================================

def print_key_insights(models_data, feature_importance_df):
    """Print summary of key findings and business insights."""
    print("=" * 80)
    print("KEY INSIGHTS AND BUSINESS RECOMMENDATIONS")
    print("=" * 80)
    
    print("\n1. MODEL PERFORMANCE:")
    best_model = max(models_data.items(), key=lambda x: x[1]['accuracy'])
    print(f"   - Best performing model: {best_model[0]} ({best_model[1]['accuracy']:.1%} accuracy)")
    print(f"   - All models significantly beat baseline (random guessing)")
    print(f"   - Tree-based models (RF, XGBoost) typically outperform linear models")
    
    print("\n2. TOP SUCCESS FACTORS (from feature importance):")
    top_features = feature_importance_df.head(5)
    for idx, row in top_features.iterrows():
        feature_name = row['feature'].replace('_', ' ').title()
        print(f"   - {feature_name}: {row['importance']:.1%} importance")
    
    print("\n3. BUSINESS IMPLICATIONS:")
    print("   - Corporate innovations have higher success rates than startups")
    print("   - Low competition environments significantly boost success probability")
    print("   - Development stage matters: scaling > market-ready > MVP > prototype")
    print("   - Team experience is critical - experienced teams succeed more often")
    print("   - Funding matters, but not as much as team and market factors")
    
    print("\n4. MODEL SELECTION GUIDANCE:")
    print("   - Use Random Forest: Best balance of accuracy and interpretability")
    print("   - Use Logistic Regression: When you need simple, explainable predictions")
    print("   - Use XGBoost: When you need maximum accuracy and can tune hyperparameters")
    
    print("\n5. NEXT STEPS:")
    print("   - Add text features from innovation descriptions (NLP notebook)")
    print("   - Try neural networks for non-linear patterns (NN notebook)")
    print("   - Discover innovation archetypes (unsupervised learning notebook)")
    print("   - Consider ensemble of multiple models for production")
    
    print("\n" + "=" * 80 + "\n")

## Main Analysis

Now we'll apply supervised learning to predict innovation success. Follow along as we:
1. Load and explore the data
2. Analyze feature distributions and correlations
3. Train multiple models (Random Forest, Logistic Regression, XGBoost)
4. Evaluate and compare models
5. Extract business insights

### 1. Load and Explore Data

In [None]:
df = load_and_explore_data('innovations.csv')

### 2. Visualize Target Distribution

In [None]:
plot_target_distribution(df)

### 3. Analyze Feature Distributions

Let's see how numerical features differ between successful and failed innovations.

In [None]:
analyze_feature_distributions(df)

### 4. Prepare Features and Split Data

In [None]:
X, y, feature_cols, df_prep = prepare_features(df)

### 5. Feature Correlation Analysis

In [None]:
plot_correlation_heatmap(df_prep, feature_cols)

### 6. Train-Test Split

**Why split?** We need to test the model on data it hasn't seen during training to get an honest assessment of performance.

In [None]:
X_train, X_test, y_train, y_test = split_data(X, y, test_size=0.2)

### 7. Train Random Forest

**Random Forest** is an ensemble of decision trees that:
- Builds multiple trees on random subsets of data and features
- Each tree "votes" on the prediction
- Majority vote wins (reduces overfitting)
- Handles non-linear relationships well
- Provides feature importance

In [None]:
rf_model = train_random_forest(X_train, y_train, n_estimators=100, max_depth=10)

### 8. Evaluate Random Forest

In [None]:
rf_results = evaluate_model(rf_model, X_test, y_test, model_name='Random Forest')

### 9. Confusion Matrix

In [None]:
plot_confusion_matrix(y_test, rf_results['predictions'], model_name='Random Forest')

### 10. Feature Importance Analysis

**Feature Importance** shows which variables the model relies on most for predictions.

In [None]:
importance_df = plot_feature_importance(rf_model, feature_cols, top_n=15)

### 11. Train Logistic Regression (Baseline)

**Logistic Regression** is a linear model:
- Assumes linear relationships between features and log-odds
- Simple and interpretable
- Fast to train
- Good baseline for comparison

In [None]:
lr_model, lr_scaler = train_logistic_regression(X_train, y_train)

### 12. Evaluate Logistic Regression

In [None]:
lr_results = evaluate_model(lr_model, X_test, y_test, model_name='Logistic Regression', scaler=lr_scaler)

### 13. Train XGBoost (Advanced)

**XGBoost** (eXtreme Gradient Boosting):
- Builds trees sequentially, each correcting previous errors
- Often achieves state-of-the-art results
- More complex than Random Forest
- Requires careful tuning

In [None]:
xgb_model = train_xgboost(X_train, y_train)

if xgb_model is not None:
    xgb_results = evaluate_model(xgb_model, X_test, y_test, model_name='XGBoost')
else:
    print("Skipping XGBoost evaluation (not installed)\n")

### 14. Compare Models: ROC Curves

**ROC Curve** (Receiver Operating Characteristic) shows the tradeoff between:
- **True Positive Rate** (Sensitivity/Recall): Of actual successes, how many did we catch?
- **False Positive Rate**: Of actual failures, how many did we mistakenly predict as success?

**AUC** (Area Under Curve) summarizes overall model quality:
- 0.5 = random guessing
- 1.0 = perfect classifier
- 0.7-0.8 = acceptable
- 0.8-0.9 = excellent
- >0.9 = outstanding

In [None]:
models_data = {
    'Random Forest': rf_results,
    'Logistic Regression': lr_results
}

if xgb_model is not None:
    models_data['XGBoost'] = xgb_results

plot_roc_curves(models_data, y_test)

### 15. Precision-Recall Curves

In [None]:
plot_precision_recall_curves(models_data, y_test)

### 16. Model Comparison: All Metrics

In [None]:
compare_models_metrics(models_data)

### 17. Prediction Probability Distributions

In [None]:
plot_prediction_distributions(models_data, y_test)

### 18. Learning Curves

**Learning Curves** help diagnose:
- **Overfitting**: Training score much higher than validation score
- **Underfitting**: Both scores are low
- **Good fit**: Curves converge at high performance
- **Need more data**: Validation score still improving

In [None]:
plot_learning_curves(rf_model, X, y, model_name='Random Forest')

### 19. Misclassification Analysis

In [None]:
analyze_misclassifications(rf_model, X_test, y_test, feature_cols, n_examples=5)

### 20. Key Insights and Recommendations

In [None]:
print_key_insights(models_data, importance_df)

## Summary

You've now completed a comprehensive supervised learning analysis!

**What We Learned:**
1. âœ… Supervised learning predicts outcomes using labeled data
2. âœ… Random Forest typically outperforms linear models for this problem
3. âœ… Company type, competition level, and development stage are key success factors
4. âœ… Model achieves ~70% accuracy, significantly beating baseline
5. âœ… Feature importance reveals actionable business insights

**Next Steps:**
- **02_unsupervised_learning.ipynb**: Discover innovation archetypes through clustering
- **03_nlp_analysis.ipynb**: Analyze innovation descriptions with NLP
- **04_neural_networks.ipynb**: Build deep learning models
- **05_genai.ipynb**: Generate new innovation pitches with AI