# Task 3.11: Learning Curve Analysis

## Overview
This task performs comprehensive learning curve analysis for our top 3 performing models to understand the bias-variance tradeoff and determine whether collecting more data would improve model performance.

## Objectives
1. **Generate learning curves** for XGBoost, Random Forest, and MLP Classifier
2. **Analyze bias-variance tradeoff** by examining training vs validation score gaps
3. **Determine data saturation** - whether more data would help or if models have reached their limit
4. **Provide actionable recommendations** based on the analysis

## What is a Learning Curve?
A learning curve shows how model performance changes as the training set size increases. It plots:
- **Training Score**: How well the model fits the training data
- **Validation Score**: How well the model generalizes to unseen data

## Interpreting Learning Curves

### High Bias (Underfitting)
- Both training and validation scores are low
- Small gap between training and validation curves
- Adding more data won't help much
- **Solution**: Use more complex model, add features

### High Variance (Overfitting)
- Training score is high, validation score is low
- Large gap between training and validation curves
- Adding more data may help
- **Solution**: Regularization, reduce model complexity, get more data

### Good Fit
- Both scores are high and converge
- Small gap between curves
- Model is well-balanced

---

## Step 1: Import Libraries and Setup

We import all necessary libraries for:
- Data manipulation (pandas, numpy)
- Machine learning models (sklearn, xgboost)
- Learning curve generation (sklearn.model_selection)
- Visualization (matplotlib, seaborn)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import learning_curve, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import LabelEncoder
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Create output directories
Path('../../outputs').mkdir(parents=True, exist_ok=True)
Path('../../outputs/figures').mkdir(parents=True, exist_ok=True)

print("="*60)
print("TASK 3.11: LEARNING CURVE ANALYSIS")
print("="*60)
print("\nLibraries loaded successfully!")

## Step 2: Load and Prepare Data

### Data Loading
We load the preprocessed and scaled datasets from Task 1.6.

### Removing Leaky Features
**Critical Step**: We must remove features that cause data leakage. These are features that were used to create the target variable (`value_category`):

- `price`, `price_normalized`, `price_per_person`, `price_per_bathroom`, `price_per_bedroom` - Price features used in FP score calculation
- `review_scores_rating`, `review_scores_value` - Rating features used in FP score calculation
- `value_density`, `estimated_revenue_l365d` - Derived from price

Keeping these features would result in artificially high accuracy (~99%) because the model can reverse-engineer the target.

In [None]:
# Load data
print("Loading preprocessed data...\n")

X_train = pd.read_csv('../../data/processed/X_train_scaled.csv')
X_test = pd.read_csv('../../data/processed/X_test_scaled.csv')
y_train = pd.read_csv('../../data/processed/y_train.csv')["value_category"]
y_test = pd.read_csv('../../data/processed/y_test.csv')["value_category"]

# Drop id column if exists
if 'id' in X_train.columns:
    X_train = X_train.drop('id', axis=1)
if 'id' in X_test.columns:
    X_test = X_test.drop('id', axis=1)

# Remove leaky features to prevent data leakage
leaky_features = [
    'price', 'price_normalized', 'price_per_person', 'price_per_bathroom',
    'price_per_bedroom', 'review_scores_rating', 'review_scores_value',
    'value_density', 'estimated_revenue_l365d'
]

cols_to_drop = [col for col in leaky_features if col in X_train.columns]
X_train = X_train.drop(columns=cols_to_drop)
X_test = X_test.drop(columns=cols_to_drop)

print(f"Dropped {len(cols_to_drop)} leaky features:")
for col in cols_to_drop:
    print(f"  - {col}")
print(f"\nRemaining features: {X_train.shape[1]}")

# Encode target variable
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Combine train and test for learning curve analysis
# Learning curves need to see how performance changes with different training sizes
X = pd.concat([X_train, X_test], axis=0).reset_index(drop=True)
y = np.concatenate([y_train_encoded, y_test_encoded])

print(f"\n" + "="*60)
print("DATASET SUMMARY")
print("="*60)
print(f"Total samples: {len(y):,}")
print(f"Number of features: {X.shape[1]}")
print(f"\nClass distribution:")
for i, cls in enumerate(label_encoder.classes_):
    count = (y == i).sum()
    pct = count / len(y) * 100
    print(f"  {cls}: {count:,} ({pct:.1f}%)")

## Step 3: Define Top 3 Models

Based on our Week 2 model comparison results, we select the top 3 performing models:

### 1. XGBoost Classifier
- **Type**: Gradient Boosting ensemble method
- **Strengths**: Handles non-linear relationships, built-in regularization, feature importance
- **Expected behavior**: May show some overfitting (high variance) due to its complexity

### 2. Random Forest Classifier
- **Type**: Bagging ensemble of decision trees
- **Strengths**: Robust to overfitting, handles high-dimensional data well
- **Expected behavior**: Generally good bias-variance balance

### 3. MLP Classifier (Neural Network)
- **Type**: Multi-layer Perceptron neural network
- **Strengths**: Can learn complex patterns, flexible architecture
- **Expected behavior**: May require more data to generalize well

We use regularized parameters to prevent excessive overfitting during learning curve analysis.

In [None]:
# Define top 3 models with STRONGER regularization
models = {
    'XGBoost': XGBClassifier(
        n_estimators=100,
        max_depth=4,              
        learning_rate=0.05,       
        min_child_weight=10,      
        subsample=0.7,           
        colsample_bytree=0.7,    
        reg_alpha=0.5,            
        reg_lambda=2.0,          
        gamma=0.1,               
        random_state=42,
        n_jobs=-1,
        verbosity=0
    ),
    'Random Forest': RandomForestClassifier(
        n_estimators=100,
        max_depth=6,             
        min_samples_split=20,    
        min_samples_leaf=12,      
        max_features=0.3,         
        max_samples=0.7,          
        random_state=42,
        n_jobs=-1
    ),
    'MLP Classifier': MLPClassifier(
        hidden_layer_sizes=(64, 32), 
        activation='relu',
        solver='adam',
        alpha=0.1,               
        learning_rate='adaptive',
        max_iter=500,
        early_stopping=True,
        validation_fraction=0.15, 
        n_iter_no_change=15,     
        random_state=42
    )
}

## Step 4: Generate Learning Curves

### Methodology
We use `sklearn.model_selection.learning_curve` which:
1. Trains the model on increasing subsets of the training data
2. Evaluates performance using cross-validation at each training size
3. Returns training and validation scores for each size

### Parameters
- **train_sizes**: We use 10 different training set sizes from 10% to 100%
- **cv=5**: 5-fold cross-validation for robust estimates
- **scoring='f1_macro'**: Macro F1-score for multi-class classification
- **n_jobs=-1**: Parallel processing for speed

### What We're Looking For
- **Convergence**: Do training and validation curves converge?
- **Gap size**: Large gap = high variance (overfitting)
- **Plateau**: Does validation score plateau? (data saturation)
- **Trend**: Is validation score still improving at max training size?

In [None]:
# Define training sizes (10% to 100% of data)
train_sizes = np.linspace(0.1, 1.0, 10)

# Store results
learning_curve_results = {}

print("="*60)
print("GENERATING LEARNING CURVES")
print("="*60)
print(f"\nTraining sizes: {[f'{s:.0%}' for s in train_sizes]}")
print(f"Cross-validation folds: 5")
print(f"Scoring metric: F1-score (macro)")
print("\nThis may take a few minutes...\n")

for name, model in models.items():
    print(f"Processing {name}...", end=" ")
    
    train_sizes_abs, train_scores, val_scores = learning_curve(
        model, X, y,
        train_sizes=train_sizes,
        cv=5,
        scoring='f1_macro',
        n_jobs=-1,
        random_state=42
    )
    
    learning_curve_results[name] = {
        'train_sizes': train_sizes_abs,
        'train_scores_mean': train_scores.mean(axis=1),
        'train_scores_std': train_scores.std(axis=1),
        'val_scores_mean': val_scores.mean(axis=1),
        'val_scores_std': val_scores.std(axis=1)
    }
    
    print(f"Done! Final val score: {val_scores.mean(axis=1)[-1]:.4f}")

print("\n" + "="*60)
print("Learning curves generated successfully!")
print("="*60)

## Step 5: Visualize Learning Curves

### Visualization 1: Individual Learning Curves
We create a subplot for each model showing:
- **Blue line**: Training score (how well model fits training data)
- **Orange line**: Validation score (how well model generalizes)
- **Shaded area**: Standard deviation across CV folds (uncertainty)

### How to Read the Plots
- **Gap between curves**: Indicates variance (overfitting)
- **Curve convergence**: Good generalization
- **Flat validation curve**: Model may be saturated (more data won't help)
- **Rising validation curve**: More data could improve performance

In [None]:
# Create individual learning curve plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

colors = {'train': 'steelblue', 'val': 'darkorange'}

for idx, (name, results) in enumerate(learning_curve_results.items()):
    ax = axes[idx]
    
    # Plot training scores
    ax.plot(results['train_sizes'], results['train_scores_mean'], 
            'o-', color=colors['train'], label='Training Score', linewidth=2, markersize=6)
    ax.fill_between(results['train_sizes'],
                    results['train_scores_mean'] - results['train_scores_std'],
                    results['train_scores_mean'] + results['train_scores_std'],
                    alpha=0.2, color=colors['train'])
    
    # Plot validation scores
    ax.plot(results['train_sizes'], results['val_scores_mean'],
            'o-', color=colors['val'], label='Validation Score', linewidth=2, markersize=6)
    ax.fill_between(results['train_sizes'],
                    results['val_scores_mean'] - results['val_scores_std'],
                    results['val_scores_mean'] + results['val_scores_std'],
                    alpha=0.2, color=colors['val'])
    
    # Calculate and display gap
    final_gap = results['train_scores_mean'][-1] - results['val_scores_mean'][-1]
    
    ax.set_xlabel('Training Set Size', fontsize=12)
    ax.set_ylabel('F1-Score (Macro)', fontsize=12)
    ax.set_title(f'{name}\nFinal Gap: {final_gap:.4f}', fontsize=14, fontweight='bold')
    ax.legend(loc='lower right', fontsize=10)
    ax.grid(True, alpha=0.3)
    ax.set_ylim([0.5, 1.0])

plt.suptitle('Learning Curves for Top 3 Models', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('../../outputs/figures/learning_curves_individual.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nSaved: outputs/figures/learning_curves_individual.png")

## Step 6: Comparative Analysis

### Visualization 2: All Models Comparison
We overlay all validation curves on a single plot to directly compare:
- Which model learns fastest (steeper initial curve)
- Which model achieves highest final performance
- Which model benefits most from additional data

In [None]:
# Comparative plot - all models on same axes
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

colors_models = ['steelblue', 'forestgreen', 'coral']

# Left plot: Validation scores comparison
ax1 = axes[0]
for idx, (name, results) in enumerate(learning_curve_results.items()):
    ax1.plot(results['train_sizes'], results['val_scores_mean'],
             'o-', color=colors_models[idx], label=name, linewidth=2, markersize=6)
    ax1.fill_between(results['train_sizes'],
                     results['val_scores_mean'] - results['val_scores_std'],
                     results['val_scores_mean'] + results['val_scores_std'],
                     alpha=0.15, color=colors_models[idx])

ax1.set_xlabel('Training Set Size', fontsize=12)
ax1.set_ylabel('Validation F1-Score (Macro)', fontsize=12)
ax1.set_title('Validation Score Comparison', fontsize=14, fontweight='bold')
ax1.legend(loc='lower right', fontsize=11)
ax1.grid(True, alpha=0.3)

# Right plot: Bias-Variance Gap
ax2 = axes[1]
for idx, (name, results) in enumerate(learning_curve_results.items()):
    gap = results['train_scores_mean'] - results['val_scores_mean']
    ax2.plot(results['train_sizes'], gap,
             'o-', color=colors_models[idx], label=name, linewidth=2, markersize=6)

ax2.axhline(y=0.05, color='red', linestyle='--', alpha=0.7, label='Acceptable Gap (0.05)')
ax2.set_xlabel('Training Set Size', fontsize=12)
ax2.set_ylabel('Train-Validation Gap', fontsize=12)
ax2.set_title('Bias-Variance Gap Analysis', fontsize=14, fontweight='bold')
ax2.legend(loc='upper right', fontsize=11)
ax2.grid(True, alpha=0.3)

plt.suptitle('Model Comparison: Learning Behavior', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('../../outputs/figures/learning_curves_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nSaved: outputs/figures/learning_curves_comparison.png")

## Step 7: Bias-Variance Analysis

### Understanding Bias and Variance

| Metric | High Bias | High Variance |
|--------|-----------|---------------|
| Training Score | Low | High |
| Validation Score | Low | Low |
| Gap | Small | Large |
| Problem | Underfitting | Overfitting |
| Solution | More complex model | Regularization, more data |

### Analysis Metrics
We calculate:
1. **Final Training Score**: Performance on training data at max size
2. **Final Validation Score**: Generalization performance
3. **Bias-Variance Gap**: Difference between train and validation
4. **Score Improvement**: How much validation improved from 10% to 100% data
5. **Convergence Rate**: How quickly the model learns

In [None]:
print("="*70)
print("BIAS-VARIANCE ANALYSIS")
print("="*70)

analysis_results = []

for name, results in learning_curve_results.items():
    train_final = results['train_scores_mean'][-1]
    val_final = results['val_scores_mean'][-1]
    gap = train_final - val_final
    
    # Score improvement from 10% to 100% data
    val_improvement = results['val_scores_mean'][-1] - results['val_scores_mean'][0]
    
    # Check if still improving (compare last two points)
    recent_improvement = results['val_scores_mean'][-1] - results['val_scores_mean'][-2]
    
    # Determine bias-variance status
    if gap > 0.15:
        status = "HIGH VARIANCE (Overfitting)"
        recommendation = "Add regularization, reduce complexity, or get more data"
    elif gap > 0.08:
        status = "MODERATE VARIANCE"
        recommendation = "Consider light regularization or more data"
    elif val_final < 0.65:
        status = "HIGH BIAS (Underfitting)"
        recommendation = "Use more complex model or add features"
    else:
        status = "GOOD BALANCE"
        recommendation = "Model is well-tuned"
    
    # Data saturation check
    if recent_improvement < 0.005:
        saturation = "SATURATED - More data unlikely to help significantly"
    elif recent_improvement < 0.01:
        saturation = "NEAR SATURATION - Diminishing returns from more data"
    else:
        saturation = "NOT SATURATED - More data could improve performance"
    
    analysis_results.append({
        'Model': name,
        'Train_Score': train_final,
        'Val_Score': val_final,
        'Gap': gap,
        'Improvement': val_improvement,
        'Recent_Improvement': recent_improvement,
        'Status': status,
        'Saturation': saturation
    })
    
    print(f"\n{'='*70}")
    print(f"MODEL: {name}")
    print(f"{'='*70}")
    print(f"\n  Performance Metrics:")
    print(f"    • Final Training Score:   {train_final:.4f}")
    print(f"    • Final Validation Score: {val_final:.4f}")
    print(f"    • Bias-Variance Gap:      {gap:.4f}")
    print(f"    • Total Improvement:      {val_improvement:.4f} ({val_improvement/results['val_scores_mean'][0]*100:.1f}%)")
    print(f"    • Recent Improvement:     {recent_improvement:.4f}")
    print(f"\n  Diagnosis:")
    print(f"    • Status: {status}")
    print(f"    • {saturation}")
    print(f"\n  Recommendation:")
    print(f"    → {recommendation}")

## Step 8: Summary Table and Visualization

We create a comprehensive summary comparing all models across key metrics.

In [None]:
# Create summary DataFrame
summary_df = pd.DataFrame(analysis_results)

print("\n" + "="*70)
print("SUMMARY TABLE")
print("="*70)
print(summary_df[['Model', 'Train_Score', 'Val_Score', 'Gap', 'Improvement']].to_string(index=False))

# Save summary to CSV
summary_df.to_csv('../../outputs/learning_curve_analysis.csv', index=False)
print("\nSaved: outputs/learning_curve_analysis.csv")

# Create summary visualization
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

models_list = summary_df['Model'].tolist()
x = np.arange(len(models_list))
width = 0.35

# Plot 1: Train vs Validation Scores
ax1 = axes[0]
bars1 = ax1.bar(x - width/2, summary_df['Train_Score'], width, label='Training', color='steelblue')
bars2 = ax1.bar(x + width/2, summary_df['Val_Score'], width, label='Validation', color='darkorange')
ax1.set_ylabel('F1-Score', fontsize=12)
ax1.set_title('Training vs Validation Scores', fontsize=14, fontweight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels(models_list, rotation=15, ha='right')
ax1.legend()
ax1.set_ylim([0.6, 1.0])
ax1.grid(axis='y', alpha=0.3)

# Add value labels
for bar in bars1:
    ax1.annotate(f'{bar.get_height():.3f}', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),
                 xytext=(0, 3), textcoords='offset points', ha='center', fontsize=9)
for bar in bars2:
    ax1.annotate(f'{bar.get_height():.3f}', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),
                 xytext=(0, 3), textcoords='offset points', ha='center', fontsize=9)

# Plot 2: Bias-Variance Gap
ax2 = axes[1]
colors_gap = ['green' if g < 0.08 else 'orange' if g < 0.15 else 'red' for g in summary_df['Gap']]
bars3 = ax2.bar(x, summary_df['Gap'], color=colors_gap, edgecolor='black', linewidth=1.5)
ax2.axhline(y=0.08, color='orange', linestyle='--', label='Moderate threshold')
ax2.axhline(y=0.15, color='red', linestyle='--', label='High variance threshold')
ax2.set_ylabel('Gap (Train - Validation)', fontsize=12)
ax2.set_title('Bias-Variance Gap', fontsize=14, fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(models_list, rotation=15, ha='right')
ax2.legend(fontsize=9)
ax2.grid(axis='y', alpha=0.3)

for bar in bars3:
    ax2.annotate(f'{bar.get_height():.3f}', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),
                 xytext=(0, 3), textcoords='offset points', ha='center', fontsize=10, fontweight='bold')

# Plot 3: Improvement from more data
ax3 = axes[2]
bars4 = ax3.bar(x, summary_df['Improvement'], color='mediumseagreen', edgecolor='black', linewidth=1.5)
ax3.set_ylabel('F1-Score Improvement', fontsize=12)
ax3.set_title('Improvement (10% → 100% data)', fontsize=14, fontweight='bold')
ax3.set_xticks(x)
ax3.set_xticklabels(models_list, rotation=15, ha='right')
ax3.grid(axis='y', alpha=0.3)

for bar in bars4:
    ax3.annotate(f'{bar.get_height():.3f}', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),
                 xytext=(0, 3), textcoords='offset points', ha='center', fontsize=10, fontweight='bold')

plt.suptitle('Learning Curve Analysis Summary', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('../../outputs/figures/learning_curve_summary.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nSaved: outputs/figures/learning_curve_summary.png")

## Step 9: Final Conclusions and Recommendations

Based on our learning curve analysis, we provide final conclusions about:
1. Which model has the best bias-variance tradeoff
2. Whether more data would help improve performance
3. Actionable recommendations for model improvement

In [None]:
print("\n" + "="*70)
print("FINAL CONCLUSIONS AND RECOMMENDATIONS")
print("="*70)

# Find best model
best_model_idx = summary_df['Val_Score'].idxmax()
best_model = summary_df.loc[best_model_idx, 'Model']
best_score = summary_df.loc[best_model_idx, 'Val_Score']

# Find model with best bias-variance balance
best_balance_idx = summary_df['Gap'].idxmin()
best_balance_model = summary_df.loc[best_balance_idx, 'Model']

print(f"\n1. BEST PERFORMING MODEL")
print(f"   → {best_model} with validation F1-score of {best_score:.4f}")

print(f"\n2. BEST BIAS-VARIANCE BALANCE")
print(f"   → {best_balance_model} with gap of {summary_df.loc[best_balance_idx, 'Gap']:.4f}")

print(f"\n3. DATA SATURATION ANALYSIS")
for _, row in summary_df.iterrows():
    print(f"   • {row['Model']}: {row['Saturation']}")




print("\n" + "="*70)
print("TASK 3.11: LEARNING CURVE ANALYSIS - COMPLETE")
print("="*70)

## Files Generated

### Analysis Files (outputs/)
- `learning_curve_analysis.csv` - Summary metrics for all models

### Visualization Files (outputs/figures/)
- `learning_curves_individual.png` - Individual learning curves for each model
- `learning_curves_comparison.png` - Comparative analysis of all models
- `learning_curve_summary.png` - Summary dashboard with key metrics

---

## Key Takeaways

1. **Bias-Variance Tradeoff**: All models show some degree of overfitting (training score > validation score), which is normal for complex models.

2. **Data Saturation**: The validation curves are flattening, indicating that simply adding more data of the same type won't dramatically improve performance.

3. **Model Selection**: XGBoost provides the best balance of performance and generalization for this dataset.

4. **Practical Implications**: For production deployment, focus on regularization and feature engineering rather than data collection.