# Regression Algorithms: From Linear to Advanced 📈

Welcome to your comprehensive regression workshop! Today you'll master:

- 🏠 **Real estate price prediction** with linear regression
- 📊 **Advanced regression techniques** (Ridge, Lasso, Elastic Net)
- 🌟 **Polynomial features** and non-linear relationships
- 🔍 **Feature selection** and regularization
- 📏 **Comprehensive evaluation** beyond just R²
- 🎯 **End-to-end pipeline** for production-ready models

Ready to predict the future? Let's go! 🚀

In [None]:
# Essential imports for regression analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_regression, load_diabetes
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_regression, RFE
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("📈 REGRESSION WORKSHOP INITIALIZED!")
print("Ready to predict continuous values with confidence!")

## Dataset 1: California Housing Prices 🏠

Let's start with everyone's favorite: predicting house prices! 
This dataset contains real California housing data with features like:
- Median income in the area
- House age
- Number of rooms
- Population density
- Geographic location

**Business Goal:** Build a model to estimate house prices for real estate valuation.

In [None]:
# Load and create California housing dataset
from sklearn.datasets import fetch_california_housing

housing_data = fetch_california_housing()
X_housing = pd.DataFrame(housing_data.data, columns=housing_data.feature_names)
y_housing = housing_data.target  # Prices in hundreds of thousands

print("🏠 CALIFORNIA HOUSING DATASET")
print(f"Samples: {X_housing.shape[0]:,}")
print(f"Features: {X_housing.shape[1]}")
print(f"Target: House price (in $100,000s)")
print(f"Price range: ${y_housing.min():.1f}k - ${y_housing.max():.1f}k")
print(f"Average price: ${y_housing.mean():.1f}k")

# Display feature information
print(f"\nFeatures:")
for i, feature in enumerate(housing_data.feature_names):
    print(f"  {i+1}. {feature}")

print(f"\nFirst few samples:")
housing_df = X_housing.copy()
housing_df['price'] = y_housing
print(housing_df.head())

In [None]:
# Comprehensive Exploratory Data Analysis
fig, axes = plt.subplots(3, 3, figsize=(20, 15))
axes = axes.ravel()

# Distribution of target variable
axes[0].hist(y_housing, bins=50, alpha=0.7, edgecolor='black')
axes[0].set_title('House Price Distribution')
axes[0].set_xlabel('Price ($100k)')
axes[0].set_ylabel('Frequency')

# Feature distributions
features_to_plot = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population']
for i, feature in enumerate(features_to_plot):
    axes[i+1].hist(X_housing[feature], bins=30, alpha=0.7, edgecolor='black')
    axes[i+1].set_title(f'{feature} Distribution')
    axes[i+1].set_xlabel(feature)

# Correlation with target
correlations = X_housing.corrwith(pd.Series(y_housing, name='Price')).sort_values(ascending=False)
axes[6].barh(range(len(correlations)), correlations.values, color='skyblue', edgecolor='black')
axes[6].set_yticks(range(len(correlations)))
axes[6].set_yticklabels(correlations.index)
axes[6].set_xlabel('Correlation with Price')
axes[6].set_title('Feature-Price Correlations')

# Scatter plot: Income vs Price (strongest correlation)
axes[7].scatter(X_housing['MedInc'], y_housing, alpha=0.5, s=1)
axes[7].set_xlabel('Median Income')
axes[7].set_ylabel('House Price ($100k)')
axes[7].set_title('Income vs House Price')

# Geographic scatter (Longitude vs Latitude colored by price)
scatter = axes[8].scatter(X_housing['Longitude'], X_housing['Latitude'], 
                         c=y_housing, cmap='viridis', alpha=0.6, s=1)
axes[8].set_xlabel('Longitude')
axes[8].set_ylabel('Latitude') 
axes[8].set_title('Geographic Price Distribution')
plt.colorbar(scatter, ax=axes[8], label='Price ($100k)')

plt.tight_layout()
plt.show()

print("🔍 KEY INSIGHTS:")
print(f"- Median Income has strongest correlation with price: {correlations['MedInc']:.3f}")
print(f"- Geographic location clearly matters (coastal areas more expensive)")
print(f"- Price distribution is right-skewed (some very expensive areas)")

### 🎮 Interactive Exercise 1: Linear Regression Fundamentals

**Your First Challenge:** Build a simple linear regression model to predict house prices.

**Tasks:**
1. Split the data (80% train, 20% test)
2. Train a basic linear regression model  
3. Evaluate using multiple regression metrics
4. Visualize predictions vs actual values
5. Analyze residuals (prediction errors)

**Learning Goal:** Understand what linear regression can and cannot capture!

In [None]:
# TODO: Implement linear regression pipeline

# Step 1: Split the data
X_train_housing, X_test_housing, y_train_housing, y_test_housing = train_test_split(
    X_housing, y_housing, test_size=0.2, random_state=42
)

print("✅ Data split complete!")
print(f"Training samples: {X_train_housing.shape[0]:,}")
print(f"Test samples: {X_test_housing.shape[0]:,}")

# Step 2: Train linear regression
lr_housing = LinearRegression()
lr_housing.fit(X_train_housing, y_train_housing)

# Step 3: Make predictions
y_pred_train = lr_housing.predict(X_train_housing)
y_pred_test = lr_housing.predict(X_test_housing)

print("✅ Linear regression model trained!")

# Step 4: Comprehensive evaluation
def evaluate_regression_model(y_true, y_pred, dataset_name="Dataset"):
    """
    Comprehensive regression model evaluation
    """
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    
    print(f"\n📊 {dataset_name.upper()} EVALUATION:")
    print(f"R² Score:     {r2:.3f} - Variance explained")
    print(f"RMSE:         ${rmse:.1f}k - Root Mean Squared Error")
    print(f"MAE:          ${mae:.1f}k - Mean Absolute Error")
    print(f"MSE:          ${mse:.1f} - Mean Squared Error")
    
    # Business interpretation
    print(f"\n💰 BUSINESS INTERPRETATION:")
    print(f"- On average, predictions are off by ${rmse:.1f}k")
    print(f"- Model explains {r2:.1%} of price variation")
    print(f"- Typical error magnitude: ${mae:.1f}k")
    
    return {'R2': r2, 'RMSE': rmse, 'MAE': mae, 'MSE': mse}

# Evaluate on both training and test sets
train_metrics = evaluate_regression_model(y_train_housing, y_pred_train, "Training")
test_metrics = evaluate_regression_model(y_test_housing, y_pred_test, "Test")

# Check for overfitting
print(f"\n🔍 OVERFITTING CHECK:")
print(f"Training R²: {train_metrics['R2']:.3f}")
print(f"Test R²:     {test_metrics['R2']:.3f}")
print(f"Difference:  {train_metrics['R2'] - test_metrics['R2']:.3f}")

if train_metrics['R2'] - test_metrics['R2'] > 0.1:
    print("⚠️  Possible overfitting detected!")
else:
    print("✅ No significant overfitting")

In [None]:
# Step 5: Visualize results
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Predictions vs Actual
axes[0,0].scatter(y_test_housing, y_pred_test, alpha=0.6, s=20)
axes[0,0].plot([y_test_housing.min(), y_test_housing.max()], 
               [y_test_housing.min(), y_test_housing.max()], 'r--', lw=2)
axes[0,0].set_xlabel('Actual Price ($100k)')
axes[0,0].set_ylabel('Predicted Price ($100k)')
axes[0,0].set_title('Predictions vs Actual Values')
axes[0,0].grid(True, alpha=0.3)

# Residuals (errors)
residuals = y_test_housing - y_pred_test
axes[0,1].scatter(y_pred_test, residuals, alpha=0.6, s=20)
axes[0,1].axhline(y=0, color='r', linestyle='--')
axes[0,1].set_xlabel('Predicted Price ($100k)')
axes[0,1].set_ylabel('Residuals')
axes[0,1].set_title('Residual Plot')
axes[0,1].grid(True, alpha=0.3)

# Residuals histogram
axes[1,0].hist(residuals, bins=30, alpha=0.7, edgecolor='black')
axes[1,0].set_xlabel('Residuals')
axes[1,0].set_ylabel('Frequency')
axes[1,0].set_title('Residuals Distribution')
axes[1,0].axvline(x=0, color='r', linestyle='--')

# Feature coefficients
coefficients = pd.DataFrame({
    'Feature': X_housing.columns,
    'Coefficient': lr_housing.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

axes[1,1].barh(range(len(coefficients)), coefficients['Coefficient'])
axes[1,1].set_yticks(range(len(coefficients)))
axes[1,1].set_yticklabels(coefficients['Feature'])
axes[1,1].set_xlabel('Coefficient Value')
axes[1,1].set_title('Feature Importance (Coefficients)')

plt.tight_layout()
plt.show()

print(f"\n🎯 COEFFICIENT INTERPRETATION:")
for _, row in coefficients.head().iterrows():
    direction = "increases" if row['Coefficient'] > 0 else "decreases"
    print(f"- {row['Feature']}: {direction} price by ${abs(row['Coefficient']):.1f}k per unit")

## Advanced Regression: Regularization Techniques 🎚️

Linear regression can overfit, especially with many features. Let's explore regularization techniques:

- **Ridge Regression (L2)**: Shrinks coefficients smoothly
- **Lasso Regression (L1)**: Can set coefficients to exactly zero (feature selection)
- **Elastic Net**: Combines both L1 and L2 penalties

These techniques help prevent overfitting and can improve generalization!

In [None]:
# Compare different regularization techniques
regularization_models = {
    'Linear Regression': LinearRegression(),
    'Ridge (α=1.0)': Ridge(alpha=1.0),
    'Ridge (α=10.0)': Ridge(alpha=10.0), 
    'Ridge (α=100.0)': Ridge(alpha=100.0),
    'Lasso (α=0.1)': Lasso(alpha=0.1, max_iter=2000),
    'Lasso (α=1.0)': Lasso(alpha=1.0, max_iter=2000),
    'ElasticNet (α=1.0)': ElasticNet(alpha=1.0, l1_ratio=0.5, max_iter=2000)
}

# Need to scale features for regularized models
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_housing)
X_test_scaled = scaler.transform(X_test_housing)

regularization_results = {}

print("🎚️ REGULARIZATION TECHNIQUES COMPARISON")
for name, model in regularization_models.items():
    if 'Linear' in name:
        # Don't scale for basic linear regression for comparison
        model.fit(X_train_housing, y_train_housing)
        y_pred = model.predict(X_test_housing)
    else:
        # Use scaled features for regularized models
        model.fit(X_train_scaled, y_train_housing)
        y_pred = model.predict(X_test_scaled)
    
    r2 = r2_score(y_test_housing, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test_housing, y_pred))
    mae = mean_absolute_error(y_test_housing, y_pred)
    
    # Count non-zero coefficients (for feature selection analysis)
    n_features_used = np.sum(np.abs(model.coef_) > 1e-4) if hasattr(model, 'coef_') else len(X_housing.columns)
    
    regularization_results[name] = {
        'R²': r2,
        'RMSE': rmse,
        'MAE': mae,
        'Features Used': n_features_used
    }

# Display results
reg_df = pd.DataFrame(regularization_results).T
print("\n📊 REGULARIZATION RESULTS:")
print(reg_df.round(3))

# Visualize regularization effects
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Performance comparison
metrics = ['R²', 'RMSE', 'MAE']
for i, metric in enumerate(metrics):
    if i < 3:
        ax_idx = (i//2, i%2) if i < 2 else (1, 0)
        values = reg_df[metric]
        bars = axes[ax_idx].bar(range(len(values)), values, alpha=0.7)
        axes[ax_idx].set_xticks(range(len(values)))
        axes[ax_idx].set_xticklabels(values.index, rotation=45, ha='right')
        axes[ax_idx].set_ylabel(metric)
        axes[ax_idx].set_title(f'{metric} Comparison')
        axes[ax_idx].grid(True, alpha=0.3)
        
        # Highlight best performer
        if metric == 'R²':
            best_idx = values.idxmax()
        else:
            best_idx = values.idxmin()
        best_bar_idx = values.index.get_loc(best_idx)
        bars[best_bar_idx].set_color('gold')

# Feature usage
axes[1,1].bar(range(len(reg_df)), reg_df['Features Used'], alpha=0.7, color='lightcoral')
axes[1,1].set_xticks(range(len(reg_df)))
axes[1,1].set_xticklabels(reg_df.index, rotation=45, ha='right')
axes[1,1].set_ylabel('Number of Features Used')
axes[1,1].set_title('Feature Selection Effect')
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Best performing model
best_model = reg_df['R²'].idxmax()
print(f"\n🏆 BEST PERFORMING MODEL: {best_model}")
print(f"R² Score: {reg_df.loc[best_model, 'R²']:.3f}")
print(f"Uses {reg_df.loc[best_model, 'Features Used']:.0f} out of {len(X_housing.columns)} features")

In [None]:
# Analyze coefficient paths for different regularization strengths
alphas = np.logspace(-3, 2, 50)  # From 0.001 to 100

ridge_coefs = []
lasso_coefs = []

print("🔍 Analyzing regularization paths...")

for alpha in alphas:
    # Ridge coefficients
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train_housing)
    ridge_coefs.append(ridge.coef_)
    
    # Lasso coefficients  
    lasso = Lasso(alpha=alpha, max_iter=2000)
    lasso.fit(X_train_scaled, y_train_housing)
    lasso_coefs.append(lasso.coef_)

ridge_coefs = np.array(ridge_coefs)
lasso_coefs = np.array(lasso_coefs)

# Plot coefficient paths
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Ridge path
for i in range(ridge_coefs.shape[1]):
    axes[0].plot(alphas, ridge_coefs[:, i], label=X_housing.columns[i])
axes[0].set_xscale('log')
axes[0].set_xlabel('Alpha (Regularization Strength)')
axes[0].set_ylabel('Coefficient Value')
axes[0].set_title('Ridge Regression: Coefficient Paths')
axes[0].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
axes[0].grid(True, alpha=0.3)

# Lasso path
for i in range(lasso_coefs.shape[1]):
    axes[1].plot(alphas, lasso_coefs[:, i], label=X_housing.columns[i])
axes[1].set_xscale('log')
axes[1].set_xlabel('Alpha (Regularization Strength)')
axes[1].set_ylabel('Coefficient Value')
axes[1].set_title('Lasso Regression: Coefficient Paths')
axes[1].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("📈 COEFFICIENT PATH INSIGHTS:")
print("- Ridge: Coefficients shrink smoothly but never reach exactly zero")
print("- Lasso: Coefficients can become exactly zero (automatic feature selection)")
print("- Higher alpha = more regularization = smaller coefficients")

### 🎮 Interactive Exercise 2: Hyperparameter Tuning with Cross-Validation

**Challenge:** Find the optimal regularization strength using GridSearchCV!

**Your Mission:**
1. Define parameter grids for Ridge and Lasso
2. Use cross-validation to find best parameters
3. Compare optimized models
4. Analyze the trade-off between bias and variance

**Pro Tip:** Use learning curves to understand if you need more data or a different model complexity!

In [None]:
# TODO: Implement hyperparameter tuning

from sklearn.model_selection import validation_curve

print("🎯 HYPERPARAMETER TUNING WITH CROSS-VALIDATION")

# Define parameter grids
ridge_params = {'alpha': np.logspace(-2, 2, 20)}  # 0.01 to 100
lasso_params = {'alpha': np.logspace(-3, 1, 20)}  # 0.001 to 10

# Models to tune
models_to_tune = {
    'Ridge': (Ridge(), ridge_params),
    'Lasso': (Lasso(max_iter=2000), lasso_params)
}

tuning_results = {}

for model_name, (model, param_grid) in models_to_tune.items():
    print(f"\nTuning {model_name}...")
    
    # Grid search with cross-validation
    grid_search = GridSearchCV(
        model, param_grid, cv=5, scoring='r2', n_jobs=-1
    )
    
    grid_search.fit(X_train_scaled, y_train_housing)
    
    # Store results
    tuning_results[model_name] = {
        'best_params': grid_search.best_params_,
        'best_cv_score': grid_search.best_score_,
        'best_estimator': grid_search.best_estimator_
    }
    
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best CV R² score: {grid_search.best_score_:.3f}")

# Test optimized models on test set
print(f"\n🏆 OPTIMIZED MODELS TEST PERFORMANCE:")

for model_name, results in tuning_results.items():
    best_model = results['best_estimator']
    y_pred_optimized = best_model.predict(X_test_scaled)
    test_r2 = r2_score(y_test_housing, y_pred_optimized)
    test_rmse = np.sqrt(mean_squared_error(y_test_housing, y_pred_optimized))
    
    print(f"{model_name}:")
    print(f"  CV R² Score: {results['best_cv_score']:.3f}")
    print(f"  Test R² Score: {test_r2:.3f}")
    print(f"  Test RMSE: ${test_rmse:.1f}k")
    print(f"  Generalization gap: {results['best_cv_score'] - test_r2:.3f}")

In [None]:
# Create validation curves to understand the bias-variance tradeoff
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

for i, (model_name, (model, param_grid)) in enumerate(models_to_tune.items()):
    param_name = list(param_grid.keys())[0]
    param_range = param_grid[param_name]
    
    train_scores, val_scores = validation_curve(
        model, X_train_scaled, y_train_housing,
        param_name=param_name, param_range=param_range,
        cv=5, scoring='r2', n_jobs=-1
    )
    
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)
    
    axes[i].semilogx(param_range, train_mean, 'o-', color='blue', 
                     label='Training Score')
    axes[i].fill_between(param_range, train_mean - train_std, 
                        train_mean + train_std, alpha=0.1, color='blue')
    
    axes[i].semilogx(param_range, val_mean, 'o-', color='red',
                     label='Validation Score') 
    axes[i].fill_between(param_range, val_mean - val_std,
                        val_mean + val_std, alpha=0.1, color='red')
    
    # Mark best parameter
    best_alpha = tuning_results[model_name]['best_params'][param_name]
    axes[i].axvline(best_alpha, color='green', linestyle='--', 
                   label=f'Best α = {best_alpha:.3f}')
    
    axes[i].set_xlabel('Alpha (Regularization Strength)')
    axes[i].set_ylabel('R² Score')
    axes[i].set_title(f'{model_name} Validation Curve')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("📊 VALIDATION CURVE INSIGHTS:")
print("- Left side (low alpha): High variance, potential overfitting")
print("- Right side (high alpha): High bias, potential underfitting")  
print("- Sweet spot: Where validation score peaks")
print("- Gap between curves indicates overfitting tendency")

## Polynomial Features: Capturing Non-Linear Relationships 🌟

Linear models assume linear relationships, but real life is often non-linear!
Let's add polynomial features to capture curves and interactions.

**Example:** Instead of just `income`, we can include `income²`, `income³`, `income × age`, etc.

In [None]:
# Demonstrate polynomial features with a subset of data
print("🌟 POLYNOMIAL FEATURES EXPLORATION")

# Use a smaller subset for clear visualization
subset_size = 1000
indices = np.random.choice(len(X_housing), subset_size, replace=False)
X_subset = X_housing.iloc[indices]
y_subset = y_housing[indices]

# Focus on the most important feature: MedInc (Median Income)
X_simple = X_subset[['MedInc']].copy()
y_simple = y_subset.copy()

# Split the simple dataset
X_train_simple, X_test_simple, y_train_simple, y_test_simple = train_test_split(
    X_simple, y_simple, test_size=0.2, random_state=42
)

# Compare different polynomial degrees
degrees = [1, 2, 3, 4, 5]
poly_results = {}

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for i, degree in enumerate(degrees):
    # Create polynomial features
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_train_poly = poly.fit_transform(X_train_simple)
    X_test_poly = poly.transform(X_test_simple)
    
    # Fit linear regression with polynomial features
    lr_poly = LinearRegression()
    lr_poly.fit(X_train_poly, y_train_simple)
    
    # Predictions
    y_pred_train = lr_poly.predict(X_train_poly)
    y_pred_test = lr_poly.predict(X_test_poly)
    
    # Metrics
    train_r2 = r2_score(y_train_simple, y_pred_train)
    test_r2 = r2_score(y_test_simple, y_pred_test)
    test_rmse = np.sqrt(mean_squared_error(y_test_simple, y_pred_test))
    
    poly_results[f'Degree {degree}'] = {
        'Train R²': train_r2,
        'Test R²': test_r2,
        'RMSE': test_rmse,
        'Overfitting': train_r2 - test_r2
    }
    
    # Plot the polynomial fit
    if i < len(axes):
        # Create smooth line for visualization
        X_plot = np.linspace(X_simple['MedInc'].min(), X_simple['MedInc'].max(), 100)
        X_plot_poly = poly.transform(X_plot.reshape(-1, 1))
        y_plot = lr_poly.predict(X_plot_poly)
        
        axes[i].scatter(X_test_simple['MedInc'], y_test_simple, alpha=0.5, s=20)
        axes[i].plot(X_plot, y_plot, 'r-', linewidth=2)
        axes[i].set_xlabel('Median Income')
        axes[i].set_ylabel('House Price ($100k)')
        axes[i].set_title(f'Polynomial Degree {degree}\nTest R² = {test_r2:.3f}')
        axes[i].grid(True, alpha=0.3)

# Performance comparison in the last subplot
if len(degrees) < len(axes):
    poly_df = pd.DataFrame(poly_results).T
    
    axes[-1].plot(degrees, poly_df['Train R²'], 'o-', label='Training R²', linewidth=2)
    axes[-1].plot(degrees, poly_df['Test R²'], 'o-', label='Test R²', linewidth=2)
    axes[-1].set_xlabel('Polynomial Degree')
    axes[-1].set_ylabel('R² Score')
    axes[-1].set_title('Polynomial Degree vs Performance')
    axes[-1].legend()
    axes[-1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Display results table
poly_df = pd.DataFrame(poly_results).T
print("📊 POLYNOMIAL FEATURES RESULTS:")
print(poly_df.round(3))

# Identify optimal degree
optimal_degree = poly_df['Test R²'].idxmax()
print(f"\n🎯 OPTIMAL POLYNOMIAL DEGREE: {optimal_degree}")
print(f"Achieves Test R² of {poly_df.loc[optimal_degree, 'Test R²']:.3f}")

if poly_df.loc[optimal_degree, 'Overfitting'] > 0.1:
    print("⚠️  Warning: Significant overfitting detected!")
    print("Consider regularization with polynomial features.")

In [None]:
# Combine polynomial features with regularization
print("🎚️ POLYNOMIAL FEATURES + REGULARIZATION")

# Use the optimal degree found above
optimal_deg = int(optimal_degree.split()[1])

# Create pipeline with polynomial features and Ridge regression
poly_ridge_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=optimal_deg, include_bias=False)),
    ('scaler', StandardScaler()),
    ('ridge', Ridge())
])

# Grid search for optimal regularization
param_grid_poly = {
    'ridge__alpha': np.logspace(-2, 2, 20)
}

print("Searching for optimal regularization with polynomial features...")
grid_search_poly = GridSearchCV(
    poly_ridge_pipeline, param_grid_poly, cv=5, scoring='r2', n_jobs=-1
)

grid_search_poly.fit(X_train_simple, y_train_simple)

print(f"Best alpha for polynomial Ridge: {grid_search_poly.best_params_['ridge__alpha']:.4f}")
print(f"Best CV R² score: {grid_search_poly.best_score_:.3f}")

# Test performance
y_pred_poly_ridge = grid_search_poly.predict(X_test_simple)
test_r2_poly_ridge = r2_score(y_test_simple, y_pred_poly_ridge)
test_rmse_poly_ridge = np.sqrt(mean_squared_error(y_test_simple, y_pred_poly_ridge))

print(f"\n🏆 POLYNOMIAL + RIDGE RESULTS:")
print(f"Test R² Score: {test_r2_poly_ridge:.3f}")
print(f"Test RMSE: ${test_rmse_poly_ridge:.1f}k")
print(f"Improvement over linear: {test_r2_poly_ridge - poly_df.loc['Degree 1', 'Test R²']:.3f}")

# Compare all approaches on the simple dataset
comparison_simple = {
    'Linear': poly_df.loc['Degree 1', 'Test R²'],
    f'Polynomial (deg {optimal_deg})': poly_df.loc[optimal_degree, 'Test R²'],
    f'Polynomial + Ridge': test_r2_poly_ridge
}

plt.figure(figsize=(10, 6))
bars = plt.bar(comparison_simple.keys(), comparison_simple.values(), 
               alpha=0.7, color=['blue', 'orange', 'green'])
plt.ylabel('Test R² Score')
plt.title('Model Comparison: Linear vs Polynomial vs Regularized Polynomial')
plt.ylim(0, max(comparison_simple.values()) * 1.1)

# Add value labels on bars
for bar, value in zip(bars, comparison_simple.values()):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{value:.3f}', ha='center', va='bottom', fontweight='bold')

plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### 🎮 Interactive Exercise 3: Complete Regression Pipeline

**Final Challenge:** Build a production-ready regression pipeline for the full housing dataset!

**Your Mission:**
1. Use all features with polynomial interactions
2. Include feature selection to avoid overfitting
3. Apply proper cross-validation
4. Compare multiple algorithms (Linear, Ridge, Random Forest)
5. Create learning curves to assess if more data would help

**Real-World Skill:** This is how you'd approach a regression problem in practice!

In [None]:
# TODO: Build complete production pipeline

print("🏭 BUILDING PRODUCTION-READY REGRESSION PIPELINE")

# Define comprehensive pipeline components
from sklearn.feature_selection import SelectFromModel

# Multiple pipeline configurations to compare
pipelines = {
    'Linear + PolyFeatures': Pipeline([
        ('poly', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)),
        ('scaler', StandardScaler()),
        ('feature_select', SelectKBest(f_regression, k=20)),  # Select top 20 features
        ('regressor', LinearRegression())
    ]),
    
    'Ridge + PolyFeatures': Pipeline([
        ('poly', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)),
        ('scaler', StandardScaler()),
        ('feature_select', SelectKBest(f_regression, k=20)),
        ('regressor', Ridge(alpha=1.0))
    ]),
    
    'Lasso + PolyFeatures': Pipeline([
        ('poly', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)),
        ('scaler', StandardScaler()),
        ('regressor', Lasso(alpha=0.1, max_iter=2000))  # Lasso does its own feature selection
    ]),
    
    'Random Forest': Pipeline([
        ('scaler', StandardScaler()),
        ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
    ]),
    
    'Optimized Ridge': Pipeline([
        ('poly', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)),
        ('scaler', StandardScaler()),
        ('feature_select', SelectKBest(f_regression, k=15)),
        ('regressor', Ridge())
    ])
}

# Parameter grids for optimization
param_grids = {
    'Linear + PolyFeatures': {},  # No hyperparameters to tune
    
    'Ridge + PolyFeatures': {
        'regressor__alpha': [0.1, 1.0, 10.0]
    },
    
    'Lasso + PolyFeatures': {
        'regressor__alpha': [0.01, 0.1, 1.0]
    },
    
    'Random Forest': {
        'regressor__n_estimators': [50, 100, 200],
        'regressor__max_depth': [10, 20, None]
    },
    
    'Optimized Ridge': {
        'feature_select__k': [10, 15, 20],
        'regressor__alpha': [0.1, 1.0, 10.0, 100.0]
    }
}

# Cross-validate all pipelines
pipeline_results = {}

print("Training and evaluating pipelines...")
for name, pipeline in pipelines.items():
    print(f"\nProcessing {name}...")
    
    if param_grids[name]:  # Has hyperparameters to tune
        grid_search = GridSearchCV(
            pipeline, param_grids[name], cv=5, scoring='r2', n_jobs=-1
        )
        grid_search.fit(X_train_housing, y_train_housing)
        
        best_pipeline = grid_search.best_estimator_
        cv_score = grid_search.best_score_
        best_params = grid_search.best_params_
        
        print(f"Best parameters: {best_params}")
        
    else:  # No hyperparameters to tune
        cv_scores = cross_val_score(pipeline, X_train_housing, y_train_housing, 
                                   cv=5, scoring='r2', n_jobs=-1)
        cv_score = cv_scores.mean()
        best_pipeline = pipeline
        best_pipeline.fit(X_train_housing, y_train_housing)
        best_params = "No tuning"
    
    # Test set evaluation
    y_pred_pipeline = best_pipeline.predict(X_test_housing)
    test_r2 = r2_score(y_test_housing, y_pred_pipeline)
    test_rmse = np.sqrt(mean_squared_error(y_test_housing, y_pred_pipeline))
    test_mae = mean_absolute_error(y_test_housing, y_pred_pipeline)
    
    pipeline_results[name] = {
        'CV_R2': cv_score,
        'Test_R2': test_r2,
        'Test_RMSE': test_rmse,
        'Test_MAE': test_mae,
        'Generalization_Gap': cv_score - test_r2,
        'Best_Pipeline': best_pipeline
    }
    
    print(f"CV R²: {cv_score:.3f}")
    print(f"Test R²: {test_r2:.3f}")
    print(f"Test RMSE: ${test_rmse:.1f}k")

# Create comprehensive results DataFrame
pipeline_df = pd.DataFrame(pipeline_results).T.drop('Best_Pipeline', axis=1)
print("\n🏆 FINAL PIPELINE COMPARISON:")
print(pipeline_df.round(3))

In [None]:
# Visualize pipeline comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# R² Scores
axes[0,0].bar(range(len(pipeline_df)), pipeline_df['Test_R2'], alpha=0.7, color='skyblue')
axes[0,0].set_xticks(range(len(pipeline_df)))
axes[0,0].set_xticklabels(pipeline_df.index, rotation=45, ha='right')
axes[0,0].set_ylabel('Test R² Score')
axes[0,0].set_title('Model R² Performance')
axes[0,0].grid(True, alpha=0.3)

# RMSE
axes[0,1].bar(range(len(pipeline_df)), pipeline_df['Test_RMSE'], alpha=0.7, color='lightcoral')
axes[0,1].set_xticks(range(len(pipeline_df)))
axes[0,1].set_xticklabels(pipeline_df.index, rotation=45, ha='right')
axes[0,1].set_ylabel('Test RMSE ($100k)')
axes[0,1].set_title('Model RMSE (Lower is Better)')
axes[0,1].grid(True, alpha=0.3)

# Generalization Gap
axes[1,0].bar(range(len(pipeline_df)), pipeline_df['Generalization_Gap'], alpha=0.7, color='gold')
axes[1,0].set_xticks(range(len(pipeline_df)))
axes[1,0].set_xticklabels(pipeline_df.index, rotation=45, ha='right')
axes[1,0].set_ylabel('CV R² - Test R²')
axes[1,0].set_title('Generalization Gap (Lower is Better)')
axes[1,0].grid(True, alpha=0.3)

# CV vs Test R²
axes[1,1].scatter(pipeline_df['CV_R2'], pipeline_df['Test_R2'], s=100, alpha=0.7)
axes[1,1].plot([0, 1], [0, 1], 'r--', alpha=0.5)  # Perfect generalization line
for i, model in enumerate(pipeline_df.index):
    axes[1,1].annotate(model, (pipeline_df.iloc[i]['CV_R2'], pipeline_df.iloc[i]['Test_R2']),
                      xytext=(5, 5), textcoords='offset points', fontsize=8)
axes[1,1].set_xlabel('Cross-Validation R²')
axes[1,1].set_ylabel('Test R²')
axes[1,1].set_title('Cross-Validation vs Test Performance')
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Identify champion
best_model_name = pipeline_df['Test_R2'].idxmax()
best_model = pipeline_results[best_model_name]['Best_Pipeline']

print(f"\n🥇 CHAMPION MODEL: {best_model_name}")
print(f"Test R² Score: {pipeline_df.loc[best_model_name, 'Test_R2']:.3f}")
print(f"Test RMSE: ${pipeline_df.loc[best_model_name, 'Test_RMSE']:.1f}k")
print(f"Generalization Gap: {pipeline_df.loc[best_model_name, 'Generalization_Gap']:.3f}")

# Business interpretation
best_rmse = pipeline_df.loc[best_model_name, 'Test_RMSE']
best_r2 = pipeline_df.loc[best_model_name, 'Test_R2']

print(f"\n💰 BUSINESS IMPACT:")
print(f"- Model explains {best_r2:.1%} of house price variation")
print(f"- Typical prediction error: ${best_rmse:.1f}k")
print(f"- Suitable for: Property valuation, market analysis, investment decisions")

In [None]:
# Learning curves for the best model
print(f"📈 LEARNING CURVES FOR {best_model_name}")

train_sizes, train_scores_lc, val_scores_lc = learning_curve(
    best_model, X_train_housing, y_train_housing, 
    cv=5, train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='r2', n_jobs=-1
)

plt.figure(figsize=(12, 8))

# Learning curve
plt.subplot(2, 2, 1)
plt.plot(train_sizes, train_scores_lc.mean(axis=1), 'o-', label='Training Score')
plt.plot(train_sizes, val_scores_lc.mean(axis=1), 'o-', label='Validation Score')
plt.fill_between(train_sizes, 
                 train_scores_lc.mean(axis=1) - train_scores_lc.std(axis=1),
                 train_scores_lc.mean(axis=1) + train_scores_lc.std(axis=1), alpha=0.1)
plt.fill_between(train_sizes,
                 val_scores_lc.mean(axis=1) - val_scores_lc.std(axis=1),
                 val_scores_lc.mean(axis=1) + val_scores_lc.std(axis=1), alpha=0.1)
plt.xlabel('Training Set Size')
plt.ylabel('R² Score')
plt.title('Learning Curves')
plt.legend()
plt.grid(True, alpha=0.3)

# Prediction vs Actual for best model
y_pred_best = best_model.predict(X_test_housing)
plt.subplot(2, 2, 2)
plt.scatter(y_test_housing, y_pred_best, alpha=0.6, s=20)
plt.plot([y_test_housing.min(), y_test_housing.max()], 
         [y_test_housing.min(), y_test_housing.max()], 'r--', lw=2)
plt.xlabel('Actual Price ($100k)')
plt.ylabel('Predicted Price ($100k)')
plt.title(f'{best_model_name}: Predictions vs Actual')
plt.grid(True, alpha=0.3)

# Residual plot
residuals_best = y_test_housing - y_pred_best
plt.subplot(2, 2, 3)
plt.scatter(y_pred_best, residuals_best, alpha=0.6, s=20)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Price ($100k)')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.grid(True, alpha=0.3)

# Residual histogram
plt.subplot(2, 2, 4)
plt.hist(residuals_best, bins=30, alpha=0.7, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Residuals Distribution')
plt.axvline(x=0, color='r', linestyle='--')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Learning curve interpretation
final_gap = train_scores_lc.mean(axis=1)[-1] - val_scores_lc.mean(axis=1)[-1]
final_val_score = val_scores_lc.mean(axis=1)[-1]

print(f"\n📊 LEARNING CURVE ANALYSIS:")
print(f"Final validation score: {final_val_score:.3f}")
print(f"Training-validation gap: {final_gap:.3f}")

if final_gap > 0.05:
    print("🔴 Still overfitting - consider:")
    print("  • More regularization")
    print("  • More training data") 
    print("  • Feature selection")
elif final_val_score < 0.7:
    print("🟡 Underfitting - consider:")
    print("  • More complex model")
    print("  • More features")
    print("  • Less regularization")
else:
    print("🟢 Good bias-variance tradeoff!")

# Check if more data would help
if val_scores_lc.mean(axis=1)[-1] > val_scores_lc.mean(axis=1)[-2]:
    print("📈 Performance still improving - more data might help!")
else:
    print("📉 Performance plateaued - more data may not help much")

## 🏆 Workshop Summary & Key Insights

Congratulations! You've mastered comprehensive regression analysis. Here's what you've accomplished:

### 🎯 Technical Skills Mastered:

1. **Linear Regression Fundamentals**: Built and evaluated basic models
2. **Regularization Techniques**: Applied Ridge, Lasso, and Elastic Net
3. **Polynomial Features**: Captured non-linear relationships
4. **Pipeline Construction**: Built production-ready ML pipelines
5. **Hyperparameter Tuning**: Optimized models with GridSearchCV
6. **Feature Selection**: Automated feature selection techniques
7. **Model Evaluation**: Comprehensive metrics beyond just R²
8. **Learning Curves**: Diagnosed bias vs variance tradeoffs

### 🎨 Advanced Techniques Applied:

- **Feature Engineering**: Polynomial and interaction terms
- **Cross-Validation**: Robust model selection
- **Regularization Paths**: Understanding coefficient behavior
- **Pipeline Integration**: Seamless preprocessing and modeling
- **Ensemble Methods**: Random Forest for non-linear patterns

### 💰 Business Impact Understanding:

- **Model Selection**: Choosing algorithms based on problem requirements
- **Performance Metrics**: RMSE for interpretable error magnitude
- **Overfitting Detection**: Recognizing and preventing poor generalization
- **Production Readiness**: Building models that work in the real world

### 🚀 Next Level Recommendations:

1. **Advanced Ensemble Methods**: XGBoost, LightGBM
2. **Time Series Regression**: When data has temporal components
3. **Bayesian Regression**: Uncertainty quantification
4. **Deep Learning**: Neural networks for complex patterns
5. **Model Deployment**: Flask, FastAPI, cloud platforms

In [None]:
print("🎉 REGRESSION WORKSHOP COMPLETE! 🎉")
print("\n📈 Regression Mastery Achieved:")
print("✅ Linear and regularized regression")
print("✅ Polynomial feature engineering")
print("✅ Advanced model pipelines")
print("✅ Hyperparameter optimization")
print("✅ Production-ready evaluation")

print(f"\n🏆 Your Champion Model: {best_model_name}")
print(f"📊 Performance: R² = {pipeline_df.loc[best_model_name, 'Test_R2']:.3f}")
print(f"💰 Business Value: ${pipeline_df.loc[best_model_name, 'Test_RMSE']:.1f}k average error")

print("\n🚀 You're now ready to:")
print("• Predict house prices, sales figures, stock prices")
print("• Build recommendation systems with ratings")
print("• Forecast demand, revenue, and other business metrics")
print("• Handle any continuous prediction problem!")

print("\n🎯 FINAL CHALLENGE:")
print("Can you build a model that achieves R² > 0.85 on the housing dataset?")
print("Hint: Try feature engineering, ensemble methods, or stacking!")

# Save the best model for future use (in practice)
print(f"\n💾 Best model pipeline saved conceptually:")
print("In production, you'd use joblib.dump() to save this pipeline!")