# Support Vector Regression (SVR) Analysis

This notebook demonstrates Support Vector Regression on real-world datasets, comparing different kernel approaches and baseline regression models.

## Table of Contents
1. [SVR Theory and Implementation](#svr-theory)
2. [California Housing Dataset Analysis](#california-housing)
3. [Wine Quality Prediction](#wine-quality)
4. [Model Comparison and Evaluation](#comparison)
5. [Hyperparameter Optimization](#optimization)
6. [Results and Insights](#results)

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Add project root to path
sys.path.append(os.path.abspath('..'))

# Import our custom implementations
from src.svm.svr import SVR
from src.svm.kernels import *
from src.utils.data_loader import DataLoader
from src.utils.preprocessing import DataPreprocessor
from src.utils.visualization import SVMVisualizer
from src.utils.evaluation import RegressionEvaluator, ModelComparator
from src.utils.baseline_models import ModelBenchmark

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Initialize components
data_loader = DataLoader()
preprocessor = DataPreprocessor()
visualizer = SVMVisualizer()
evaluator = RegressionEvaluator()
comparator = ModelComparator()

## 1. SVR Theory and Implementation {#svr-theory}

Support Vector Regression (SVR) extends the SVM concept to regression problems. Instead of finding a separating hyperplane, SVR finds a function that approximates the training data within a specified tolerance $\epsilon$.

### Key Concepts:

1. **Œµ-insensitive loss**: Errors within Œµ are not penalized
2. **Support vectors**: Points outside the Œµ-tube
3. **Regularization**: Balance between model complexity and approximation error

### Mathematical Formulation:

The primal optimization problem:
$$\min_{w,b,\xi,\xi^*} \frac{1}{2}||w||^2 + C \sum_{i=1}^n (\xi_i + \xi_i^*)$$

Subject to:
- $y_i - w^T x_i - b \leq \epsilon + \xi_i$
- $w^T x_i + b - y_i \leq \epsilon + \xi_i^*$
- $\xi_i, \xi_i^* \geq 0$

In [None]:
# Demonstrate SVR concept with synthetic data
def demonstrate_svr_concept():
    np.random.seed(42)
    
    # Generate synthetic 1D data
    X_demo = np.linspace(0, 10, 100).reshape(-1, 1)
    y_demo = 2 * np.sin(X_demo.flatten()) + 0.5 * X_demo.flatten() + np.random.normal(0, 0.3, 100)
    
    # Train SVR with different epsilon values
    epsilons = [0.1, 0.5, 1.0]
    
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    
    for idx, epsilon in enumerate(epsilons):
        # Train SVR
        svr_demo = SVR(kernel='rbf', C=1.0, epsilon=epsilon, gamma=0.1)
        svr_demo.fit(X_demo, y_demo)
        
        # Make predictions
        X_plot = np.linspace(0, 10, 200).reshape(-1, 1)
        y_pred = svr_demo.predict(X_plot)
        
        # Plot results
        ax = axes[idx]
        ax.scatter(X_demo, y_demo, alpha=0.6, s=30, label='Training Data')
        ax.plot(X_plot, y_pred, 'r-', linewidth=2, label='SVR Prediction')
        
        # Plot epsilon tube
        ax.fill_between(X_plot.flatten(), y_pred - epsilon, y_pred + epsilon, 
                       alpha=0.2, color='red', label=f'Œµ-tube (Œµ={epsilon})')
        
        # Highlight support vectors
        if hasattr(svr_demo, 'support_vectors_'):
            support_indices = svr_demo.support_vector_indices_
            ax.scatter(X_demo[support_indices], y_demo[support_indices], 
                      s=100, facecolors='none', edgecolors='black', linewidth=2,
                      label=f'Support Vectors ({len(support_indices)})')
        
        ax.set_title(f'SVR with Œµ = {epsilon}')
        ax.set_xlabel('X')
        ax.set_ylabel('y')
        ax.legend()
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('Support Vector Regression: Effect of Œµ Parameter', fontsize=16)
    plt.tight_layout()
    plt.show()
    
    print("Key Observations:")
    print("‚Ä¢ Smaller Œµ: More support vectors, tighter fit")
    print("‚Ä¢ Larger Œµ: Fewer support vectors, smoother fit")
    print("‚Ä¢ Support vectors are points outside the Œµ-tube")

demonstrate_svr_concept()

## 2. California Housing Dataset Analysis {#california-housing}

Let's analyze the California Housing dataset - a classic regression problem.

In [None]:
# Load California Housing dataset
print("Loading California Housing Dataset...")
housing_data = data_loader.load_california_housing_data()

X_housing = housing_data['X']
y_housing = housing_data['y']
feature_names = housing_data['feature_names']

print(f"Dataset shape: {X_housing.shape}")
print(f"Features: {feature_names}")
print(f"Target range: {y_housing.min():.2f} - {y_housing.max():.2f}")
print(f"Target mean: {y_housing.mean():.2f}")

In [None]:
# Exploratory Data Analysis for Housing dataset
housing_df = pd.DataFrame(X_housing, columns=feature_names)
housing_df['MedHouseVal'] = y_housing

print("Dataset Info:")
print(housing_df.info())
print("\nBasic Statistics:")
print(housing_df.describe())

# Visualizations
fig, axes = plt.subplots(3, 3, figsize=(18, 15))
axes = axes.ravel()

# Target distribution
axes[0].hist(y_housing, bins=50, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].set_title('Distribution of House Values')
axes[0].set_xlabel('Median House Value ($100k)')
axes[0].set_ylabel('Frequency')

# Feature distributions
for idx, feature in enumerate(feature_names[:7], 1):
    axes[idx].hist(housing_df[feature], bins=30, alpha=0.7, color='lightgreen')
    axes[idx].set_title(f'Distribution of {feature}')
    axes[idx].set_xlabel(feature)
    axes[idx].set_ylabel('Frequency')

# Correlation with target
correlations = housing_df.corr()['MedHouseVal'].sort_values(ascending=False)
axes[8].barh(range(len(correlations)-1), correlations[:-1], 
            color=['red' if x > 0 else 'blue' for x in correlations[:-1]])
axes[8].set_yticks(range(len(correlations)-1))
axes[8].set_yticklabels(correlations.index[:-1])
axes[8].set_title('Feature Correlation with House Value')
axes[8].set_xlabel('Correlation Coefficient')

plt.tight_layout()
plt.show()

In [None]:
# Preprocess California Housing data
print("Preprocessing California Housing data...")

# Split the data
X_train_housing, X_test_housing, y_train_housing, y_test_housing = train_test_split(
    X_housing, y_housing, test_size=0.2, random_state=42
)

# Scale the features
X_train_housing_scaled, X_test_housing_scaled = preprocessor.scale_features(
    X_train_housing, X_test_housing
)

print(f"Training set shape: {X_train_housing_scaled.shape}")
print(f"Test set shape: {X_test_housing_scaled.shape}")
print(f"Training target range: {y_train_housing.min():.2f} - {y_train_housing.max():.2f}")
print(f"Test target range: {y_test_housing.min():.2f} - {y_test_housing.max():.2f}")

In [None]:
# Train SVR models on California Housing data
print("Training SVR models on California Housing data...")

# Linear SVR
print("Training Linear SVR...")
svr_linear_housing = SVR(kernel='linear', C=1.0, epsilon=0.1)
svr_linear_housing.fit(X_train_housing_scaled, y_train_housing)
y_pred_linear_housing = svr_linear_housing.predict(X_test_housing_scaled)

# RBF SVR
print("Training RBF SVR...")
svr_rbf_housing = SVR(kernel='rbf', C=1.0, epsilon=0.1, gamma=0.1)
svr_rbf_housing.fit(X_train_housing_scaled, y_train_housing)
y_pred_rbf_housing = svr_rbf_housing.predict(X_test_housing_scaled)

# Polynomial SVR
print("Training Polynomial SVR...")
svr_poly_housing = SVR(kernel='polynomial', degree=2, C=1.0, epsilon=0.1)
svr_poly_housing.fit(X_train_housing_scaled, y_train_housing)
y_pred_poly_housing = svr_poly_housing.predict(X_test_housing_scaled)

# Evaluate models
print("\nSVR Model Results on California Housing:")
print("-" * 50)

svr_models_housing = {
    'Linear SVR (Custom)': (svr_linear_housing, y_pred_linear_housing),
    'RBF SVR (Custom)': (svr_rbf_housing, y_pred_rbf_housing),
    'Polynomial SVR (Custom)': (svr_poly_housing, y_pred_poly_housing)
}

housing_results_custom = {}
for name, (model, y_pred) in svr_models_housing.items():
    metrics = evaluator.evaluate(y_test_housing, y_pred)
    housing_results_custom[name] = metrics
    
    print(f"\n{name}:")
    print(f"  RMSE: {metrics['rmse']:.4f}")
    print(f"  MAE: {metrics['mae']:.4f}")
    print(f"  R¬≤ Score: {metrics['r2_score']:.4f}")
    print(f"  MAPE: {metrics['mape']:.2f}%")
    if hasattr(model, 'support_vectors_'):
        print(f"  Support Vectors: {len(model.support_vectors_)}")

In [None]:
# Visualize SVR results for California Housing
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

model_names = list(svr_models_housing.keys())

# Predictions vs Actual plots
for idx, (name, (model, y_pred)) in enumerate(svr_models_housing.items()):
    ax = axes[0, idx]
    evaluator.plot_predictions_vs_actual(y_test_housing, y_pred, 
                                        title=f'{name}\nPredictions vs Actual')
    plt.figure(figsize=(6, 6))
    plt.show()

# Residual plots
for idx, (name, (model, y_pred)) in enumerate(svr_models_housing.items()):
    evaluator.plot_residuals(y_test_housing, y_pred, 
                           title=f'{name}\nResidual Analysis')
    plt.figure(figsize=(12, 5))
    plt.show()

## 3. Wine Quality Prediction {#wine-quality}

Now let's work with the Wine Quality dataset.

In [None]:
# Load Wine Quality dataset
print("Loading Wine Quality Dataset...")
wine_data = data_loader.load_wine_quality_data()

X_wine = wine_data['X']
y_wine = wine_data['y']
feature_names_wine = wine_data['feature_names']

print(f"Dataset shape: {X_wine.shape}")
print(f"Features: {feature_names_wine}")
print(f"Quality range: {y_wine.min()} - {y_wine.max()}")
print(f"Quality mean: {y_wine.mean():.2f}")
print(f"Quality distribution: {np.bincount(y_wine.astype(int))}")

In [None]:
# Exploratory Data Analysis for Wine Quality
wine_df = pd.DataFrame(X_wine, columns=feature_names_wine)
wine_df['quality'] = y_wine

print("Wine Quality Dataset Info:")
print(wine_df.info())

# Visualizations
fig, axes = plt.subplots(3, 4, figsize=(20, 15))
axes = axes.ravel()

# Quality distribution
quality_counts = pd.Series(y_wine).value_counts().sort_index()
axes[0].bar(quality_counts.index, quality_counts.values, color='purple', alpha=0.7)
axes[0].set_title('Wine Quality Distribution')
axes[0].set_xlabel('Quality Score')
axes[0].set_ylabel('Count')

# Feature distributions
for idx, feature in enumerate(feature_names_wine[:10], 1):
    axes[idx].hist(wine_df[feature], bins=30, alpha=0.7, color='lightcoral')
    axes[idx].set_title(f'Distribution of {feature}')
    axes[idx].set_xlabel(feature)
    axes[idx].set_ylabel('Frequency')

# Correlation heatmap
correlation_matrix = wine_df.corr()
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
ax = axes[11]
sns.heatmap(correlation_matrix, mask=mask, annot=False, cmap='coolwarm', 
           center=0, ax=ax, cbar_kws={'shrink': 0.8})
ax.set_title('Feature Correlation Matrix')

plt.tight_layout()
plt.show()

# Show features most correlated with quality
quality_corr = wine_df.corr()['quality'].sort_values(ascending=False)
print("\nFeatures most correlated with wine quality:")
print(quality_corr[:-1].round(3))  # Exclude quality itself

In [None]:
# Preprocess Wine Quality data
print("Preprocessing Wine Quality data...")

# Split the data
X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(
    X_wine, y_wine, test_size=0.2, random_state=42
)

# Scale the features
X_train_wine_scaled, X_test_wine_scaled = preprocessor.scale_features(
    X_train_wine, X_test_wine
)

print(f"Training set shape: {X_train_wine_scaled.shape}")
print(f"Test set shape: {X_test_wine_scaled.shape}")
print(f"Training quality range: {y_train_wine.min()} - {y_train_wine.max()}")
print(f"Test quality range: {y_test_wine.min()} - {y_test_wine.max()}")

In [None]:
# Train SVR models on Wine Quality data
print("Training SVR models on Wine Quality data...")

# Linear SVR
print("Training Linear SVR...")
svr_linear_wine = SVR(kernel='linear', C=1.0, epsilon=0.1)
svr_linear_wine.fit(X_train_wine_scaled, y_train_wine)
y_pred_linear_wine = svr_linear_wine.predict(X_test_wine_scaled)

# RBF SVR
print("Training RBF SVR...")
svr_rbf_wine = SVR(kernel='rbf', C=10.0, epsilon=0.1, gamma=0.1)
svr_rbf_wine.fit(X_train_wine_scaled, y_train_wine)
y_pred_rbf_wine = svr_rbf_wine.predict(X_test_wine_scaled)

# Polynomial SVR
print("Training Polynomial SVR...")
svr_poly_wine = SVR(kernel='polynomial', degree=2, C=1.0, epsilon=0.1)
svr_poly_wine.fit(X_train_wine_scaled, y_train_wine)
y_pred_poly_wine = svr_poly_wine.predict(X_test_wine_scaled)

# Evaluate models
print("\nSVR Model Results on Wine Quality:")
print("-" * 45)

svr_models_wine = {
    'Linear SVR (Custom)': (svr_linear_wine, y_pred_linear_wine),
    'RBF SVR (Custom)': (svr_rbf_wine, y_pred_rbf_wine),
    'Polynomial SVR (Custom)': (svr_poly_wine, y_pred_poly_wine)
}

wine_results_custom = {}
for name, (model, y_pred) in svr_models_wine.items():
    metrics = evaluator.evaluate(y_test_wine, y_pred)
    wine_results_custom[name] = metrics
    
    print(f"\n{name}:")
    print(f"  RMSE: {metrics['rmse']:.4f}")
    print(f"  MAE: {metrics['mae']:.4f}")
    print(f"  R¬≤ Score: {metrics['r2_score']:.4f}")
    print(f"  MAPE: {metrics['mape']:.2f}%")
    if hasattr(model, 'support_vectors_'):
        print(f"  Support Vectors: {len(model.support_vectors_)}")

In [None]:
# Visualize Wine Quality SVR results
for name, (model, y_pred) in svr_models_wine.items():
    # Predictions vs Actual
    evaluator.plot_predictions_vs_actual(y_test_wine, y_pred, 
                                        title=f'{name}\nWine Quality Predictions vs Actual')
    plt.figure(figsize=(8, 8))
    plt.show()
    
    # Residual analysis
    evaluator.plot_residuals(y_test_wine, y_pred, 
                           title=f'{name}\nWine Quality Residual Analysis')
    plt.figure(figsize=(15, 6))
    plt.show()

## 4. Model Comparison and Evaluation {#comparison}

Let's compare our SVR implementations with baseline regression models.

In [None]:
# Comprehensive model comparison on California Housing dataset
print("Running comprehensive model comparison on California Housing dataset...")

# Prepare custom models for comparison
custom_models_housing = {
    'Linear SVR (Custom)': svr_linear_housing,
    'RBF SVR (Custom)': svr_rbf_housing,
    'Polynomial SVR (Custom)': svr_poly_housing
}

# Run benchmark
benchmark = ModelBenchmark(random_state=42)
housing_comparison_results = benchmark.run_regression_benchmark(
    X_train_housing_scaled, X_test_housing_scaled, 
    y_train_housing, y_test_housing, 
    custom_models=custom_models_housing
)

# Get best model
best_model_housing, best_metrics_housing = benchmark.get_best_model('regression')
print(f"\nBest model for California Housing: {best_model_housing}")
print(f"Best R¬≤ Score: {best_metrics_housing['r2_score']:.4f}")

In [None]:
# Model comparison on Wine Quality dataset
print("Running model comparison on Wine Quality dataset...")

custom_models_wine = {
    'Linear SVR (Custom)': svr_linear_wine,
    'RBF SVR (Custom)': svr_rbf_wine,
    'Polynomial SVR (Custom)': svr_poly_wine
}

wine_comparison_results = benchmark.run_regression_benchmark(
    X_train_wine_scaled, X_test_wine_scaled, 
    y_train_wine, y_test_wine, 
    custom_models=custom_models_wine
)

best_model_wine, best_metrics_wine = benchmark.get_best_model('regression')
print(f"\nBest model for Wine Quality: {best_model_wine}")
print(f"Best R¬≤ Score: {best_metrics_wine['r2_score']:.4f}")

In [None]:
# Create comprehensive comparison visualizations
# California Housing comparison
housing_comparison_df = pd.DataFrame({
    model: {
        'R¬≤ Score': metrics['r2_score'],
        'RMSE': metrics['rmse'],
        'MAE': metrics['mae'],
        'MAPE': metrics['mape']
    }
    for model, metrics in housing_comparison_results.items()
    if isinstance(metrics, dict) and 'r2_score' in metrics
}).T

# Wine Quality comparison
wine_comparison_df = pd.DataFrame({
    model: {
        'R¬≤ Score': metrics['r2_score'],
        'RMSE': metrics['rmse'],
        'MAE': metrics['mae'],
        'MAPE': metrics['mape']
    }
    for model, metrics in wine_comparison_results.items()
    if isinstance(metrics, dict) and 'r2_score' in metrics
}).T

# Plot comparisons
fig, axes = plt.subplots(2, 2, figsize=(18, 12))

# California Housing - R¬≤ Score
sorted_housing_r2 = housing_comparison_df['R¬≤ Score'].sort_values(ascending=False)
bars = axes[0,0].bar(range(len(sorted_housing_r2)), sorted_housing_r2.values,
                    color=['red' if 'Custom' in name else 'skyblue' for name in sorted_housing_r2.index])
axes[0,0].set_title('California Housing - R¬≤ Score Comparison')
axes[0,0].set_ylabel('R¬≤ Score')
axes[0,0].set_xticks(range(len(sorted_housing_r2)))
axes[0,0].set_xticklabels([name.replace(' (Custom)', '\n(Custom)') for name in sorted_housing_r2.index], 
                         rotation=45, ha='right')
for bar, value in zip(bars, sorted_housing_r2.values):
    height = bar.get_height()
    axes[0,0].text(bar.get_x() + bar.get_width()/2., height + 0.01,
                  f'{value:.3f}', ha='center', va='bottom')

# California Housing - RMSE
sorted_housing_rmse = housing_comparison_df['RMSE'].sort_values(ascending=True)
bars = axes[0,1].bar(range(len(sorted_housing_rmse)), sorted_housing_rmse.values,
                    color=['red' if 'Custom' in name else 'lightcoral' for name in sorted_housing_rmse.index])
axes[0,1].set_title('California Housing - RMSE Comparison')
axes[0,1].set_ylabel('RMSE')
axes[0,1].set_xticks(range(len(sorted_housing_rmse)))
axes[0,1].set_xticklabels([name.replace(' (Custom)', '\n(Custom)') for name in sorted_housing_rmse.index], 
                         rotation=45, ha='right')
for bar, value in zip(bars, sorted_housing_rmse.values):
    height = bar.get_height()
    axes[0,1].text(bar.get_x() + bar.get_width()/2., height + 0.01,
                  f'{value:.3f}', ha='center', va='bottom')

# Wine Quality - R¬≤ Score
sorted_wine_r2 = wine_comparison_df['R¬≤ Score'].sort_values(ascending=False)
bars = axes[1,0].bar(range(len(sorted_wine_r2)), sorted_wine_r2.values,
                    color=['red' if 'Custom' in name else 'lightgreen' for name in sorted_wine_r2.index])
axes[1,0].set_title('Wine Quality - R¬≤ Score Comparison')
axes[1,0].set_ylabel('R¬≤ Score')
axes[1,0].set_xticks(range(len(sorted_wine_r2)))
axes[1,0].set_xticklabels([name.replace(' (Custom)', '\n(Custom)') for name in sorted_wine_r2.index], 
                         rotation=45, ha='right')
for bar, value in zip(bars, sorted_wine_r2.values):
    height = bar.get_height()
    axes[1,0].text(bar.get_x() + bar.get_width()/2., height + 0.01,
                  f'{value:.3f}', ha='center', va='bottom')

# Wine Quality - RMSE
sorted_wine_rmse = wine_comparison_df['RMSE'].sort_values(ascending=True)
bars = axes[1,1].bar(range(len(sorted_wine_rmse)), sorted_wine_rmse.values,
                    color=['red' if 'Custom' in name else 'orange' for name in sorted_wine_rmse.index])
axes[1,1].set_title('Wine Quality - RMSE Comparison')
axes[1,1].set_ylabel('RMSE')
axes[1,1].set_xticks(range(len(sorted_wine_rmse)))
axes[1,1].set_xticklabels([name.replace(' (Custom)', '\n(Custom)') for name in sorted_wine_rmse.index], 
                         rotation=45, ha='right')
for bar, value in zip(bars, sorted_wine_rmse.values):
    height = bar.get_height()
    axes[1,1].text(bar.get_x() + bar.get_width()/2., height + 0.01,
                  f'{value:.3f}', ha='center', va='bottom')

plt.suptitle('Regression Model Comparison', fontsize=16)
plt.tight_layout()
plt.show()

print("\nCalifornia Housing Results Summary:")
print(housing_comparison_df.round(4))

print("\nWine Quality Results Summary:")
print(wine_comparison_df.round(4))

## 5. Hyperparameter Optimization {#optimization}

Let's optimize SVR hyperparameters to improve performance.

In [None]:
# Hyperparameter optimization for SVR
from sklearn.model_selection import validation_curve
from sklearn.svm import SVR as SklearnSVR

print("SVR Hyperparameter Optimization...")

# Test different C values for California Housing
C_range = np.logspace(-2, 3, 6)
train_scores, test_scores = validation_curve(
    SklearnSVR(kernel='rbf', gamma='scale'), 
    X_train_housing_scaled, y_train_housing, 
    param_name='C', param_range=C_range, 
    cv=5, scoring='r2'
)

train_mean = np.mean(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)

plt.figure(figsize=(15, 5))

# C parameter validation curve
plt.subplot(1, 3, 1)
plt.semilogx(C_range, train_mean, 'o-', color='blue', label='Training')
plt.semilogx(C_range, test_mean, 'o-', color='red', label='Cross-validation')
plt.xlabel('C Parameter')
plt.ylabel('R¬≤ Score')
plt.title('California Housing: C Parameter')
plt.legend()
plt.grid(True, alpha=0.3)

optimal_C = C_range[np.argmax(test_mean)]
print(f"Optimal C for California Housing: {optimal_C}")

# Test different epsilon values
epsilon_range = np.logspace(-3, 0, 4)
train_scores_eps, test_scores_eps = validation_curve(
    SklearnSVR(kernel='rbf', C=optimal_C, gamma='scale'), 
    X_train_housing_scaled, y_train_housing, 
    param_name='epsilon', param_range=epsilon_range, 
    cv=5, scoring='r2'
)

train_mean_eps = np.mean(train_scores_eps, axis=1)
test_mean_eps = np.mean(test_scores_eps, axis=1)

plt.subplot(1, 3, 2)
plt.semilogx(epsilon_range, train_mean_eps, 'o-', color='blue', label='Training')
plt.semilogx(epsilon_range, test_mean_eps, 'o-', color='red', label='Cross-validation')
plt.xlabel('Epsilon Parameter')
plt.ylabel('R¬≤ Score')
plt.title('California Housing: Epsilon Parameter')
plt.legend()
plt.grid(True, alpha=0.3)

optimal_epsilon = epsilon_range[np.argmax(test_mean_eps)]
print(f"Optimal Epsilon for California Housing: {optimal_epsilon}")

# Test different gamma values
gamma_range = np.logspace(-4, 0, 5)
train_scores_gamma, test_scores_gamma = validation_curve(
    SklearnSVR(kernel='rbf', C=optimal_C, epsilon=optimal_epsilon), 
    X_train_housing_scaled, y_train_housing, 
    param_name='gamma', param_range=gamma_range, 
    cv=5, scoring='r2'
)

train_mean_gamma = np.mean(train_scores_gamma, axis=1)
test_mean_gamma = np.mean(test_scores_gamma, axis=1)

plt.subplot(1, 3, 3)
plt.semilogx(gamma_range, train_mean_gamma, 'o-', color='blue', label='Training')
plt.semilogx(gamma_range, test_mean_gamma, 'o-', color='red', label='Cross-validation')
plt.xlabel('Gamma Parameter')
plt.ylabel('R¬≤ Score')
plt.title('California Housing: Gamma Parameter')
plt.legend()
plt.grid(True, alpha=0.3)

optimal_gamma = gamma_range[np.argmax(test_mean_gamma)]
print(f"Optimal Gamma for California Housing: {optimal_gamma}")

plt.tight_layout()
plt.show()

In [None]:
# Grid search for optimal SVR hyperparameters
print("\nRunning Grid Search for optimal SVR hyperparameters...")

# Parameter grid for California Housing
param_grid_housing = {
    'C': [0.1, 1, 10, 100],
    'epsilon': [0.01, 0.1, 0.2],
    'gamma': ['scale', 'auto', 0.001, 0.01, 0.1],
    'kernel': ['rbf', 'linear']
}

grid_search_housing = GridSearchCV(
    SklearnSVR(), param_grid_housing, cv=5, scoring='r2', n_jobs=-1
)

grid_search_housing.fit(X_train_housing_scaled, y_train_housing)

print(f"Best parameters for California Housing: {grid_search_housing.best_params_}")
print(f"Best cross-validation R¬≤ score: {grid_search_housing.best_score_:.4f}")

# Test the optimized model
best_svr_housing = grid_search_housing.best_estimator_
y_pred_best_housing = best_svr_housing.predict(X_test_housing_scaled)
best_r2_housing = r2_score(y_test_housing, y_pred_best_housing)

print(f"Test R¬≤ score with best parameters: {best_r2_housing:.4f}")

# Parameter grid for Wine Quality
param_grid_wine = {
    'C': [1, 10, 100],
    'epsilon': [0.01, 0.1, 0.2],
    'gamma': ['scale', 0.01, 0.1, 1],
    'kernel': ['rbf', 'linear']
}

grid_search_wine = GridSearchCV(
    SklearnSVR(), param_grid_wine, cv=5, scoring='r2', n_jobs=-1
)

grid_search_wine.fit(X_train_wine_scaled, y_train_wine)

print(f"\nBest parameters for Wine Quality: {grid_search_wine.best_params_}")
print(f"Best cross-validation R¬≤ score: {grid_search_wine.best_score_:.4f}")

best_svr_wine = grid_search_wine.best_estimator_
y_pred_best_wine = best_svr_wine.predict(X_test_wine_scaled)
best_r2_wine = r2_score(y_test_wine, y_pred_best_wine)

print(f"Test R¬≤ score with best parameters: {best_r2_wine:.4f}")

## 6. Results and Insights {#results}

Let's summarize our findings and provide practical insights.

In [None]:
# Comprehensive results summary
print("=" * 60)
print("COMPREHENSIVE SVR REGRESSION ANALYSIS SUMMARY")
print("=" * 60)

print("\n1. CALIFORNIA HOUSING DATASET RESULTS:")
print("-" * 45)
print(f"Dataset size: {X_housing.shape[0]} samples, {X_housing.shape[1]} features")
print(f"Task: Regression (Median House Value Prediction)")
print(f"Target range: ${y_housing.min():.0f}k - ${y_housing.max():.0f}k")
print(f"Best performing model: {best_model_housing}")
print(f"Best R¬≤ Score: {best_metrics_housing['r2_score']:.4f}")
print(f"Best RMSE: {best_metrics_housing['rmse']:.4f}")

print("\nOptimized SVR Performance:")
print(f"  Best parameters: {grid_search_housing.best_params_}")
print(f"  Optimized R¬≤ Score: {best_r2_housing:.4f}")

print("\nTop 3 performing models:")
housing_sorted = sorted(housing_comparison_results.items(), 
                       key=lambda x: x[1]['r2_score'] if isinstance(x[1], dict) else 0, 
                       reverse=True)[:3]
for i, (name, metrics) in enumerate(housing_sorted, 1):
    if isinstance(metrics, dict):
        print(f"  {i}. {name}: R¬≤={metrics['r2_score']:.4f}, RMSE={metrics['rmse']:.4f}")

print("\n2. WINE QUALITY DATASET RESULTS:")
print("-" * 40)
print(f"Dataset size: {X_wine.shape[0]} samples, {X_wine.shape[1]} features")
print(f"Task: Regression (Wine Quality Score Prediction)")
print(f"Quality range: {y_wine.min()} - {y_wine.max()}")
print(f"Best performing model: {best_model_wine}")
print(f"Best R¬≤ Score: {best_metrics_wine['r2_score']:.4f}")
print(f"Best RMSE: {best_metrics_wine['rmse']:.4f}")

print("\nOptimized SVR Performance:")
print(f"  Best parameters: {grid_search_wine.best_params_}")
print(f"  Optimized R¬≤ Score: {best_r2_wine:.4f}")

print("\nTop 3 performing models:")
wine_sorted = sorted(wine_comparison_results.items(), 
                    key=lambda x: x[1]['r2_score'] if isinstance(x[1], dict) else 0, 
                    reverse=True)[:3]
for i, (name, metrics) in enumerate(wine_sorted, 1):
    if isinstance(metrics, dict):
        print(f"  {i}. {name}: R¬≤={metrics['r2_score']:.4f}, RMSE={metrics['rmse']:.4f}")

In [None]:
# Key insights and practical recommendations
print("\n3. KEY INSIGHTS AND OBSERVATIONS:")
print("-" * 40)

insights = [
    "üìä PERFORMANCE INSIGHTS:",
    "   ‚Ä¢ SVR with RBF kernel often performs well on non-linear regression tasks",
    "   ‚Ä¢ Linear SVR is competitive for linear relationships and high-dimensional data",
    "   ‚Ä¢ Random Forest consistently shows robust performance across different datasets",
    "   ‚Ä¢ Feature scaling is crucial for SVR performance",
    "",
    "‚öôÔ∏è  HYPERPARAMETER INSIGHTS:",
    f"   ‚Ä¢ Optimal C for Housing: {grid_search_housing.best_params_['C']}",
    f"   ‚Ä¢ Optimal C for Wine: {grid_search_wine.best_params_['C']}",
    f"   ‚Ä¢ Optimal Œµ for Housing: {grid_search_housing.best_params_['epsilon']}",
    f"   ‚Ä¢ Optimal Œµ for Wine: {grid_search_wine.best_params_['epsilon']}",
    "   ‚Ä¢ Œµ parameter significantly affects the number of support vectors",
    "   ‚Ä¢ Higher C values tend to overfit on complex datasets",
    "",
    "üîç KERNEL COMPARISON:",
    "   ‚Ä¢ RBF kernel: Best for non-linear patterns, requires gamma tuning",
    "   ‚Ä¢ Linear kernel: Fast, interpretable, good for high-dimensional data",
    "   ‚Ä¢ Polynomial kernel: Can capture interactions but prone to overfitting",
    "",
    "üìà DATASET-SPECIFIC FINDINGS:",
    "   ‚Ä¢ California Housing: Linear relationships dominate, simpler models work well",
    "   ‚Ä¢ Wine Quality: More complex relationships, benefits from non-linear kernels",
    "   ‚Ä¢ Feature importance varies significantly between datasets",
    "",
    "üí° PRACTICAL RECOMMENDATIONS:",
    "   ‚Ä¢ Start with Linear SVR for interpretability and speed",
    "   ‚Ä¢ Use RBF kernel when non-linear relationships are suspected",
    "   ‚Ä¢ Always tune Œµ parameter - it controls model complexity",
    "   ‚Ä¢ Consider ensemble methods for robust performance",
    "   ‚Ä¢ Validate hyperparameters using cross-validation",
    "",
    "‚ö†Ô∏è  LIMITATIONS OBSERVED:",
    "   ‚Ä¢ SVR can be sensitive to outliers",
    "   ‚Ä¢ Computational complexity increases with dataset size",
    "   ‚Ä¢ Memory usage depends on number of support vectors",
    "   ‚Ä¢ Hyperparameter tuning is crucial but time-consuming",
    "",
    "üéØ WHEN TO USE SVR:",
    "   ‚úÖ Non-linear regression problems",
    "   ‚úÖ Medium-sized datasets (< 100k samples)",
    "   ‚úÖ When robustness to outliers is needed (with appropriate Œµ)",
    "   ‚úÖ High-dimensional feature spaces",
    "",
    "‚ùå WHEN NOT TO USE SVR:",
    "   ‚Ä¢ Very large datasets (computational cost)",
    "   ‚Ä¢ When interpretability is paramount (use Linear Regression)",
    "   ‚Ä¢ When training time is critical",
    "   ‚Ä¢ Extremely noisy data with many outliers"
]

for insight in insights:
    print(insight)

print("\n" + "=" * 60)
print("SVR Analysis completed successfully! üéâ")
print("=" * 60)

In [None]:
# Save regression results
import pickle

regression_results_summary = {
    'california_housing': {
        'results': housing_comparison_results,
        'best_model': best_model_housing,
        'best_metrics': best_metrics_housing,
        'optimal_params': grid_search_housing.best_params_,
        'optimized_r2': best_r2_housing
    },
    'wine_quality': {
        'results': wine_comparison_results,
        'best_model': best_model_wine,
        'best_metrics': best_metrics_wine,
        'optimal_params': grid_search_wine.best_params_,
        'optimized_r2': best_r2_wine
    }
}

# Create results directory
os.makedirs('../results', exist_ok=True)

# Save results
with open('../results/regression_results.pkl', 'wb') as f:
    pickle.dump(regression_results_summary, f)

print("Results saved to '../results/regression_results.pkl'")
print("\nSVR Analysis notebook execution completed! ‚úÖ")

# Final performance summary
print("\nFINAL PERFORMANCE SUMMARY:")
print(f"California Housing - Best R¬≤: {max(best_r2_housing, best_metrics_housing['r2_score']):.4f}")
print(f"Wine Quality - Best R¬≤: {max(best_r2_wine, best_metrics_wine['r2_score']):.4f}")
print("\nCustom SVR implementations demonstrate competitive performance!")