# Logistic Regression for Steering Classification

This notebook trains and evaluates Logistic Regression models on the steering image dataset:
- **Standard Logistic Regression** (no regularization)
- **L1 Regularized (Lasso)** - promotes sparsity, feature selection
- **L2 Regularized (Ridge)** - shrinks coefficients, prevents overfitting

We train each variant using both:
1. Raw vectorized image data (with PCA for dimensionality reduction)
2. Engineered features (38 domain-specific features)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import GridSearchCV, StratifiedKFold
import warnings
warnings.filterwarnings('ignore')

# Import shared utilities
from utils import (
    load_data, preprocess_data, evaluate_model, 
    cross_validate_model, save_results, get_class_weights,
    print_class_distribution, CLASSES, RANDOM_STATE
)

print("Libraries loaded successfully!")


## 1. Load Data


In [None]:
# Load both raw and engineered features
data = load_data()

X_raw, y_raw = data['raw']
X_eng, y_eng, feature_names = data['engineered']

print(f"\nRaw features shape: {X_raw.shape}")
print(f"Engineered features shape: {X_eng.shape}")


## 2. Preprocess Data

- Apply PCA to raw data (4096 â†’ ~50 features) to handle multicollinearity
- Standardize all features
- Stratified train/test split (80/20)


In [None]:
# Preprocess raw data with PCA
print("Preprocessing RAW data with PCA:")
raw_processed = preprocess_data(
    X_raw, y_raw, 
    test_size=0.2, 
    apply_pca_reduction=True, 
    pca_variance=0.95,
    scale=True
)

print("\nPreprocessing ENGINEERED data:")
eng_processed = preprocess_data(
    X_eng, y_eng, 
    test_size=0.2, 
    apply_pca_reduction=False,
    scale=True
)

# Check class distribution in training set
print("\nTraining set class distribution:")
print_class_distribution(raw_processed['y_train'], raw_processed['label_encoder'])


## 3. Standard Logistic Regression (No Regularization)

Using very high C value (low regularization) to approximate unregularized logistic regression.


In [None]:
# Get class weights to handle imbalance
class_weights = get_class_weights()
print("Class weights:", class_weights)


In [None]:
# Standard Logistic Regression - Raw (PCA) Features
print("Training Standard Logistic Regression on RAW (PCA) features...")

lr_standard_raw = LogisticRegression(
    C=1e6,  # Very high C = minimal regularization
    max_iter=1000,
    class_weight='balanced',
    random_state=RANDOM_STATE,
    solver='lbfgs',
    multi_class='multinomial'
)

lr_standard_raw.fit(raw_processed['X_train'], raw_processed['y_train'])

results_standard_raw = evaluate_model(
    lr_standard_raw,
    raw_processed['X_test'],
    raw_processed['y_test'],
    model_name='Logistic Regression (Standard)',
    feature_type='raw',
    label_encoder=raw_processed['label_encoder']
)


In [None]:
# Standard Logistic Regression - Engineered Features
print("Training Standard Logistic Regression on ENGINEERED features...")

lr_standard_eng = LogisticRegression(
    C=1e6,
    max_iter=1000,
    class_weight='balanced',
    random_state=RANDOM_STATE,
    solver='lbfgs',
    multi_class='multinomial'
)

lr_standard_eng.fit(eng_processed['X_train'], eng_processed['y_train'])

results_standard_eng = evaluate_model(
    lr_standard_eng,
    eng_processed['X_test'],
    eng_processed['y_test'],
    model_name='Logistic Regression (Standard)',
    feature_type='engineered',
    label_encoder=eng_processed['label_encoder']
)


## 4. L1 Regularized Logistic Regression (Lasso)

L1 regularization promotes sparsity - can shrink coefficients to exactly zero, performing feature selection.


In [None]:
# L1 Logistic Regression with Cross-Validation for C - Raw Features
print("Training L1 Logistic Regression on RAW (PCA) features with CV...")

# Use LogisticRegressionCV for automatic C selection
lr_l1_raw = LogisticRegressionCV(
    penalty='l1',
    Cs=10,  # 10 values of C to try
    cv=5,
    scoring='f1_macro',
    solver='saga',
    max_iter=2000,
    class_weight='balanced',
    random_state=RANDOM_STATE,
    multi_class='multinomial',
    n_jobs=-1
)

lr_l1_raw.fit(raw_processed['X_train'], raw_processed['y_train'])

print(f"Best C: {lr_l1_raw.C_[0]:.4f}")

results_l1_raw = evaluate_model(
    lr_l1_raw,
    raw_processed['X_test'],
    raw_processed['y_test'],
    model_name='Logistic Regression (L1/Lasso)',
    feature_type='raw',
    label_encoder=raw_processed['label_encoder']
)


In [None]:
# L1 Logistic Regression - Engineered Features
print("Training L1 Logistic Regression on ENGINEERED features with CV...")

lr_l1_eng = LogisticRegressionCV(
    penalty='l1',
    Cs=10,
    cv=5,
    scoring='f1_macro',
    solver='saga',
    max_iter=2000,
    class_weight='balanced',
    random_state=RANDOM_STATE,
    multi_class='multinomial',
    n_jobs=-1
)

lr_l1_eng.fit(eng_processed['X_train'], eng_processed['y_train'])

print(f"Best C: {lr_l1_eng.C_[0]:.4f}")

results_l1_eng = evaluate_model(
    lr_l1_eng,
    eng_processed['X_test'],
    eng_processed['y_test'],
    model_name='Logistic Regression (L1/Lasso)',
    feature_type='engineered',
    label_encoder=eng_processed['label_encoder']
)


In [None]:
# Analyze L1 feature selection on engineered features
print("\nL1 Feature Selection (Engineered Features):")
print("="*50)

# Get coefficients
coefs = lr_l1_eng.coef_

# Count non-zero coefficients per class
for i, cls in enumerate(CLASSES):
    n_nonzero = np.sum(coefs[i] != 0)
    print(f"{cls}: {n_nonzero}/{len(feature_names)} features selected")

# Find most important features (by max absolute coefficient across classes)
max_coefs = np.max(np.abs(coefs), axis=0)
top_indices = np.argsort(max_coefs)[::-1][:10]

print("\nTop 10 Features by L1 Coefficient Magnitude:")
for idx in top_indices:
    print(f"  {feature_names[idx]}: {max_coefs[idx]:.4f}")


## 5. L2 Regularized Logistic Regression (Ridge)

L2 regularization shrinks coefficients toward zero but never exactly to zero. Generally better for prediction when many features are relevant.


In [None]:
# L2 Logistic Regression with Cross-Validation for C - Raw Features
print("Training L2 Logistic Regression on RAW (PCA) features with CV...")

lr_l2_raw = LogisticRegressionCV(
    penalty='l2',
    Cs=10,
    cv=5,
    scoring='f1_macro',
    solver='lbfgs',
    max_iter=1000,
    class_weight='balanced',
    random_state=RANDOM_STATE,
    multi_class='multinomial',
    n_jobs=-1
)

lr_l2_raw.fit(raw_processed['X_train'], raw_processed['y_train'])

print(f"Best C: {lr_l2_raw.C_[0]:.4f}")

results_l2_raw = evaluate_model(
    lr_l2_raw,
    raw_processed['X_test'],
    raw_processed['y_test'],
    model_name='Logistic Regression (L2/Ridge)',
    feature_type='raw',
    label_encoder=raw_processed['label_encoder']
)


In [None]:
# L2 Logistic Regression - Engineered Features
print("Training L2 Logistic Regression on ENGINEERED features with CV...")

lr_l2_eng = LogisticRegressionCV(
    penalty='l2',
    Cs=10,
    cv=5,
    scoring='f1_macro',
    solver='lbfgs',
    max_iter=1000,
    class_weight='balanced',
    random_state=RANDOM_STATE,
    multi_class='multinomial',
    n_jobs=-1
)

lr_l2_eng.fit(eng_processed['X_train'], eng_processed['y_train'])

print(f"Best C: {lr_l2_eng.C_[0]:.4f}")

results_l2_eng = evaluate_model(
    lr_l2_eng,
    eng_processed['X_test'],
    eng_processed['y_test'],
    model_name='Logistic Regression (L2/Ridge)',
    feature_type='engineered',
    label_encoder=eng_processed['label_encoder']
)


## 6. Results Summary


In [None]:
# Compile all results
all_results = [
    results_standard_raw,
    results_standard_eng,
    results_l1_raw,
    results_l1_eng,
    results_l2_raw,
    results_l2_eng
]

# Create summary DataFrame
summary_df = pd.DataFrame([
    {
        'Model': r['model_name'],
        'Features': r['feature_type'],
        'Accuracy': r['accuracy'],
        'Balanced Acc': r['balanced_accuracy'],
        'F1 (Macro)': r['f1_macro'],
        'ROC-AUC': r['roc_auc']
    }
    for r in all_results
])

print("\n" + "="*80)
print("LOGISTIC REGRESSION RESULTS SUMMARY")
print("="*80)
print(summary_df.to_string(index=False))


In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Prepare data for plotting
models = ['Standard', 'L1 (Lasso)', 'L2 (Ridge)']
raw_f1 = [results_standard_raw['f1_macro'], results_l1_raw['f1_macro'], results_l2_raw['f1_macro']]
eng_f1 = [results_standard_eng['f1_macro'], results_l1_eng['f1_macro'], results_l2_eng['f1_macro']]

x = np.arange(len(models))
width = 0.35

# F1 Score comparison
bars1 = axes[0].bar(x - width/2, raw_f1, width, label='Raw (PCA)', color='steelblue')
bars2 = axes[0].bar(x + width/2, eng_f1, width, label='Engineered', color='coral')

axes[0].set_ylabel('Macro F1 Score', fontsize=12)
axes[0].set_title('Logistic Regression: F1 Score Comparison', fontsize=14, fontweight='bold')
axes[0].set_xticks(x)
axes[0].set_xticklabels(models)
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')
axes[0].set_ylim(0, 1)

# Add value labels
for bar in list(bars1) + list(bars2):
    height = bar.get_height()
    axes[0].annotate(f'{height:.3f}', xy=(bar.get_x() + bar.get_width()/2, height),
                     xytext=(0, 3), textcoords='offset points', ha='center', fontsize=9)

# Balanced Accuracy comparison
raw_ba = [results_standard_raw['balanced_accuracy'], results_l1_raw['balanced_accuracy'], results_l2_raw['balanced_accuracy']]
eng_ba = [results_standard_eng['balanced_accuracy'], results_l1_eng['balanced_accuracy'], results_l2_eng['balanced_accuracy']]

bars3 = axes[1].bar(x - width/2, raw_ba, width, label='Raw (PCA)', color='steelblue')
bars4 = axes[1].bar(x + width/2, eng_ba, width, label='Engineered', color='coral')

axes[1].set_ylabel('Balanced Accuracy', fontsize=12)
axes[1].set_title('Logistic Regression: Balanced Accuracy Comparison', fontsize=14, fontweight='bold')
axes[1].set_xticks(x)
axes[1].set_xticklabels(models)
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')
axes[1].set_ylim(0, 1)

for bar in list(bars3) + list(bars4):
    height = bar.get_height()
    axes[1].annotate(f'{height:.3f}', xy=(bar.get_x() + bar.get_width()/2, height),
                     xytext=(0, 3), textcoords='offset points', ha='center', fontsize=9)

plt.tight_layout()
plt.show()


In [None]:
# Save results for final comparison
save_results(all_results, 'logistic_regression')
print("\nResults saved successfully!")


## 7. Key Observations

### Regularization Effects
- **Standard (no regularization)**: May overfit, especially with high-dimensional data
- **L1 (Lasso)**: Performs feature selection by zeroing out irrelevant coefficients
- **L2 (Ridge)**: Shrinks all coefficients, often better for prediction

### Feature Types
- **Raw (PCA)**: Uses principal components from 4096 pixel values
- **Engineered**: Uses 38 domain-specific features (edges, spatial, texture, etc.)

### Class Imbalance Handling
- Used `class_weight='balanced'` to account for ~74% forward, ~16% left, ~9% right
- Evaluated with Macro F1 and Balanced Accuracy (not just accuracy)

### Next Steps
- Compare with non-linear models (kNN, trees, SVM with kernels)
- Final model comparison in `10_model_comparison.ipynb`
