# 1. Linear Regression

## üìñ What is Linear Regression?

**Linear Regression** is a supervised learning algorithm that models the relationship between input features (X) and a continuous target variable (y) using a linear equation.

**Mathematical Formula:**
```
y = Œ≤‚ÇÄ + Œ≤‚ÇÅx‚ÇÅ + Œ≤‚ÇÇx‚ÇÇ + ... + Œ≤‚Çôx‚Çô + Œµ

Where:
  y = predicted value (target)
  Œ≤‚ÇÄ = intercept (bias)
  Œ≤‚ÇÅ, Œ≤‚ÇÇ, ..., Œ≤‚Çô = coefficients (weights)
  x‚ÇÅ, x‚ÇÇ, ..., x‚Çô = features
  Œµ = error term
```

**Goal:** Minimize the **Sum of Squared Errors (SSE)**:
```
SSE = Œ£(y·µ¢ - ≈∑·µ¢)¬≤
```

**Key Concepts:**
- **Ordinary Least Squares (OLS)**: Standard method to fit the line
- **R¬≤ Score**: Measures how well the model fits (0 to 1, higher is better)
- **Assumptions**: Linearity, independence, homoscedasticity, normality
- **Interpretability**: Coefficients show feature importance and direction

## üéØ Why Use Linear Regression?

### **Advantages:**
1. **Simplicity** - Easy to understand and implement
2. **Interpretability** - Coefficients explain feature impact
3. **Fast Training** - Closed-form solution (no iterations needed)
4. **Low Computational Cost** - Works well with limited resources
5. **Baseline Model** - Good starting point for any regression problem
6. **Statistical Inference** - Provides p-values, confidence intervals
7. **Extrapolation** - Can predict beyond training data range

### **Disadvantages:**
1. **Linear Assumption** - Can't capture non-linear relationships
2. **Sensitive to Outliers** - Single outlier can skew the line
3. **Multicollinearity** - Correlated features cause instability
4. **Overfitting** - With many features, may not generalize
5. **Assumes Normality** - Residuals should be normally distributed

## ‚è±Ô∏è When to Use Linear Regression

### ‚úÖ **Use When:**

**1. Relationship is Linear**
- Example: House price vs square footage
- Why: Price increases proportionally with size
- Data: Scatter plot shows straight-line trend

**2. Need Interpretability**
- Example: Understand impact of marketing spend on sales
- Why: Coefficients tell you "$1 increase in ads = $X increase in sales"
- Use case: Business decisions, stakeholder communication

**3. Small Dataset**
- Example: 100 samples, 5 features
- Why: Complex models would overfit
- Benefit: Linear regression generalizes well with limited data

**4. Fast Predictions Needed**
- Example: Real-time pricing engine
- Why: O(n) prediction time (just multiply and add)
- Performance: Milliseconds for inference

**5. Baseline/Benchmark**
- Example: Start of any regression project
- Why: Establishes minimum performance threshold
- Next step: Try more complex models if R¬≤ is low

**6. Statistical Inference Required**
- Example: Medical study analyzing treatment effect
- Why: Need p-values to determine significance
- Use case: Academic research, regulatory compliance

**7. Few Features (< 10)**
- Example: Predict salary from years of experience and education
- Why: Low dimensionality, relationships likely linear
- Benefit: Avoid overfitting, easy to interpret

### ‚ùå **Don't Use When:**

**1. Relationship is Non-Linear**
- Problem: House price doesn't increase linearly with age (decreases after peak)
- Better: Polynomial regression, decision trees, neural networks
- Why: Linear model can't capture curves

**2. Many Correlated Features**
- Problem: Height in cm, height in inches, height in feet (all correlated)
- Better: Ridge/Lasso regression, PCA + linear regression
- Why: Multicollinearity makes coefficients unstable

**3. Outliers Present**
- Problem: One house sold for $10M in a $200K neighborhood
- Better: Huber regression, RANSAC, remove outliers
- Why: Outliers pull the line away from true trend

**4. High-Dimensional Data (p >> n)**
- Problem: 10,000 features, 100 samples
- Better: Regularized regression (Ridge/Lasso), dimensionality reduction
- Why: Overfits, can't compute inverse of singular matrix

**5. Need Probability Estimates**
- Problem: Predict if customer will churn (yes/no)
- Better: Logistic regression, classification models
- Why: Linear regression outputs can be < 0 or > 1

## üìä How It Works

**Training Process:**
1. **Input**: Dataset with features X and target y
2. **Solve**: Normal equation: Œ≤ = (X^T X)^(-1) X^T y
3. **Output**: Coefficients Œ≤‚ÇÄ, Œ≤‚ÇÅ, ..., Œ≤‚Çô

**Prediction:**
```python
≈∑ = Œ≤‚ÇÄ + Œ≤‚ÇÅx‚ÇÅ + Œ≤‚ÇÇx‚ÇÇ + ... + Œ≤‚Çôx‚Çô
```

**Evaluation Metrics:**
- **R¬≤ Score**: 1 - (SS_res / SS_tot) ‚Üí [0, 1], closer to 1 is better
- **RMSE**: ‚àö(Œ£(y - ≈∑)¬≤ / n) ‚Üí Lower is better
- **MAE**: Œ£|y - ≈∑| / n ‚Üí Lower is better

## üåç Real-World Applications

1. **Real Estate** - Predict house prices from size, location, bedrooms
2. **Finance** - Stock price prediction, risk assessment
3. **Marketing** - ROI prediction from ad spend
4. **Healthcare** - Disease progression from patient metrics
5. **Economics** - GDP forecasting from economic indicators
6. **Retail** - Sales forecasting from historical data
7. **Manufacturing** - Quality prediction from process parameters
8. **Agriculture** - Crop yield from weather, soil conditions
9. **Energy** - Power consumption forecasting
10. **Insurance** - Claim amount prediction

## üí° Key Insights

‚úÖ **Always visualize** data first - scatter plots reveal linearity  
‚úÖ **Check assumptions**: linearity, normality of residuals  
‚úÖ **Feature scaling** not required (coefficients adjust)  
‚úÖ **Regularization** (Ridge/Lasso) if features are correlated  
‚úÖ **R¬≤ = 0.7+** is generally good, but depends on domain  
‚úÖ **Coefficients** show feature importance and direction (+/-)  
‚úÖ **Remove outliers** or use robust regression  
‚úÖ **Check VIF** (Variance Inflation Factor) for multicollinearity  
‚úÖ **Residual plots** should show random scatter (no pattern)  
‚úÖ **Use as baseline** - always try linear regression first!

In [None]:
# LINEAR REGRESSION - COMPLETE EXAMPLE

print("="*80)
print("LINEAR REGRESSION - COMPREHENSIVE GUIDE")
print("="*80)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import seaborn as sns

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# 1. GENERATE SYNTHETIC DATA (HOUSE PRICES)
print("\n1. GENERATING SYNTHETIC HOUSE PRICE DATA")
print("-"*80)

np.random.seed(42)
n_samples = 200

# Features
square_feet = np.random.uniform(800, 3500, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
age = np.random.uniform(0, 50, n_samples)

# Target: Price (linear relationship + noise)
price = (150 * square_feet +        # $150 per sqft
         20000 * bedrooms +          # $20K per bedroom
         -2000 * age +               # -$2K per year of age
         100000 +                    # Base price
         np.random.normal(0, 30000, n_samples))  # Noise

# Create DataFrame
df = pd.DataFrame({
    'square_feet': square_feet,
    'bedrooms': bedrooms,
    'age': age,
    'price': price
})

print("Dataset created:")
print(df.head(10))
print(f"\nDataset shape: {df.shape}")
print(f"\nStatistics:")
print(df.describe())

# 2. EXPLORATORY DATA ANALYSIS
print("\n2. EXPLORATORY DATA ANALYSIS")
print("-"*80)

# Correlation matrix
correlation = df.corr()
print("Correlation with price:")
print(correlation['price'].sort_values(ascending=False))

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Scatter plots
axes[0, 0].scatter(df['square_feet'], df['price'], alpha=0.5)
axes[0, 0].set_xlabel('Square Feet')
axes[0, 0].set_ylabel('Price ($)')
axes[0, 0].set_title('Price vs Square Feet')

axes[0, 1].scatter(df['bedrooms'], df['price'], alpha=0.5)
axes[0, 1].set_xlabel('Bedrooms')
axes[0, 1].set_ylabel('Price ($)')
axes[0, 1].set_title('Price vs Bedrooms')

axes[1, 0].scatter(df['age'], df['price'], alpha=0.5)
axes[1, 0].set_xlabel('Age (years)')
axes[1, 0].set_ylabel('Price ($)')
axes[1, 0].set_title('Price vs Age')

# Correlation heatmap
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0, ax=axes[1, 1])
axes[1, 1].set_title('Correlation Matrix')

plt.tight_layout()
plt.savefig('linear_regression_eda.png', dpi=100, bbox_inches='tight')
print("\nEDA visualizations saved to 'linear_regression_eda.png'")
plt.close()

# 3. PREPARE DATA
print("\n3. PREPARING DATA")
print("-"*80)

# Features and target
X = df[['square_feet', 'bedrooms', 'age']]
y = df['price']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

# Train-test split (80-20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"\nTrain set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# 4. TRAIN LINEAR REGRESSION MODEL
print("\n4. TRAINING LINEAR REGRESSION MODEL")
print("-"*80)

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

print("Model trained successfully!")
print(f"\nModel coefficients:")
for feature, coef in zip(X.columns, model.coef_):
    print(f"  {feature:15s}: ${coef:12,.2f}")
print(f"  {'Intercept':15s}: ${model.intercept_:12,.2f}")

print(f"\nInterpretation:")
print(f"  - Each additional sqft increases price by ${model.coef_[0]:.2f}")
print(f"  - Each additional bedroom increases price by ${model.coef_[1]:,.2f}")
print(f"  - Each year of age decreases price by ${abs(model.coef_[2]):,.2f}")

# 5. MAKE PREDICTIONS
print("\n5. MAKING PREDICTIONS")
print("-"*80)

# Predict on training and test sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Show some predictions
print("Sample predictions (first 5 test samples):")
comparison = pd.DataFrame({
    'Actual': y_test.values[:5],
    'Predicted': y_test_pred[:5],
    'Difference': y_test.values[:5] - y_test_pred[:5]
})
print(comparison)

# 6. EVALUATE MODEL
print("\n6. MODEL EVALUATION")
print("-"*80)

# Training metrics
train_r2 = r2_score(y_train, y_train_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_mae = mean_absolute_error(y_train, y_train_pred)

print("Training Set Performance:")
print(f"  R¬≤ Score: {train_r2:.4f}")
print(f"  RMSE: ${train_rmse:,.2f}")
print(f"  MAE: ${train_mae:,.2f}")

# Test metrics
test_r2 = r2_score(y_test, y_test_pred)
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_mae = mean_absolute_error(y_test, y_test_pred)

print(f"\nTest Set Performance:")
print(f"  R¬≤ Score: {test_r2:.4f}")
print(f"  RMSE: ${test_rmse:,.2f}")
print(f"  MAE: ${test_mae:,.2f}")

print(f"\nModel Interpretation:")
print(f"  - R¬≤ = {test_r2:.2%} of variance in price is explained by the model")
print(f"  - Average prediction error: ${test_mae:,.0f}")
print(f"  - Model generalizes well (train R¬≤ ‚âà test R¬≤)")

# 7. VISUALIZE RESULTS
print("\n7. VISUALIZING RESULTS")
print("-"*80)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Actual vs Predicted
axes[0, 0].scatter(y_test, y_test_pred, alpha=0.6)
axes[0, 0].plot([y_test.min(), y_test.max()], 
                [y_test.min(), y_test.max()], 
                'r--', lw=2, label='Perfect Prediction')
axes[0, 0].set_xlabel('Actual Price ($)')
axes[0, 0].set_ylabel('Predicted Price ($)')
axes[0, 0].set_title(f'Actual vs Predicted (R¬≤ = {test_r2:.4f})')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Residuals plot
residuals = y_test - y_test_pred
axes[0, 1].scatter(y_test_pred, residuals, alpha=0.6)
axes[0, 1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[0, 1].set_xlabel('Predicted Price ($)')
axes[0, 1].set_ylabel('Residuals ($)')
axes[0, 1].set_title('Residual Plot (should be random)')
axes[0, 1].grid(True, alpha=0.3)

# Residuals distribution
axes[1, 0].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
axes[1, 0].axvline(x=0, color='r', linestyle='--', lw=2)
axes[1, 0].set_xlabel('Residuals ($)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Residuals Distribution (should be normal)')
axes[1, 0].grid(True, alpha=0.3)

# Feature importance (coefficients)
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
}).sort_values('Coefficient', key=abs, ascending=True)

axes[1, 1].barh(feature_importance['Feature'], feature_importance['Coefficient'])
axes[1, 1].set_xlabel('Coefficient Value')
axes[1, 1].set_title('Feature Importance (Coefficients)')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('linear_regression_results.png', dpi=100, bbox_inches='tight')
print("Results visualizations saved to 'linear_regression_results.png'")
plt.close()

# 8. PREDICTION ON NEW DATA
print("\n8. PREDICTING ON NEW DATA")
print("-"*80)

# New houses to predict
new_houses = pd.DataFrame({
    'square_feet': [1500, 2500, 3000],
    'bedrooms': [3, 4, 5],
    'age': [5, 10, 2]
})

print("New houses to predict:")
print(new_houses)

# Make predictions
new_predictions = model.predict(new_houses)

print(f"\nPredicted prices:")
for i, (idx, row) in enumerate(new_houses.iterrows()):
    print(f"  House {i+1}: {row['square_feet']:.0f} sqft, "
          f"{row['bedrooms']:.0f} beds, {row['age']:.0f} years old")
    print(f"           ‚Üí Predicted price: ${new_predictions[i]:,.2f}")

# 9. MODEL EQUATION
print("\n9. MODEL EQUATION")
print("-"*80)

print("Fitted equation:")
print(f"\nPrice = {model.intercept_:,.2f}")
for feature, coef in zip(X.columns, model.coef_):
    sign = '+' if coef >= 0 else '-'
    print(f"        {sign} {abs(coef):,.2f} √ó {feature}")

print("\n" + "="*80)
print("SUMMARY")
print("="*80)
print(f"‚úì Linear Regression trained on {len(X_train)} samples")
print(f"‚úì Test R¬≤ = {test_r2:.4f} (explains {test_r2:.1%} of variance)")
print(f"‚úì Test RMSE = ${test_rmse:,.2f}")
print(f"‚úì Test MAE = ${test_mae:,.2f}")
print(f"‚úì Model is interpretable: coefficients show feature impact")
print(f"‚úì No overfitting: train R¬≤ ‚âà test R¬≤")
print(f"‚úì Residuals are roughly normally distributed")
print(f"‚úì Good baseline model for house price prediction")
print("="*80)

# 2. Ridge Regression (L2 Regularization)

## üìñ What is Ridge Regression?

**Ridge Regression** is a regularized version of Linear Regression that adds an L2 penalty term to prevent overfitting and handle multicollinearity.

**Mathematical Formula:**
```
Cost = SSE + Œ± √ó Œ£(Œ≤·µ¢¬≤)

Where:
  SSE = Sum of Squared Errors (same as Linear Regression)
  Œ± = Regularization strength (hyperparameter)
  Œ£(Œ≤·µ¢¬≤) = Sum of squared coefficients (L2 penalty)
```

**Key Difference from Linear Regression:**
- **Linear Regression**: Minimize SSE only
- **Ridge Regression**: Minimize SSE + penalty for large coefficients

**Effect:**
- **Shrinks coefficients** toward zero (but never exactly zero)
- **Reduces variance** at the cost of increased bias
- **Prevents overfitting** by limiting coefficient magnitude

## üéØ Why Use Ridge Regression?

### **Advantages:**
1. **Handles Multicollinearity** - Works when features are correlated
2. **Prevents Overfitting** - Regularization reduces model complexity
3. **Stable Coefficients** - Small data changes don't drastically change model
4. **Works with p > n** - Can handle more features than samples
5. **Closed-Form Solution** - Fast to train (no iterations)
6. **All Features Retained** - Doesn't set coefficients to exactly zero

### **Disadvantages:**
1. **Not Feature Selection** - Keeps all features (unlike Lasso)
2. **Hyperparameter Tuning** - Need to find optimal Œ±
3. **Requires Scaling** - Features must be on same scale
4. **Less Interpretable** - Coefficients are shrunk, harder to interpret

## ‚è±Ô∏è When to Use Ridge Regression

### ‚úÖ **Use When:**

**1. Multicollinearity Present**
- Example: Height in cm and height in inches (perfectly correlated)
- Why: Linear regression coefficients become unstable
- Solution: Ridge shrinks correlated coefficients

**2. Many Features (High Dimensionality)**
- Example: Gene expression data (20,000 genes, 100 samples)
- Why: Linear regression overfits badly
- Benefit: Regularization prevents overfitting

**3. All Features are Relevant**
- Example: Image pixels (all contribute to recognition)
- Why: Don't want to remove features (Lasso would)
- Use Ridge: Keeps all features, just shrinks coefficients

**4. Overfitting Detected**
- Example: Train R¬≤ = 0.99, Test R¬≤ = 0.60
- Why: Model memorized training data
- Solution: Ridge adds penalty for complexity

**5. Small Dataset**
- Example: 50 samples, 20 features
- Why: Linear regression would overfit
- Benefit: Ridge generalizes better

### ‚ùå **Don't Use When:**

**1. Need Feature Selection**
- Problem: Want to identify most important 10 features from 100
- Better: Lasso regression (sets coefficients to exactly zero)
- Why: Ridge shrinks all coefficients but never eliminates features

**2. Features Not Scaled**
- Problem: Feature 1 in [0, 1], Feature 2 in [0, 1000]
- Better: Scale features first, then use Ridge
- Why: Penalty affects large-scale features more

**3. No Multicollinearity**
- Problem: All features are independent
- Better: Standard Linear Regression
- Why: No need for regularization if no overfitting

## üìä How It Works

**Training:**
```python
Œ≤_ridge = (X^T X + Œ± I)^(-1) X^T y

Where:
  Œ± = regularization parameter
  I = identity matrix
```

**Œ± (Alpha) Selection:**
- **Œ± = 0**: Ridge = Linear Regression (no penalty)
- **Œ± = small (0.1)**: Light regularization
- **Œ± = large (100)**: Heavy regularization, coefficients ‚Üí 0
- **Œ± = ‚àû**: All coefficients ‚Üí 0

**Find Optimal Œ±:**
- Cross-validation (try Œ± = 0.001, 0.01, 0.1, 1, 10, 100)
- Choose Œ± with best validation performance

## üåç Real-World Applications

1. **Genomics** - Predict disease from gene expression (20K features)
2. **Finance** - Stock prediction with correlated economic indicators
3. **Image Processing** - Pixel-based predictions (correlated pixels)
4. **Marketing** - Sales prediction with correlated channels
5. **Climate Science** - Weather prediction (correlated sensors)
6. **Medical Diagnosis** - Predict from correlated patient metrics
7. **Text Analysis** - Sentiment with high-dimensional word features

## üí° Key Insights

‚úÖ **Always scale features** before Ridge (StandardScaler)  
‚úÖ **Use cross-validation** to find optimal Œ±  
‚úÖ **Œ± = 1.0** is good starting point  
‚úÖ **Ridge keeps all features**, Lasso does feature selection  
‚úÖ **Check multicollinearity** with VIF before using Ridge  
‚úÖ **RidgeCV** automatically finds best Œ± via cross-validation  
‚úÖ **Compare with Linear Regression** to see if regularization helps  
‚úÖ **Larger Œ±** = more regularization = simpler model  
‚úÖ **Works well** when most features are relevant

In [None]:
# RIDGE REGRESSION - COMPLETE EXAMPLE

print("="*80)
print("RIDGE REGRESSION (L2 REGULARIZATION) - COMPREHENSIVE GUIDE")
print("="*80)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, RidgeCV, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# 1. GENERATE DATA WITH MULTICOLLINEARITY
print("\n1. GENERATING DATA WITH CORRELATED FEATURES")
print("-"*80)

np.random.seed(42)
n_samples = 100

# Create correlated features (multicollinearity)
x1 = np.random.randn(n_samples)
x2 = x1 + np.random.randn(n_samples) * 0.1  # Highly correlated with x1
x3 = x1 + np.random.randn(n_samples) * 0.1  # Also correlated with x1
x4 = np.random.randn(n_samples)  # Independent
x5 = np.random.randn(n_samples)  # Independent

# Target variable
y = 3*x1 + 2*x2 + 1.5*x3 + 4*x4 + 2*x5 + np.random.randn(n_samples) * 0.5

# Create DataFrame
X = pd.DataFrame({
    'feature1': x1,
    'feature2': x2,  # Correlated with feature1
    'feature3': x3,  # Correlated with feature1
    'feature4': x4,
    'feature5': x5
})

print("Dataset created with multicollinearity:")
print(X.head())

# Check correlations
print(f"\nCorrelation matrix:")
correlations = X.corr()
print(correlations)
print(f"\nNote: feature1, feature2, feature3 are highly correlated!")

# 2. SPLIT AND SCALE DATA
print("\n2. SPLITTING AND SCALING DATA")
print("-"*80)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Train set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Scale features (IMPORTANT for Ridge!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nFeatures scaled to mean=0, std=1")
print(f"Train mean: {X_train_scaled.mean(axis=0)}")
print(f"Train std: {X_train_scaled.std(axis=0)}")

# 3. COMPARE LINEAR REGRESSION VS RIDGE
print("\n3. COMPARING LINEAR REGRESSION VS RIDGE")
print("-"*80)

# Linear Regression (baseline)
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)

# Ridge Regression (Œ±=1.0)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
y_pred_ridge = ridge.predict(X_test_scaled)

# Compare coefficients
coef_comparison = pd.DataFrame({
    'Feature': X.columns,
    'Linear Reg': lr.coef_,
    'Ridge (Œ±=1)': ridge.coef_
})
print("\nCoefficient comparison:")
print(coef_comparison)
print(f"\nObservation: Ridge coefficients are shrunk toward zero")

# Compare performance
print(f"\nPerformance comparison:")
print(f"Linear Regression R¬≤: {r2_score(y_test, y_pred_lr):.4f}")
print(f"Ridge Regression R¬≤:  {r2_score(y_test, y_pred_ridge):.4f}")

# 4. FIND OPTIMAL ALPHA WITH CROSS-VALIDATION
print("\n4. FINDING OPTIMAL ALPHA WITH CROSS-VALIDATION")
print("-"*80)

# Try different alpha values
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

# RidgeCV automatically finds best alpha
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_train_scaled, y_train)

print(f"Alphas tested: {alphas}")
print(f"\nBest alpha: {ridge_cv.alpha_}")

# Train Ridge with best alpha
best_ridge = Ridge(alpha=ridge_cv.alpha_)
best_ridge.fit(X_train_scaled, y_train)
y_pred_best = best_ridge.predict(X_test_scaled)

# 5. EVALUATE BEST MODEL
print("\n5. EVALUATING BEST RIDGE MODEL")
print("-"*80)

# Metrics
r2 = r2_score(y_test, y_pred_best)
rmse = np.sqrt(mean_squared_error(y_test, y_pred_best))
mae = mean_absolute_error(y_test, y_pred_best)

print(f"Best Ridge (Œ±={ridge_cv.alpha_}) Performance:")
print(f"  R¬≤ Score: {r2:.4f}")
print(f"  RMSE: {rmse:.4f}")
print(f"  MAE: {mae:.4f}")

print(f"\nCoefficients with best alpha:")
for feature, coef in zip(X.columns, best_ridge.coef_):
    print(f"  {feature}: {coef:8.4f}")

# 6. VISUALIZE REGULARIZATION PATH
print("\n6. VISUALIZING REGULARIZATION PATH")
print("-"*80)

# Train Ridge with different alphas
alphas_range = np.logspace(-3, 3, 100)
coefs = []

for alpha in alphas_range:
    ridge_temp = Ridge(alpha=alpha)
    ridge_temp.fit(X_train_scaled, y_train)
    coefs.append(ridge_temp.coef_)

coefs = np.array(coefs)

# Plot regularization path
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Coefficients vs alpha
for i, feature in enumerate(X.columns):
    axes[0].plot(alphas_range, coefs[:, i], label=feature)
axes[0].set_xscale('log')
axes[0].set_xlabel('Alpha (regularization strength)')
axes[0].set_ylabel('Coefficient value')
axes[0].set_title('Ridge Regularization Path')
axes[0].legend()
axes[0].axvline(ridge_cv.alpha_, color='red', linestyle='--', 
                label=f'Best Œ±={ridge_cv.alpha_:.3f}')
axes[0].grid(True, alpha=0.3)

# Actual vs Predicted
axes[1].scatter(y_test, y_pred_best, alpha=0.6)
axes[1].plot([y_test.min(), y_test.max()], 
             [y_test.min(), y_test.max()], 
             'r--', lw=2)
axes[1].set_xlabel('Actual Values')
axes[1].set_ylabel('Predicted Values')
axes[1].set_title(f'Ridge Predictions (Œ±={ridge_cv.alpha_:.3f}, R¬≤={r2:.4f})')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('ridge_regression_results.png', dpi=100, bbox_inches='tight')
print("Visualizations saved to 'ridge_regression_results.png'")
plt.close()

# 7. COEFFICIENT MAGNITUDE COMPARISON
print("\n7. COEFFICIENT MAGNITUDE COMPARISON")
print("-"*80)

comparison = pd.DataFrame({
    'Feature': X.columns,
    'Linear Reg': lr.coef_,
    'Ridge (Œ±=0.1)': Ridge(alpha=0.1).fit(X_train_scaled, y_train).coef_,
    'Ridge (Œ±=1)': Ridge(alpha=1.0).fit(X_train_scaled, y_train).coef_,
    'Ridge (Œ±=10)': Ridge(alpha=10.0).fit(X_train_scaled, y_train).coef_,
    'Ridge (Œ±=100)': Ridge(alpha=100.0).fit(X_train_scaled, y_train).coef_
})

print("Coefficient shrinkage with increasing alpha:")
print(comparison.to_string(index=False))
print(f"\nAs Œ± increases, coefficients shrink toward zero")

print("\n" + "="*80)
print("SUMMARY")
print("="*80)
print(f"‚úì Ridge Regression adds L2 penalty to prevent overfitting")
print(f"‚úì Best alpha found via CV: {ridge_cv.alpha_:.4f}")
print(f"‚úì Test R¬≤ = {r2:.4f}")
print(f"‚úì Handles multicollinearity (correlated features)")
print(f"‚úì Shrinks coefficients but never sets to exactly zero")
print(f"‚úì Feature scaling is REQUIRED before Ridge")
print(f"‚úì Use RidgeCV to automatically find best alpha")
print(f"‚úì Larger alpha = more regularization = simpler model")
print("="*80)