# Ridge Regression From Scratch

Ridge Regression is a regularized version of Linear Regression that adds an **L2 penalty** to prevent overfitting.

## Key Concepts:
- **L2 Regularization**: Adds penalty proportional to the square of coefficients
- **Alpha (λ)**: Regularization strength parameter
- **Prevents Overfitting**: Shrinks coefficients toward zero
- **Handles Multicollinearity**: Works well when features are correlated

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## 1. Mathematical Foundation

### Linear Regression Cost Function:
$$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$$

### Ridge Regression Cost Function (with L2 penalty):
$$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \alpha \sum_{j=1}^{n} \theta_j^2$$

### Closed-Form Solution:
$$\theta = (X^T X + \alpha I)^{-1} X^T y$$

where:
- $\alpha$ is the regularization parameter
- $I$ is the identity matrix

## 2. Implementation

In [None]:
class RidgeRegression:
    def __init__(self, alpha=1.0):
        """
        Initialize Ridge Regression
        
        Parameters:
        -----------
        alpha : float
            Regularization strength (default=1.0)
            - alpha = 0: equivalent to Linear Regression
            - alpha > 0: stronger regularization
        """
        self.alpha = alpha
        self.weights = None
        self.bias = None
    
    def fit(self, X, y):
        """
        Fit Ridge Regression using closed-form solution
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
        y : array-like, shape (n_samples,)
        """
        # Add bias term (column of ones)
        n_samples, n_features = X.shape
        X_with_bias = np.c_[np.ones((n_samples, 1)), X]
        
        # Create identity matrix (don't penalize bias term)
        I = np.eye(n_features + 1)
        I[0, 0] = 0  # Don't regularize bias
        
        # Closed-form solution: θ = (X^T X + αI)^(-1) X^T y
        XtX = X_with_bias.T @ X_with_bias
        theta = np.linalg.inv(XtX + self.alpha * I) @ X_with_bias.T @ y
        
        # Extract bias and weights
        self.bias = theta[0]
        self.weights = theta[1:]
        
        return self
    
    def predict(self, X):
        """
        Make predictions
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
        
        Returns:
        --------
        y_pred : array, shape (n_samples,)
        """
        return X @ self.weights + self.bias
    
    def score(self, X, y):
        """
        Calculate R² score
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
        y : array-like, shape (n_samples,)
        
        Returns:
        --------
        r2_score : float
        """
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)

## 3. Testing on Synthetic Data

In [None]:
# Generate synthetic data with noise
np.random.seed(42)
X, y = make_regression(n_samples=100, n_features=10, noise=20, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features (important for regularization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")
print(f"Features: {X_train.shape[1]}")

In [None]:
# Train Ridge Regression
ridge = RidgeRegression(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

# Make predictions
y_pred_train = ridge.predict(X_train_scaled)
y_pred_test = ridge.predict(X_test_scaled)

# Calculate scores
train_score = ridge.score(X_train_scaled, y_train)
test_score = ridge.score(X_test_scaled, y_test)

print(f"\nRidge Regression (alpha={ridge.alpha})")
print(f"Train R² Score: {train_score:.4f}")
print(f"Test R² Score: {test_score:.4f}")

## 4. Comparison with Scikit-learn

In [None]:
from sklearn.linear_model import Ridge as SklearnRidge

# Train sklearn Ridge
sklearn_ridge = SklearnRidge(alpha=1.0)
sklearn_ridge.fit(X_train_scaled, y_train)

# Compare scores
sklearn_train_score = sklearn_ridge.score(X_train_scaled, y_train)
sklearn_test_score = sklearn_ridge.score(X_test_scaled, y_test)

print("\nComparison:")
print(f"{'Method':<20} {'Train R²':<12} {'Test R²':<12}")
print("-" * 44)
print(f"{'Our Ridge':<20} {train_score:<12.4f} {test_score:<12.4f}")
print(f"{'Sklearn Ridge':<20} {sklearn_train_score:<12.4f} {sklearn_test_score:<12.4f}")

## 5. Effect of Alpha (Regularization Strength)

In [None]:
# Test different alpha values
alphas = [0, 0.1, 1, 10, 100, 1000]
train_scores = []
test_scores = []

for alpha in alphas:
    ridge = RidgeRegression(alpha=alpha)
    ridge.fit(X_train_scaled, y_train)
    
    train_scores.append(ridge.score(X_train_scaled, y_train))
    test_scores.append(ridge.score(X_test_scaled, y_test))

# Plot results
plt.figure(figsize=(10, 6))
plt.semilogx(alphas, train_scores, 'o-', label='Train Score', linewidth=2)
plt.semilogx(alphas, test_scores, 's-', label='Test Score', linewidth=2)
plt.xlabel('Alpha (Regularization Strength)', fontsize=12)
plt.ylabel('R² Score', fontsize=12)
plt.title('Ridge Regression: Effect of Regularization', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

print("\nAlpha vs Performance:")
print(f"{'Alpha':<10} {'Train R²':<12} {'Test R²':<12}")
print("-" * 34)
for i, alpha in enumerate(alphas):
    print(f"{alpha:<10} {train_scores[i]:<12.4f} {test_scores[i]:<12.4f}")

## 6. Coefficient Comparison

In [None]:
# Compare coefficients with different alphas
ridge_weak = RidgeRegression(alpha=0.1)
ridge_strong = RidgeRegression(alpha=100)

ridge_weak.fit(X_train_scaled, y_train)
ridge_strong.fit(X_train_scaled, y_train)

# Plot coefficient magnitudes
plt.figure(figsize=(12, 5))

x_pos = np.arange(len(ridge_weak.weights))
width = 0.35

plt.bar(x_pos - width/2, ridge_weak.weights, width, label='Alpha=0.1 (Weak)', alpha=0.8)
plt.bar(x_pos + width/2, ridge_strong.weights, width, label='Alpha=100 (Strong)', alpha=0.8)

plt.xlabel('Feature Index', fontsize=12)
plt.ylabel('Coefficient Value', fontsize=12)
plt.title('Ridge Regression: Coefficient Shrinkage', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3, axis='y')
plt.show()

print("\nCoefficient L2 Norms:")
print(f"Alpha=0.1:  {np.linalg.norm(ridge_weak.weights):.4f}")
print(f"Alpha=100:  {np.linalg.norm(ridge_strong.weights):.4f}")

## 7. Key Takeaways

### Advantages:
- ✅ Prevents overfitting through regularization
- ✅ Handles multicollinearity well
- ✅ Has closed-form solution (fast training)
- ✅ All features retained (unlike Lasso)

### Disadvantages:
- ❌ Doesn't perform feature selection
- ❌ Requires feature scaling
- ❌ Need to tune alpha hyperparameter

### When to Use:
- Many correlated features
- Want to keep all features
- Prevent overfitting on high-dimensional data
- Linear relationships in data