#### This notebook will go through the implementation of Ridge and Lasso regressions from scratch

# Regularization


### 1. What is Regularization?

Regularization is a technique used in machine learning to prevent overfitting by discouraging overly complex models. It introduces a penalty term to the loss function that the algorithm tries to minimize, which constrains the magnitude of the model parameters (coefficients). By doing so, regularization encourages the model to learn simpler patterns that generalize better to unseen data.

---

### 2. L1 and L2 Regularization

- **L1 Regularization (Lasso Regression):**
  - Adds the **absolute values** of the coefficients as penalty to the loss function.
  - Formula: `Loss + α * Σ|wᵢ|`
  - Encourages **sparse solutions**—some coefficients may become exactly zero, effectively performing feature selection.

- **L2 Regularization (Ridge Regression):**
  - Adds the **squared values** of the coefficients as penalty.
  - Formula: `Loss + α * Σ(wᵢ²)`
  - Tends to **shrink coefficients** evenly but rarely forces them to zero.

---

### 3. Why is Regularization Important?

- **Reduces Overfitting**: Helps prevent the model from fitting noise in the training data.
- **Improves Generalization**: Encourages simpler models that work better on new, unseen data.
- **Feature Selection (L1)**: Lasso can automatically eliminate irrelevant features by assigning them zero weight.
- **Model Stability**: Regularization makes the model less sensitive to small changes in the input data.

---

🔑 In summary, regularization is a critical tool in a machine learning practitioner's toolbox to build robust, generalizable models, especially when working with high-dimensional or noisy datasets.


In [1]:
# comparing all models 

In [2]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Ridge, Lasso, SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Load and split the dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Linear Regression (Closed-form)
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
lr_preds = lr.predict(X_test_scaled)

# Linear Regression with SGDRegressor
sgd = SGDRegressor(max_iter=1000, learning_rate='invscaling', eta0=0.01, penalty=None, random_state=42)
sgd.fit(X_train_scaled, y_train)
sgd_preds = sgd.predict(X_test_scaled)

# Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
ridge_preds = ridge.predict(X_test_scaled)

# Lasso Regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)
lasso_preds = lasso.predict(X_test_scaled)

# Evaluation function
def evaluate_model(name, y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"{name:<35} | MSE: {mse:.4f} | R²: {r2:.4f}")

# Compare models
print("📊 Model Comparison Results:\n")
evaluate_model("Linear Regression (Closed-form)", y_test, lr_preds)
evaluate_model("SGDRegressor (Linear GD)", y_test, sgd_preds)
evaluate_model("Ridge Regression", y_test, ridge_preds)
evaluate_model("Lasso Regression", y_test, lasso_preds)


📊 Model Comparison Results:

Linear Regression (Closed-form)     | MSE: 0.5559 | R²: 0.5758
SGDRegressor (Linear GD)            | MSE: 0.5506 | R²: 0.5798
Ridge Regression                    | MSE: 0.5559 | R²: 0.5758
Lasso Regression                    | MSE: 0.6796 | R²: 0.4814


In [5]:
from sklearn.linear_model import SGDRegressor
        
sgd = SGDRegressor(penalty=None, max_iter=1000, learning_rate='invscaling', eta0=0.01, random_state=42)
sgd.fit(X_train_scaled, y_train)
y_pred_sgd = sgd.predict(X_test_scaled)

mse_sgd = mean_squared_error(y_test, y_pred_sgd)
r2_sgd = r2_score(y_test, y_pred_sgd)

print("SGDRegressor (Linear GD):")

print(mse_sgd)
print(r2_sgd)

SGDRegressor (Linear GD):
0.5506075204179468
0.5798200946722103


# Ridge from scratch

## 📘 Ridge Regression with Gradient Descent

### What is Ridge Regression?

Ridge Regression is a type of **linear regression** that includes **L2 regularization**. It modifies the standard linear regression loss function by adding a penalty proportional to the **square of the magnitude of the coefficients**.

The goal is to **prevent overfitting** by discouraging the model from learning excessively large weights.

---

### 🧮 Ridge Loss Function with L2 Penalty

The objective function Ridge minimizes is:

\[
J(\mathbf{w}) = \frac{1}{2m} \sum_{i=1}^m (y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^n w_j^2
\]

- \( m \): number of training samples
- \( y_i \): true value
- \( \hat{y}_i \): predicted value
- \( w_j \): model weight (coefficient)
- \( \alpha \): regularization strength
- The first term is the **mean squared error**
- The second term is the **L2 penalty** (sum of squared weights)

---

### ✅ Gradient Descent Update Rule

Since the loss function is differentiable, we can apply standard gradient descent. The gradient of the Ridge loss with respect to the weights is:

\[
\nabla J(\mathbf{w}) = \frac{1}{m} X^\top (X\mathbf{w} - y) + 2\alpha \mathbf{w}
\]

**Weight Update Rule:**

\[
\mathbf{w} = \mathbf{w} - \eta \cdot \nabla J(\mathbf{w})
\]

Where:
- \( \eta \): learning rate
- The regularization term \( 2\alpha \mathbf{w} \) **shrinks** the weights during training

---

### 💡 Key Properties of Ridge Regression

- Encourages **small but non-zero** weights (unlike Lasso, which can set them to zero)
- Useful when you have **multicollinearity** or **high-dimensional data**
- Bias term is typically **excluded** from regularization

---

### 🛠️ Why Use Ridge Regression?

- Reduces model complexity and variance
- Helps prevent overfitting
- Does **not perform feature selection** (all features are retained)


In [4]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Load and prepare data
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Add bias term
X_train_bias = np.c_[np.ones((X_train_scaled.shape[0], 1)), X_train_scaled]
X_test_bias = np.c_[np.ones((X_test_scaled.shape[0], 1)), X_test_scaled]

# Ridge Regression using Gradient Descent
class RidgeRegressionGD:
    def __init__(self, alpha=1.0, learning_rate=0.01, n_iterations=1000):
        self.alpha = alpha
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations

    def fit(self, X, y):
        m, n = X.shape
        self.weights = np.zeros(n)

        for _ in range(self.n_iterations):
            predictions = X @ self.weights
            errors = predictions - y

            # Gradient: add 2*alpha*w for L2 regularization (excluding bias)
            gradient = (1/m) * (X.T @ errors) + (2 * self.alpha / m) * np.r_[0, self.weights[1:]]
            self.weights -= self.learning_rate * gradient

    def predict(self, X):
        return X @ self.weights

# Train and evaluate
ridge_gd = RidgeRegressionGD(alpha=1.0, learning_rate=0.1, n_iterations=1000)
ridge_gd.fit(X_train_bias, y_train)
preds = ridge_gd.predict(X_test_bias)

mse = mean_squared_error(y_test, preds)
r2 = r2_score(y_test, preds)

print("📉 Ridge Regression (Gradient Descent):")
print(f"MSE: {mse:.4f}")
print(f"R² Score: {r2:.4f}")


📉 Ridge Regression (Gradient Descent):
MSE: 0.5559
R² Score: 0.5758


In [3]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Load and prepare data
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Add intercept manually
X_train_bias = np.c_[np.ones((X_train_scaled.shape[0], 1)), X_train_scaled]
X_test_bias = np.c_[np.ones((X_test_scaled.shape[0], 1)), X_test_scaled]

class LassoRegressionScratch:
    def __init__(self, alpha=0.1, max_iter=1000, tol=1e-4):
        self.alpha = alpha
        self.max_iter = max_iter
        self.tol = tol

    def soft_thresholding(self, rho, alpha):
        if rho < -alpha:
            return rho + alpha
        elif rho > alpha:
            return rho - alpha
        else:
            return 0.0

    def fit(self, X, y):
        m, n = X.shape
        self.coef_ = np.zeros(n)
        for iteration in range(self.max_iter):
            coef_old = self.coef_.copy()
            for j in range(n):
                X_j = X[:, j]
                residual = y - X @ self.coef_ + self.coef_[j] * X_j
                rho = np.dot(X_j, residual)

                if j == 0:  # Intercept (bias) term - no regularization
                    self.coef_[j] = rho / np.dot(X_j, X_j)
                else:
                    self.coef_[j] = self.soft_thresholding(rho, self.alpha) / np.dot(X_j, X_j)

            # Check convergence
            if np.sum(np.abs(self.coef_ - coef_old)) < self.tol:
                break

    def predict(self, X):
        return X @ self.coef_

# Train Lasso
lasso_scratch = LassoRegressionScratch(alpha=0.1)
lasso_scratch.fit(X_train_bias, y_train)
y_pred = lasso_scratch.predict(X_test_bias)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("🛠 Lasso Regression (From Scratch):")
print(f"MSE: {mse:.4f}")
print(f"R² Score: {r2:.4f}")


🛠 Lasso Regression (From Scratch):
MSE: 0.5559
R² Score: 0.5758


## 📘 Lasso Regression with Gradient Descent

### Why is Lasso Harder Than Ridge?

Lasso regression applies **L1 regularization**, adding a penalty equal to the **absolute values of the coefficients**. Unlike Ridge (which uses squared coefficients), the absolute value function is **not differentiable at 0**, making traditional gradient descent inapplicable directly.

To handle this, we use a technique called **Subgradient Descent**, which extends gradient descent to non-differentiable functions.

---

### 🧮 Lasso Loss Function with L1 Penalty

The objective function to minimize is:

\[
J(\mathbf{w}) = \frac{1}{2m} \sum_{i=1}^m (y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^n |w_j|
\]

- \( \hat{y}_i \) is the model prediction for sample \( i \)
- \( \alpha \) is the regularization strength
- The first term is the mean squared error
- The second term is the L1 penalty (sum of absolute values of weights)

---

### 🧮 Subgradient of the L1 Term

Since the derivative of \( |w_j| \) is undefined at \( w_j = 0 \), we use its **subgradient**:

\[
\frac{d}{dw_j} |w_j| =
\begin{cases}
1, & \text{if } w_j > 0 \\
-1, & \text{if } w_j < 0 \\
0, & \text{if } w_j = 0 \\
\end{cases}
\]

This allows us to apply a modified form of gradient descent, called **subgradient descent**, for optimizing Lasso.

---

### 💡 Summary

- Lasso Regression adds L1 penalty which **encourages sparsity** in weights.
- The **absolute value** function requires using **subgradients** instead of standard gradients.
- Lasso is powerful when you want both **regularization and feature selection** in a linear model.



In [5]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Load and prepare data
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Add intercept term
X_train_bias = np.c_[np.ones((X_train_scaled.shape[0], 1)), X_train_scaled]
X_test_bias = np.c_[np.ones((X_test_scaled.shape[0], 1)), X_test_scaled]

# Lasso Regression using Gradient Descent
class LassoRegressionGD:
    def __init__(self, alpha=0.1, learning_rate=0.01, n_iterations=1000):
        self.alpha = alpha
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations

    def subgradient(self, w):
        return np.where(w > 0, 1, np.where(w < 0, -1, 0))

    def fit(self, X, y):
        m, n = X.shape
        self.weights = np.zeros(n)

        for _ in range(self.n_iterations):
            predictions = X @ self.weights
            errors = predictions - y

            grad = (1/m) * (X.T @ errors) + self.alpha * self.subgradient(self.weights)
            grad[0] -= self.alpha * self.subgradient(self.weights[0])  # exclude bias from regularization
            self.weights -= self.learning_rate * grad

    def predict(self, X):
        return X @ self.weights

# Train and evaluate
lasso_gd = LassoRegressionGD(alpha=0.1, learning_rate=0.01, n_iterations=1000)
lasso_gd.fit(X_train_bias, y_train)
preds = lasso_gd.predict(X_test_bias)

mse = mean_squared_error(y_test, preds)
r2 = r2_score(y_test, preds)

print("📉 Lasso Regression (Gradient Descent):")
print(f"MSE: {mse:.4f}")
print(f"R² Score: {r2:.4f}")


📉 Lasso Regression (Gradient Descent):
MSE: 0.6795
R² Score: 0.4815
