# **RIDGE REGRESSION**

## **A. CONCEPT**

***Ridge Regression*** is a type of linear regression that includes an addition **penality term** to help reduce overfitting, especially when there are many predictors (features) in the dataset or when multicollinearity exists.

### **1. Linear Regression Formula**

In standard linear regression, we try to find the line/hyperplane that best fits the data. This is done by minimizing $SS_{residuals}$ between the obeserved values and the predicted values.

$Cost function = \sum_{i=1}^n(y_i - \hat{y_i})^2$

where:
- $y_i$: true values
- $\hat{y_i}$: predicted values

### **2. Ridge Regression Formula**

Ridge Regression modifies the linear regression cost function by adding a panlty term to prevent large coefficients. The penalty is proportional to the square of the magnitude of the coefficients, controlled by a parameter $\lambda$ *(also known as the regularization parameter)*

$Cost function = \sum_{i=1}^n(y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^p \beta_j^2$

where:
- $\beta_j$: coefficients of the features
- $\lambda$: regularization strength

The penalty term $\lambda \sum_{j=1}^p \beta_j^2$ forces the coefficient $\beta_j$ to be smaller. Larger values of $\lambda$ result in stronger regularization, shrinking the coefficients more toward zero. This reduces the complexity of the model and helps prevent overfitting.

The parameter $\lambda$ is typically chosing through *cross-validation*. A larger $\lambda$  results in greater regularization (more shrinkage of coefficients), while a smaller $\lambda$ allows the model to resemble ordinary least squares (OLS) regression.

Ridge Regression is especially useful when there are many correlated features (multicollinearity), as it helps stabilize the coefficient estimates and reduce variance in the model.

### **3. Gradient Descent for Ridge Regression**

To train Ridge Regression, we need to use Gradient Descent. Gradient Descent updates the model parameters iteratively to minimize the cost function:

$\beta_j \leftarrow \beta_j - \alpha \frac{\partial J(\beta)}{\partial \beta_j}$

The gradient for Ridge Regression is:

$\frac{\partial J}{\partial \beta_j} = -2 \sum_{i=1}^n x_{ij}(y_i - \hat{y_i}) + 2 \lambda \beta_j$

where:
- $x_{ij}$: the j-th feature of the i-th sample.
- $\hat{y_i} = x_i^T \beta$: the prediction.
- $\alpha$: the learning rate.

## **B. IMPLEMENTATION**

### **0. Preparing data**

The dataset that is going to be used is the ***California Housing Dataset***

Derived from the 1990 U.S. Census, this dataset includes various attributes for California districts, such as *median house value, median income, housing median age, total rooms, total bedrooms, population, households, latitude, and longitude.*

- **Labels**: Continuous values representing the median house value (in $1000s).

- **Scope**: Includes various attributes of districts in California such as median income, house age, and geographical coordinates.

- **Size**: 20,640 samples, each with 9 attributes.

- **Language**: N/A (numerical data).

In [234]:
# Import dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

In [235]:
# File path
DATA_PATH = './data/housing[1].csv'

In [236]:
# Load data
data = pd.read_csv(DATA_PATH)
data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [237]:
data_encoded = pd.get_dummies(data['ocean_proximity'], prefix='ocean_proximity')
data = pd.concat([data, data_encoded], axis=1)
data = data.drop('ocean_proximity', axis=1)

In [238]:
# Seperate data into features and target
X = data.drop('median_house_value', axis=1)
y = data['median_house_value']

print(len(X.columns))
print(X.shape)
print(y.shape)

13
(20640, 13)
(20640,)


In [239]:
# Normalize data
from sklearn.preprocessing import StandardScaler
scaler_X = StandardScaler()
scaler_y = StandardScaler()
X = scaler_X.fit_transform(X)
y = scaler_y.fit_transform(y.values.reshape(-1, 1)).flatten()

print(X[:5])
print(y[:5])

[[-1.32783522  1.05254828  0.98214266 -0.8048191  -0.97032521 -0.9744286
  -0.97703285  2.34476576 -0.89115574 -0.68188905 -0.01556621  2.83074203
  -0.38446649]
 [-1.32284391  1.04318455 -0.60701891  2.0458901   1.34827594  0.86143887
   1.66996103  2.33223796 -0.89115574 -0.68188905 -0.01556621  2.83074203
  -0.38446649]
 [-1.33282653  1.03850269  1.85618152 -0.53574589 -0.82556097 -0.82077735
  -0.84363692  1.7826994  -0.89115574 -0.68188905 -0.01556621  2.83074203
  -0.38446649]
 [-1.33781784  1.03850269  1.85618152 -0.62421459 -0.71876767 -0.76602806
  -0.73378144  0.93296751 -0.89115574 -0.68188905 -0.01556621  2.83074203
  -0.38446649]
 [-1.33781784  1.03850269  1.85618152 -0.46240395 -0.61197437 -0.75984669
  -0.62915718 -0.012881   -0.89115574 -0.68188905 -0.01556621  2.83074203
  -0.38446649]]
[2.12963148 1.31415614 1.25869341 1.16510007 1.17289952]


In [240]:
# Split data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(16512, 13)
(4128, 13)
(16512,)
(4128,)


In [241]:
# Drop rows with NaN values
# Combine X and y for training and testing datasets
train_data = np.column_stack((X_train, y_train))
test_data = np.column_stack((X_test, y_test))

# Drop rows with NaN values from the combined train and test data
train_data = train_data[~np.isnan(train_data).any(axis=1)]
test_data = test_data[~np.isnan(test_data).any(axis=1)]

# Separate back into X and y after dropping NaN rows
X_train = train_data[:, :-1]
y_train = train_data[:, -1]
X_test = test_data[:, :-1]
y_test = test_data[:, -1]

### **1. Training model**

In [242]:
# Setting up hyperparameters
ALPHA = 0.01 # The learning rate (lr)
LAMBDA = 0.5 # The regularization parameter (lambda)
EPOCHS = 5000 # The number of iterations

#### *1.1. Add intercept column to the features (X)*

In [243]:
# Add intercept term to X_train and X_test
X_train = np.hstack((np.ones((X_train.shape[0], 1)), X_train)) if X_train.shape[1] == 13 else X_train
X_test = np.hstack((np.ones((X_test.shape[0], 1)), X_test)) if X_test.shape[1] == 13 else X_test

print(X_train[:5])

[[ 1.          1.26764451 -1.36797628  0.34647803  0.22471827  0.21152061
   0.7722505   0.32292363 -0.32165429 -0.89115574 -0.68188905 -0.01556621
  -0.35326426  2.60100692]
 [ 1.          0.7036268  -0.87169852  1.61780729  0.34206536  0.59123012
  -0.09843989  0.67079931 -0.03061993 -0.89115574 -0.68188905 -0.01556621
  -0.35326426  2.60100692]
 [ 1.         -0.45435647 -0.45501247 -1.95780625 -0.33863945 -0.49094197
  -0.45077809 -0.42775547  0.1503488  -0.89115574 -0.68188905 -0.01556621
  -0.35326426  2.60100692]
 [ 1.          1.22771405 -1.37734001  0.58485227 -0.55683169 -0.40550733
  -0.00660236 -0.37805894 -1.01494666 -0.89115574 -0.68188905 -0.01556621
  -0.35326426  2.60100692]
 [ 1.         -0.11494758  0.53754306  1.14105882 -0.11632172 -0.25362353
  -0.48698327 -0.31266878 -0.16658335 -0.89115574  1.46651424 -0.01556621
  -0.35326426 -0.38446649]]


#### *1.2. Initialize Coefficients*

In [244]:
beta = np.random.randn(X_train.shape[1]) * 0.01
print(beta)

[-0.00333474 -0.00679754 -0.0111908   0.0005447  -0.01538228  0.00744551
 -0.01440408 -0.0020603   0.01690558 -0.01130915  0.0098676   0.02075594
  0.01103969  0.01896849]


#### *1.3. Implement Gradient Descent*

In [245]:
from tqdm import tqdm

for epoch in range(EPOCHS):
    # Predict
    y_pred = np.dot(X_train, beta)
    # Calculate residuals
    residuals = y_train - y_pred
    # Calculate gradient
    gradient = (-2 * (np.dot(X_train.T, residuals)) + (2 * LAMBDA * beta)) / X_train.shape[0]
    # Update bias term
    beta -= ALPHA * gradient
    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Cost: {np.mean(residuals ** 2)}, Beta: {beta[:10]}")

print("Trained Ridge Regression Coefficients: ", beta)

Epoch 0, Cost: 0.9873005288066656, Beta: [-0.00320672 -0.007595   -0.01414565  0.00246856 -0.01230926  0.00884514
 -0.01442485 -0.00033401  0.03055097 -0.00566761]
Epoch 100, Cost: 0.38740653196496916, Beta: [-3.86535085e-04 -8.02280082e-02 -6.82993995e-02  8.44175185e-02
  6.90711240e-02  5.07100171e-02 -1.06209833e-01  4.16763589e-02
  5.37523339e-01  9.94342137e-02]
Epoch 200, Cost: 0.3732680497423016, Beta: [-0.00067174 -0.10104476 -0.08618224  0.11032794  0.07007082  0.08955498
 -0.17213288  0.0711088   0.60140827  0.09308952]
Epoch 300, Cost: 0.36776886645619944, Beta: [-0.00075715 -0.11868728 -0.10515683  0.12124189  0.0598668   0.12302271
 -0.22029459  0.09720096  0.61513963  0.09083751]
Epoch 400, Cost: 0.36425629487809363, Beta: [-0.00077912 -0.13593011 -0.12371669  0.12515964  0.04690765  0.15026612
 -0.25684621  0.11894188  0.62043955  0.08968024]
Epoch 500, Cost: 0.3618061108158911, Beta: [-0.00078934 -0.15255026 -0.14157383  0.12644829  0.03338043  0.17247619
 -0.28469757

### **2. Evaluation**

In [246]:
# Predict
y_test_pred = X_test @ beta
print(y_test_pred)
y_test_pred_original = scaler_y.inverse_transform(y_test_pred.reshape(-1, 1)).flatten()
y_test_pred_original

[-1.30691446  0.3300369  -0.02068932 ...  2.01136131 -0.73958894
 -0.20282866]


array([ 56047.27104609, 244939.70509301, 204468.41774545, ...,
       438952.47120035, 121512.56334535, 183450.84588621])

#### *2.1. MSE*

$MSE = \frac{1}{n} \sum_{i=1}^n(y_i - \hat{y_i})^2$

In [247]:
def mean_squared_error(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

print("Mean Squared Error: ", mean_squared_error(y_test, y_test_pred))

Mean Squared Error:  0.36281572712363325


#### 2.2. R-Squared

$R^2 = 1 - \frac{SS_{residuals}}{SS_{totals}}$

In [248]:
def r_squared(y_true, y_pred):
    ss_res = np.sum((y_true - y_pred) ** 2)
    ss_total = np.sum((y_true - np.mean(y_true)) ** 2)
    rs = 1 - (ss_res / ss_total)
    return rs

print("R-Squared: ", r_squared(y_test, y_test_pred))

R-Squared:  0.6323689736157119


## **C. PUTTING EVERYTHING TOGETHER**

In [249]:
import numpy as np
from typing import Annotated

class RidgeRegressionModel:
    def __init__(self,
                 X: Annotated[np.ndarray, 'Features matrix'],
                 y: Annotated[np.ndarray, 'Target values'],
                 lambda_: Annotated[float, 'Regularization parameter'] = 1.0,
                 epochs: Annotated[int, 'Number of epochs'] = 1000,
                 learning_rate: Annotated[float, 'Learning rate'] = 0.01,
                 ):
        self.lambda_ = lambda_
        self.epochs = epochs
        self.learning_rate = learning_rate

    def fit(self, X: np.ndarray, y: np.ndarray):
        _ , n_features = X.shape
        self.w = np.zeros(n_features)
        self.b = 0
        for epoch in range(self.epochs):
            self.update_weights(X, y)
            if epoch % 100 == 0:
                print(f'Epoch: {epoch}, Loss: {self.loss}')

    def update_weights(self, X: np.ndarray, y: np.ndarray):
        n_rows , _ = X.shape
        y_pred = X @ self.w
        res = y - y_pred
        dW = ((-2 * (X.T @ (res)) + 2 * self.lambda_ * self.w) / n_rows)
        dB = (-2 * np.sum(res)) / n_rows
        self.w -= self.learning_rate * dW
        self.b -= self.learning_rate * dB
        self.loss = np.mean((y - y_pred) ** 2)
        
    def predict(self, X: np.ndarray):
        return X @ self.w + self.b