### **Ridge Regression Gradient Descent**

Ridge regression is a linear regression method that adds a penalty term to the ordinary least squares (OLS) loss function. This penalty term discourages overly complex models by penalizing large coefficients. The ridge regression cost function can be defined as:

\[ J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta_j^2 \]

Where:
- \( J(\theta) \) is the cost function.
- \( m \) is the number of training examples.
- \( h_\theta(x^{(i)}) \) is the hypothesis function.
- \( y^{(i)} \) is the actual output for the ith training example.
- \( \theta_j \) represents the jth model parameter (or coefficient).
- \( \lambda \) is the regularization parameter (also known as alpha) which controls the strength of the regularization. A higher \( \lambda \) leads to a stronger regularization.

Gradient descent is an iterative optimization algorithm used for finding the minimum of a function. To minimize the ridge regression cost function, you can use gradient descent. The gradient descent update rule for ridge regression is:

\[ \theta_j := \theta_j - \alpha \left( \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j \right) \]

Where:
- \( \alpha \) is the learning rate, determining the size of the steps in each iteration.

In the update rule, the first term (\( \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \)) is the derivative of the OLS loss function, and the second term (\( \frac{\lambda}{m}\theta_j \)) is the derivative of the regularization term. By simultaneously updating all the \( \theta_j \) values using this rule, you can iteratively converge towards the optimal values that minimize the ridge regression cost function.

It's important to normalize the features before applying ridge regression to ensure that all features are on a similar scale, preventing some features from dominating the regularization term unfairly. Also, the choice of the regularization parameter (\( \lambda \)) is crucial and is often determined using techniques like cross-validation.

In [1]:
# using sci-kit learn library
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor


In [2]:
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=4)
reg = SGDRegressor(penalty='l2', max_iter=500, eta0=0.1,
                   learning_rate='constant', alpha=0.001)


In [3]:
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
print("R2_score:", r2_score(y_test, y_pred))
print(reg.coef_)
print(reg.intercept_)


R2_score: 0.39776487428979046
[  51.86761198 -136.85378032  351.3687856   262.28843363   -2.16367803
  -52.81062271 -170.44594101  138.81965823  317.79207598  103.88778649]
[138.26603412]


In [4]:
# using Ridge Regression for Gradient Descent
from sklearn.linear_model import Ridge
reg = Ridge(alpha=0.001, max_iter=500, solver='sparse_cg')


In [5]:
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
print("R2_score:", r2_score(y_test, y_pred))
print(reg.coef_)
print(reg.intercept_)


R2_score: 0.4625010162027918
[  34.52192778 -290.84083871  482.40181675  368.06786931 -852.44872818
  501.59160694  180.11115474  270.76334443  759.73534802   37.49135796]
151.101985182554


# **Ridge Regression Gradient Descent - Scratch Code**

In [6]:
class GDRidgeRegression:

    def __init__(self, epochs, learning_rate, alpha):

        self.learning_rate = learning_rate
        self.epochs = epochs
        self.alpha = alpha
        self.coef_ = None
        self.intercept_ = None

    def fit(self, X_train, y_train):

        self.coef_ = np.ones(X_train.shape[1])
        self.intercept_ = 0
        thetha = np.insert(self.coef_, 0, self.intercept_)

        X_train = np.insert(X_train, 0, 1, axis=1)

        for i in range(self.epochs):
            thetha_der = np.dot(X_train.T, X_train).dot(
                thetha) - np.dot(X_train.T, y_train) + self.alpha*thetha
            thetha = thetha - self.learning_rate*thetha_der

        self.coef_ = thetha[1:]
        self.intercept_ = thetha[0]

    def predict(self, X_test):

        return np.dot(X_test, self.coef_) + self.intercept_


In [7]:
reg = GDRidgeRegression(epochs=500, alpha=0.001, learning_rate=0.005)
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
print("R2 score", r2_score(y_test, y_pred))
print(reg.coef_)
print(reg.intercept_)


R2 score 0.4738018280260915
[  46.65050914 -221.3750037   452.12080647  325.54248128  -29.09464178
  -96.47517735 -190.90017011  146.32900372  400.80267299   95.09048094]
150.8697531671347
