# Overview  

**Implementing Machine and Deep Learning Algorithms from Scratch**  

Welcome to the **second notebook** in this series!  

In this notebook, we’ll dive into **three closely related algorithms**:  
- **Ridge Regression (L2 Regularization)**  
- **Lasso Regression (L1 Regularization)**  
- **Elastic Net** (a combination of the two)  

We’ll explore how each works, implement them from scratch, and understand the intuition behind their regularization techniques.  

**Let’s get started!**  

# Imports

In [1]:
import math
import numpy as np
import pandas as pd
import plotly.express as px

# Data Loading and Analysis

In [2]:
#Load the training and testing data
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

In [3]:
train_data.shape, test_data.shape

((700, 2), (300, 2))

In [4]:
#Checking for missing values
print(train_data.isna().sum())
print(test_data.isna().sum())

x    0
y    1
dtype: int64
x    0
y    0
dtype: int64


In [5]:
train_data = train_data.dropna()

In [6]:
print(train_data.isna().sum())
print(test_data.isna().sum())

x    0
y    0
dtype: int64
x    0
y    0
dtype: int64


In [7]:
train_data.head()

Unnamed: 0,x,y
0,24.0,21.549452
1,50.0,47.464463
2,15.0,17.218656
3,38.0,36.586398
4,87.0,87.288984


# Data Visualization

In [8]:
fig = px.scatter(x=train_data['x'], y=train_data['y'], template='seaborn')
fig.show(renderer='iframe')

# Data Preprocessing

In [9]:
#Setting training features and labels
X_train = train_data['x'].values
y_train = train_data['y'].values

#Setting testing features and labels
X_test = test_data['x'].values
y_test = test_data['y'].values

In [10]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((699,), (300,), (699,), (300,))

## Standardize the data

Standardization is a preprocessing technique used in machine learning to rescale and transform the features (variables) of a dataset to have a mean of 0 and a standard deviation of 1. It is also known as "z-score normalization" or "z-score scaling." Standardization is an essential step in the data preprocessing pipeline for various reasons:

### Why Use Standardization in Machine Learning?

1. **Mean Centering**: Standardization centers the data by subtracting the mean from each feature. This ensures that the transformed data has a mean of 0. Mean centering is crucial because it helps in capturing the relative variations in the data.

2. **Scale Invariance**: Standardization scales the data by dividing each feature by its standard deviation. This makes the data scale-invariant, meaning that the scale of the features no longer affects the performance of many machine learning algorithms. Without standardization, features with larger scales may dominate the learning process.

3. **Improved Convergence**: Many machine learning algorithms, such as gradient-based optimization algorithms (e.g., gradient descent), converge faster when the features are standardized. It reduces the potential for numerical instability and overflow/underflow issues during training.

4. **Comparability**: Standardizing the features makes it easier to compare and interpret the importance of each feature. This is especially important in models like linear regression, where the coefficients represent the feature's impact on the target variable.

5. **Regularization**: In regularization techniques like Ridge and Lasso regression, the regularization strength is applied uniformly to all features. Standardization ensures that the regularization term applies fairly to all features.

### How to Standardize Data

The standardization process involves the following steps:

1. Calculate the mean ($\mu$) and standard deviation ($\sigma$) for each feature in the dataset.
      $$
      \text{Mean ($\mu$)} = \frac{1}{n} \sum_{i=1}^{n} x_i
      $$
      $$
      \text{Standard Deviation($\sigma$)} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2}
      $$
2. For each data point (sample), subtract the mean ($\mu$) of the feature and then divide by the standard deviation ($\sigma$) of the feature.

Mathematically, the standardized value for a feature `x` in a dataset is calculated as:

$$
\text{Standardized value} = \frac{x - \mu}{\sigma}
$$

Here, `x` is the original value of the feature, $\mu$ is the mean of the feature, and $\sigma$ is the standard deviation of the feature.


In [11]:
class Standardization:
    def __init__(self):
        self.mean = None
        self.std = None

    def fit_transform(self, X_train):
        """
        Calculating the Mean and Standard Deviation using the training data.

        Parameter: X_train(np.ndarray)
        Returns: Standardized X_train
        """
        self.mean = np.mean(X_train, axis=0)
        self.std = np.std(X_train, axis=0)

        X_train = (X_train - self.mean) / (self.std)

        return X_train
    def transform(self, X_test):
        """
        Parameter: X_test(np.ndarray)
        Returns: Standardized X_test
        """
        X_test = (X_test - self.mean) / self.std

        return X_test

In [12]:
standardizer = Standardization()
X_train = standardizer.fit_transform(X_train)
X_test = standardizer.transform(X_test)

In [13]:
# Currently our data is 1-D, but our model would expect 2-D data.
X_train = np.expand_dims(X_train, axis=-1)
X_test = np.expand_dims(X_test, axis=-1)

# Model Implementation

# Lasso vs Ridge vs ElasticNet
Regularization methods like Lasso, Ridge and Elastic Net help improve linear regression models by preventing overfitting which address multicollinearity and helps in feature selection. These techniques increase the model’s accuracy and stability. In this article we will see explanation of how each technique works and their differences.


## Ridge Regression (L2 Regularization)
Ridge regression is a technique used to address overfitting by adding a penalty to the model's complexity. It introduces an L2 penalty (also called L2 regularization) which is the sum of the squares of the model's coefficients. This penalty term reduces the size of large coefficients but keeps all features in the model. This prevents overfitting with correlated features.

**Cost function for Ridge Regression:**
$$RidgeLoss = \frac{1}{2m}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2 + \lambda\sum_{i=1}^{n}\beta_j^2$$

### Notation  

- $y^{(i)}$: Actual value for the $i$-th sample  
- $\hat{y}^{(i)}$: Predicted value for the $i$-th sample
- $m$: The total number of samples
- $\beta_j$: Coefficient for the $j$-th feature (model parameters)  
- $\lambda$: Regularization parameter controlling the penalty strength  


## Lasso Regression (L1 Regularization)
Lasso regression addresses overfitting by adding an L1 penalty i.e sum of absolute coefficients to the model's loss function. This encourages some coefficients to become exactly zero helps in effectively removing less important features. It also helps to simplify the model by selecting only the key features.

**Cost function for Lasso Regression:**
$$LassoLoss = \frac{1}{2m}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2 + \lambda\sum_{i=1}^{n}|\beta_j|$$

### Notation  

- $y^{(i)}$: Actual value for the $i$-th sample  
- $\hat{y}^{(i)}$: Predicted value for the $i$-th sample
- $m$: The total number of samples
- $\beta_j$: Coefficient for the $j$-th feature (model parameters)  
- $\lambda$: Regularization parameter controlling the penalty strength

## Elastic Net Regression (L1 + L2 Regularization)
Elastic Net regression combines both L1 (Lasso) and L2 (Ridge) penalties to perform feature selection, manage multicollinearity and balancing coefficient shrinkage. This works well when there are many correlated features helps in avoiding the problem where Lasso might randomly pick one and ignore others.

**Cost function for Elastic Net Regression:**
$$ElasticNetLoss = \frac{1}{2m}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2 + \lambda^{(1)}\sum_{i=1}^{n}|\beta_j| + \lambda^{(2)}\sum_{i=1}^{n}\beta_j^2$$

- $y^{(i)}$: Actual value for the $i$-th sample  
- $\hat{y}^{(i)}$: Predicted value for the $i$-th sample
- $m$: The total number of samples  
- $\beta_j$: Coefficient for the $j$-th feature (model parameters)  
- $\lambda^{(1)}$: Regularization parameter controlling the penalty strength for Lasso(L1)
- $\lambda^{(2)}$: Regularization parameter controlling the penalty strength for Ridge(L2)

**Note: While it’s possible to create three separate classes for Ridge, Lasso, and Elastic Net, I’ll instead implement a single Elastic Net class. By adjusting the regularization parameters, this one class can replicate pure Ridge (L2) or pure Lasso (L1) behavior as well.**


## The Class Functions:

### Forward Pass
The forward pass is the step where we compute the predicted output for the input data using the current weights and biases. In other words, it’s the process of applying our model to the input data to generate predictions.

### Cost Function:
For this implementation, I’ll be using the Elastic Net cost function, which combines both L1 (Lasso) and L2 (Ridge) regularization terms:
$$ElasticNetLoss = \frac{1}{2m}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2 + \lambda^{(1)}\sum_{i=1}^{n}|\beta_j| + \lambda^{(2)}\sum_{i=1}^{n}\beta_j^2$$

### Backward Pass (Gradient Computation)
The backward pass computes the gradients of the cost function with respect to the weights and biases. These gradients are essential for updating the model parameters during training using gradient descent.

The gradients are computed as follows:

Gradient with respect to weights ($\beta_j$):
$$\frac{\partial Loss}{\partial \beta_j} = -\frac{1}{m}\sum_{i=1}^{m}X_j^{(i)}(y^{(i)} - \hat{y}^{(i)}) + \lambda^{(1)}sign\beta_j
+ 2\lambda^{(2)}\beta_j$$

Where
- $X_j^{(i)}$: the $j$-th feature of the $i$-th sample
- $y^{(i)}$: Actual value for the $i$-th sample  
- $\hat{y}^{(i)}$: Predicted value for the $i$-th sample
- $\lambda^{(1)}$ and $\lambda^{(2)}$ : The L1 and L2 penalty parameters respectively
- The sign function is defined as:
$$
\text{sign}(\beta_j) =
\begin{cases}
+1, & \beta_j > 0 \\
-1, & \beta_j < 0 \\
0, & \beta_j = 0
\end{cases}
$$


Gradient with respect to intercept (b):
$$\frac{\partial Loss}{\partial b} = -\frac{1}{m}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})$$

- $y^{(i)}$: Actual value for the $i$-th sample  
- $\hat{y}^{(i)}$: Predicted value for the $i$-th sample

## Training Process  

The training process involves iteratively updating the weights and biases to minimize the cost function.  
This is typically done using an optimization algorithm like **Gradient Descent**.  

The update equations for the parameters are:  

$$\beta_j \leftarrow \beta_j - \alpha\frac{\partial loss}{\partial \beta_j}$$
$$b \leftarrow b - \alpha\frac{\partial loss}{\partial b}$$

Here, $\alpha$ represents the **learning rate**, which controls the step size during parameter updates.  

By repeatedly performing the **forward pass**, computing the **cost**, executing the **backward pass** (to compute gradients), and updating the parameters, the model gradually learns to make better predictions and fit the data more effectively.


In [14]:
import numpy as np

class ElasticNet:
    def __init__(self, l1_ratio, l2_ratio, learning_rate, convergence_tol=1e-6, iterations = 1000, plot_cost = False):
        # Safety checks
        assert iterations > 0, "Iterations must be greater than 0"
        assert 0 <= l1_ratio <= 1, "L1 ratio must be in the range 0-1"
        assert 0 <= l2_ratio <= 1, "L2 ratio must be in the range 0-1"
        assert learning_rate > 0, "Learning rate must be positive"
        assert convergence_tol > 0, "Convergence tolerance must be positive"
        self.l1_ratio = l1_ratio
        self.l2_ratio = l2_ratio
        self.learning_rate = learning_rate
        self.convergence_tol = convergence_tol
        self.plot_cost = plot_cost
        self.iterations = iterations

        self.W = None
        self.b = None

    def initialize_parameters(self, n_features):
        """
        Initialize model parameters

        Parameters:
        n_features (int): The number of features in the input data
        """
        self.W = np.ones(n_features)
        self.b = 0

    def forward_pass(self, X):
        """
        Compute the forward pass of the model

        Parameters:
        X (numpy.ndarray): Input data of shape(m, n_features)

        Returns:
        Predictions (numpy.ndarray): Predictions of shape(m,)
        """
        return np.dot(X, self.W) + self.b

    def calculate_cost(self, predictions):
        """
        Compute the value of the mean squared error cost

        Parameters:
        Predictions (numpy.ndarray): Predictions of shape(m,)

        Returns:
            float: Mean squared error cost.
        """
        m = len(self.y)
        errors = self.y - predictions

        cost = (1/(2*m)) * np.sum(np.square(errors)) \
               + self.l1_ratio * np.sum(np.abs(self.W)) \
               + self.l2_ratio * np.sum(np.square(self.W))
        return cost

    def backward_pass(self, predictions):
        """
        Compute gradients for the model

        Parameters:
        predictions (numpy.ndarray): Predictions of shape (m,).

        Updates:
            numpy.ndarray: Gradient of W.
            float: Gradient of b.
        """
        m = len(predictions)
        errors = self.y - predictions

        sign_W = np.where(self.W > 0, 1, np.where(self.W < 0, -1, 0))

        self.dW = (-1/m) * np.dot(self.X.T, errors)\
             + self.l1_ratio * sign_W \
             + 2 * self.l2_ratio * self.W
        self.db = (-1/m) * np.sum(errors)
        

    def fit(self, X, y):
        # Safety Checks
        assert isinstance(X, np.ndarray), "X must be a NumPy array"
        assert isinstance(y, np.ndarray), "y must be a NumPy array"
        assert X.shape[0] == y.shape[0], "X and y must have the same number of samples"
        
        self.X = X
        self.y = y
        self.initialize_parameters(X.shape[1])
        costs = []

        for i in range(self.iterations):
            predictions = self.forward_pass(X)
            costs.append(self.calculate_cost(predictions))
            self.backward_pass(predictions)

            self.W -= self.learning_rate * self.dW
            self.b -= self.learning_rate * self.db

            if i % 100 == 0:
                print(f'Iteration: {i}, Cost: {costs[-1]}')

            if i > 0 and abs(costs[-1] - costs[-2]) < self.convergence_tol:
                print(f'Converged after {i} iterations.')
                break

        if(self.plot_cost):
            fig = px.line(y=costs, title="Cost vs Iteration", template="plotly_dark")
            fig.update_layout(
                title_font_color="#41BEE9",
                xaxis=dict(color="#41BEE9", title="Iterations"),
                yaxis=dict(color="#41BEE9", title="Cost")
            )
            fig.show(renderer='iframe')


    def predict(self, X):
        """
        Predict target values for new input data.

        Parameters:
            X (numpy.ndarray): Input data of shape (m, n_features).

        Returns:
            numpy.ndarray: Predicted target values of shape (m,).
        """
        return self.forward_pass(X)

In [15]:
elastic_net = ElasticNet(l1_ratio=1, l2_ratio=0, learning_rate=0.01, plot_cost=True)
elastic_net.fit(X_train, y_train)

Iteration: 0, Cost: 1642.6095800555427
Iteration: 100, Cost: 248.1238460597492
Iteration: 200, Cost: 61.29110082492663
Iteration: 300, Cost: 36.25931036554461
Iteration: 400, Cost: 32.90555921868395
Iteration: 500, Cost: 32.45622473047303
Iteration: 600, Cost: 32.39602304184007
Iteration: 700, Cost: 32.38795723917112
Iteration: 800, Cost: 32.38687658555207
Converged after 861 iterations.


## Evaluation

### Imports

In [16]:
from sklearn.metrics import mean_squared_error, r2_score

In [17]:
# Using the model to make predictions
y_pred = elastic_net.predict(X_test)

In [18]:
mse_value = mean_squared_error(y_test, y_pred)
rmse_value = np.sqrt(mse_value)
r_squared_value = r2_score(y_test, y_pred)

n = X_test.shape[0]  # number of samples
k = X_test.shape[1]  # number of features

adj_r_squared_value = 1 - (1 - r_squared_value) * (n - 1) / (n - k - 1)

print("MSE:", mse_value)
print("RMSE:", rmse_value)
print("R²:", r_squared_value)
print("Adjusted R²:", adj_r_squared_value)

MSE: 11.207972373216705
RMSE: 3.347830995318716
R²: 0.9866941443104955
Adjusted R²: 0.9866494937880476
