# Linear Regression
Linear Regression is a fundamental and widely used algorithm in machine learning and statistics. Its goal is to model the relationship between a `dependent variable y` and one or more `independent variables x`, assuming this relationship is linear.

## Linear Regression Hypothesis Function and Loss

### Hypothesis Function
In Linear Regression, the hypothesis function is used to predict the output (dependent variable) based on the input features (independent variables). The hypothesis function is defined as:

$$ \hat{y} = \mathbf{w}^T\mathbf{x} + \mathbf{\beta}$$

Where:
- $\hat{y} $ is the predicted value.
- $\beta$ is the bias term (intercept).
- $\mathbf{w}$ are the coefficients (weights) for the input features.
- $\mathbf{x}$ are the input features.

In matrix form, the hypothesis function can be written as:

$$ \mathbf{y} = \mathbf{X}\mathbf{w} $$

Where `X` is the input matrix with a bias term added, and `w` is the vector of weights.

### Loss Function
The loss function in Linear Regression is used to measure the error between the predicted values and the actual target values. The most commonly used loss function is the Mean Squared Error (MSE), which is defined as:

$$ \mathcal{L}(\mathbf{w}) = \frac{1}{N} \| X\mathbf{w} - \mathbf{y} \|^2$$

Where:
- N is the number of training samples.

The goal of Linear Regression is to minimize the loss function $\mathcal{L}$ by finding the optimal values of $ \mathbf{w} $. This can be achieved using optimization techniques such as Gradient Descent or the Normal Equation.

## Import libraries

In [1]:
import numpy as np

from sklearn.datasets import make_regression

# Base Regression Model

A base class for regression models.

This class provides a template for implementing regression models. 
It includes methods for adding a bias term to the input data, 
calculating the R-squared score, and placeholders for fitting 
and predicting, which should be implemented by subclasses.

In [None]:
class RegressionModel:
    """
    A base class for regression models.

    This class provides a template for implementing regression models. 
    It includes methods for adding a bias term to the input data, 
    calculating the R-squared score, and placeholders for fitting 
    and predicting, which should be implemented by subclasses.
    """
    def __init__(self):
        """
        Initializes the RegressionModel with default attributes for coefficients, bias, and weights.
        """
        self.coef = None
        self.bias = None
        self.w = None

    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        Placeholder method for fitting the model to the data.

        Parameters:
        - X (np.ndarray): The input features.
        - y (np.ndarray): The target values.

        Raises:
        - NotImplementedError: This method should be implemented by subclasses.
        """
        raise NotImplementedError('Subclasses should implement this class')
    
    def predict(self, X: np.ndarray):
        """
        Placeholder method for making predictions.

        Parameters:
        - X (np.ndarray): The input features.

        Returns:
        - None: This method should be implemented by subclasses.
        """
        pass

    def add_bias(self, X: np.ndarray):
        """
        Adds a bias term (column of ones) to the input features.

        Parameters:
        - X (np.ndarray): The input features.

        Returns:
        - np.ndarray: The input features with an added bias term.
        """
        n_samples = X.shape[0]
        bias_term = np.ones((n_samples, 1))
        X_bias = np.hstack((bias_term, X))

        return X_bias
    
    def score(self, X: np.ndarray, y: np.ndarray):
        """
        Calculates the R-squared score of the model.

        Parameters:
        - X (np.ndarray): The input features.
        - y (np.ndarray): The true target values.

        Returns:
        - float: The R-squared score, a measure of how well the model explains the variance in the data.
        """
        predictions = self.predict(X)
        ss_total = np.sum((y - np.mean(y)) ** 2)
        ss_residual = np.sum((y - predictions) ** 2)
        return 1 - (ss_residual / ss_total)

### A Regression model trained by normal Equation

One of techniques for optimizing the loss function is to solve equation:
$$ \frac{\partial{\mathcal{L}}}{\partial{\mathbf{w}}} = 0$$

which is:
$$ \frac{\partial{\mathcal{L}}}{\partial{\mathbf{w}}} = \frac{1}{N}\mathbf{X}^T(\mathbf{X}\mathbf{w} - \mathbf{y}) = 0$$

this results:
$$ \boxed{\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}}$$

That is, normal equation can find the optimal weights for the loss function.

### Limitations of Using the Normal Equation for Linear Regression

1. **Computational Complexity (O(d³))**:
   - The normal equation requires computing $\mathbf{X}^T \mathbf{X}$  and then inverting the resulting matrix, which has a computational cost of $O(d^3)$, where `d` is the number of features (dimensionality of the feature vector).
   - This becomes inefficient for **high-dimensional** datasets, as the matrix inversion grows computationally expensive. For very large `d`, this may become impractical or very slow.

2. **Singular Matrix Issue**:
   - If $\mathbf{X}^T \mathbf{X}$ is **singular** or **non-invertible** (i.e., its determinant is zero), the normal equation cannot be solved. This typically happens when:
     - **Multicollinearity** exists in the dataset (i.e., some features are linearly dependent or highly correlated).
     - **Insufficient data** relative to the number of features.
   - In these cases, regularization techniques or dimensionality reduction methods are needed to ensure invertibility.

3. **Memory Usage**:
   - For very large datasets with many features `d`, storing and manipulating the matrices $\mathbf{X}^T \mathbf{X}$ can require substantial memory, which may not be feasible in low-memory environments.

4. **Doesn't Scale Well with Large Datasets**:
   - For large datasets with a large number of training examples `N`, the **gradient descent** method might be more efficient because it updates weights iteratively, rather than requiring the entire matrix to be inverted. This makes **gradient descent** better suited for big data.

5. **No Flexibility for Regularization**:
   - The normal equation by itself doesn't incorporate **regularization** (e.g., L2 or L1 regularization), which is often necessary to prevent overfitting in high-dimensional problems. In contrast, gradient descent can be easily modified to include regularization terms like Ridge regression or Lasso regression.

In [3]:
class NormalRegression(RegressionModel):
    def __init__(self):
        super().__init__()

    def fit(self, X: np.ndarray, y: np.ndarray):
        X = self.add_bias(X)

        self.w = np.linalg.pinv(X.T @ X) @ X.T @ y
    
    def predict(self, X):
        X = self.add_bias(X)
        return X @ self.w

## Linear regression with gradient descend

### A Regression model trained by Gradient Descent

Instead of solving the above gradient equation analytically, we can take **iterative steps** in the opposite direction of the gradient:

Given the same loss function:
$$
\mathcal{L}(\mathbf{w}) = \frac{1}{N} \|\mathbf{X}\mathbf{w} - \mathbf{y}\|^2
$$

The gradient is:
$$
\nabla_{\mathbf{w}} \mathcal{L} = \frac{1}{N} \mathbf{X}^T(\mathbf{X}\mathbf{w} - \mathbf{y})
$$

Then the **gradient descent update rule** is:
$$
\boxed{
\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \cdot \frac{1}{N} \mathbf{X}^T (\mathbf{X}\mathbf{w}^{(t)} - \mathbf{y})
}
$$

### Gradient Descent Variants for Linear Regression

There are different types of **Gradient Descent** methods, each with its own advantages and trade-offs. Here's an overview of the **Batch**, **Stochastic**, and **Mini-Batch** Gradient Descent algorithms.

---

#### 1. **Batch Gradient Descent**

**Batch Gradient Descent** computes the gradient using the entire dataset in each iteration. This is a deterministic approach where the weights are updated only after computing the gradient for all samples.

- **Advantages:**
  - Stable convergence.
  - The model is updated after considering the entire dataset.

- **Disadvantages:**
  - Very slow for large datasets, as it needs to process the entire dataset for each update.
  - Can consume a lot of memory.

---

In this method:
- $\mathbf{w}$ are the model parameters (weights).
- $X$ is the feature matrix, and $y$ is the target values.
- The gradient is computed using the entire dataset, and the weights are updated after each epoch.

---

#### 2. **Stochastic Gradient Descent (SGD)**

**Stochastic Gradient Descent (SGD)** computes the gradient using only a single randomly selected sample at each iteration. This makes the updates faster, but they can be noisy and lead to fluctuations in the loss curve.

- **Advantages:**
  - Very fast since it updates the weights after every single sample.
  - Works well for large datasets and online learning.

- **Disadvantages:**
  - Noisy updates can lead to unstable convergence.
  - Requires a larger number of epochs to converge to the optimal solution.


In this method:
- A **random index** is selected, and the gradient is computed based on that single sample.
- The weight update happens after every data point, which speeds up the process but can lead to noisy convergence.

---

#### 3. **Mini-Batch Gradient Descent**

**Mini-Batch Gradient Descent** is a hybrid approach that combines elements of both Batch and Stochastic Gradient Descent. It divides the dataset into small batches and updates the weights based on each mini-batch. This method provides a balance between the efficiency of SGD and the stability of Batch Gradient Descent.

- **Advantages:**
  - Faster than Batch Gradient Descent for large datasets.
  - Can lead to more stable convergence than SGD.
  - Suitable for parallel processing (especially in deep learning).

- **Disadvantages:**
  - Still requires tuning the batch size and learning rate.
  - Can still be noisy if the batch size is too small.


```

In this method:
- The dataset is randomly divided into **mini-batches**.
- The weight update happens based on the average gradient computed from a subset of the data, providing a good trade-off between speed and stability.

```

---

### Comparison of Gradient Descent Variants

| **Method**            | **Computation Per Update**            | **Speed**  | **Convergence** | **Memory Usage**  |
|-----------------------|---------------------------------------|------------|-----------------|-------------------|
| **Batch Gradient Descent**   | Entire dataset                       | Slow       | Stable, deterministic | High              |
| **Stochastic Gradient Descent (SGD)** | Single sample                     | Fast       | Noisy, fluctuates | Low               |
| **Mini-Batch Gradient Descent** | Small batch                        | Medium     | More stable than SGD | Medium            |

---

### Conclusion

- **Batch Gradient Descent** is ideal for smaller datasets and when you want precise updates.
- **Stochastic Gradient Descent** is better for large datasets and online learning, but it requires careful tuning to reduce noise.
- **Mini-Batch Gradient Descent** strikes a good balance and is widely used in practice, especially in machine learning and deep learning.



In [None]:
class LinearRegression(RegressionModel):
    """
    A Linear Regression model that supports Batch, Stochastic, and Mini-Batch Gradient Descent.

    This class extends the `RegressionModel` base class and implements the Linear Regression algorithm
    with support for different gradient descent variants. It includes methods for training the model
    using Batch, Stochastic, or Mini-Batch Gradient Descent, and for making predictions.

    Attributes:
        epoches (int): The number of iterations for training the model.
        learning_rate (float): The step size for gradient descent updates.
        n_samples (int): The number of training samples in the dataset.
        type (str): The type of gradient descent to use ('batch', 'stochastic', or 'mini-batch').
        batch_size (int or None): The size of mini-batches for Mini-Batch Gradient Descent. If None, defaults to 10% of the dataset size.

    Methods:
        fit(X: np.ndarray, y: np.ndarray):
            Trains the model using the specified gradient descent type.

        fit_batch(X: np.ndarray, y: np.ndarray):
            Trains the model using Batch Gradient Descent.

        fit_stochastic(X: np.ndarray, y: np.ndarray):
            Trains the model using Stochastic Gradient Descent.

        fit_mini_batch(X: np.ndarray, y: np.ndarray):
            Trains the model using Mini-Batch Gradient Descent.

        predict(X: np.ndarray) -> np.ndarray:
            Makes predictions for the given input features.
    """
    def __init__(self, epoches=1000, learning_rate=0.1, lr_type='batch', batch_size=None):
        """
        Initializes the LinearRegression model with the specified parameters.

        Parameters:
            epoches (int): The number of iterations for training the model. Default is 1000.
            learning_rate (float): The step size for gradient descent updates. Default is 0.1.
            lr_type (str): The type of gradient descent to use ('batch', 'stochastic', or 'mini-batch'). Default is 'batch'.
            batch_size (int or None): The size of mini-batches for Mini-Batch Gradient Descent. Default is None.
        """
        super().__init__()
        self.epoches = epoches
        self.learning_rate = learning_rate
        self.n_samples = 0
        self.type = lr_type
        self.batch_size = batch_size

    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        Trains the model using the specified gradient descent type.

        Parameters:
            X (np.ndarray): The input features.
            y (np.ndarray): The target values.
        """
        self.n_samples, n_features = X.shape[0], X.shape[1]
        self.w = np.random.randn(n_features + 1) * 0.01
        X_bias = self.add_bias(X)

        if self.type == 'batch':
            self.fit_batch(X_bias, y)
        elif self.type == 'stochastic':
            self.fit_stochastic(X_bias, y)
        elif self.type == 'mini-batch':
            self.fit_mini_batch(X_bias, y)

    def fit_batch(self, X: np.ndarray, y: np.ndarray):
        """
        Trains the model using Batch Gradient Descent.

        Parameters:
            X (np.ndarray): The input features with a bias term added.
            y (np.ndarray): The target values.
        """
        for _ in range(self.epoches):
            predictions = X @ self.w
            errors = predictions - y
            gradient = (X.T @ errors) / self.n_samples

            self.w = self.w - self.learning_rate * gradient

    def fit_stochastic(self, X: np.ndarray, y: np.ndarray):
        """
        Trains the model using Stochastic Gradient Descent.

        Parameters:
            X (np.ndarray): The input features with a bias term added.
            y (np.ndarray): The target values.
        """
        for _ in range(self.epoches):
            random_idx = np.random.randint(self.n_samples)
            random_sample = X[random_idx]
            predict = self.w.T.dot(random_sample)
            error = predict - y[random_idx]

            stoch = error * random_sample
            self.w = self.w - self.learning_rate * stoch

    def fit_mini_batch(self, X: np.ndarray, y: np.ndarray):
        """
        Trains the model using Mini-Batch Gradient Descent.

        Parameters:
            X (np.ndarray): The input features with a bias term added.
            y (np.ndarray): The target values.
        """
        for _ in range(self.epoches):
            size = self.batch_size if self.batch_size else max(1, int(self.n_samples * 0.1))
            random_inds = np.random.randint(0, self.n_samples, size=size)

            batch_samples, batch_labels = X[random_inds], y[random_inds]
            predictions = batch_samples @ self.w
            errors = predictions - batch_labels

            gradient = (batch_samples.T @ errors) / size

            self.w = self.w - self.learning_rate * gradient

    def predict(self, X: np.ndarray):
        """
        Makes predictions for the given input features.

        Parameters:
            X (np.ndarray): The input features.

        Returns:
            np.ndarray: The predicted values.
        """
        X = self.add_bias(X)
        return X @ self.w

# Linear Regression with Regularizations

## Why Use Regularization?

In linear regression, the goal is to fit a model that minimizes the residual sum of squares between the observed targets in the dataset and the targets predicted by the linear approximation. However, in many real-world problems, this approach can lead to **overfitting**, where the model learns the noise or random fluctuations in the data instead of the actual underlying patterns. This typically happens when we have a large number of features or when the model is too complex for the available data.

### Overfitting
Overfitting occurs when the model is too "flexible" and captures not just the true relationships but also the noise in the data. This can result in a model that performs well on the training data but poorly on unseen (test) data, as it has essentially "memorized" the training data rather than learning generalizable patterns.

### Regularization
Regularization is a technique used to reduce overfitting by adding a penalty term to the loss function. This penalty term discourages the model from fitting the noise by shrinking the magnitude of the coefficients. In this way, the model is forced to find a balance between fitting the training data and keeping the model as simple as possible.

By introducing a regularization term, we add a cost to the model’s complexity, typically by penalizing large weights. This helps in reducing the model's variance and improving its ability to generalize to new data.

## Types of Regularization

There are primarily two types of regularization techniques used in linear regression:

### 1. **L2 Regularization (Ridge Regression)**

Ridge regression adds a penalty proportional to the sum of the squares of the coefficients. The loss function becomes:

$$ \mathcal{L}(\mathbf{w}) = \frac{1}{N} \| X\mathbf{w} - \mathbf{y} \|^2 + \lambda \| \mathbf{w} \|^2 $$

Where:
- $\lambda$ is the regularization strength (also called the Ridge parameter).
- $\mathbf{w}$ is the vector of coefficients.

This penalty shrinks the coefficients by forcing them to be small, but it does not set them exactly to zero.

### 2. **L1 Regularization (Lasso Regression)**

Lasso regression adds a penalty proportional to the sum of the absolute values of the coefficients. The loss function becomes:

$$ \mathcal{L}(\mathbf{w}) = \frac{1}{N} \| X\mathbf{w} - \mathbf{y} \|^2 + \lambda \| \mathbf{w} \|_1 $$

Where:
- $\lambda$ is the regularization strength (also called the Lasso parameter).
- $\mathbf{w}$ is the vector of coefficients.

Lasso has the added benefit of performing **feature selection**, as it can drive some of the coefficients to exactly zero.

### 3. **Elastic Net Regularization**

Elastic Net combines both L1 and L2 penalties. The loss function becomes:

$$ \mathcal{L}(\mathbf{w}) = \frac{1}{N} \| X\mathbf{w} - \mathbf{y} \|^2 + \lambda \left( \alpha \| \mathbf{w} \|_1 + \frac{1-\alpha}{2} \| \mathbf{w} \|_2^2 \right) $$

Where:
- $\lambda$ is the regularization strength.
- $\alpha$ controls the mix between Lasso $(\alpha = 1)$ and Ridge $(\alpha = 0)$ regularization.
- $\mathbf{w}$ is the vector of coefficients.

Elastic Net is useful when there are many correlated features, as it tends to select one feature from a group of correlated features, unlike Lasso.


In [None]:
class RegularizedRegression(RegressionModel):
    def __init__(self, learning_rate=0.01, epochs=1000, alpha=0.5, l=0.1):
        """
        Initializes the Regularized Regression model.

        Parameters:
        - learning_rate: The learning rate for gradient descent.
        - epochs: The number of iterations for gradient descent.
        - alpha: The mixing parameter between L1 (Lasso) and L2 (Ridge) regularization.
          alpha = 1 corresponds to Lasso, and alpha = 0 corresponds to Ridge.
        - l: The regularization strength (lambda).
        """
        super().__init__()
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.alpha = alpha
        self.l = l

    def fit(self, X, y):
        """
        Trains the model using gradient descent.

        Parameters:
        - X: Feature matrix (n_samples x n_features).
        - y: Target vector (n_samples).
        """
        X_bias = self.add_bias(X)
        self.n_samples, self.n_features = X_bias.shape
        self.w = np.zeros(self.n_features)

        for _ in range(self.epochs):
            predictions = X_bias @ self.w
            errors = predictions - y

            gradient = (X_bias.T @ errors) / self.n_samples

            l1_reg = self.alpha * self.l * np.sign(self.w)
            l2_reg = (1 - self.alpha) * self.l * self.w

            gradient += l1_reg + l2_reg

            self.w -= self.learning_rate * gradient

    def predict(self, X):
        """
        Makes predictions on new data.

        Parameters:
        - X: Feature matrix (n_samples x n_features).
        
        Returns:
        - np.ndarray: Predicted values for the input X.
        """
        X_bias = self.add_bias(X)
        return X_bias @ self.w

# Logistic Regression

Logistic Regression is a classification algorithm used to predict the probability of a binary outcome (i.e., two classes). Unlike Linear Regression, which is used for regression tasks, Logistic Regression is used for classification, where the output is a probability between 0 and 1, which can then be mapped to a class label.

### Logistic Regression Hypothesis Function

In Logistic Regression, we aim to predict the probability that an input sample belongs to a particular class (e.g., class 1). The hypothesis function is defined using the **sigmoid function**:

$$ \hat{y} = \sigma(\mathbf{w}^T \mathbf{x} + \mathbf{\beta}) $$

Where:
- $ \hat{y} $ is the predicted probability that the input belongs to class 1.
- $ \mathbf{w} $ is the vector of weights for the features.
- $ \mathbf{x} $ is the input feature vector.
- $ \mathbf{\beta} $ is the bias term (intercept).
- $ \sigma(z) $ is the sigmoid function.

In matrix form, the hypothesis function becomes:

$$ \hat{y} = \sigma(\mathbf{X} \mathbf{w}) $$

Where:
- $ \mathbf{X} $ is the feature matrix with the bias term added (typically a column of ones).
- $ \mathbf{w} $ is the vector of weights (including the bias term).
- $ \sigma(z) = \frac{1}{1 + e^{-z}} $ is the **sigmoid function**.

### Sigmoid Function

The **sigmoid function** (also known as the logistic function) maps any real-valued number to a value between 0 and 1, making it suitable for modeling probabilities. The sigmoid function is defined as:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Where:
- $ z = \mathbf{w}^T \mathbf{x} $ is the linear combination of input features and weights (the "logit").
- The output of the sigmoid function is a probability, which is used for binary classification.

The sigmoid function has the following properties:
- $ \sigma(z) $ is continuous and smooth.
- $ \sigma(z) $ is strictly increasing.
- $ \sigma(z) $ outputs values between 0 and 1.

### Logistic Regression Loss Function

In logistic regression, we use the **log loss** (also called binary cross-entropy loss) to measure the difference between the predicted probabilities and the actual binary labels. The loss function for logistic regression is given by:

$$ \mathcal{L}(\mathbf{w}) = -\frac{1}{N} \sum_{i=1}^{N} \left( y_i \log(\hat{y_i}) + (1 - y_i) \log(1 - \hat{y_i}) \right) $$

Where:
- $ N $ is the number of training samples.
- $ y_i $ is the actual class label for the `i-th` sample (either 0 or 1).
- $ \hat{y_i} $ is the predicted probability for the `i-th` sample, computed using the sigmoid function $ \sigma(\mathbf{x_i} \mathbf{w}) $.

The goal of Logistic Regression is to **minimize** this loss function by adjusting the weights $ \mathbf{w} $ using optimization techniques like **Gradient Descent**.

### Gradient Descent for Logistic Regression

To update the weights in logistic regression, we compute the gradient of the loss function with respect to the weights $ \mathbf{w} $. The gradient is given by:

$$ \boxed{\nabla_{\mathbf{w}} \mathcal{L}(\mathbf{w}) = \frac{1}{N} \sum_{i=1}^{N} ( \hat{y_i} - y_i ) \mathbf{x_i}} $$

Where:
- $\hat{y_i} = \sigma(\mathbf{x_i} \mathbf{w})$ is the predicted probability for the `i-th` sample.
- $\mathbf{x_i}$ is the feature vector for the `i-th` sample.

This gradient can be used to update the weights using the **Gradient Descent** algorithm:

$$ \mathbf{w} = \mathbf{w} - \eta \cdot \nabla_{\mathbf{w}} \mathcal{L}(\mathbf{w}) $$

Where:
- $\eta $ is the learning rate.
- $\mathbf{w}$ is the weight vector.

The update rule involves iterating over the training data and adjusting the weights to minimize the loss function.



In [5]:
class LogisticRegression(RegressionModel):
    def __init__(self, epoches=1000, learning_rate=0.1):
        super().__init__()
        self.epoches = epoches
        self.learning_rate = learning_rate
        self.probs = None

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))
    
    def loss_function(self, X: np.ndarray, y: np.ndarray):
        sigmoids = self.sigmoid(X @ self.w)

        epsilon = 1e-15
        # to be sure log(0) not happens
        sigmoids = np.clip(sigmoids, epsilon, 1 - epsilon)
        
        loss = - np.mean(((y * np.log(sigmoids)) + ((1-y) * np.log(1 - sigmoids))))
        return loss

    def fit(self, X: np.ndarray, y: np.ndarray):
        self.w = np.zeros(X.shape[1]+1)
        for _ in range(self.epoches):
            predicts = self.predict(X)
            X_bias = self.add_bias(X)
            errors = predicts - y
            
            n_samples = X_bias.shape[0]
            gradient = (X_bias.T @ errors) / n_samples

            self.w = self.w - self.learning_rate * gradient

    def predict(self, X: np.ndarray):
        X = self.add_bias(X)
        self.probs = self.sigmoid(X @ self.w)
        return (self.probs >= 0.5).astype(int)