<a href="https://colab.research.google.com/github/Samarth745/ML-algo-from-scratch/blob/main/Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression/ Softmax Regression

### Softmax Regression

In logistic regression, the decision boundary is linear and is represented as:

$Z = W_1 X_1 + W_2 X_2 + \dots + W_n X_n + C$

Where:
- $W_1, W_2, \dots, W_n$ are the weights associated with the features $X_1, X_2, \dots, X_n$,
- $C$ is the bias term,
- $Z$ is the linear combination of weights and inputs.

### Adjusting the Decision Boundary

To improve the model, we need the decision boundary to "push" the line when points are correctly classified and "pull" or "push" the boundary based on the distance of points. A step function can achieve a classification, but it’s not differentiable, making optimization difficult.

Instead of using a **Step Function** to classify points, we use the **Sigmoid Function** to satisfy the two key conditions:
- Push the decision boundary when points are correctly classified.
- Adjust the boundary based on the distance from the decision boundary.

The **Sigmoid Function** is given by:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

This maps the output to a probability value between 0 and 1, which is useful for binary classification problems.

---

### Limitations of Using the Step Function Directly

While the sigmoid function offers a solution, iterating over the points and asking "which region does it belong to?" shifts the decision boundary but doesn't ensure a well-generalized solution. This is because:
- Manually shifting the boundary isn't efficient.
- This method does not capture the "best" line; a loss function is required for optimization.

### Using the Logistic Function for Prediction

For each point $i$, the predicted probability of being in a particular class (e.g., class 1) is given by:

$$\hat{y}_i = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-(W_1 x_1 + W_2 x_2 + \dots + W_n x_n + C)}}$$

Where $\hat{y}_i$ is the predicted probability for point $i$.

This can be interpreted as:
- $\hat{y}_i$ is the probability that point $i$ belongs to a particular class (e.g., green or red).
- For binary classification:
  - $\hat{y}_i = P(G)$, the probability that the point is green.
  - $1 - \hat{y}_i = P(R)$, the probability that the point is red.

---

### Loss Function: Maximum Likelihood Estimation

To measure the performance of the logistic regression model, we aim to maximize the probability of observing the true labels given the predicted probabilities. This is called **Maximum Likelihood Estimation (MLE)**. In the case of logistic regression, the likelihood function is:

$$L = P(y_1) \cdot P(y_2) \cdot P(y_3) \dots P(y_N)$$

Where $P(y_i)$ is the probability assigned to the correct class for point $i$.

However, multiplying these probabilities often leads to very small values, which can cause numerical instability. Instead, we maximize the logarithm of the likelihood function, leading to **Log-Loss** (or **Binary Cross-Entropy**).

---

### Binary Cross-Entropy (Log-Loss) Function

The **Log-Loss Function** is the negative log likelihood, given by:

$$H(p, q) = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$$

Where:
- $y_i$ is the true label of point $i$ (either 0 or 1),
- $\hat{y}_i$ is the predicted probability for point $i$,
- $N$ is the total number of data points.

The goal of logistic regression is to minimize this **cross-entropy loss** function, which ensures that predicted probabilities $\hat{y}_i$ are as close as possible to the true labels $y_i$.

---

### Summary

- **Logistic Regression** uses the sigmoid function to model the probability that a given input belongs to a particular class.
- The model aims to maximize the likelihood of the true labels by minimizing the **binary cross-entropy** loss function.
- This approach ensures that the decision boundary is adjusted in a way that maximizes classification performance.


In [77]:
# data manipulation
import numpy as np
import pandas as pd

class LogisticRegression:
  def __init__(self, alpha=0.5, epochs = 500, threshold = 1e-5):
    self.alpha = alpha
    self.epochs = epochs
    self.threshold = threshold


  def fit(self, X, y):
    num_of_classes = len(np.unique(y))
    num_of_columns = X.shape[1]
    num_of_rows = X.shape[0]
    self.constants = np.random.rand(num_of_classes)
    self.weights = np.random.rand(num_of_classes,num_of_columns)
    Y = np.diag(np.ones(num_of_classes))[y] ## Create diagonal matrix with 1 and multiply with y to get one hot encoding
    prev_J = np.inf
    for iter in range(self.epochs):
      z = (X @ self.weights.T) + self.constants ## Calculate X
      sig_out = np.exp(z)/np.sum(np.exp(z), axis=1, keepdims=True) ## Sigmoid value for Z
      errors = Y - sig_out ## Total Error
      J = -((Y * np.log(sig_out)).sum(axis=1)).mean() ## Cost Function
      if np.abs(prev_J - J) < self.threshold:
        print(f"Algorithm completed at {iter} Iterations")
        break
      else:
        ## Calculating Gradient
        weight_gradient = (-1 /num_of_rows) * (errors.T @ X)
        constant_gradient = -errors.mean(axis=0)

        ## Update self.weights and Bias
        self.weights = self.weights - (self.alpha * weight_gradient)
        self.constants = self.constants - (self.alpha * constant_gradient)
        prev_J = J

  def predict(self, X):
    z = (X @ self.weights.T) + constants ## Calculate X
    sig_out = np.exp(z)/np.sum(np.exp(z), axis=1, keepdims=True) ## Sigmoid
    return np.argmax(sig_out, axis=1)


  def predict_proba(self, X):
    z = (X @ self.weights.T) + constants ## Calculate X
    sig_out = np.exp(z)/np.sum(np.exp(z), axis=1, keepdims=True) ## Sigmoid
    return np.round(sig_out, decimals=2)


In [73]:
# dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=7)

In [78]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.predict(X_test)

array([2, 0, 0, 1, 2, 1, 2, 0, 2, 2, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 2, 0,
       1, 0, 2, 2, 1, 1, 0, 2])

In [79]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.predict(X_test)

array([2, 0, 0, 1, 2, 1, 2, 0, 2, 2, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 2, 0,
       1, 0, 2, 2, 2, 1, 0, 2])