# Logistic Regression From Scratch 
--- 

This is a full on implementation of the classification algorithm **Logistic Regression** from scratch using only **Numpy**, including :
- Logistic Function
- Gradient Descent
  - Full Batch
  - Mini-Batch
  - Stocastic
- Cross entropy Loss (Cost function)
- Prediction and accuracy
- Ridge Logistic Regression **L2**
- Lasso Logistic Regression **L1**
- Cross Validation

There will be another jupyter notebook for the **Heart Disease** data set (mini study) + **sklearn** benchmark

## Implementation 

In [2]:
import numpy as np

- We only need **numPy** to implement every logistic regression functionality in this project
- The class implementation based can be found in the `Logistic_Regression_Scratch.py` File which contain better code practices similar to `sklearn` library 

### Generating Data

In [10]:
def generate_dummy_data(n=1000,p=3,seed=12):
    np.random.seed(seed)

    X= np.random.randn(n,p)

    coeff_true = np.array([2]*p)
    intercept = 3

    P_x = np.exp(X @ coeff_true + intercept)/(1+ np.exp(X@coeff_true + intercept))
    
    y = np.random.binomial(1,P_x)

    return X,coeff_true, intercept , y

- This function serves as a way to generate dummy data to set our function
- `n` number of **observations** and `p` number of **features** or predictors
- Binary Logistic Regression follow's a **Binomial** distribution so the **true responses** are randomly drawn from it with a probability $P(x)$

$$  p(X)=\frac{e^{\beta_{0}+X\beta}}{1+e^{\beta_{0}+X\beta}}=\frac{1}{e^{-(\beta+X\beta)}+1}$$

- This is called the **Logistic Function** which gives us results between $0$ and $1$ also called the sigmoid function
- $P(x)$ gives us probability results for each observation 

### Logistic Function

In [14]:
def sigmoid_function(X,intercept,beta):
    logit = X @ beta + intercept

    P_x = 1/(np.exp(-logit)+1)

    return P_x , logit 

- This `sigmoid_function` calculated the probability of an observation as stated above
- The `logit` is just log of the **odds**
$$ odds =p(X)/(1-p(X))$$

- The`logit` is just a **Linear Regression** equation which allow us to do **inference** and statistical analysis on **Logistic Regression** 

### Cross Entropy Loss (Cost Function)

In [15]:
def cross_entropy_loss(y,X,beta,intercept):
    
    sigmoid_fn = sigmoid_function(X,intercept,beta)
    const_fn = -np.mean(y.T@np.log(sigmoid_fn)+(1-y).T@np.log(1-sigmoid_fn))
    
    return const_fn

- The cost function for the **Logistic Regression** is called **cross_entropy_loss** given by : $$\mathcal{l}(\beta)=-[y^T \log(p(X))+(1-y)^T\log(1-p(X))] $$

- it's simply the log likelihood of the **maximum likelihood** function
- The **ML** is similar to the binomial distribution **PMF**
- The `-` on the equation is simply for optimizaiton purpose to apply the **Gradient Descent** <br>
(more information and details on the documentation pdf)

### Gradient Descent

- Time to fit our logistic regression and estimate the coefficients $beta$, This function will apply all 3 types of known gradient descent 

In [19]:
def gradient_descent(lr,n_itr,batch_size,Y,X,n):
    p = X.shape[1]
    beta_est = np.zeros((p,1))
    intercept_est = 0

    for i in range(n_itr):
        idx = np.random.choice(n,size = batch_size , replace = False)
        X_GD = X[idx]
        Y_GD = Y[idx].reshape(-1,1)

        sigmoid_fn = sigmoid_function(X_GD,intercept_est,beta_est)

        gradient_cel = (1/batch_size)*(X_GD.T@(sigmoid_fn-Y_GD))

        gradient_intercept = (1/batch_size)*np.sum(sigmoid_fn-Y_GD)

        beta_est = beta_est - (lr*gradient_cel)
        intercept_est = intercept_est - (lr*gradient_intercept)

    return beta_est , intercept_est

- The **Logistic Regression** has no closed form solution unlike the **Linear Regression** OLS, so gradient descent is the only way to estimate the coefficients of the model, the gradient of the cost function (cross entropy loss) is : $$ \nabla J(\beta)=\frac{1}{n}X^T(\sigma(X\beta)-y)$$

And for the intercept it's : $$ \nabla J(\beta_{0})=\frac{1}{n}\sum(\sigma(X\beta)-y)$$

- They are simply the pratial derivaiton with respect to $\beta$ and for the intercept for $\beta_{0}$

- The `idx` is simply randomly samples take batches from the data `n`
- Both of `X_GD` and `Y_GD` are samples to used to calculate the gradient for the next step in the **Gradient Descent** algorithm 

### Prediction & Accuracy 

In [23]:
def predict(X,intercept,beta,threshold=0.5):
    predicted_probability = sigmoid_function(X,intercept,beta)

    predicted_class = (predicted_probability>= threshold).astype(int)
    return predicted_class, predicted_probability

- This function simply calculate the probability of each observation using the estimated coefficients from the `gradient_descent` function
- Classify based on a `threshold` usually set to $0.5$ to either $0$ or $1$

In [None]:
def accuracy(Y,Y_pred):
    return np.mean(Y_pred==Y)

- comparing the true values of the response `Y` and the predicted values of `Y_pred`
- This will come in handy when we compare different regularizations and hyperparameters

### Ridge Logistic Regression 