# Logistic Regression

Logistic regression is an algorithm that is very useful in classification problems. Classification problems are those in which the target $y$ is a categorical variable (nominal / ordinal). The algorithms used in these problems are usually called "classifiers", while in problems where the target $y$ is a continuous variable, they are called "regressors".

## Table of Contents:

* [Introduction to Logistic Regression](#introduction-to-logistic-regression)
* [Likelihood](#likelihood)
* [Gradient Ascent](#gradient-ascent)
* [Code](#code)

## Introduction to Logistic Regression

If we tried to use the linear regression in a binary (2-classes) classification problem (e.g., is a tumor maliginant or not based on its size ?)

![Linear Regression in Classification - case 1](assets/images/linear-regression-in-classification-1.png)

we can fit this line and say when $h_\theta(x) \geq 0.5$,  say it's malignant and not if otherwise.

<img src="assets/images/linear-regression-in-classification-1-details.png">

It's very logical until now. If we introduced another sample,

<img src="assets/images/linear-regression-in-classification-2.png" width="1500" height="600">

The fitted line is different and skewed towards the new sample and introudced a new tumor size threshold for $h_\theta(x) \geq 0.5$, although it should be the same as the previous case.

<img src="assets/images/linear-regression-in-classification-2-details.png">

Thus the linear regression model isn't the best option for the classification problems that should find the optimal decision boundary given some data samples. And here comes the logisitic regression.

The logisitc regression model uses a hypothesis called ***Sigmoid (Logistic) function***. Which gives a value in the range of [0, 1]

$$\sigma(z) = \frac{1}{1+e^{-z}}$$

$$\boxed{ h_\theta(x) = \sigma(\theta^TX) = \frac{1}{1+e^{-\theta^TX}} } \tag{1}$$

$$h_\theta(x) \in (0, 1)$$

$$$$

<img src="assets/images/sigmoid-function.png" width="1000" height="600">

Which is used to predict the probability of $y = 1$ given $x$ which is parameterized by $\theta$ for binary classification

$$P(y = 1|x;\theta) = h_\theta(x)$$
$$P(y = 0|x;\theta) = 1 - h_\theta(x)$$
$$P(y|x;\theta) = h_\theta(x)^y (1 - h_\theta(x))^{(1-y)}$$

## Likelihood

The likelihood of the parameters $\theta$, $L(\theta)$, in learning algorithms is the probability of getting the true values given input if the parameters are set to specific values. The higher likelihood, the better.

$$L(\theta) = \prod_{i=1}^{m}p(y^{(i)}|x^{(i)};\theta) = \prod_{i=1}^{m}  \left[ h_\theta(x^{(i)})^{y^{(i)}} \left( 1 - h_\theta(x^{(i)}) \right)^{(1-y^{(i)})} \right]$$

We can calculate the log-likelihood, $l(\theta)$,

$$\boxed{ l(\theta) = log(L(\theta)) = \sum_{i=1}^{m} \left[ y^{(i)} h_\theta(x^{(i)}) * (1 - y^{(i)}) \left(1 - h_\theta(x^{(i)})\right) \right] } \tag{2}$$

Minimizing the cost function is the same as **maximizing the likelihood**. We will use **gradient *ascent*** to find the parameters that maximize the log likelihood.

**Note:** you can use the negative of the log-likelihood as the *cost* function and use gradient descent. It's just some mathematical manipulation.

## Gradient Ascent

It's the same as gradient descent. But at gradient descent, we want to find the parameters that are located at the minima of a cost function. For gradient ascent, we want to find the parameters that are located at the maxima of the likelihood function.

<img src="assets/images/gradient-descent-ascent.png">

$$\theta_j = \theta_j + \alpha * \frac{\partial{}}{\partial{\theta_j}}l(\theta)$$

$$\boxed{ \frac{\partial{}}{\partial{\theta_j}}l(\theta) = \frac{1}{m} \sum_{i=1}^m \left( y^{(i)} - h_\theta \left( x^{(i)} \right) \right) x_j^{(i)} } \tag{3}$$

$$\boxed { \theta_j := \theta_j + \alpha * \frac{\partial{}}{\partial{\theta_j}} l(\theta) } \tag{4}$$

Check the logistic regression part of the [lecture notes](http://cs229.stanford.edu/notes2021fall/cs229-notes1.pdf) of CS229 for Prof. Andrew Ng for the proof of the gradient of the log-likelihood function.

## Code

The logistic regression logic is the same as the linear regression. The main difference is the hypothesis, the cost (likelihood) function. Calculating the gradients and updating them are the same.

$$
\begin{bmatrix}
z^{(1)} \\
z^{(2)} \\
\vdots \\
z^{(m)}
\end{bmatrix}
_{m * 1}

 =
 
  
\begin{bmatrix}
x_0^{(1)} & x_1^{(1)} & \cdots & x_n^{(1)} \\
x_0^{(2)} & x_1^{(2)} & \cdots & x_n^{(2)} \\
\vdots & \vdots & \cdots & x_n^{(3)} \\
x_0^{(m)} & x_1^{(m)} & \cdots & x_n^{(4)} 
\end{bmatrix}
_{m * n + 1}


\begin{bmatrix}
\theta_0 \\
\theta_1\\
\vdots \\
\theta_n
\end{bmatrix}
_{n + 1 * 1}

$$


$$\boxed{ Z_{m * 1} = X_{m*n+1} \hspace{1mm} \theta_{n+1*1} } \tag{5}$$

$$
\begin{bmatrix}
h^{(1)} \\
h^{(1)} \\
\vdots \\
h^{(m)}
\end{bmatrix}
_{m * 1}

=

\begin{bmatrix}
\sigma\left(z^{(1)}\right) \\
\sigma\left(z^{(2)}\right) \\
\vdots \\
\sigma\left(z^{(m)}\right)
\end{bmatrix}
_{m * 1}
$$

$$
\boxed{ H_{m * 1} = \sigma \left( Z_{m*1} \right) } \tag{6}
$$

$$
l(\theta)

=

\frac{1}{m}

\left(

\begin{bmatrix}
y^{(1)} \\
y^{(1)} \\
\vdots \\
y^{(m)}
\end{bmatrix}^T
_{1 * m}

\begin{bmatrix}
log \left( h^{(1)} \right) \\
log \left( h^{(2)} \right) \\
\vdots \\
log \left( h^{(m)} \right)
\end{bmatrix}
_{m * 1}

+

\begin{bmatrix}
\left( 1 - y^{(1)} \right) \\
\left( 1 - y^{(1)} \right) \\
\vdots \\
\left( 1 - y^{(m)} \right)
\end{bmatrix}^T
_{1 * m}

\begin{bmatrix}
log \left( 1 - h^{(1)} \right) \\
log \left( 1 - h^{(2)} \right)  \\
\vdots \\
log \left( 1 - h^{(m)} \right)
\end{bmatrix}
_{m * 1}


\right)

$$

$$\boxed{ l(\theta) = \frac{1}{m} \left( \hspace{1mm} Y^T_{1*m} * \hspace{1mm} log(H)_{m*1} + \hspace{1mm} (\vec{1}_{m*1} - Y)^T_{1*m} * \hspace{1mm} log(\vec{1}_{m*1} - H)_{m*1}\right) } \tag{7}$$

$$
\begin{bmatrix}
\frac{\partial{l(\theta)}}{\partial{\theta_0}} \\ \\
\frac{\partial{l(\theta)}}{\partial{\theta_1}}\\ \\
\vdots \\\\
\frac{\partial{l(\theta)}}{\partial{\theta_n}}
\end{bmatrix}
_{n + 1 * 1}

= 
\frac{1}{m}



\left(

\begin{bmatrix}
x_0^{(1)} & x_0^{(2)} & \cdots & x_0^{(m)} \\
x_1^{(1)} & x_1^{(2)} & \cdots & x_1^{(m)} \\
\vdots & \vdots & \cdots & \vdots \\
x_n^{(1)} & x_n^{(2)} & \cdots & x_n^{(m)} 
\end{bmatrix}  
_{n+1 * m}

\begin{bmatrix}
y^{(1)} - h_\theta(x^{(1)}) \\
y^{(2)} - h_\theta(x^{(2)})\\
\vdots \\
y^{(m)} - h_\theta(x^{(m)})
\end{bmatrix}
_{m * 1}  

\right)

$$


$$ \boxed{ \frac{\partial{l(\theta)}}{\partial{\theta}}_{n+1 * 1} = \frac{1}{m} X^T_{m * n+1} (Y - H)_{m * 1} } \tag{8}$$

$$ \boxed{\theta_{n + 1 * 1} := \theta_{n + 1 * 1} + \alpha * \frac{\partial{l(\theta)}}{\partial{\theta}} _{n + 1 * 1}} \tag{9}$$

### Preparing the data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv("assets/data.csv")
data.head()

Unnamed: 0,Exam 1 marks,Exam 2 marks,Admission status
0,34.62366,78.024693,0
1,30.286711,43.894998,0
2,35.847409,72.902198,0
3,60.182599,86.308552,1
4,79.032736,75.344376,1


In [3]:
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values.reshape(-1, 1)

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15)

In [5]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

### Training

In [6]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [7]:
m_train = X_train.shape[0]
m_test = X_test.shape[0]

In [8]:
X_train_intercept = np.hstack((np.ones((m_train, 1)), X_train_scaled))
X_test_intercept = np.hstack((np.ones((m_test, 1)), X_test_scaled))

In [9]:
n = X_train_intercept.shape[1]

In [10]:
theta = np.random.random((n, 1))

In [11]:
for i in range(100000):
    z = X_train_intercept @ theta
    h = sigmoid(z)
    grads = (1 / m_train) * (X_train_intercept.T @ (h - y_train))
    theta = theta - 0.01 * grads

In [12]:
from sklearn.metrics import confusion_matrix
h_hat = sigmoid(X_test_intercept @ theta)
h_hat[h_hat >= 0.5] = 1
h_hat[h_hat < 0.5] = 0
confusion_matrix(y_test, h_hat)

array([[5, 2],
       [0, 8]], dtype=int64)

### sklearn model

In [13]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty = 'none')
lr.fit(X_train_scaled, y_train)
lr.coef_, lr.intercept_
confusion_matrix(lr.predict(X_test_scaled), h_hat)

  return f(*args, **kwargs)


array([[ 5,  0],
       [ 0, 10]], dtype=int64)

In [14]:
lr.score(X_test_scaled, y_test)

0.8666666666666667

### our model

In [15]:
from BinaryLogisticRegression import BinaryLogisticRegression

blr = BinaryLogisticRegression(X_train_scaled, y_train)
blr.fit(n_iterations = 100000)
blr.theta

array([[2.18518579],
       [4.78832816],
       [3.95238248]])

In [16]:
y_pred = blr.predict(X_test_scaled)
blr.score(y_test, y_pred, probabilistic=True)

0.8666666666666667

### TODO: Using softmax for multiclass regression.