# Derivations related to the Logistic Regression model

### What is Logistic Regression?

Logistic regression is a binary classification algorithm, that converts a feature vector $x$ into a probability distribution over $2$ classes.

The idea is that we give the model a $x \in \mathbb{R}^{n}$ and we want it to output the probability of obtaining a certain class. We usually denote the classes 0 and 1 respecitively, and the model outputs the probability of obtaining a 1 given the input features. We denote our hypothesis function as:

$$ h_{\theta}(x) = \frac{1}{1 + exp[\theta^Tx + b]} $$

This is also known as the sigmoid function and we can denote this as:
$$ h_{\theta}(x) = \sigma(\theta^Tx + b) $$

You may notice that we are still using a linear model where we multiply weights $\theta$ with our feature vector and add a bias $b$. The difference is that after performing this operation we place it into a sigmoid function which effectively squashes the line into a range of $(0, 1)$. This means that we have converted from a range of $(-\infty, \infty)$ to $(0, 1)$, which allows us to interpret the output in a different way. This concept is the main idea behind Generalized Linear Models which both Logistic and Linear regression are a part of. The function that converts the output into a specific probability distribution is the inverse of what is known as the 'canonical link function'. In the case of logistic regression the canonical link function is known as the logit function.

Like I have mentioned previously, we wish to make this an accurate model to represent the probability of a certain outcome happening given some inputs. In order to do so, we need to optimize our parameters to some data.

$$\text{Let } X = (x^{(1)}, x^{(2)}, \rightarrow x^{(m)})^{T} \in \mathbb{R}^{m \times n}, x^{(i)} \in \mathbb{R}^{n} $$
$$ y = (y^{(1)}, y^{(2)}, ..., y^{(m)}) \in \mathbb{R}^{m}, y^{(i)} \in \mathbb{R}$$

In other words: $X$ is a matrix where each column represents a feature, and each row represents a training example, and $y$ is a vector of each training example output

A common loss function to optimize over for Logistic Regression is the Binary Cross Entropy Loss.
If you want to know more about how this loss is derived and why we use it for Logistic Regression, check out the 
CrossEntropyLoss.ipynb notebook for more info. We will define our loss function as a function of our parameters
$J(\theta, b) = BCE$

$$ \implies J(\theta, b) = -\sum_{i = 1}^{m}y^{(i)} \cdot \log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) \cdot \log(1 - h_{\theta}(x^{(i)})) $$
$$ = -\sum_{i = 1}^{m}y^{(i)} \cdot \log(\sigma(\theta^Tx^{(i)} + b)) + (1 - y^{(i)}) \cdot \log(1 - \sigma(\theta^Tx^{(i)} + b)) 


Basically, we get a very large negative number when we predict very incorrectly and a very small negative number when
we predict close to correctly. This results in a very large loss if there were a lot of very large discrepencies between
our predictions and the actual output or a smaller loss when we predicted close to correct most of the time.

To optimize our parameters we need to find the parameters that minimize this function. We need to find:

$$ \underset{\theta, b}{\mathrm{argmin}} [J(\theta, b)]$$

To do so we may use an optimization algorithm such as gradient descent. This means we need to compute the gradient of the loss function w.r.t. the parameters of the model. The gradient of the loss is defined as:

$$  \nabla_{J}(\theta) = - \frac{\partial h}{\partial \theta}^T y_{h} $$
$$ \text{Where } \; \frac{\partial h}{\partial \theta} = (\frac{\partial h}{\partial \theta_{1}} \rightarrow \frac{\partial h}{\partial \theta_{n}}, \frac{\partial h}{\partial b}) \in \mathbb{R}^{m \times n + 1},$$
$$ \frac{\partial h}{\partial \theta_{j}} = (\frac{\partial h_{\theta}(x^{(1)})}{\partial \theta_{j}}, ..., \frac{\partial h_{\theta}(x^{(m)})}{\partial \theta_{j}}) \in \mathbb{R}^{m},$$
$$ \frac{\partial h}{\partial b} = (\frac{\partial h_{\theta}(x^{(1)})}{\partial b}, ..., \frac{\partial h_{\theta}(x^{(m)})}{\partial b}) \in \mathbb{R}^{m}$$
$$ \text{And } \; y_g = (y - h_{\theta}(x)) \otimes (h_{\theta}(x) - h_{\theta}(x) \odot h_{\theta}(x))  $$

Where $\otimes$ is the Hadamard division (element-wise division) and $\odot$ is the Hadamard product (element-wise product)
$\frac{\partial g}{\partial \theta}$ is known as the Jacobian of $h$
<br><br> *If you are interested in how this gradient was computed, check out the CrossEntropyLoss.ipynb notebook where I go over it in detail*

This means we only have to compute $\frac{\partial g}{\partial \theta}$ and let the loss function compute the rest and then matrix by vector multiply!

Let's start by computing $\frac{\partial h_{\theta}(x^{(i)})}{\partial \theta_j}$ for some arbitrary $i \in \{1,...,m\}, j \in \{1,...,n\}$

$$ h_{\theta}(x^{(i)}) = \sigma(\theta^Tx^{(i)} + b) $$

The derivative of $\theta^Tx^{(i)}$ w.r.t $\theta_j$ is just $x^{(i)}_j$, and the derivative of $\sigma(a)$ is $\sigma(a)(1-\sigma(a))$. Thus by chain rule:

$$ \frac{\partial h_{\theta}(x^{(i)})}{\partial \theta_j} =  \sigma(\theta^Tx^{(i)} + b) \cdot (1 - \sigma(\theta^Tx^{(i)} + b)) \cdot x^{(i)}_j$$

Then we have the derivative w.r.t $b$ which by chain rule is the same, except that the derivative of $\theta^Tx^{(i)}$ w.r.t $b$ is just $1$. So we have:

$$ \frac{\partial h_{\theta}(x^{(i)})}{\partial b} =  \sigma(\theta^Tx^{(i)} + b) \cdot (1 - \sigma(\theta^Tx^{(i)} + b))$$

So finally, we have our Jacobian as:

$$\frac{\partial h}{\partial \theta} = 
\begin{bmatrix} 
    h_{\theta}(x^{(1)})\cdot (1 - h_{\theta}(x^{(1)})) \cdot x^{(1)}_1 & \dots & h_{\theta}(x^{(1)})\cdot (1 - h_{\theta}(x^{(1)})) \cdot x^{(1)}_n & h_{\theta}(x^{(1)})\cdot (1 - h_{\theta}(x^{(1)})) \\
    \vdots & \ddots  & \vdots & \vdots \\
    h_{\theta}(x^{(m)})\cdot (1 - h_{\theta}(x^{(m)})) \cdot x^{(m)}_1 & \dots & h_{\theta}(x^{(m)})\cdot (1 - h_{\theta}(x^{(m)})) \cdot x^{(m)}_n & h_{\theta}(x^{(m)})\cdot (1 - h_{\theta}(x^{(m)}))
\end{bmatrix}

Let's decompose this matrix further to make calculations easier.

$$ \text{Let } \; \partial\sigma_h = (h_{\theta}(x^{(1)})\cdot (1 - h_{\theta}(x^{(1)})), ..., h_{\theta}(x^{(m)})\cdot (1 - h_{\theta}(x^{(m)})))) \in \mathbb{R}^{m} $$
Then if we brodcast $\partial\sigma_h$ to a matrix,

$$ H = (\leftarrow \partial\sigma_h \rightarrow) \in \mathbb{R}^{m \times n + 1} $$

Then our Jacobian matrix is:
$$\frac{\partial h}{\partial \theta} = [X \quad b_m] \odot H, \quad b = (1, 1, ..., 1) \in \mathbb{R}^{m} $$

Thus:

$$  \nabla_{J}(\theta) = ([X \quad b_m] \odot H)^T (-y_g) $$

We can now convert our broadcasted matrix back into a vector, since we can see that when multiplying $-y_g$ by a matrix it will effectively be broadcasted before summed so we can use this to our advantage.

Our final gradient is:

$$  \nabla_{J}(\theta) = [X \quad b_m]^T (\partial\sigma_h \odot (-y_g)) $$

Altough this may look daunting, remember we only have to worry about the components that we calculated from our Jacobian. This means we do not have to worry about the $-y_g$ term as that will be provided by our loss function. And we can see that $X$ is provided to us and $b_m$ is just a vector of 1. So the only computation we really have to worry about is $\partial \sigma_h$

### Coding our model

In [15]:
import numpy as np
import sys
from pathlib import Path

# Add the directory containing your code file to sys.path
sys.path.append(str(Path().resolve().parent.parent))

from loss.CrossEntropyLoss import BinaryCrossEntropyLoss
from optimizers.SGDOptimizer import SGDOptimizer

# CREATE SOME DUMMY DATA
x = np.linspace(0, 10, 20)
y = np.array([1 if el + 2*np.random.rand() else 0 for el in x])

# reshape the data to place into the model
X = np.reshape(x, (x.shape[0], 1))

# initialize our loss (I will use Binary Cross Entropy Loss here) and our optimizer
loss = BinaryCrossEntropyLoss()
optimizer = SGDOptimizer(0.01)


Firstly we need to initialize the weights of the model. We need to specify the feature space. In this example I will use a dimension of 1, however this scales to $n$ dimensions

In [6]:
weights = np.random.rand(1) # shape is (1,)
bias = np.random.rand(1) # shape is (1,)
print(f"Weights: {weights}, Bias: {bias}")

Weights: [0.32456918], Bias: [0.36886216]


Let's define the forward pass of our model, i.e the hypothesis function, and run through it for all training examples to get a vector of predictions. We first must calculate the linear component $\theta^Tx$ and then calculate the sigmoid of that linear component.

In [7]:
def forward(X):
    linear = np.dot(X, weights) + bias # linear
    return 1 / (1 + np.exp(-linear)) # sigmoid

y_pred = forward(X)
y_pred

array([0.59118401, 0.63173985, 0.67051471, 0.70709979, 0.74119056,
       0.77258958, 0.80120098, 0.82701863, 0.85011066, 0.87060235,
       0.88865949, 0.90447329, 0.91824767, 0.93018916, 0.9404994 ,
       0.94936981, 0.95697831, 0.96348741, 0.96904355, 0.97377722])

Now let's specify our gradient function. This should take in the calculated gradients for the loss function and we will multiply with the derivatives of our model (Using the formula we derived above)

We start by running a forward pass to obtain our $h_{\theta}(x)$. Then using this, we know we need the vector $\partial \sigma_h$ which is just a vector of $h_{\theta}(x) \odot (1 - h_{\theta}(x)) = h_{\theta}(x) - h_{\theta}(x) \odot h_{\theta}(x)$ We then multiply this by the $-y_g$ which is supplied via our loss function. 
Finally we take the dot with our input matrix X, and concatenate on taking the dot with a vector of $1s$ which is equivalent to just summing up the entries of a vector.

In [11]:
def grads(X, loss_grad):
        y_pred = forward(X) # we need to run a forward pass again, since we use this in our Jacobian
        adjusted_loss_grads = loss_grad * (y_pred - y_pred**2) # this line is equivalent to our vector of partials of the sigmoids * (-yg)
        return np.concatenate((np.dot(X.T, adjusted_loss_grads), [np.sum(adjusted_loss_grads)]))

loss_grad = loss.grads(y_pred, y)
model_grads = grads(X, loss_grad)
model_grads

array([-8.87121986, -3.14202359])

Finally we need to apply them using the optimizer to get the updated parameters

In [16]:
params = np.concatenate((weights, bias))
optimizer.step(params, model_grads)
weights = np.array(params[:weights.size])
bias = np.array(params[weights.size:])
print(f"Weights: {weights}, Bias: {bias}")

Weights: [0.41328138], Bias: [0.40028239]


Then we would repeat this until convergence

Note: You may extend the concept of linear regression into multiple classes. This is known as Softmax Regression. The same concepts apply, however our hypothesis function now outputs a probability distribution (vector) over our $k$ classes. Each of the entries in the vector gets it's own linear model with it's own parameters to optimize. However after evaluating each of these linear models, the entries are placed into what is known as a softmax function. $$S(x^i) = \frac{\exp[x^i]}{\sum_{j = 1}^{k}\exp[x^j]} $$


For k = 2, this boils down to Logistic Regression, and this can be proved, thus it is an extension of Logistic Regression into multiple classes. Your loss function is also likely to change to Categorical Cross Entropy Loss which works on multi class problems.