# Lecture 7 Logistic Regression and Neural network

This notebook provides a comprehensive guide on implementing Classification using scikit-learn libraries.

By: Bryce Li, NUS

Logistic regression is a statistical method which is used for prediction when the dependent variable or the output is categorical. It is used when we want to know whether a particular data point belongs to class 0 or class 1 . In logistic regression, we need to find the probability that the output will be $y=1$ given an input vector $x . y^{\prime}$ is the predicted value when the input is $x$. Mathematically it can be defined as :
$$
y^{\prime}=p(y=1 \mid x)
$$

## Mathematical Model
Input: $X$ is an input matrix of dimensions $n \times m$ where $n$ is the number of
Top highlight
features in $X$ and $m$ is the number of training examples.

Parameters: $W$ is a Weight Matrix of dimensions $n \times 1$ where $n$ is the number of features in $X$. Bias $b$ helps in controlling the value at which the activation function will trigger.

Output:
$$
y^{\prime}=\sigma\left(W^T X+b\right)
$$

## Activation Function
Activation functions are really important for an Artificial Neural Network to learn and make sense of something really complicated. They introduce nonlinear properties to the network. Their main purpose is to convert an input signal of a node into an output signal. That output signal now is used as an input in the next layer of the Neural Network. The activation used above is the sigmoid activation function. Mathematically, it can be defined as:
$$
\sigma(x)=\frac{1}{1+e^{-x}}
$$

## Loss Function
Loss can be defined as the error that is present between the actual output and the predicted output. We want the value of loss function to be as low as possible as it would reduce the loss and the predicted value would be close to the actual value. The loss function that we use to train the neural network varies from case to case. Therefore it is important to select a proper loss function for our use case so that the neural network is trained properly. The loss function which we are going to use for logistic regression can be mathematically defined as:
$$
L\left(y^{\prime}, y\right)=-\left[y \log y^{\prime}+(1-y) \log \left(1-y^{\prime}\right)\right]
$$

Let us study why this loss function is good for logistic regression,

* 1. When $y=1$ the loss function equates to $L\left(y^{\prime}, y\right)=-\log y^{\prime}$. As we want the value of loss function to be less, the value of $\log y^{\prime}$ should be more, which will be more when $y^{\prime}$ will be more i.e close to 1 and therefore the predicted value and actual value will be similar.
* 2. When $y=0$ the loss function equates to $L\left(y^{\prime}, y\right)=-\log \left(1-y^{\prime}\right)$. As we want the value of loss function to be less, the value of $\log \left(1-y^{\prime}\right)$ should be more, which will be more when $y^{\prime}$ will be less i.e close to 0 and therefore the predicted value and actual value will be similar.
* 3. The above loss function is convex which means that it has a single global minimum and the network won't be stuck in local minimum(s) which are present in non-convex loss functions.

## Cost Function
The loss function is used for each and every input training example during the training process whereas the cost function is used for the whole training dataset in one iteration. So basically, the cost function is an average of all the loss values over the whole training dataset. Mathematically it can be defined as:
$$
J(W, b)=\sum_{i=1}^m L\left(y^{\prime i}, y^i\right)
$$

In the above equation, $m$ is the total number of training examples. The objective of training the network is to find Weight matrix $W$ and Bias $b$ such that the value of cost function $J$ is minimized.

## Gradient Descent
The Weight Matrix $W$ is randomly initialized. We use gradient descent to minimize the Cost Function $J$ and obtain the optimal Weight Matrix $W$ and Bias $b$. Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. We apply gradient descent to the Cost Function $J$ to minimize the cost. Mathematically it can be defined as:
$$
\begin{aligned}
& \text { Repeat }\{ \\
& W=W-\alpha \frac{\partial}{\partial W} J(W) \\
& b=b-\alpha \frac{\partial}{\partial b} J(b) \\
& \}
\end{aligned}
$$

The first equation represents the change in Weight Matrix $W$ whereas the second equation represents the change in Bias $b$. The change in the values is determined by learning rate alpha and the derivatives of the cost $J$ with respect to the Weight Matrix $W$ and Bias $b$. We repeat the updation of $W$ and $b$ until the Cost Function $J$ has been minimized. Now lets us understand how Gradient Descent works with the help of the following graph:

![image.png](attachment:image.png)

## Logistic Regression using Gradient Descent
Till now, we have understood the mathematical model of both logistic regression and gradient descent. In this section, we will see how we can use gradient descent for learning the Weight Matrix $W$ and Bias $b$ in the context of logistic regression. Let us summarize all the equations that we know so far.
$$
\begin{aligned}
& z=W^T X+b \\
& y^{\prime}=a=\sigma(z) \\
& \mathcal{L}(a, y)=-(y \log (a)+(1-y) \log (1-a))
\end{aligned}
$$
1. The first equation denotes the product of an input $X$ with the Weight Matrix $W$ and Bias $b$.
2. The second equation is the sigmoid activation function which introduces non-linearity.
3. The third equation is the loss function which calculates the loss between a given $Y$ and predicted $Y$ '.