# Classification and Logistic Regression

Lets now talk about the classification problem. This is just like the regression problem, except that the vlaues y are now want to predict take on only a small number of discrete values. For now, we will focus on the **binary classification** problem in which y can take on two values, 0 and 1. (Most of what this notebook say here will also generalize to the multiple-class case). For instance, if we are trying to build a spam classifier for email, then $x^{(i)}$ may be some features of a piece of email, and $y$ may be 1 if it is a piece of spam mail, and 0 otherwise. 0 is called **negative class**, and 1 the **positive class**, they are sometimes also denoted by the symbols "$-$" and "$+$". Given $x^{(i)}$, the corresponding $y^{(i)}$ is also called the **label** for the training example.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
cd /content/drive/MyDrive/files

/content/drive/MyDrive/files


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12.0, 9.0)

In [13]:
df = pd.read_csv("iris.csv")
df = df.sample(frac=1).reset_index(drop=True)
df[:5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,6.6,2.9,4.6,1.3,Iris-versicolor
1,5.7,3.8,1.7,0.3,Iris-setosa
2,5.1,3.8,1.6,0.2,Iris-setosa
3,5.5,3.5,1.3,0.2,Iris-setosa
4,6.3,2.5,4.9,1.5,Iris-versicolor


In [None]:
df['class'].unique()

As we are working on the logistic regression and where the target variable $\in \{0, 1\}$, its better to convert the class variable to the

#2.1 Logistic regression

We could approach the classification problem ignoring the fact that $y$ is discrete-valued, and use old linear regression algorithm to try to predict $y$ given $x$. However, it is easy to construct examlpes where this method performs very poorly. Intuitively, it also deosn't make sense for $h_\theta(x)$ to take values larger than 1 or smaller than 0 when we know that $ y \in \{0, 1\}$.

To fix this, let's change the form for our hypothesis $h_\theta(x)$. We will choose
$$h_\theta(x) = g(\theta^Tx) = \frac{1}{1 + e^{-\theta^Tx}},$$
where
$$g(z) = \frac{1}{1+e^{-z}}$$
is called the **logistic function** or the **sigmoid function**. Here is a plot showing $g(z)$:
<figure>
  <img src="TraditionalML/images/sigmoid_activation_function.png" width="40%"/>
  <figcaption>Ref: https://cs229.stanford.edu/notes2021fall/cs229-notes1.pdf</figcaption>
</figure>
 <br>

Notice that $g(z)$ tends towards 1 as $z \to \infty$, and $g(z)$ tends towards $0$ as $z \to -\infty$. Moreover, $g(z)$, and hence also $h(x)$, is always bounded between $0$ and $1$. As before, we are keeping the convention of letting $x_0 = 1$, so that $ \theta^Tx = \theta_0 + \sum_{j=1}^d\theta_jx_j$.<br>

For now, let's take the choice of $g$ as given. Other functions that smoothly increase from $0$ to $1$ can also be used, but for a couple of reasons that we'll see later( when we talk about GLMs, and when we talk about generative learning algorithms), the choice of the logistic function is a fairly natural one. Before moving on, here's a useful property of the derivative of the signmoid fucntion, which we write as $g'$:<br>

$$\begin{align*}g'(z) &= \frac{d}{dz}\frac{1}{1 + e^{-z}} \\
&= \frac{1}{(1 + e^{-z})^2}(e^{-z}) \\
&= \frac{1}{(1 + e^{-z})}.\left(1 - \frac{1}{(1 + e^{-z})}\right)\end{align*}$$

 So, given the logistic regression model, how do we fit $\theta$ for it? Following how we saw least squares regression could be derived as the maximum like-lihood estimator under a set of assumptions, let's endow our classficitiona model with a set of probabilistc assumptions, lets endow our classifiction model with a set of probabilistic assumptions, and then fit the parameters via maximum likelihood. <br>

 Let us assume that
 $$\begin{align*}
 P(Y = 1 | x; \theta) &= h_\theta(x) \\
 P(Y = 0 | x; \theta) &= 1 - h_\theta(x)
 \end{align*}$$

 <br>Note that thsi can be written more compactly as<br>
 $$\begin{align*}
 p(y | x;\theta) = (h_\theta(x))^y (1 - h_\theta(x))^{1-y}
 \end{align*}$$
 <br>
 Assuming that the $n$ traininig examples were generated independently, we can then write down the likelihood of the parameters as
 $$\begin{align*}
 L(\theta) &= p(\vec{y}| X;\theta) \\
 &= \prod_{i=1}^np(y^{(i)} | x^{(i)}; \theta) \\
 &= \prod_{i=1}^np(h_\theta(x^{(i)}))^{y^{(i)}}(1 - h_\theta(x^{(i)}))^{1 - y^{(i)}}
\end{align*}$$

As before, it will be easier to maximize the log likelihood:
$$\begin{align*}
l(\theta) &= \text{log} L(\theta) \\
&= \sum_{i=n}^ny^{(i)}\text{log}h(x^{(i)}) + (1 - y^{(i)})\text{log}(1 - h(x^{(i)}))
\end{align*}$$

How do we maximize the likelihood? Similar to our derivation in the case of linear regression, we can use gradient ascent. Written in vectorial notation, our update will therefore be given by $ \theta := \theta + \alpha\Delta_\theta l(\theta)$. (Note the positive rather than negative sign in the update formula, since we'are maximizing, rather than minimizing, a function now.) Let's start by working with just one training examle $(x,y)$, and take derivatives to derive the stochastic gradient ascent rule:

$$
\begin{align*}
\frac{\partial}{\partial \theta_j} l(\theta) &= \left( y \frac{1}{g(\theta^Tx)} - (1-y)\frac{1}{1-g(\theta^Tx)}  \right)\frac{\partial}{\theta_j}g(\theta^Tx)\\
&= \left( y \frac{1}{g(\theta^Tx)} - (1-y)\frac{1}{1-g(\theta^Tx)}  \right) g(\theta^Tx)(1 - g(\theta^Tx))\frac{\partial}{\partial \theta_j}\theta^Tx \\
&= (y(1 - g(\theta^Tx)) - (1-y)g(\theta^Tx))x_j \\
&=(y -h\theta(x))x_j
\end{align*}$$
<br>
Above, we used the fact that $g'(z) = g(z)(1-g(z))$. This therefore gives us the stochastic gradient ascent rule:
$$ \theta_j := \theta_j + \alpha(y^{(i)} - h_\theta(x^{(i)}))x_j^{(i)}$$
<br>
If we compare this to the LMS update rule, we that it looks identical; but this $\textit{not}$ the same algorithm, becuase $h_\theta(x^{(i)})$ is now defined as a non-linear function of $\theta^Tx^{(i)}$. Nonetheless, its a little suprising that we end up with the same update rule for a rather different algorithm and learning problem.

$\cdots \vdots \ddots$