# Lecture #7: Classification with Logistic Regression

## A: Motivation

* If $y$ can only be one of two values, we have **binary classification**
* linear regression is not good for classification problems

## B: The Model

$$
f_{\vec w, b}(\vec x) = g(\vec w \cdot \vec x + b) = \frac{1}{1+e^{-(\vec w \cdot \vec x + b)}}
$$

* Note: $0 < g(z) < 1$ because this is a probability
* The interpretation of the model: the probability that the feature is of the positive class

## C: Decision Boundaries

* You can have linear decision boundaries
* You can also have non-linear decision boundaries
  * ex: $f_{\vec w, b}(\vec x) = g(z) = g(w_1x_1^2 + w_2x^2_2 + b) \ge 0.5$ (circular decision boundary)
* more complex decision boundaries: $f_{\vec w, b}(\vec x) = g(z) = g(w_1x_1 + w_2x_2 + w_3x_1^2 + w_4x_1x_2 + w_5x^2_2+b)$

## D: Cost Function for Logistic Regression

* We cannot use squared error because then we would just stop at a local minimum rather than the absolute minimum (which is what we want)
* As such, we use the following:

$$
J(\vec w, b) = \frac{1}{m} \sum_{i=1}^m \frac{1}{2} (f_{\vec w, b}(\vec x^{(i)})-y^{(i)})^2
$$

With loss function:

$$
L(f_{\vec w, b}(\vec x^{(i)}, y^{(i)})) = -\log(f_{\vec w, b}(\vec x^{(i)}))
$$

if the positive class is 1

* As $f_{\vec w, b} \to 1$, then the loss $\to 0$
* As $f_{\vec w, b} \to 0$, then the loss $\to \infty$

OR
$$
L(f_{\vec w, b}(\vec x^{(i)}, y^{(i)})) = -\log(f_{\vec w, b}(1 - \vec x^{(i)}))
$$ 

if the positive class is 0. 

* As $f_{\vec w, b} \to 0$, then the loss $\to 0$
* As $f_{\vec w, b} \to 1$, then the loss $\to \infty$

A simplified loss function can be written as follows:

$$
L(f_{\vec w, b}(\vec x^{(i)}), y^{(i)}) = -y^{(i)} \log(f_{\vec w, b}(\vec x^{(i)})) - (1 - y^{(i)}) \log(1 - f_{\vec w, b}(\vec x^{(i)}))
$$

So the cost function becomes:

$$
J(\vec w, b) = -\frac{1}{m} \sum_{i=1}^m [y^{(i)} \log(f_{\vec w, b}(\vec x^{(i)})) + (1 - y^{(i)}) \log(1 - f_{\vec w, b}(\vec x^{(i)}))]
$$

## E: Gradient Descent for Logistic Regression

Repeat the following process:

$$
w_j = w_j - \alpha \frac{\partial}{\partial w_j} J(\vec w, b)
$$

$$
b = b - \alpha \frac{\partial}{\partial b}J(\vec w, b)
$$


The above simplifies to the following:

$$
w_j = w_j - \alpha \frac{1}{m} \sum_{i=1}^m [(f_{\vec w, b}(\vec x^{(i)})-y^{(i)})x_j^{(i)}]
$$

$$
b = b - \alpha \frac{1}{m} \sum_{i=1}^m [(f_{\vec w, b}(\vec x^{(i)})-y^{(i)})]
$$

* This looks the same as the linear regression, just be careful because the $f$ is going to be implemented differently!