# Lecture05 Logistic Regression

## Classification and Representation

To attempt classification, one method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. However, this method doesn't work well because classification is not actually a linear function.

**Logistic Regression Model**: $h_{\theta}(x) = g(\theta^Tx) = g(z) = \frac{1}{1+e^{-z}}$, where $g(z)$ is denoted as logistic function or sigmoid function.
* interpretation of hypothesis output: $h_{\theta}(x)$ = estimated probability that y=1 on input x.
* decision boundary: $y = 1 \Rightarrow h_{\theta}(x) \geq 0.5 \Rightarrow g(z) \geq 0.5 \Rightarrow \theta^Tx \geq 0.5$. Therefore, the descision boundary is the line that separates the area where y=0 and where y=1.


## Logistic Regression Model

Training Set: $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)})$, m examples and x is a $(n+1) \times 1$ vector while $x_0 = 1$, $y \in \{0, 1\}$.

Since the $J(\theta) = Cost(h_{\theta}(x), y) = \frac{1}{2}(h_{\theta}(x) - y)^2$ is non-convex, the logistic regression model uses $Cost(h_{\theta}(x), y) = \left\{\begin{matrix} -log(h_{\theta}(x)) & if & y = 1 \\ -log(1 - h_{\theta}(x)) & if & y = 0 \end{matrix}\right.$ as the **cost function**.
* If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.
* If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs 1. If our hypothesis approaches 0, then the cost function will approach infinity.

**Simplified cost function**: $Cost(h_{\theta}(x), y) = -y log(h_{\theta}(x)) - (1 - y) log(1 - h_{\theta}(x))$.

The $J(\theta) = \frac{1}{m}\sum^m_{i=1}Cost(h_{\theta}(x), y)$:
* to fit parameter $\theta$: $min_{\theta} J(\theta)$.
* to make prediction for given new x, output $h_{\theta}(x) = \frac{1}{1 + e^{-\theta^tx}}$.


**Gradient descent**: algorithm looks identical to linear regression.

**Optimization algorithm**: Conjugate gradient, BFGS, L-BFGS. They have some advantages which is no need to manually pick $\alpha$ and oftern faster than gradient descent, but they are more complex.

To use the library of the optimization algorithms, it needs to write a single function that returns both of these:

``` Octave
function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end
```

Then we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()". 

``` Octave
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
   [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
```

## Multiclass Classification: one-vs-all

Train a logistic regression classifier $h_{\theta}^i(x)$ for each class $i$ to predict the probability that $y=i$.

To make a prediction on a new $x$, pick the class $i$ that maximizes the $h_{\theta}(x)$.

