## Use Logistic Regression
### Hypothesis Representation
$h_{\theta}(x) = P(y=1|x;\theta)$: the probability that the output is 1, given input x, parameterized by $\theta$
- $h_\theta(x) \geq 0.5, P = 1$
- $h_\theta(x) < 0.5, P = 0$

### Logistic Function/Sigmoid Function
$h_{\theta}(x) = \frac{1}{1+e^{-\theta^Tx}}$
- Derived from the Sigmoid Function
    - $h_{\theta}(x) = g(z)$
    - Sigmoid Function: $g(z) = \frac{1}{1+e^{-z}}$
    - $z = \theta^T X$
![W3-Sigmoid](Plots/W3-Sigmoid.png)
- Transforming an arbitrary-valued function into a function better suited for classification

### Benefit
- $h_{\theta}(x)$ can always fall into the range of [0,1]

### Not tot use Linear Regression
- $h_{\theta}(x)$ from the Linear Regression can be out of the target range (ie, [0,1])
    - Even if the $y$s in the training set all fall into the range, the $h$ from the testing set may be out of the range
- Classification may not be a linear function
    - A line may not work effectively in classification

### Decision Boundary
$h_{\theta}(x) = g(z) \geq 0.5$, when $z = \theta^T X \geq 0$
- Solve the equation $\theta^T X = 0$ can provide us the decision boundary
![W3-Decision-Boundary](Plots/W3-Decision-Boundary.png)

The **Decision Boundary** is the line that **separate the area** where y = 0 and where y= 1
- Decision boundary is a property of parameters $\theta$s, not the data
    - We can plot the decision boundary without the dataset

### Cost Function
$J(\theta) = \frac{1}{m} \sum_{i = 1}^{m}Cost(h_{\theta}(x^{(i)},y^{(i)})$
- If y = 1, $Cost(h_{\theta}(x^{(i)},y^{(i)}) = -log(h_{\theta(x)})$
    - The cost will be very high if $h_{\theta(x)}$ approaches 0
- If y = 0, $Cost(h_{\theta}(x^{(i)},y^{(i)}) = -log(1-h_{\theta(x)})$
    - The cost will be very high if $h_{\theta(x)}$ approaches 1
![W3-LGR-COST](Plots/W3-LGR-COST.png)
- Guarantee that $J(\theta)$ is a convex
    - If we use the same cost function as the Linear Regression, it will become non-convex, which is difficult to converge to the global minimum

#### Simplified (Compressed) Cost Function
- $Cost(h_{\theta}(x),y)=-y log(h_{\theta}(x))-(1-y)log(1-h_{\theta}(x))$ 
- $J(\theta)=-\frac{1}{m}\sum^m_{i=1}[y^{(i)}log(h_{\theta}(x^{(i)})+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))]$
- Take Derivative of $J(\theta)$: $\frac{\partial}{\partial \theta_j}J(\theta)=-\frac{1}{m}\sum_{i=1}^m[(h_{\theta}(x^{(i)})-y^{(i)}).*x^{(i)}]$
    - [Good Reference](https://math.stackexchange.com/questions/477207/derivative-of-cost-function-for-logistic-regression)

#### Gradient Descent
$\theta_j := \theta_j - \alpha*\frac{1}{m}\sum_{i=1}^m[(h_{\theta}(x^{(i)})-y^{(i)})*x_j^{(i)}]$
- Vectorized Implementation: $\theta := \theta - \alpha*\frac{1}{m}\sum_{i=1}^m[(h_{\theta}(x^{(i)})-y^{(i)}).*x^{(i)}]$

Feature Scaling can also help speed up the Gradient Descent in Logistic Regression

#### Other Optimatization Algorithms
- Other than the Gradient Descent:
    - Conjugate Gradient
    - BFGS
    - L-BFGS
- Advantages:
    - No need to define the learning rate
    - Much faster than the Gradient Descent
- Disadvantages
    - More complex: require expertise to implement
    
### Multi-class Classification
Example: Email Foldering, Tagging

#### One-vs-All Method
- Training: Train a Logistic Regression classifier $h^{(i)}_{\theta}(x)$ for each class $i$ to predict the probability that $y=i$ ($y$ is in the class $i$)
- Prediction: on a new input $x$, $i = argmax_{i}[h^{(i)}_{\theta}(x)]$
![W3-MULTI-CLASS](Plots/W3-MULTI-CLASS.png)