-----------
# Outline of Notebook
- ### Classification - Logistic Regression
- ### Gradient Descent for Logistic Regression
- ### Overfitting & Underfitting + Regularization
-----------

# Classification - Logistic Regression

<u>Sigmoid Function</u>: $g(z) = \frac{1}{1 + e^{-z}}$

![](2022-07-20-12-45-15.png)

You can verify the shape of $g(z)$:
- $\lim\limits_{x \to \infty} g(z) = 1$
- $\lim\limits_{x \to -\infty} g(z) = 0$
- $g(0) = \frac{1}{2}$

Remember the Linear Regression model: $f_{\vec{w}, b}(\vec{x}) = \vec{w} \cdot \vec{x} + b$

Now, let's set $z = f_{\vec{w}, b}(x)$

And let's input z back into the Sigmoid Function: $g(z) = \frac{1}{1 + e^{-z}}$

<u>Model for Logistic Regression:</u> $f_{w, b}(\vec{x}) = g(\vec{w} \cdot \vec{x} + b) = \frac{1}{1 + e^{-(\vec{w} \cdot \vec{x} + b)}}$
- Remember that $f$ is always going to return a number between 0 and 1 since the sigmoid function is applied to it

<u>Interpreting Output of Model:</u> "probability" that class is 1
- Ex. $x$ is "tumor size", $y$ is 0 (benign) or 1 (malignant)
- If $f$ outputs 0.1, then there is a 10% chance that the current tumor is malignant
- If $f$ outputs 0.5, then there is a 50% chance that the current tumor is malignant
- If $f$ outputs 0.8, then there is a 80% chance that the current tumor is malignant

<u>Decision Boundary:</u> the dividing line between one class and another
- The model predicts 1 whenever $z \geq 0$ as seen in the plot above
- Therefore, the model predicts 1 whenever $\vec{w} \cdot \vec{x} + b \geq 0$
- As a result, the decision boundary is whenever $\vec{w} \cdot \vec{x} + b = 0$

![](2022-07-20-14-11-44.png)
- The decision boundary for $f$ (with 2 features) when the parameters $w_1 = 1, w_2 = 1, b = -3$

![](2022-07-20-14-15-06.png)
- A non-linear decision boundary for another $f$ with two features but a different $z$ where $w_1 = 1, w_2 = 1, b = -1$



<u>Loss Function for Logistic Regression:</u> $L(f_{\vec{w}, b}(\vec{x}^{(i)}, y^{(i)})) = \left\{\begin{array}{lr} -\log{(f_{w, b}(\vec{x}^{(i)}))}, & \text{if } y^{(i)} = 1 \\ -\log{(1 - f_{w, b}(\vec{x}^{(i)}))}, & \text{if } y^{(i)} = 0 \end{array} \right\}$

<u>Simplified Loss Function:</u> $L(f_{\vec{w}, b}(\vec{x}^{(i)}, y^{(i)})) = -y^{(i)}\log(f_{\vec{w}, b}(\vec{x}^{(i)})) - (1 - y^{(i)})\log(1 - f_{\vec{w}, b}(\vec{x}^{(i)}))$

<u>Cost Function for Logistic Regression:</u> $J(\vec{w}, b) = \frac{1}{N}\sum_{i = 1}^{N}(L(f_{\vec{w}, b}(\vec{x}^{(i)}, y^{(i)})))$
- Cannot use the same Cost Function as Linear Regression because then the cost function is then not convex for Logistic Regression when plotted (meaning it has several local minimums other than the global minimum) which makes it hard for Gradient Descent to converge to the global minimum
- Also, this Cost Function makes wrong classifications very expensive because if $y^{(i)} = 1$ and your predicted value is 0 with 100% confidence, then the Cost Function will go to infinity. Same happens when $y^{(i)} = 0$ and your predicted value is 1 with 100% accuracy

# Gradient Descent for Logistic Regression

perform simultaneous updates {
- $w_j = w_j - \alpha\frac{\partial}{\partial w_j}J(\vec{w}, b)$
    - Where $j = 1 \ldots m$
- $b = b - \alpha\frac{\partial}{\partial b}J(\vec{w}, b)$

}


# Overfitting & Underfitting

![](2022-07-20-15-46-45.png)

![](2022-07-20-15-50-40.png)

<u>Combating Overfitting</u>
- Gather more training data
- Use fewer features
- Regularization = decrease the size of your parameters $w_1 \ldots w_m$
    - Benefits: It allows you to keep all your features but it restricts some from having an overly large affect
    - It is basically like having all the features but still having a simpler model as the additional features don't have a large effect
    - Regularizing b does not really matter

<u>Regularization in Practice:</u>
- Assume you overfit the model $w_1x + w_2x^2 + w_3x^3 + w_4x^4 + b$ to data
- To reduce overfitting, you want to make $w_3, w_4$ really small
- You can do this by altering your cost function: $J(\vec{w}, b) = \frac{1}{2N}\sum_{i = 1}^{N}(f_{\vec{w}, b}(\vec{x}^{(i)}) - y^{(i)})^2 + 1000w_3^2 + 1000w_4^2$
    - The addition at the end would penalize the model if $w_3, w_4$ are large
    - Therefore, the model would choose small values for $w_3, w_4$

<u>How Regularization is Usually Done:</u>
- If you don't know which feature to penalize and let's say you have a 100 features, then you penalize all of them a little bit
- Your Linear Regression Cost Function changes to $J(\vec{w}, b) = \frac{1}{2N}\sum_{i = 1}^{N}(f_{\vec{w}, b}(\vec{x}^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2N}\sum_{j = 1}^{m} (w_j)^2$
    - $\lambda$ is called the regularization parameter (similar to learning rate $\alpha$, you also have to choose a number for $\lambda$)
    - If $\lambda$ is too small, then the model may overfit
    - If $\lambda$ is too big, then the model may underfit as all the parameters will be very close to 0, except b

![](2022-07-20-18-24-56.png)

- Your Cost Function for Logistic Regression looks similar:

![](2022-07-20-23-27-16.png)

- Note that in this picture, $m$ represents the # of datapoints and $n$ represents the # of features