**<font size=3>2.1 Binary classification</font>**

 **Some notations in the course**

$(x, y)$: an instance / example.  
For logistic regression, $x \in R^{n_x}$, $y \in \{0, 1\}$, where $n_x$ is the number of features$.  

$m$: the size of the data set.  
If there are $m$ training examples: $\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots, (x^{(m)}, y^{(m)})\}$  
$m_{train}$:  
the size of the training set.  
$m_{test}$:  
the size of the test set.  

Vectorized notation:  
$X = [x^{(1)}, x^{(2)}, \cdots, x^{(m)}]$, $n_x$ rows and $m$ columns. Each row represents a feature and each column represents an example.  
$Y = [y^{(1)}, y^{(2)}, \cdots, y^{(m)}]$

**<font size=3>2.2 Logistic Regression</font>**

**Some notations**

$\hat y$: the prediction  
$\hat y = p(y = 1 | x)$: the probability of the chance that $y = 1$ given the input features $x$.  

When we programmed neural networks, we'll usually keep the parameter $w$ and $b$ separate, where $b$ corresponds to an interceptor.

**Logistic Regression**

Given $x$, want $\hat y = p(y = 1 | x)$  
  
$x \in R^{n_x}$, $0 \leq \hat y \leq 1$  
  
Parameters:  
$w \in R^{n_x}$, $b \in R$  
  
Output:  
$\hat y = \sigma(w^T x + b) = \sigma(z)$, where $z = w^T x + b$, and $\sigma(z) = \frac{1}{1 + e^{-z}}$  
  
More correctly:  
$\hat y^{(i)} = \sigma(w^T x^{(i)} + b)$, where $\sigma(z^{(i)}) = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-(w^T x^{(i)} + b)}}$

**the sigmoid function**

**<font size=3>2.3 Logistic Regression cost function</font>**

Given $(x^{(1)}, y^{(1)}), \cdots, (x^{(m)}, y^{(m)})$, want $\hat y^{(i)} \approx y^{(i)}$

**Loss function**

Note that $L(\hat y^{(i)}) = \frac{1}{2}(\hat y - y)^2$ will not be used for logistic regression for that when optimizing the problem with gradient descent there will be multiple **local optima**.<font color=red># why?</font>  
  
Instead, we will use:  
$L(\hat y, y) = -(y log \hat y + (1 - y) log (1 - \hat y))$  

Some explanation:  
When $y = 1$, let $L(\hat y, y) = -log \hat y$ as small as possible  
-> $log \hat y$ as large as possible  
-> $log \hat y = ln \hat y$ here, which is an increasing function  
-> $\hat y$ as large as possible.
  
Since $0 \leq \hat y \leq 1$, so that  
$\hat y$ close to 1 as much as possible.

When $y = 0$, let $L(\hat y, y) = -log (1 - \hat y)$ as small as possible  
-> let $log (1 - \hat y)$ as large as possible  
-> $log (1 - \hat y) = ln (1 - \hat y)$ here, which is an increasing function  
-> $1 - \hat y$ as large as possible  
-> $- \hat y$ as large as possible  
-> $\hat y$ as small as possible  
  
Since $0 \leq \hat y \leq 1$, so that  
$\hat y$ close to 0 as much as possible.

**Loss func and cost func**

The **loss func** is defined with respect to **a single training example**, it measures how well you're doing on a **single** training example.  
  
**The cost func** measures how well you're doing on the **entire** training set.**  
$J(w, b) = \frac{1}{m} \sum^m_{i = 1} L(\hat y^{(i)}, y^{(i)}) = -\frac {1}{m} \sum^m_{i = 1} [y^{(i)} log \hat y^{(i)} + (1 - y^{(i)}) log (1 - \hat y^{(i)})]$.  
  
Note that $J(w, b)$ here is a cost function.

<font size=3>**2.4 Gradient Descent**</font>

Repeat {  
&nbsp;&nbsp;&nbsp;&nbsp;$w := w - \alpha \frac{\partial J(w, b)}{dw}$  
&nbsp;&nbsp;&nbsp;&nbsp;$b := b - \alpha \frac{\partial J(w, b)}{db}$  
}  

It's the inner loop of the GD.  
When writing code, the derivative term will be written as dw, db.

When the derivative is greater than 0, $w$ will be decreased.  
When the derivative is less than 0, $w$ will be increased.

<font size=3>**2.8 Derivatives with a Computation Graph**</font>

**Some notations**

dvar in the code you write will represent the derivative of the **final output** you care about such as J, sometimes the Last L, respect to the various **intermediate quantities** you're computing in your code.  
  
For example, $\frac{dJ}{dv} = dv$, $\frac{dJ}{da} = da$

<font size=3>**2.9 Logistic Regression Greadient descent**</font>

**some notations**

$\hat y = a = \sigma(z)$  
$L(a, y) = -(y log a + (1 - y) log (1 - a))$

**Logistic regression derivatives**

$da = -\frac{y}{a} + \frac{1 - y}{1 - a}$  
$dz = a - y$  
  
$dw_1 = x_1 dz$  
$dw_2 = x_2 dz$  
$db = dz$

**GD respect to a single example**

Repeat {  
&nbsp;&nbsp;&nbsp;&nbsp;$w_1 := w_1 - \alpha dw_1$   
&nbsp;&nbsp;&nbsp;&nbsp;$w_2 := w_2 - \alpha dw_2$   
&nbsp;&nbsp;&nbsp;&nbsp;$b := b - \alpha db$  
}  

when plugging in the derivatives:

Repeat {  
&nbsp;&nbsp;&nbsp;&nbsp;$w_1 := w_1 - \alpha x_1 dz$   
&nbsp;&nbsp;&nbsp;&nbsp;$w_2 := w_2 - \alpha x_2 dz$   
&nbsp;&nbsp;&nbsp;&nbsp;$b := b - \alpha dz$  
}  

<font size=3>**2.10 Gradient descent on $m$ examples**</font>

$\frac{\partial}{\partial w_1}J(w, b) = \frac{1}{m} \sum^m_{i = 1} \frac{\partial}{\partial w^{(i)}_1} L(a^{(i)}, y^{(i)})$

**GD for $m$ examples in one step**

// accumulators  
J = 0, <font color=blue>dw1 = 0, dw2 = 0</font>, db = 0  
  
// over the training set, compute the derivatives respect to each training example  
For i = 1 to m {   
&nbsp;&nbsp;&nbsp;&nbsp;// forward prop  
&nbsp;&nbsp;&nbsp;&nbsp;$z^{(i)} = w^T x^{(i)} + b$  
&nbsp;&nbsp;&nbsp;&nbsp;$a^{(i)} = \sigma(z^{(i)})$  
  
&nbsp;&nbsp;&nbsp;&nbsp;J += $-[y^{(i)}loga^{(i)} + (1 - y^{(i)}) log (1 - a^{(i)})]$  
  
&nbsp;&nbsp;&nbsp;&nbsp;//back prop  
&nbsp;&nbsp;&nbsp;&nbsp;$dz^{(i)} = a^{(i)} - y^{(i)}$  
<font color=blue>
&nbsp;&nbsp;&nbsp;&nbsp;$dw_1 += x^{(i)}_1 dz^{(i)}$  
&nbsp;&nbsp;&nbsp;&nbsp;$dw_2 += x^{(i)}_2 dz^{(i)}$
</font>  
&nbsp;&nbsp;&nbsp;&nbsp;$db += dz^{(i)}$  
}  
  
// mean  
J /= m  
<font color=blue>
dw_1 /= m  
dw_2 /= m  
</font>
db /= m  
  
// GD  
// a for loop will be needed to calculate the derivative of each feature  
// if there are multiple features  
$w_1 := w_1 - \alpha \cdot dw_1$   
$w_2 := w_2 - \alpha \cdot dw_2$   
$b := b - \alpha \cdot db$  

Two weaknesses.  
  
Two for loops. One for iterating over $m$ examples and the other for calculating the derivative of each example.  
  
Note that there are only two examples here thus a for loop is not needed to calculated the derivative of each feature. If there are multiple features, a for loop is needed.

<font size=3>**2.11 Vectorization**</font>

// accumulators  
J = 0, <font color=blue>dw = np.zeros([$n_x$, 1])</font>, db = 0  
  
// over the training set, compute the derivatives respect to each training example  
For i = 1 to m {   
&nbsp;&nbsp;&nbsp;&nbsp;// forward prop  
&nbsp;&nbsp;&nbsp;&nbsp;$z^{(i)} = w^T x^{(i)} + b$  
&nbsp;&nbsp;&nbsp;&nbsp;$a^{(i)} = \sigma(z^{(i)})$  
  
&nbsp;&nbsp;&nbsp;&nbsp;J += $-[y^{(i)}loga^{(i)} + (1 - y^{(i)}) log (1 - a^{(i)})]$  
  
&nbsp;&nbsp;&nbsp;&nbsp;//back prop  
&nbsp;&nbsp;&nbsp;&nbsp;$dz^{(i)} = a^{(i)} - y^{(i)}$  
<font color=blue> 
&nbsp;&nbsp;&nbsp;&nbsp;$dw += x^{(i)}dz^{(i)}$
</font>  
&nbsp;&nbsp;&nbsp;&nbsp;$db += dz^{(i)}$  
}  
  
// mean  
J /= m  
<font color=blue>
dw /= m  
</font>
db /= m  
  
// GD  
$w := w - \alpha \cdot dw$   
$b := b - \alpha \cdot db$  

<font size=3>**2.13 Vectorizing Logistic Regression**</font>

**Implement a vectorized implementation of the forward prop for all m training examples at the same time**

$$
\begin{aligned}
X &= [x^{(1)}, x^{(2)}, \cdots, x^{(m)}] \\
\\
Z &= [z^{(1)}, z^{(2)}, \cdots, z^{(m)}] \\
&= w^T X + [b, b, \cdots, b] \\
&= [w^T x^{(1)} + b, w^T x^{(2)} + b, \cdots, w^T x^{(m)} + b] \\
\\
A &= [a^{(1)}, a^{(2)}, \cdots, a^{(m)}] = \sigma(Z)
\end{aligned}
$$

<font size=3 >**2.14 Vectorizing Logistic Regression's Gradient Computation**</font>

**a vectorized implementation of the back prop**

$$
\begin{aligned}
dZ &= [dz^{(1)}, dz^{(2)}, \cdots, dz^{(m)}] \\
&= A - Y \\
&= [a^{(1)} - y^{(1)}, a^{(2)} - y^{(2)}, \cdots, a^{(m)} - y^{(m)}] \\
\\
db &= \frac{1}{m} \sum^m_{i=1}dz^{(i)} \\
&= \frac{1}{m} np.sum(dZ) \\
\\
dW &= \frac{1}{m}X \cdot dZ^T \\
&= \frac{1}{m} [x^{(1)}, x^{(2)}, \cdots, x^{(m)}]
\begin{bmatrix}
dz^{(1)} \\
dz^{(2)} \\
\vdots \\
dz^{(n_x)} \\
\end{bmatrix} \\
&= \frac{1}{m}
\begin{bmatrix}
\sum^m_{i=1} x^{(i)}_1 dz^{(1)} \\
\sum^m_{i=1} x^{(i)}_2 dz^{(2)} \\
\vdots \\
\sum^m_{i=1} x^{(i)}_m dz^{(m)} \\
\end{bmatrix} \\
&= 
\begin{bmatrix}
dw_1 \\
dw_2 \\
\vdots \\
dw_{n_x} \\
\end{bmatrix} \\
\end{aligned}
$$

**Implementing Logistic Regression**
(one loop)

// accumulators  
J = 0, dw = np.zeros([$n_x$, 1]), db = 0  
  
// over the training set, compute the derivatives respect to each training example  
<font color=blue>
$Z = w^T X + b$  
$A = \sigma(Z)$  

//back prop  
$dZ = A - Y$  
$dw = XdZ^T$  
$db = np.sum(dZ)$
</font>
    
// mean  
dw /= m  
db /= m  
  
// GD   
<font color=blue>
$w := w - \alpha \cdot dw$   
$b := b - \alpha \cdot db$  
</font>

**GD for 1000 loops**

for i = 1 to m {
&nbsp;&nbsp;&nbsp;&nbsp;// accumulators  
&nbsp;&nbsp;&nbsp;&nbsp;J = 0, dw = np.zeros([$n_x$, 1]), db = 0  
&nbsp;&nbsp;&nbsp;&nbsp;  
&nbsp;&nbsp;&nbsp;&nbsp;// over the training set, compute the derivatives respect to each training &nbsp;&nbsp;&nbsp;&nbsp;example  
<font color=blue>
&nbsp;&nbsp;&nbsp;&nbsp;$Z = w^T X + b$  
&nbsp;&nbsp;&nbsp;&nbsp;$A = \sigma(Z)$  
  
&nbsp;&nbsp;&nbsp;&nbsp;//back prop  
&nbsp;&nbsp;&nbsp;&nbsp;$dZ = A - Y$  
&nbsp;&nbsp;&nbsp;&nbsp;$dw = XdZ^T$  
&nbsp;&nbsp;&nbsp;&nbsp;$db = np.sum(dZ)$
&nbsp;&nbsp;&nbsp;&nbsp;</font>
  
&nbsp;&nbsp;&nbsp;&nbsp;// mean  
&nbsp;&nbsp;&nbsp;&nbsp;dw /= m  
&nbsp;&nbsp;&nbsp;&nbsp;db /= m  
&nbsp;&nbsp;&nbsp;&nbsp;  
&nbsp;&nbsp;&nbsp;&nbsp;// GD   
<font color=blue>
&nbsp;&nbsp;&nbsp;&nbsp;$w := w - \alpha \cdot dw$   
&nbsp;&nbsp;&nbsp;&nbsp;$b := b - \alpha \cdot db$  
</font>
}

<font size=3>**2.18 Explanation of logistic regression cost func**</font>

(My understanding)  
If the label of one example is 1, we want the evaluated probability $p(y = 1 | x)$ to be maximum. If the label of one example is 0, we want $p(y = 0 | x)$ to be maximum. Thus for an example, we want $p(y | x)$ to be maximum.

 **Loss function (for one example)**

$\hat y = p(y = 1 | x)$  
This equation means given an example x, the probability that $y = 1$.  
  
if $y = 1$, $p(y | x) = p(y = 1 | x) = \hat y$  
if $y = 0$, $p(y | x) = p(y = 0 | x) = 1 - p(y = 1 | x) = 1 - \hat y$  
  
Take these two equations and summarize them into a single equation, we get:  
$p(y | x) = {\hat y}^y{(1 - \hat y)}^{(1 - y)}$  
  
Our goal is $max\{p(y | x)\}$, and since $log x$ is an increasing function, we can then maximize $log p(y | x)$.  
$max\{log p(y | x)\} \\= max \{ylog \hat y + (1 - y) log \hat (1 - \hat y)\} \\= max\{-L(\hat y, y)\}$.

**Cost function (for multiple examples)**

For multiple examples, we want $max${p(labels in the training set)} = $\prod^m_{i=1} p(y^{(i)} | x^{(i)})$  
  
Similarly, we want  
$max${p(labels in the training set)}  
$= max\{log \prod^m_{i=1} p(y^{(i)} | x^{(i)})\}
\\= max\{\sum^m_{i=1} log p(y^{(i)} | x^{(i)})\}
\\= max\{-L(\hat y^{(i)}, y^{(i)})\}
\\= min \sum^m_{i=1} L(\hat y^{(i)}, y^{(i)})$

For convenience, we make sure that our quantities are better scale (缩放), we just add a $\frac{1}{m}$ which is an extra scaling factor here.
  
Thus $\frac{1}{m}\sum^m_{i=1} L(\hat y^{(i)}, y^{(i)})$

(My understanding)  
**Why classify according to the probability?**
Say an expert on weather want to predict if tomorrow is rainy. He may evaluate the 降水確率 to predict. Here the kakuritsu is a kind of probability. If the kakuritsu is high, there is a high probability that tomorrow will be rainy. Here the classification task make a prediction that whether tomorrow is rainy, where sunny is a class and rainy is another class. And the prediction, or classification here is made according to the kakuritsu. If the kakuritsu is high, tomorrow will be classified as a rainy day. Thus thinking from our daili life, it's natual for a computer to make classification using probabilities. And calculating probabilities is kind of calculation, which computers are good at.  
  
**Why linear function works?**  
For logistic regression we get one of the two classes with a step related to a linear function. How can we convert the outcome of the linear function into two classes, or for logistic regression, two values 0 or 1? Actually the classification is done by separating the real number set into two halves, one half represents a class and the other half represents the other class. For logistic regression, classes are determined by if the outcome calculated by the linear function is greater than 0. And the sigmoid function here is to map the outcomes calculated by the linear regresion into 0 or 1. Actually the classifier can work ,in a way, without the sigmoid function, but determine the class using the outcome of the linear function directly. If the outcome is greater than 0, the example belongs to a class. And if the outcome is less than 0, the example belongs to the other class. Therefore, what the linear function does is calculating the y for the corresponding example, which is in the real number set. And the outcome is used to classify according to some rules. Here the sigmoid function can be considered as a rule, this rule determines the outcome, which represents a specified example, belongs to which class.