# Logistic Regression

## Binary Classification
### Notation
E.g. Cat vs. Non-Cat\
for 1(cat) vs 0(non-cat)\
image shape: (num_px, num_px, color) = (height, width, color)\
* num_px = 64, 64x64 image
* color = 3, RGB

x=[x1,x2,...,xn] is a vector of features\
y is a label
* in this case nx = 64x64x3 = 12288

m is the number of training examples
* m:{(x1,y1),(x2,y2),...,(xm,ym)}

X is a matrix of features
- X=[x1,x2,...,xm], X.shape = (nx,m)

Y is a vector of labels
- Y=[y1,y2,...,ym], Y.shape = (1,m)

### Logistic Regression
> Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.

Given x, y_hat=P(y=1|x), y_hat = sigmoid(w^T@x + b), where sigmoid(z) = 1/(1+e^{-z})\
x is Rnx, w is Rnx, b is R, y_hat is R, y is R, P(y=1|x) is R, sigmoid(z) is R.


### Logistic Regression Cost Function
- $\hat{y} = g(z)$
- $g(z) = \frac{1}{1+e^{-z}}$
- $z(x)=w^Tx+b$
- $J(w,b) = \frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})$
- $L(\hat{y},y) = -y\log(\hat{y})-(1-y)\log(1-\hat{y})$ (for $y\in\{0,1\}$)

### Gradient Descent
> Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model. Parameters refer to coefficients in Linear Regression and weights in neural networks.

repeat until convergence: {\
$w=w-α\frac{\delta J(w,b)}{\delta w}$\
}

### Computation Graphs
> A computation graph is a way of writing a mathematical expression as a graph. It is composed of nodes and edges. An edge from node A to node B means that the value of node A is an input to node B. A node can be a variable, a constant, or an operation. A computation graph can be evaluated to compute the value of each node.

E.g. $J(x,y,z) = 3(a+bc)$ as $u=bc, v=a+u, J=3v$\
for a=5, b=3, c=2, u=bc=6, v=a+u=11, J=3v=33\
$\frac{dJ}{dv}=3, \frac{dv}{da}=1, \frac{dJ}{da}=\frac{dJ}{dv}\frac{dv}{da}=3$\
$\frac{dJ}{du}=3, \frac{du}{db}=c, \frac{dJ}{db}=\frac{dJ}{du}\frac{du}{db}=3c=6$\
c the same as b
\
quote: actually it is chain rule in calculus but with a back propagation graph.

### Logistic Regression Gradient Descent
- e.g. we have $x^{(n)}, w^{(n)}, b$
- $z=w^{(n)}x^{(n)}+b$
- $\hat{y}=\sigma(z)$
- $L(\hat{y},y)=-y\log(\hat{y})-(1-y)\log(1-\hat{y})$
* back prop
- $\frac{\partial L}{\partial \hat{y}}=-\frac{y}{\hat{y}}+\frac{1-y}{1-\hat{y}}$
- $\frac{\partial \hat{y}}{\partial z}=\hat{y}(1-\hat{y})$
- $\frac{\partial z}{\partial w^{(i)}}=x^{(i)}$
- $\frac{\partial z}{\partial b}=1$
* chain rule applied
- $\frac{\partial L}{\partial z}=\hat{y}-y$
- $\frac{\partial L}{\partial w^{(i)}}=x^{(i)}(\hat{y}-y)$
- $\frac{\partial L}{\partial b}=\hat{y}-y$
* update
- $w^{(i)}=w^{(i)}-\alpha\frac{\partial L}{\partial w^{(i)}}$
- $b=b-\alpha\frac{\partial L}{\partial b}$

```to be continued```