# Neural Networks Basics

> Week2

- [Logistic Regression as a Neural Network](#logistic-regression-as-a-neural-network)
  - [Binary Classification](#binary-classification)
  - [Logistic Regression](#logistic-regression)
  - [Logistic Regression Cost Function](#logistic-regression-cost-function)
  - [Gradient Descent](#gradient-descent)
  - [Computation Graphs](#computation-graphs)
  - [Logistic Regression Gradient Descent](#logistic-regression-gradient-descent)
  - [Gradient Descent on m Examples](#gradient-descent-on-m-examples)
- [Python and Vectorization](#python-and-vectorization)
  - [Vectorization](#vectorization)
  - [Examples](#examples)
  - [Vectorizing Logistic Regression](#vectorizing-logistic-regression)
  - [Vectorizing Gradient](#vectorizing-gradient)
  - [Broadcasting in Python](#broadcasting-in-python)
  - [Note on Vectors](#note-on-vectors)
  - [Logistic Regression Cost Function](#logistic-regression-cost-function)

## Logistic Regression as a Neural Network

### Binary Classification
Notation
E.g. Cat vs. Non-Cat\
for 1(cat) vs 0(non-cat)\
image shape: (num_px, num_px, color) = (height, width, color)\
* num_px = 64, 64x64 image
* color = 3, RGB

x=[x1,x2,...,xn] is a vector of features\
y is a label
* in this case nx = 64x64x3 = 12288

m is the number of training examples
* m:{(x1,y1),(x2,y2),...,(xm,ym)}

X is a matrix of features
- X=[x1,x2,...,xm], X.shape = (nx,m)

Y is a vector of labels
- Y=[y1,y2,...,ym], Y.shape = (1,m)

### Logistic Regression
> Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.

Given x, y_hat=P(y=1|x), y_hat = sigmoid(w^T@x + b), where sigmoid(z) = 1/(1+e^{-z})\
x is Rnx, w is Rnx, b is R, y_hat is R, y is R, P(y=1|x) is R, sigmoid(z) is R.


### Logistic Regression Cost Function

$\hat{y} = g(z)$

$g(z) = \frac{1}{1+e^{-z}}$

$z(x)=w^Tx+b$

$J(w,b) = \frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)})$

$L(\hat{y},y) = -y\log(\hat{y})-(1-y)\log(1-\hat{y})$ (for $y\in\{0,1\}$)


### Gradient Descent
> Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model. Parameters refer to coefficients in Linear Regression and weights in neural networks.

repeat until convergence: {\

$w=w-α\frac{\delta J(w,b)}{\delta w}$\

}

### Computation Graphs
> A computation graph is a way of writing a mathematical expression as a graph. It is composed of nodes and edges. An edge from node A to node B means that the value of node A is an input to node B. A node can be a variable, a constant, or an operation. A computation graph can be evaluated to compute the value of each node.

E.g. $J(x,y,z) = 3(a+bc)$ as $u=bc, v=a+u, J=3v$

for a=5, b=3, c=2, u=bc=6, v=a+u=11, J=3v=33

$\frac{dJ}{dv}=3, \frac{dv}{da}=1, \frac{dJ}{da}=\frac{dJ}{dv}\frac{dv}{da}=3$

$\frac{dJ}{du}=3, \frac{du}{db}=c, \frac{dJ}{db}=\frac{dJ}{du}\frac{du}{db}=3c=6$

c the same as b

quote: actually it is chain rule in calculus but with a back propagation graph.

### Logistic Regression Gradient Descent

e.g. 

we have $x^{(n)}, w^{(n)}, b$

$z=w^{(n)}x^{(n)}+b$

$\hat{y}=\sigma(z)$

$L(\hat{y},y)=-y\log(\hat{y})-(1-y)\log(1-\hat{y})$

- back prop

$\frac{\partial L}{\partial \hat{y}}=-\frac{y}{\hat{y}}+\frac{1-y}{1-\hat{y}}$

$\frac{\partial \hat{y}}{\partial z}=\hat{y}(1-\hat{y})$

$\frac{\partial z}{\partial w^{(i)}}=x^{(i)}$

$\frac{\partial z}{\partial b}=1$

- chain rule applied

$\frac{\partial L}{\partial z}=\hat{y}-y$

$\frac{\partial L}{\partial w^{(i)}}=x^{(i)}(\hat{y}-y)$

$\frac{\partial L}{\partial b}=\hat{y}-y$

- update

$w^{(i)}=w^{(i)}-\alpha\frac{\partial L}{\partial w^{(i)}}$

$b=b-\alpha\frac{\partial L}{\partial b}$


### Gradient Descent on m examples
```python
for i in range(0, m):
    z[i] = np.dot(w.T, x[i]) + b
    a[i] = sigmoid(z[i])
    J+= -y[i] * np.log(a[i]) - (1 - y[i]) * np.log(1 - a[i])
    dz[i] = a[i] - y[i]
    dw += x[i] * dz[i]
    db += dz[i]
J /= m
dw /= m
```

## Python and Vectorization

### Vectorization
- Vectorization is the art of getting rid of explicit for-loops in code.
- Vectorization is important in deep learning because it provides computational efficiency as applying parallel computing.

In [6]:
import numpy as np
import time

w=np.random.rand(1000000)
x=np.random.rand(1000000)

#non-vectorized implementation:
tic=time.time()
z=0
for i in range(len(w)):
    z+=w[i]*x[i]
toc=time.time()
print("non-vectorized version:"+str(1000*(toc-tic))+"ms")

#vectorized implementation:
tic=time.time()
z=np.dot(w,x)
toc=time.time()
print("vectorized version:"+str(1000*(toc-tic))+"ms")

# expected output:
# it changes every time you run it, but the point is the gap between
# non-vectorized version:578.0763626098633ms
# vectorized version:1.3744831085205078ms
# vectorized version is 420 times faster than non-vectorized version


non-vectorized version:573.4927654266357ms
vectorized version:1.3382434844970703ms


### Examples
```python
import numpy as np
dw=np.zeros((n_x,1))
u=np.exp(v)
u=np.log(u)
u=np.maximum(0,v)
```

### Vectorizing Logistic Regression

```python
# Z is a 1*m vector, where m is the number of training examples
# w is a n*1 vector, where n is the number of features
# X is a n*m matrix, where n is the number of features
# b is a scalar
# A is a 1*m vector, where m is the number of training examples
Z=w.T*X+b
A=sigmoid(Z)
```

### Vectorizing Gradient

```python
# Y is a 1*m vector, A is a 1*m vector, dZ is a 1*m vector
dZ=A-Y
db=1/m*sum(dZ)
dw=1/m*X*dZ.T
w=w-alpha*dw
b=b-alpha*db
```

### Broadcasting in Python

> Broadcasting is a very useful feature in Python that allows us to perform operations on two arrays/ tensors, even if these arrays/ tensors do not have the same shape.


In [8]:
import numpy as np
A=np.array([[1,2,3,4],
            [5,6,7,8],
            [9,10,11,12]])
print(A)
cal = A.sum(axis=0) # sum of each column
print(cal)
print(cal.shape)
percentage = 100*A/cal.reshape(1,4) # reshape(1,4) is to make cal become a row vector
print(percentage)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
[15 18 21 24]
(4,)
[[ 6.66666667 11.11111111 14.28571429 16.66666667]
 [33.33333333 33.33333333 33.33333333 33.33333333]
 [60.         55.55555556 52.38095238 50.        ]]


where A is a (3,4) matrix, cal (1,4)\
percentage is (3,4) divided by (1,4)

**General**
- (m,n) + (1,n) = (m,n) + (m,n)
- (m,n) + (m,1) = (m,n) + (m,n)

### Note on Vectors
use randn(5,1) instead of (5)\
the former one is a rank 2 column vector\
randn(1,5) is row vector\
assert(a.shape == (5,1))

In [9]:
import numpy as np
a=np.random.randn(5)
print(a)
print(a.shape)
print(a.T)
print(np.dot(a,a.T))
a=np.random.randn(5,1)
print(a)
print(a.shape)
print(a.T)
print(np.dot(a,a.T))

[-1.05704483  1.52958024 -0.766576   -0.61909726  0.76353203]
(5,)
[-1.05704483  1.52958024 -0.766576   -0.61909726  0.76353203]
5.010860818185966
[[ 0.50686633]
 [ 1.48000822]
 [-1.19011419]
 [-1.09778839]
 [-0.66861646]]
(5, 1)
[[ 0.50686633  1.48000822 -1.19011419 -1.09778839 -0.66861646]]
[[ 0.25691348  0.75016634 -0.60322881 -0.55643197 -0.33889917]
 [ 0.75016634  2.19042433 -1.76137878 -1.62473583 -0.98955786]
 [-0.60322881 -1.76137878  1.41637178  1.30649353  0.79572994]
 [-0.55643197 -1.62473583  1.30649353  1.20513934  0.73399939]
 [-0.33889917 -0.98955786  0.79572994  0.73399939  0.44704797]]


### Logistic Regression Cost Function

$z(x)=wx+b$

$/sigma(z)=\frac{1}{1+e^{-z}}$

$\hat{y}=\sigma(z)$

we interpret $\hat{y}$ as the probability that $y=1$ given $x$.

$P(y=1|x)=\hat{y}$

$P(y=0|x)=1-\hat{y}$

$P(y|x)=\hat{y}^{y}(1-\hat{y})^{(1-y)}$

$L(y)=logP(y|x)=ylog\hat{y}+(1-y)log(1-\hat{y})$

$P("labels in traning set")=log\prod_{i=1}^{m}P(y^{(i)}|x^{(i)})$

Due to Maximum liklihood estimation, we want to maximize the above equation.

$J(w,b)=-\frac{1}{m}log\prod_{i=1}^{m}P(y^{(i)}|x^{(i)})$