## DNN - Dense Neural Networks (Linear Regression)

Learn the features from data.

![](https://miro.medium.com/v2/resize:fit:828/format:webp/1*63sGPbvLLpvlD16hG1bvmA.gif)

Assume we have 'm' examples with 'n' parameters and 'k' categories.

$$\begin{aligned}
Given \quad \mathbb{X}_ {(m,n)}&=\begin{vmatrix}
x_ {11} & x_ {12} & ... & x_ {1n}\\
x_ {21} & x_ {22} & ... & x_ {2n}\\
| & | & ... & |\\
x_ {m1} & x_ {m2} & ... & x_ {mn}
\end{vmatrix}
\\
\\
Given\quad \mathbb{Y}_ {(m,1)}&=\begin{vmatrix}
y_ {1}\\
y_ {2}\\
|\\
y_ {m}
\end{vmatrix}=\begin{vmatrix}
cat\\
dog\\
|\\
cat
\end{vmatrix}
\end{aligned}$$

In [None]:
import numpy as np
arr = np.loadtxt("sample_data/mnist_train_small.csv", delimiter=",", dtype=int)
X = arr[1:, 1:]
Y = arr[1:, 0]
print(f"X: {len(X)}\n", X, "\n", X[0])
print(f"Y: {len(Y)}\n", Y, "\n", Y[0])

In [None]:
from PIL import Image
img = X[0]
print(Y[0])
Image.fromarray(img.reshape(28, 28).astype(np.uint8)).resize((140, 140)).show()

## Onehot Encoding

$$\mathbb{Y}_ {(m,k)}=\begin{vmatrix}
y_ {11} & y_ {12} & ... & y_ {1k}\\
y_ {21} & y_ {22} & ... & y_ {2k}\\
| & | & ... & |\\
y_ {m1} & y_ {m2} & ... & y_ {mk}
\end{vmatrix}=\begin{vmatrix}
1 & 0 & ... & 0\\
0 & 1 & ... & 0\\
| & | & ... & |\\
1 & 0 & ... & 0
\end{vmatrix}$$

In [None]:
def oneHot(y):
    labels = np.unique(y)
    labels.sort()
    label_dict = dict()
    for value in labels:
        key = np.where(labels == value)[0][0]
        label_dict[key] = str(value)
        y = np.where(y == value, key, y)
    y = y.astype(int)
    y = np.eye(len(np.unique(y)))[y].astype(int)
    return y, label_dict

Y_onehot, label_dict = oneHot(Y)
X_train, X_test = X[:18000, :], X[18000:, :]
Y_train, Y_test = Y_onehot[:18000], Y_onehot[18000:]
Y_onehot

## Weight and bias

![](https://c.mql5.com/18/20/NN1__1.gif)

In [None]:
W1 = np.random.randn(784, 128) * 0.1
b1 = np.random.randn(128) * 0.1
W2 = np.random.randn(128, 64) * 0.1
b2 = np.random.randn(64) * 0.1
W3 = np.random.randn(64, 10) * 0.1
b3 = np.random.randn(10) * 0.1
print("parameters:", (784 * 128) + (128 * 64) + (64 * 10))

## Activation Function

![](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*ZafDv3VUm60Eh10OeJu1vw.png)

* sigmoid function

$$\sigma(z)=\cfrac{1}{1+e^{-z}}=a$$
    
$$\sigma'(z)=\sigma(z)(1-\sigma(z))=a(1-a)$$

* tanh (second choise)
    
$$\sigma(z)=tanh(z)=\cfrac{e^{z}-e^{-z}}{e^{z}+e^{-z}}=a$$
    
$$\sigma'(z)=1-(tanh(z))^2=1-a^2$$

* ReLU (most people using for non-negative values)
    
$$\sigma(z)=max(0,z)$$

$$\sigma'(z)=\begin{cases}0,\quad if\ z<0\\
1,\quad if\ z\ge0
\end{cases}$$

* Leaky ReLU

$$\sigma(z)=max(0.01z,z)$$

$$\sigma'(z)=\begin{cases}0.01,\quad if\ z<0\\
1,\quad \quad \ if\ z\ge0
\end{cases}$$

* SoftMax

$$\sigma(z)=\cfrac{e^{z}}{\sum{e^{z}}}=a$$

$$\sigma'(z)=Y_{hat}-z$$

In [None]:
def relu(z):
    return np.maximum(0, z)

def relu_backward(dz):
    return 1 * (dz > 0)

def softmax(z):
    z = z - np.max(z, axis=1, keepdims=True)
    z_exp = np.exp(z)
    y_hat = z_exp / np.sum(z_exp, axis=1, keepdims=True)
    y_hat = np.where(y_hat == 0, 10**-10, y_hat)
    return y_hat

def softmax_backward(y_hat, Y):
    return y_hat - Y

## Forward Propagation

$$\begin{aligned}
X^{[0]}_ {(m,n)}&=X_ {(m,n)}\\
\\
X^{[1]}_ {(m, h_ {1})}&=\sigma^{[1]}(Z^{[1]}_ {(m, h_ {1})})=\sigma^{[1]}(X^{[0]}_ {(m,n)}{\ \cdot\ }W^{[1]}_ {(n,h_ {1})}+B^{[1]}_ {(1,h_ {1})})\\
\\
X^{[2]}_ {(m, h_ {2})}&=\sigma^{[2]}(Z^{[2]}_ {(m, h_ {2})})=\sigma^{[2]}(X^{[1]}_ {(m, h_ {1})}{\ \cdot\ }W^{[2]}_ {(h_ {1},h_ {2})}+B^{[2]}_ {(1,h_ {2})})\\
\\
&\bullet\\
&\bullet\\
&\bullet\\
\\
X^{[l]}_ {(m, h_ {l})}&=\sigma^{[l]}(Z^{[l]}_ {(m, h_ {l})})=\sigma^{[l]}(X^{[l-1]}_ {(m, h_ {l-1})}{\ \cdot\ }W^{[l]}_ {(h_ {l-1},h_ {l})}+B^{[l]}_ {(1,h_ {l})})\\
\\
\hat{Y}_ {(m,k)}&=g(Z^{[o]}_ {(m,k)})=g^{[o]}(X^{[l]}_ {(m, h_ {l})}{\ \cdot\ }W^{[o]}_ {(h_ {l},k)}+B^{[o]}_ {(1,k)})\\
\end{aligned}$$

In [None]:
Z1 = np.dot(X_train, W1) + b1
X1 = relu(Z1)
Z2 = np.dot(X1, W2) + b2
X2 = relu(Z2)
Z3 = np.dot(X2, W3) + b3
Y_hat = softmax(Z3)
Y_hat

## Accuracy and Cost (sum of loss)

![](https://www.researchgate.net/publication/367393140/figure/fig4/AS:11431281114710300@1674648981676/Confusion-matrix-Precision-Recall-Accuracy-and-F1-score.jpg)

* MSE

$$loss=\sqrt{(\hat{Y} - Y)^2}$$

* CategoricalCrossentropy

$$loss=-(Y*log(\hat{Y}) + (1-Y)*log(1-\hat{Y}))$$

* Cost

$$C=\cfrac{1}{m}\sum{L(\hat{Y},Y)}$$

In [None]:
m = X_train.shape[0]
loss = np.average(np.square(Y_hat - Y_train) ** 0.5)
acc = (np.argmax(Y_hat, axis=1) == np.argmax(Y_train, axis=1)).sum() / m
print("loss:", loss)
print("acc:", acc)

## Gradient Descent (Optimizers)

$$\begin{aligned}
\mathbb{W}^{[l]}_ {(h_ {l-1},h_ {l})}&=\mathbb{W}^{[l]}_ {(h_ {l-1},h_ {l})}-\alpha\cfrac{∂C}{∂W}^{[l]}_ {(h_ {l-1},h_ {l})}
\\
\\
\mathbb{B}^{[l]}_ {(1,h_ {l})}&=\mathbb{B}^{[l]}_ {(1,h_ {l})}-\alpha\cfrac{∂C}{∂B}^{[l]}_ {(1,h_ {l})}
\end{aligned}$$

## Backward Propagation

![](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*CnoTckCQhlXMDjjR8_n7IQ.gif)

### Output Layer (Softmax)

$$\begin{cases}
\cfrac{∂C}{∂Z}^{[o]}_ {(m,k)}&=\hat{Y}_ {(m,k)}-Y_ {(m,k)}\\
\\
\cfrac{∂C}{∂W}^{[o]}_ {(h_ {l},k)}&=\cfrac{1}{m}X^{[l]T}_ {(h_ {l},m)}{\ \cdot\ }\cfrac{∂C}{∂Z}^{[o]}_ {(m,k)}\\
\\
\cfrac{∂C}{∂B}^{[o]}_ {(1,k)}&=\cfrac{1}{m}\sum{(\cfrac{∂C}{∂Z}^{[o]}_ {(m,k)})}\\
\end{cases}$$

### Hidden Layer

$$\begin{cases}
\cfrac{∂C}{∂Z}^{[l]}_ {(m,h_ {l})}&=[\cfrac{∂C}{∂Z}^{[o]}_ {(m,k)}{\ \cdot\ }W^{[o]T}_ {(k,h_ {l})}]\times[\sigma'^{[l]}(Z^{[l]}_ {(m,h_ {l})})]\\
\\
\cfrac{∂C}{∂W}^{[l]}_ {(h_ {l-1},h_ {l})}&=\cfrac{1}{m}X^{[l]T}_ {(h_ {l-1},m)}{\ \cdot\ }\cfrac{∂C}{∂Z}^{[o]}_ {(m,h_ {l})}\\
\\
\cfrac{∂C}{∂B}^{[o]}_ {(1,h_ {l})}&=\cfrac{1}{m}\sum{(\cfrac{∂C}{∂Z}^{[l]}_ {(m,h_ {l})})}\\
\end{cases}$$

$\quad$

$$\begin{cases}
\cfrac{∂C}{∂Z}^{[l-1]}_ {(m,h_ {l-1})}&=[\cfrac{∂C}{∂Z}^{[l]}_ {(m,h_ {l})}{\ \cdot\ }W^{[l]T}_ {(h_ {l},h_ {l-1})}]\times[\sigma'^{[l-1]}(Z^{[l-1]}_ {(m,h_ {l-1})})]\\
\\
\cfrac{∂C}{∂W}^{[l-1]}_ {(h_ {l-2},h_ {l-1})}&=\cfrac{1}{m}X^{[l-2]T}_ {(h_ {l-2},m)}{\ \cdot\ }\cfrac{∂C}{∂Z}^{[l]}_ {(m,h_ {l-1})}\\
\\
\cfrac{∂C}{∂B}^{[l-1]}_ {(1,h_ {l-1})}&=\cfrac{1}{m}\sum{(\cfrac{∂C}{∂Z}^{[l-1]}_ {(m,h_ {l-1})})}\\
\end{cases}$$

$$\begin{aligned}
\\
&\quad \quad \quad \quad \quad \quad \bullet\\
&\quad \quad \quad \quad \quad \quad \bullet\\
&\quad \quad \quad \quad \quad \quad \bullet\\
\\
\end{aligned}$$

$$\begin{cases}
\cfrac{∂C}{∂Z}^{[2]}_ {(m,h_ {2})}&=[\cfrac{∂C}{∂Z}^{[3]}_ {(m,h_ {3})}{\ \cdot\ }W^{[3]T}_ {(h_ {3},h_ {2})}]\times[\sigma'^{[2]}(Z^{[2]}_ {(m,h_ {2})})]\\
\\
\cfrac{∂C}{∂W}^{[2]}_ {(h_ {1},h_ {2})}&=\cfrac{1}{m}X^{[1]T}_ {(h_ {1},m)}{\ \cdot\ }\cfrac{∂C}{∂Z}^{[2]}_ {(m,h_ {2})}\\
\\
\cfrac{∂C}{∂B}^{[2]}_ {(1,h_ {2})}&=\cfrac{1}{m}\sum{(\cfrac{∂C}{∂Z}^{[2]}_ {(m,h_ {2})})}\\
\end{cases}$$

### Input Layer

$$\begin{cases}
\cfrac{∂C}{∂Z}^{[1]}_ {(m,h_ {1})}&=[\cfrac{∂C}{∂Z}^{[2]}_ {(m,h_ {2})}{\ \cdot\ }W^{[2]T}_ {(h_ {2},h_ {1})}]\times[\sigma'^{[1]}(Z^{[1]}_ {(m,h_ {1})})]\\
\\
\cfrac{∂C}{∂W}^{[1]}_ {(n,h_ {1})}&=\cfrac{1}{m}X^{[0]T}_ {(n,m)}{\ \cdot\ }\cfrac{∂C}{∂Z}^{[1]}_ {(m,h_ {1})}\\
\\
\cfrac{∂C}{∂B}^{[1]}_ {(1,h_ {1})}&=\cfrac{1}{m}\sum{(\cfrac{∂C}{∂Z}^{[1]}_ {(m,h_ {1})})}\\
\end{cases}$$

In [None]:
# layer 3
dZ3 = softmax_backward(Y_hat, Y_train)
dW3 = np.dot(X2.T, dZ3) / m
db3 = np.sum(dZ3, axis=0) / m
# layer 2
dZ2 = np.dot(dZ3, W3.T) * relu_backward(Z2)
dW2 = np.dot(X1.T, dZ2) / m
db2 = np.sum(dZ2, axis=0) / m
# layer 1
dZ1 = np.dot(dZ2, W2.T) * relu_backward(Z1)
dW1 = np.dot(X_train.T, dZ1) / m
db1 = np.sum(dZ1, axis=0) / m

lr = 1e-3
W1 = W1 - lr * dW1
W2 = W2 - lr * dW2
W3 = W3 - lr * dW3
b1 = b1 - lr * db1
b2 = b2 - lr * db2
b3 = b3 - lr * db3

# complete a epoch

## Train in batch

In [None]:
# hyperparameters
lr = 1e-3
epochs = 50
batch_size = 16 # batch_size=1 -> 盲人摸象

# weight and bias
W1 = np.random.randn(784, 128) * 0.1
b1 = np.random.randn(128) * 0.1
W2 = np.random.randn(128, 64) * 0.1
b2 = np.random.randn(64) * 0.1
W3 = np.random.randn(64, 10) * 0.1
b3 = np.random.randn(10) * 0.1

# set batch
batch_iteration = int(X_train.shape[0] / batch_size)
batch_iteration = batch_iteration + 1 if X_train.shape[0] % batch_size > 0 else batch_iteration
for i in range(epochs):
    loss = acc = val_loss = val_acc = 0
    for j in range(batch_iteration):
        batch_X = X_train[j*batch_size:(j+1)*batch_size]
        batch_Y = Y_train[j*batch_size:(j+1)*batch_size]
        m = batch_X.shape[0]
        # forward propogation
        Z1 = np.dot(batch_X, W1) + b1
        X1 = relu(Z1)
        Z2 = np.dot(X1, W2) + b2
        X2 = relu(Z2)
        Z3 = np.dot(X2, W3) + b3
        Y_hat = softmax(Z3)
        # batch loss, acc
        batch_loss = np.average(np.square(Y_hat - batch_Y) ** 0.5)
        batch_acc = (np.argmax(Y_hat, axis=1) == np.argmax(batch_Y, axis=1)).sum() / m
        loss = loss + batch_loss / batch_iteration
        acc = acc + batch_acc / batch_iteration
        # backward propogation
        # layer 3
        dZ3 = softmax_backward(Y_hat, batch_Y)
        dW3 = np.dot(X2.T, dZ3) / m
        db3 = np.sum(dZ3, axis=0) / m
        # layer 2
        dZ2 = np.dot(dZ3, W3.T) * relu_backward(Z2)
        dW2 = np.dot(X1.T, dZ2) / m
        db2 = np.sum(dZ2, axis=0) / m
        # layer 1
        dZ1 = np.dot(dZ2, W2.T) * relu_backward(Z1)
        dW1 = np.dot(batch_X.T, dZ1) / m
        db1 = np.sum(dZ1, axis=0) / m
        # gradient descent
        W1 = W1 - lr * dW1
        W2 = W2 - lr * dW2
        W3 = W3 - lr * dW3
        b1 = b1 - lr * db1
        b2 = b2 - lr * db2
        b3 = b3 - lr * db3
    # loss, acc, val_loss, val_acc
    Z1 = np.dot(X_test, W1) + b1
    X1 = relu(Z1)
    Z2 = np.dot(X1, W2) + b2
    X2 = relu(Z2)
    Z3 = np.dot(X2, W3) + b3
    Y_hat = softmax(Z3)
    val_loss = np.average(np.square(Y_hat - Y_test) ** 0.5)
    val_acc = (np.argmax(Y_hat, axis=1) == np.argmax(Y_test, axis=1)).sum() / Y_test.shape[0]
    print("Epoch {: 5d}/{}\t- loss: {:.4f} - acc: {:.4f} - val_loss: {:.4f} - val_acc: {:.4f}".format(i+1, epochs, loss, acc, val_loss, val_acc))

## Validation

In [None]:
import random
i_test = random.randint(0, 2000)
img = X_test[i_test]
Z1 = np.dot(np.array([img]), W1) + b1
X1 = relu(Z1)
Z2 = np.dot(X1, W2) + b2
X2 = relu(Z2)
Z3 = np.dot(X2, W3) + b3
Y_hat = softmax(Z3)
predict = label_dict[Y_hat[0].argmax()]
print("predict:", predict)
print("answer:", label_dict[Y_test[i_test].argmax()])
Image.fromarray(img.reshape(28, 28).astype(np.uint8)).resize((140, 140)).show()