---
layout: post
title:  "Dense Neural Network"
date:   2023-03-08 10:14:54 +0700
categories: jekyll update
---

# TOC

- [Definition](#define)
- [Backpropagation](#backprop)
- [Gradient descent](#grad)
- [Code example](#code)


# Definition

Remember the linear combination of input x (note that x can be non linear):

$$ \hat{y}=h_{\theta}(x) = \theta \cdot x $$ 

Also remember when we wrap this linear combination in all kinds of non linear function (sigmoid, sign, softmax). There are some other non linear functions that are also as popular: tanh, ReLU.. In general, those transformations are called activation functions. They are there to transform the data flow and to make the investigation intesresting (instead of a big chunk of linear combination) for complex problems.

In deep learning, each of those nonlinear transformations is one neuron. Hence the perceptron has one neuron. Since it uses the sign function, we can call it a sign neuron. In general, the last neurons that output classes (using softmax) or values are called output layer. Those neurons between input and output layer are called hidden layers since they transform input and continue to do so before outputing some thing for classification or regression.

This kind of network that each neuron of one layer is connected (to be input) to all the neuron for the next layer is called a dense network, or a fully connected feedforward network. It is called feedforward (or sequential) since the input flows (and is transformed) one-way forward from the input to output layer.

## ReLU

ReLU, shorted for Rectified linear unit, is an incredibly fast and straightforward but successful activation function. It is:

$$ ReLU(x) = max(0, h_{\theta}) $$

ReLU returns either 0 or the linear combination of input, whichever is greater.

## A 2-layer neural network

A 2-layer neural network would have one hidden (middle) layer and one output layer. Let's begin the calculation. Say we have 3 attributes $$ x_1, x_2, x_3 $$, the linear combination would be:

$$ \hat{y_1} = h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 $$

with $$ x_0 = 1 $$ and $$ \theta_0 $$ to be the bias. Take this through the transformation of ReLU and we have the first neuron of the hidden layer:

$$ a_{11} = ReLU(\theta_{01} x_0 + \theta_{11} x_1 + \theta_{21} x_2 + \theta_{31} x_3) $$

For the second neuron of the hidden layers:

$$ a_{21} = ReLU(\theta_{02} x_0 + \theta_{12} x_1 + \theta_{22} x_2 + \theta_{32} x_3) $$

For the output layer:

$$ \hat{y} = h_\theta(x) = \theta_0 + \theta_1 a_{11} + \theta_2 a_{21}  $$

Usually we don't use activation for the output layer that predicts a value (regression). If we need nonnegative value we can use ReLU. For classification problem, we can use a softmax.

With this setup in general, we use a MSE for loss function of a regression problem and cross entropy for classification problem. To optimize loss function, we calculate gradient descent. Backpropagation is a technique to calculate gradient so we can use it for the descent step. In crucial, the whole process of training a neural network means:

- to randomly initialize the parameter vector

- use those starting parameters to do a forward calculation (multiply with input then transform) outputing prediction

- measure the error of prediction

- do a backward pass: calculate how much each parameter is responsible for the error (i.e. we take partial derivative of error with respect to each parameter since technically gradient measure how much pertubed the error is given a minor change in each paramater)

- update the parameters in the direction of descending the gradient so that the error is on the way to the minimal

The backward pass is called backpropagation: we backward propagate the error.

## Backpropagation

Here is the loss function:

$$ L = \frac{1}{2}(y - \hat{y})^2 $$ 

Here is the derivatives of loss function with respect to parameters in the output layer $$ \theta_0, \theta_1, \theta_2 $$:

$$ \frac{\partial L}{\partial \theta_{0}} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial \theta_{0}} = (y - \hat{y})(-\hat{y}) \frac{\partial (\theta_0 + \theta_1 a_{11} + \theta_2 a_{21})}{\partial \theta_0} = -\hat{y}(y - \hat{y}) $$


$$ \frac{\partial L}{\partial \theta_1} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial \theta_{1}} = (y - \hat{y})(-\hat{y}) \frac{\partial (\theta_0 + \theta_1 a_{11} + \theta_2 a_{21})}{\partial \theta_1} = -\hat{y}(y - \hat{y}) a_{11} $$

$$ \frac{\partial L}{\partial \theta_2} = -\hat{y}(y - \hat{y})a_{21} $$

Here is the derivatives of loss function with respect to parameters $$ \theta_{01},..,\theta_{31}, \theta_{02},..,\theta_{32} $$ in the hidden layer:

$$ \frac{\partial L}{\partial \theta_{01}} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial a_{11}} \frac{\partial a_{11}}{\partial \theta_{01}} = -\hat{y}(y - \hat{y}) \theta_{1} \frac{\partial ReLU}{\partial \theta_{01}} $$

with $$ \frac{\partial ReLU}{\partial \theta_{01}} =
\begin{cases}
      0 & \text{if $x_0$ < 0}\\
      1 & \text{if $x_0$ > 0}\\
\end{cases}
$$

$$ \Leftrightarrow \frac{\partial L}{\partial \theta_{01}} =
\begin{cases}
      0 & \text{if $x_0$ < 0}\\
      -\hat{y}(y - \hat{y}) \theta_{1} & \text{if $x_0$ > 0}\\
\end{cases}
$$

## Code example

Consider the following analytical example: 3 features and 2 input data points:

| $$ x_1 $$ | $$ x_2$$  |  $$x_3  $$ | y| 
|--|--|--|--|
| -2 | 4 | 3 | 380 | 
| 5  | 8 | 10| 1950|

We have 11 parameters:

|$$\theta_{01}$$|$$\theta_{11}$$|$$\theta_{21}$$|$$\theta_{31}$$| $$\theta_{02}$$|$$\theta_{12}$$|$$\theta_{22}$$|$$\theta_{32}$$| $$\theta_0$$|$$\theta_1$$|$$\theta_2$$ |
|--|--|--|--|--|--|--|--|--|--|--|
|4|7|8|2|4|5|7|9|1|10|3|


In [4]:
import numpy as np

In [None]:
# calculate prediction

In [8]:
o = np.ones(2).reshape(2,1)

In [33]:
X = [[-2,4,3],[5,8,10]]
theta = [[4,7,8,2],[4,5,7,9]]
theta_output=[1,10,3]

In [34]:
X = np.array(X).reshape(2,3)

In [21]:
X.shape

(2, 3)

In [22]:
o.shape

(2, 1)

In [35]:
X=np.concatenate([o,X],axis=1)

In [36]:
X

array([[ 1., -2.,  4.,  3.],
       [ 1.,  5.,  8., 10.]])

In [37]:
theta=np.array(theta).reshape(4,2)

In [38]:
X2=np.dot(X,theta)

In [39]:
X2=np.concatenate([o,X2],axis=1)

In [40]:
X2

array([[  1.,  25.,  50.],
       [  1., 146., 147.]])

In [41]:
theta_output=np.array(theta_output).reshape(3,1)

In [43]:
# y hat
y_hat = np.dot(X2,theta_output)

In [44]:
# loss
y = [[380],[1950]]


array([[-21.],
       [ 48.]])

In [51]:
L = np.sum(np.square(y-y_hat))/2
L

1372.5