# Neural Networks



  1. Dataset exploration
  2. Neural Networks Explanation
  3. Neural Networks Implementation
  4. Training
  5. Evaluation
  




### Dataset

Lets talk about the dataset, I have downloaded [Used Car Price Dataset](https://www.kaggle.com/datasets/rishabhkarn/used-car-dataset) from Kaggle website. The dataset contains `13` feature/independent variables($x_i$) and a target/dependant variable($y$), from those 13 feature/independant variables, I will be using 5 feature/independent variables they are `kms_driven` ,`mileage(kmpl)`, `engine(cc)`, `max_power(bhp)` and `torque(Nm). ` and `price(in lakhs)` as the target variable.<br><br>


kms_driven   | mileage(kmpl)| engine(cc) | max_power(bhp) | torque(Nm) | price(in lakhs)
-------------|-------------|------------|----------------|------------|-------------
 56000       |   7.81      |2996        |	   2996        |     333    |       63.75
 30615       |  17.4      |    999     |     999        |     9863   |       8.99
24000        |  20.68     |    1995    |     1995       |      188   |       23.75
18378        |  16.5      |    1353    |     1353       |     13808  |       13.56


<br>
<br>
Here $x$'s are the five-dimensional vector in $\mathbb{R}^5$. For instance, $X_1^{(i)}$ is the `kms_driven`, $x_2^{(i)}$ is the `mileage(kmpl)`, $x_3^{(i)}$ is the `engine(cc)`, $x_4$^{(i)} is the `max_power(bhp)`, and $x_5^{(i)}$ is the `torque(Nm)` of the $i$-th house in the training data


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
cd /content/drive/MyDrive/files

/content/drive/MyDrive/files


In [None]:
ls

imdb_data.csv  iris.csv  Used_Car_Dataset.csv  yelp.csv


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
np.warnings.filterwarnings('ignore', 'overflow')
from sklearn.model_selection import train_test_split
plt.rcParams['figure.figsize'] = (12.0, 9.0)

In [None]:
df = pd.read_csv("Used_Car_Dataset.csv")
df = df[['kms_driven','mileage(kmpl)', 'engine(cc)', 'max_power(bhp)', 'torque(Nm)', 'price(in lakhs)']]
df = df.sample(frac=0.2).reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311 entries, 0 to 310
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   kms_driven       311 non-null    int64  
 1   mileage(kmpl)    311 non-null    float64
 2   engine(cc)       311 non-null    float64
 3   max_power(bhp)   311 non-null    float64
 4   torque(Nm)       311 non-null    float64
 5   price(in lakhs)  311 non-null    float64
dtypes: float64(5), int64(1)
memory usage: 14.7 KB


###Neural Networks Explanation & Implementation

Neural networks refer to broad type of non-linear models/parametrizations $h_\theta(x)$ tht involve combinations of matrix multiplications and other entry-wise non-linear operations. We will start small and slowly build up a neural network, step by step

**Neural Network with a Single Neuron.**
We define a parameterized function $h_\theta(x)$ with input  x, parameterized by $\theta$, which outpus the price of the house $y$. Formally, $h_\theta : x \to y$. Perhaps one of the simplest parametrization would be
$$h_\theta(x) = max(wx + b, 0), \quad\text{where } \theta = (w, b) \in \mathbb{R}^2$$

Here $h_\theta(x)$ returns a single value: $(wx + b)$ or zero, whichever is greater. In the context of neural networks, the function max$\{t, 0\}$ is called ReLU (pronounced "ray-lu"), or rectified linear unit, and often denoted by ReLU$(t) \triangleq \text{max}\{t, 0\}$.
<br>
Generally, a one-dimensional non-linear function that maps $\mathbb{R}$ to $\mathbb{R}$ such as ReLU is often referred to as an **activation function**. The model $h_\theta(x)$ is said to have a singel neuron partly because it has a single non-liear activation function.<br>

When the input $x \in \mathbb{R}^d$ has mulitple dimensions, a neural network with a single neuron can be written as
$$\begin{align} h_\theta(x) = \sigma(w^Tx + b), \quad\text{where } w \in \mathbb{R}^d, b \in \mathbb{R}, \text{ and } \theta = (w, b) \tag{nn.7}
\end{align}$$
The term $b$ is often referred to as the "bias", and the vector $w$ is referred to as the weight vector. Such a neural network has 1 layer.<br>

**Stacking Neurons.**
A more complex neural network may take the single neuron described above and "stack" them together such that one neuron passes its output as input into the next neuron, resulting in a more complex function.








**Two-layer Fully-Connected Neural Networks.**
In the equation $(nn.8)$, based on the prior knowledge we have constructed "family size", "walkable" and "school quality" features based on the inputs. Such a prior knowledge might not be available for other applications. To make it more flexible and general to have a generic parameterization and can be written as intermediate variable $a_1$ as a function of all $x_1,\cdots,x_4$. <br>
$$\begin{align*}
a_1 &= \sigma(w_1^Tx + b_1),\quad\text{where } w_1 \in \mathbb{R}^4 \text{ and  } b_1 \in \mathbb{R} \tag{nn.9}\\
a_2 &= \sigma(w_2^Tx + b_2),\quad \text{where } w_2 \in \mathbb{R}^4 \text{ and  } b_2 \in \mathbb{R} \\
a_3 &= \sigma(w_3^Tx + b_3),\quad \text{where } w_3 \in \mathbb{R}^4 \text{ and  } b_3 \in \mathbb{R}
\end{align*}$$

Now $h_\theta(x)$ is defined similar to equation $(\text{nn}.8)$ and we have a **fully-connected neural network** where the intermediate $a_i$'s depend on all the inputs $x_i$'s.<br>

To fully generalize a two-layer fully-connected neural network with $m$ hidden units and $d$ dimentionsa input $x \in \mathbb{R}^d $ is define as
$$\begin{align*}
\forall j \in [1,\cdots,m], \quad z_j &= w_{j}^{[1]^T} x + b_j^{[1]} \quad \text{where } w_j^{[1]} \in \mathbb{R}^d, b_j^{[1]} \in R \tag{nn.10}\\
a_j &= \sigma(z_j), \\
a &= [a_1,\cdots,a_m]^T \in \mathbb{R}^m\\
h\theta(x) &= w^{[2]^T}a + b^{[2]} \quad \text{where } w^{[2]} \in \mathbb{R}^m,\; b^{[2]} \in \mathbb{R} \tag{nn.11}
\end{align*}$$
Note that by deafult the vectors in $\mathbb{R}^d$ are views as column vectors, and in particular $a$ is a column vector with components $a_1, a_2,\cdots,a_m$. The indices ${}^{[1]}$ and ${}^{[2]}$ are used to distinguish two sets of parametes: the $w_j^{[1]}$'s (each of which is a vector in $\mathbb{R}^d$) and $w^{[2]}$ (which is a vector in $\mathbb{R}^m$)<br>

**Vectorization.** Writing the neural networks in terms of matrix and vector notations, will help us to add more layers and more comlpex structures. Vectorization is the speed perspective in the implementation as it takes advantages of matrix algebra and highly optimized numerical linear algebra packages (e.g., BLAS)<br><br><br>
$$ W^{[1]} = \left[ \begin{array}{c}
-w_1^{[1]^T}- \\
-w_2^{[1]^T}-\\
\vdots\\
-w_m^{[1]^T}-
\end{array}\right] \in \mathbb{R}^{m \times d} \tag{nn.12}$$
<br>
Now by teh definition of matrix vector multiplication, we can write $z = [z_1,...,z_m]^T \in \mathbb{R}^m$ as <br>
$$ \underbrace{\left[ \begin{array}{c} z_1 \\
\vdots \\
\vdots \\
z_m\end{array}\right]}_{z\; \in \; \mathbb{R}^{m \times 1}} =
\underbrace{\left[ \begin{array}{c}
-w_1^{[1]^T}- \\
-w_2^{[1]^T}-\\
\vdots\\
- w_m^{[1]^T}-
\end{array}\right] }_{W^{[1]} \; \in \; \mathbb{R}^{m \times d}}
\underbrace{\left[ \begin{array}{c} x_1 \\
x_2 \\
\vdots \\
x_d\end{array}\right]}_{x\; \in \; \mathbb{R}^{d \times 1}} +
\underbrace{\left[ \begin{array}{c} b_1 \\
b_2 \\
\vdots \\
b_m\end{array}\right]}_{b\; \in \; \mathbb{R}^{m \times 1}} \tag{nn.13}
$$
Or succinctly,
$$ z = W^{[1]}x + b^{[1]} \tag{nn.14}$$


**Multi-layer fully-connected neural networks**. With this succinct notations, we can stack more layers to get a deeper fully-connected neural network. Let r be the number of layers (weight matrices). Let
$W^{[1]}, . . . , W^{[r]}$, $b^{[1]}, . . . ,b^{[r]}$ be the weight matrices and biases of all the layers.
Then a multi-layer neural network can be written as
$$\begin{align}a^{[1]} &= \sigma(W^{[1]}x + b^{[1]}) \\
a^{[2]} &= \sigma(W^{[2]}a^{[1]} + b^{[2]})\\
\cdots\\
a^{[r−1]} &= \sigma(W^{[r−1]}a^{[r−2]} + b^{[r−1]})\\
a^{[r]} &= z^{[r]}=W^{[r]}a^{[r−1]} + b^{[r]}\\
J &= \frac{1}{2}(a^{[r]}-y)^2
\end{align}$$

Here we define both $a^{[r]}$ and $z^{[r]}$ and $h_\theta(x)$ for notational simplicaity. Define<br>
$$ \delta^{[k]} = \frac{\partial J}{\partial z^{[k]} } \quad \text{for } k = r \text{ to } 1.$$

#### Python Implementation

In [None]:
np.random.seed(seed=123456)
X = df[['kms_driven','mileage(kmpl)', 'engine(cc)', 'max_power(bhp)', 'torque(Nm)']].to_numpy()
y = np.log(df[['price(in lakhs)']].to_numpy())

In [None]:
from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.15, random_state=42)

In [None]:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

In [None]:
X_test.shape, y_test.shape

((47, 5), (47, 1))

In [None]:
# define inputs, hidden layer size
input_layer = 5
hidden_1 = 10
hidden_2 = 5
hidden_3 = 7
output_layer = 1

In [None]:
cache = {}

In [None]:
def sigmoid(x, derivative=False):
    if derivative:
        return (np.exp(-x))/((np.exp(-x)+1)**2)
    return np.exp(x)/(1 + np.exp(x))


In [None]:
### I'm creating four layer network where three are hidden layer and last is output layer
params = {
    'W1': np.random.randn(hidden_1, input_layer,),
    'b1': np.random.randn(hidden_1, 1),
    'W2': np.random.randn(hidden_2, hidden_1),
    'b2': np.random.randn(hidden_2, 1),
    'W3': np.random.randn(hidden_3, hidden_2),
    'b3': np.random.randn(hidden_3, 1),
    'W4': np.random.randn(output_layer, hidden_3),
    'b4': np.random.randn(output_layer, 1),

}

In [None]:
### Feedforward network
def feed_forward(x, cache, params):
    cache['X'] = x
    cache['Z1'] = np.matmul(params["W1"], cache["X"].T) + params["b1"]
    cache['A1'] = sigmoid(cache['Z1'])
    cache['Z2'] = np.matmul(params["W2"], cache["A1"]) + params["b2"]
    cache['A2'] = sigmoid(cache['Z2'])
    cache['Z3'] = np.matmul(params["W3"], cache["A2"]) + params["b3"]
    cache["A3"] = sigmoid(cache['Z3'])

    cache["output"] = np.matmul(params["W4"], cache["A3"]) + params["b4"]

    return cache["output"]


In [None]:
### loss function
def loss_fuction(y_hat, y):
    lenth = y.shape[0]
    return (2/lenth)*np.sum((y - y_hat))

#### Back Propagation

**Preliminary: chain rule**<br>
We first recall the chain rule in calculus. Suppose the variable J depends on the variables $\theta_1, . . . , \theta_p$ via the intermediate variable $g_1, . . . , g_k$:

$$g_j = g_j(\theta1, . . . , \theta_p), \forall_j \in {1, · · · , k} $$
$$J = J(g_1, . . . , g_k) $$
Here we overload the meaning of $g_j$’s: they denote both the intermediate
variables but also the functions used to compute the intermediate variables.
Then, by the chain rule, we have that $\forall_i$,
$$\frac{\partial J}{\partial\theta_i} = \sum_{j=1}^k
\frac{\partial J}
{\partial g_j}\frac{\partial g_j}{\partial \theta_i}$$
For the ease of invoking the chain rule in the following subsections in various ways, we will call $J$ the output variable, $g_1, . . . , g_k$ intermediate variables, and $\theta_1, . . . , \theta_p$ the input variable in the chain rule.<br>

**Two-layer neural network with vector notation**<br>
$$\text{Let $\quad$ } \delta^{[2]} \overset{\Delta}{=} \frac{\partial J}{\partial o} \in \mathbb{R}$$
$$\delta^{[1]}  \overset{\Delta}{=} \frac{\partial J}{\partial z} \in \mathbb{R}^m$$

**Back-propagation for two-layer neural networks in vectorized
notations**<br>

1. Compute the values of $z \in \mathbb{R}^m, \; a \in \mathbb{R}^m, \text{ and } o$ <br>
2. Compute $\delta^{[2]} = (o - y) \in \mathbb{R}$ <br>
3. Compute $\delta^{[1]} = (o - y) \; \cdot \; W^{[2]^T} \; \odot \sigma'(z) \in \mathbb{R}$ <br>
4. Compute $$\begin{align} \frac{\partial J}{\partial W^{[2]}} &= \delta^{[2]}a^T \in \mathbb{R}^{1 \times m} \\
\frac{\partial J}{\partial b^{[2]}} &= \delta^{[2]} \in \mathbb{R} \\
\frac{\partial J}{\partial W^{[1]}} &= \delta^{[1]}x^T \in \mathbb{R}^{m \times d} \\
\frac{\partial J}{\partial b^{[1]}} &= \delta^{[1]} \in \mathbb{R}^m
\end{align}$$

**Vectorization over training examples** Back-propagation for multi-layer neural networks.
<br>
1. Compute and store the values of $a^{[k]}$'s and $z^{[k]}$'s for $ k =1,...,r-1$ and $J\quad \quad \text{This often called as "forward pass"}$ <br>
2. Compute $\delta^{[r]} = \frac{\partial J}{\partial z^{[r]}} = (z^{[r]} -o)$ <br>
3. $\text{for }\;  k = r - 1 \text{ to } 1 \text{ do}$ <br>
4. Compute $$\delta^{[k]} = \frac{\partial J}{\partial z^{[r]}} = \left( W^{[k+1]^T} \delta^{[k+1]} \right)\odot \sigma'(z^{[k]}) \in \mathbb{R}$$ <br>
5. Compute $$\begin{align} \frac{\partial J}{\partial W^{[k+1]}} &= \delta^{[k+1]}a{[k]^T}\\
\frac{\partial J}{\partial b^{[k+1]}} &= \delta^{[k+1]}  
\end{align}$$ <br><br>

The above logic is applied in the back propagation `python` code, where I calculated  $a^{[k]}$'s and $z^{[k]}$'s and multiplied with respective $W^{[k]}$'s and $b^{[k]}$'s matrices

In [None]:
def back_propagation(x, y, output, params):
    current_batch_size = y.shape[0]

    dloss = output - y.T

    dA3 = np.matmul(params["W4"].T, dloss)
    dZ3 = dA3 * sigmoid(cache["Z3"], derivative=True)
    dA2 = np.matmul(params["W3"].T, dZ3)
    dZ2 = dA2 * sigmoid(cache["Z2"], derivative=True)
    dA1 = np.matmul(params["W2"].T, dZ2)
    dZ1 = dA1 * sigmoid(cache["Z1"], derivative=True)

    dW4 = (1./current_batch_size) * np.matmul(dloss, cache["A3"].T)
    db4 = (1./current_batch_size) * np.sum(dloss, axis=1, keepdims=True)

    dW3 = (1./current_batch_size) * np.matmul(dZ3, cache["A2"].T)
    db3 = (1./current_batch_size) * np.sum(dZ3, axis=1, keepdims=True)

    dW2 = (1./current_batch_size) * np.matmul(dZ2, cache["A1"].T)
    db2 = (1./current_batch_size) * np.sum(dZ2, axis=1, keepdims=True)

    dW1 = (1./current_batch_size) * np.matmul(dZ1, cache["X"])
    db1 = (1./current_batch_size) * np.sum(dZ1, axis=1, keepdims=True)

    grads = {'W4' : dW4, 'b4' : db4, 'W3' : dW3, 'b3': db3, 'W2' : dW2, 'b2' : db2, 'W1' : dW1, 'b1' : db1}
    return  grads

#### Training

In [None]:
def training(x_train, y_train, x_test, y_test, cache, params, epochs=100000000,\
             batch_size = 64, optimizer="sgd", l_rate=0.001):

    num_batches = (x_train.shape[0]//batch_size) + 1
    start_time = time.time()
    template = "Epoch {}: {:.2f}s, train loss={:.2f}"

    # Train
    for i in range(epochs):
        # Shuffle
        permutation = np.random.permutation(x_train.shape[0])
        x_train_shuffled = x_train[permutation]
        y_train_shuffled = y_train[permutation]

        for j in range(num_batches):
            # Batch
            begin = j * batch_size
            end = min(begin + batch_size, x_train.shape[0]-1)
            x = x_train_shuffled[begin:end]
            y = y_train_shuffled[begin:end]

            #forward
            output = feed_forward(x, cache, params)
            #backprop
            grads = back_propagation(x, y, output, params)
            for key in params:
                params[key] = params[key] - l_rate * grads[key]

            # import sys
            # sys.exit(0)

        # Training data
        output = feed_forward(x_train, cache, params)
        train_loss = loss_fuction(output,  y_train)


        print(template.format(i+1, time.time()-start_time,  train_loss))





In [None]:
training(X_train, y_train, X_test, y_test, cache, params, epochs=200,\
             batch_size = 64, )

Epoch 1: 0.00s, train loss=1619.39
Epoch 2: 0.01s, train loss=1587.97
Epoch 3: 0.01s, train loss=1557.30
Epoch 4: 0.02s, train loss=1525.95
Epoch 5: 0.02s, train loss=1497.27
Epoch 6: 0.03s, train loss=1469.71
Epoch 7: 0.03s, train loss=1442.22
Epoch 8: 0.03s, train loss=1414.33
Epoch 9: 0.03s, train loss=1387.58
Epoch 10: 0.04s, train loss=1360.82
Epoch 11: 0.04s, train loss=1334.51
Epoch 12: 0.04s, train loss=1308.66
Epoch 13: 0.05s, train loss=1283.49
Epoch 14: 0.05s, train loss=1257.16
Epoch 15: 0.06s, train loss=1233.28
Epoch 16: 0.06s, train loss=1209.28
Epoch 17: 0.06s, train loss=1184.79
Epoch 18: 0.06s, train loss=1161.20
Epoch 19: 0.06s, train loss=1138.69
Epoch 20: 0.07s, train loss=1115.88
Epoch 21: 0.07s, train loss=1094.02
Epoch 22: 0.07s, train loss=1071.53
Epoch 23: 0.07s, train loss=1050.60
Epoch 24: 0.08s, train loss=1030.39
Epoch 25: 0.08s, train loss=1010.88
Epoch 26: 0.08s, train loss=990.72
Epoch 27: 0.08s, train loss=971.02
Epoch 28: 0.08s, train loss=952.96
Epoc

#### Evaluation

In [None]:
# Test data
output = feed_forward(X_test, cache, params)
output

array([[2.11514447, 2.07164962, 2.08335919, 2.07228222, 2.08213417,
        2.08069017, 2.08242817, 2.08237428, 2.08408262, 2.06829722,
        2.09371075, 2.07656254, 2.08953671, 2.07294164, 2.0896713 ,
        2.07293797, 2.08037512, 2.09703914, 2.08302123, 2.07961042,
        2.07361027, 2.0839098 , 2.09911021, 2.08584386, 2.09733804,
        2.0884576 , 2.09688344, 2.08387309, 2.0849225 , 2.08807728,
        2.08104995, 2.10434862, 2.09108127, 2.0922988 , 2.10699648,
        2.09559404, 2.09637999, 2.08859352, 2.07247683, 2.0858388 ,
        2.07508118, 2.06928778, 2.07694884, 2.07016694, 2.07371564,
        2.0796634 , 2.08692599]])

In [None]:
results = pd.DataFrame({'predicted_value': output.T[:,0], 'actual_value':y_test[:,0]})
results[:5]

Unnamed: 0,predicted_value,actual_value
0,2.115144,1.18479
1,2.07165,2.99072
2,2.083359,2.297573
3,2.072282,1.7492
4,2.082134,1.938742


In [None]:
from sklearn.metrics import mean_squared_error

errors = mean_squared_error(y_test, output.T, squared=True)
# report error
print(errors)

1.3400598060859434


**Observations**
1. The model is too complex for the data <br>
2. when the complete dataset is taken, the params and cache are filled with `nan` <br>
3. Using Sigmoid have overfit the model <br>
4. Better option is using **ReLU** and 2 layer network <br>
5. Using L1/L2 regularization is again good choice <br>

**References:**<br>
1. https://github.com/lionelmessi6410/Neural-Networks-from-Scratch/blob/main/NN-from-Scratch.ipynb
2. https://cs229.stanford.edu/lectures-spring2022/main_notes.pdf