# Neural Networks



  1. Dataset exploration
  2. Neural Networks Explanation
  3. Neural Networks Implementation
  4. Training
  5. Evaluation
  




### Dataset

Lets talk about the dataset, I have downloaded [Used Car Price Dataset](https://www.kaggle.com/datasets/rishabhkarn/used-car-dataset) from Kaggle website. The dataset contains `13` feature/independent variables($x_i$) and a target/dependant variable($y$), from those 13 feature/independant variables, I will be using 5 feature/independent variables they are `kms_driven` ,`mileage(kmpl)`, `engine(cc)`, `max_power(bhp)` and `torque(Nm). ` and `price(in lakhs)` as the target variable.<br><br>


kms_driven   | mileage(kmpl)| engine(cc) | max_power(bhp) | torque(Nm) | price(in lakhs)
-------------|-------------|------------|----------------|------------|-------------
 56000       |   7.81      |2996        |	   2996        |     333    |       63.75
 30615       |  17.4      |    999     |     999        |     9863   |       8.99
24000        |  20.68     |    1995    |     1995       |      188   |       23.75
18378        |  16.5      |    1353    |     1353       |     13808  |       13.56


<br>
<br>
Here $x$'s are the five-dimensional vector in $\mathbb{R}^5$. For instance, $X_1^{(i)}$ is the `kms_driven`, $x_2^{(i)}$ is the `mileage(kmpl)`, $x_3^{(i)}$ is the `engine(cc)`, $x_4$^{(i)} is the `max_power(bhp)`, and $x_5^{(i)}$ is the `torque(Nm)` of the $i$-th house in the training data


In [40]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [41]:
cd /content/drive/MyDrive/files

/content/drive/MyDrive/files


In [42]:
ls

imdb_data.csv  iris.csv  Used_Car_Dataset.csv  yelp.csv


In [43]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split
plt.rcParams['figure.figsize'] = (12.0, 9.0)

In [44]:
df = pd.read_csv("Used_Car_Dataset.csv")
df = df[['kms_driven','mileage(kmpl)', 'engine(cc)', 'max_power(bhp)', 'torque(Nm)', 'price(in lakhs)']]
df = df.sample(frac=.1).reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   kms_driven       155 non-null    int64  
 1   mileage(kmpl)    154 non-null    float64
 2   engine(cc)       154 non-null    float64
 3   max_power(bhp)   154 non-null    float64
 4   torque(Nm)       154 non-null    float64
 5   price(in lakhs)  155 non-null    float64
dtypes: float64(5), int64(1)
memory usage: 7.4 KB


###Neural Networks Explanation & Implementation

Neural networks refer to broad type of non-linear models/parametrizations $h_\theta(x)$ tht involve combinations of matrix multiplications and other entry-wise non-linear operations. We will start small and slowly build up a neural network, step by step

**Neural Network with a Single Neuron.**
We define a parameterized function $h_\theta(x)$ with input  x, parameterized by $\theta$, which outpus the price of the house $y$. Formally, $h_\theta : x \to y$. Perhaps one of the simplest parametrization would be
$$h_\theta(x) = max(wx + b, 0), \quad\text{where } \theta = (w, b) \in \mathbb{R}^2$$

Here $h_\theta(x)$ returns a single value: $(wx + b)$ or zero, whichever is greater. In the context of neural networks, the function max$\{t, 0\}$ is called ReLU (pronounced "ray-lu"), or rectified linear unit, and often denoted by ReLU$(t) \triangleq \text{max}\{t, 0\}$.
<br>
Generally, a one-dimensional non-linear function that maps $\mathbb{R}$ to $\mathbb{R}$ such as ReLU is often referred to as an **activation function**. The model $h_\theta(x)$ is said to have a singel neuron partly because it has a single non-liear activation function.<br>

When the input $x \in \mathbb{R}^d$ has mulitple dimensions, a neural network with a single neuron can be written as
$$\begin{align} h_\theta(x) = \text{ReLU}(w^Tx + b), \quad\text{where } w \in \mathbb{R}^d, b \in \mathbb{R}, \text{ and } \theta = (w, b) \tag{nn.7}
\end{align}$$
The term $b$ is often referred to as the "bias", and the vector $w$ is referred to as the weight vector. Such a neural network has 1 layer.<br>

**Stacking Neurons.**
A more complex neural network may take the single neuron described above and "stack" them together such that one neuron passes its output as input into the next neuron, resulting in a more complex function.

Let us now deepen the house prediction example. In addition to the size of the house, suppose that you know the number of bedrooms, the zip code and the wealth of the neigjborhood. Building neural networks is analogous to Lego bricks: you take individual bricks and stack them together to build complex structures. The same applied to neural networks: we take individual neurons and stack them together to create complex neural networks.

Given these features(size, number of bedrooms, zip code, adn wealth), we might then decide that the price of the house depends on the maximum family size it can accommodate. Suppose the family size is a function of the size of the house and number of bedrooms. The zip code may provide additional information such as how walkable the neighborhood is (i.e., can you walk to the grocery store or do you need o drive everywhere). Combining the zip code with the wealth of the neighborhood may predict the quality of the local  elementary school. Given these three derived features(family size, walkable, school quality), we may conclude that the price of the home ultimately depends on these three features.
<br>

Formally, the input to a neural network is a set of input features $x_1, x_2,x_3,x_4$. We denote the intermediate variables for "family size", "walkable", and "school quality" by $a_1, a_2, a_3$ (these $a_i$'s as a neural network with a single neuron with a subset of $x_1,\cdots,x_4$ as inputs. Then we will have the parameterizaion:<br>
$$\begin{align*}
a_1 &= \text{ReLU}(\theta_1x_1 + \theta_2x_2 + \theta_3) \\
a_2 &= \text{ReLU}(\theta_4x_3 + \theta_5) \\
a_3 &= \text{ReLU}(\theta_6x_3 + \theta_7x_4 + \theta_8)
\end{align*}$$
where $(\theta_1, \theta_2,\cdots,\theta_8)$ are parameters. Now we represetnt the final output $h_\theta(x)$ as another linear function with $a_1, a_2, a_3$ as inputs, and we get:
$$h_\theta(x) = \theta_9a_1 + \theta_{10}a_2 + \theta_{11}a_3 + \theta_{12} \tag{nn.8}$$
where $\theta$ contains all the parameters $(\theta_1,...,\theta_{12})$.<br>
Now we represent the output as a quite complex function of $x$ with parameters $\theta$. Then you can use this parametrization $h_\theta$ with the machinery to learn the parameters $\theta$.







**Two-layer Fully-Connected Neural Networks.**
In the equation $(nn.8)$, based on the prior knowledge we have constructed "family size", "walkable" and "school quality" features based on the inputs. Such a prior knowledge might not be available for other applications. To make it more flexible and general to have a generic parameterization and can be written as intermediate variable $a_1$ as a function of all $x_1,\cdots,x_4$. <br>
$$\begin{align*}
a_1 &= \text{ReLU}(w_1^Tx + b_1),\quad\text{where } w_1 \in \mathbb{R}^4 \text{ and  } b_1 \in \mathbb{R} \tag{nn.9}\\
a_2 &= \text{ReLU}(w_2^Tx + b_2),\quad \text{where } w_2 \in \mathbb{R}^4 \text{ and  } b_2 \in \mathbb{R} \\
a_3 &= \text{ReLU}(w_3^Tx + b_3),\quad \text{where } w_3 \in \mathbb{R}^4 \text{ and  } b_3 \in \mathbb{R}
\end{align*}$$

Now $h_\theta(x)$ is defined similar to equation $(\text{nn}.8)$ and we have a **fully-connected neural network** where the intermediate $a_i$'s depend on all the inputs $x_i$'s.<br>

To fully generalize a two-layer fully-connected neural network with $m$ hidden units and $d$ dimentionsa input $x \in \mathbb{R}^d $ is define as
$$\begin{align*}
\forall j \in [1,\cdots,m], \quad z_j &j = w_{j}^{[1]^T} x + b_j^{[1]} \quad \text{where } w_j^{[1]} \in \mathbb{R}^d, b_j^{[1]} \in R \tag{nn.10}\\
a_j &= \text{ReLU}(z_j), \\
a &= [a_1,\cdots,a_m]^T \in \mathbb{R}^m\\
h\theta(x) &= w^{[2]^T}a + b^{[2]} \quad \text{where } w^{[2]} \in \mathbb{R}^m,\; b^{[2]} \in \mathbb{R} \tag{nn.11}
\end{align*}$$
Note that by deafult the vectors in $\mathbb{R}^d$ are views as column vectors, and in particular $a$ is a column vector with components $a_1, a_2,\cdots,a_m$. The indices ${}^{[1]}$ and ${}^{[2]}$ are used to distinguish two sets of parametes: the $w_j^{[1]}$'s (each of which is a vector in $\mathbb{R}^d$) and $w^{[2]}$ (which is a vector in $\mathbb{R}^m$<br>

**Vectorization.**


### Python Implementation

In [45]:
np.random.seed(seed=123456)
X = df[['kms_driven','mileage(kmpl)', 'engine(cc)', 'max_power(bhp)', 'torque(Nm)']].to_numpy()
y = np.log(df[['price(in lakhs)']].to_numpy())

In [46]:
from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.15, random_state=42)

In [47]:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

In [48]:
X_test.shape, y_test.shape

((24, 5), (24, 1))

In [49]:
# define inputs, hidden layer size
input_layer = 5
hidden_1 = 10
hidden_2 = 5
hidden_3 = 7
output_layer = 1

In [50]:
cache = {}

In [51]:
def sigmoid(x, derivative=False):
    if derivative:
        return (np.exp(-x))/((np.exp(-x)+1)**2)
    return np.exp(x)/(1 + np.exp(x))


In [52]:
### I'm creating four layer network where three are hidden layer and last is output layer
params = {
    'W1': np.random.randn(hidden_1, input_layer,),
    'b1': np.random.randn(hidden_1, 1),
    'W2': np.random.randn(hidden_2, hidden_1),
    'b2': np.random.randn(hidden_2, 1),
    'W3': np.random.randn(hidden_3, hidden_2),
    'b3': np.random.randn(hidden_3, 1),
    'W4': np.random.randn(output_layer, hidden_3),
    'b4': np.random.randn(output_layer, 1),

}

In [53]:
### Feedforward network
def feed_forward(x, cache, params):
    cache['X'] = x
    cache['Z1'] = np.matmul(params["W1"], cache["X"].T) + params["b1"]
    cache['A1'] = sigmoid(cache['Z1'])
    cache['Z2'] = np.matmul(params["W2"], cache["A1"]) + params["b2"]
    cache['A2'] = sigmoid(cache['Z2'])
    cache['Z3'] = np.matmul(params["W3"], cache["A2"]) + params["b3"]
    cache["A3"] = sigmoid(cache['Z3'])

    cache["output"] = np.matmul(params["W4"], cache["A3"]) + params["b4"]

    return cache["output"]


In [54]:
### loss function
def loss_fuction(y_hat, y):
    lenth = y.shape[0]
    return (1/lenth)*np.sum((y_hat - y)**2)

In [55]:
def back_propagation(x, y, output, params):
    current_batch_size = y.shape[0]

    dloss = output - y.T

    dA3 = np.matmul(params["W4"].T, dloss)
    dZ3 = dA3 * sigmoid(cache["Z3"], derivative=True)
    dA2 = np.matmul(params["W3"].T, dZ3)
    dZ2 = dA2 * sigmoid(cache["Z2"], derivative=True)
    dA1 = np.matmul(params["W2"].T, dZ2)
    dZ1 = dA1 * sigmoid(cache["Z1"], derivative=True)

    dW4 = (1./current_batch_size) * np.matmul(dloss, cache["A3"].T)
    db4 = (1./current_batch_size) * np.sum(dloss, axis=1, keepdims=True)

    dW3 = (1./current_batch_size) * np.matmul(dZ3, cache["A2"].T)
    db3 = (1./current_batch_size) * np.sum(dZ3, axis=1, keepdims=True)

    dW2 = (1./current_batch_size) * np.matmul(dZ2, cache["A1"].T)
    db2 = (1./current_batch_size) * np.sum(dZ2, axis=1, keepdims=True)

    dW1 = (1./current_batch_size) * np.matmul(dZ1, cache["X"])
    db1 = (1./current_batch_size) * np.sum(dZ1, axis=1, keepdims=True)

    grads = {'W4' : dW4, 'b4' : db4, 'W3' : dW3, 'b3': db3, 'W2' : dW2, 'b2' : db2, 'W1' : dW1, 'b1' : db1}
    return  grads

In [56]:
def training(x_train, y_train, x_test, y_test, cache, params, epochs=100000000,\
             batch_size = 64, optimizer="sgd", l_rate=0.0001, beta=0.9,):

    num_batches = (x_train.shape[0]//batch_size) + 1
    start_time = time.time()
    template = "Epoch {}: {:.2f}s, train loss={:.2f}, test loss={:.2f}"

    # Train
    for i in range(epochs):
        # Shuffle
        permutation = np.random.permutation(x_train.shape[0])
        x_train_shuffled = x_train[permutation]
        y_train_shuffled = y_train[permutation]

        for j in range(num_batches):
            # Batch
            begin = j * batch_size
            end = min(begin + batch_size, x_train.shape[0]-1)
            x = x_train_shuffled[begin:end]
            y = y_train_shuffled[begin:end]

            #forward
            output = feed_forward(x, cache, params)
            #backprop
            grads = back_propagation(x, y, output, params)
            print(f'cache -- {cache}')
            print(f'params -- {params}')
            for key in params:
                params[key] = params[key] - l_rate * grads[key]

            # import sys
            # sys.exit(0)

        # Training data
        output = feed_forward(x_train, cache, params)
        train_loss = loss_fuction(output,  y_train)

        # Test data
        output = feed_forward(x_test, cache, params)
        test_loss = loss_fuction( output, y_test,)
        #print(train_loss, test_loss)
        print(template.format(i+1, time.time()-start_time,  train_loss, test_loss))





In [57]:
training(X_train, y_train, X_test, y_test, cache, params, epochs=1,\
             batch_size = 64, )

cache -- {'X': array([[8.60425844e-02, 3.22979435e-03, 3.42167285e-10, 3.42167285e-10,
        6.84175307e-03],
       [4.66527166e-01, 4.28048056e-03, 3.57817985e-10, 3.57817985e-10,
        1.01134561e-05],
       [5.44787078e-01, 9.70442868e-04, 5.85827216e-10, 5.85827216e-10,
        1.52047385e-02],
       [8.67841410e-01, 2.72827349e-03, 4.23182677e-10, 4.23182677e-10,
        9.08778311e-03],
       [2.29052863e-01, 3.49058520e-03, 2.19109813e-10, 2.19109813e-10,
        3.93666280e-03],
       [1.61527166e-02, 2.88123735e-03, 5.87054722e-10, 5.87054722e-10,
        1.57415945e-02],
       [4.25183554e-01, 4.25540451e-03, 3.57817985e-10, 3.57817985e-10,
        6.93614532e-04],
       [6.58810573e-02, 2.58784765e-03, 4.34230231e-10, 4.34230231e-10,
        9.48642184e-03],
       [2.69471366e-01, 1.46193331e-03, 4.20113913e-10, 4.20113913e-10,
        6.65802528e-05],
       [7.46108664e-02, 4.22280566e-03, 2.81098863e-10, 2.81098863e-10,
        5.48570716e-03],
       [5.16409

In [58]:
# Test data
output = feed_forward(X_test, cache, params)
output

array([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]])

In [59]:
cache

{'X': array([[2.43759178e-01, 2.89377537e-03, 4.60314733e-10, 4.60314733e-10,
         1.01134561e-03],
        [3.17797357e-01, 3.10190653e-03, 4.60314733e-10, 4.60314733e-10,
         1.01134561e-03],
        [4.64023495e-01, 2.22424502e-03, 4.63076621e-10, 4.63076621e-10,
         9.70049000e-04],
        [6.69801762e-01, 3.70373157e-03, 3.57817985e-10, 3.57817985e-10,
         1.01134561e-05],
        [5.68627019e-01, 3.58085895e-03, 3.42167285e-10, 3.42167285e-10,
         7.40642103e-03],
        [2.50352423e-01, 2.70068984e-03, 3.42167285e-10, 3.42167285e-10,
         6.84680979e-03],
        [2.01556535e-01, 1.67257207e-03, 4.63076621e-10, 4.63076621e-10,
         9.70049000e-04],
        [3.46549192e-01, 1.97348459e-03, 5.87668475e-10, 5.87668475e-10,
         6.57374648e-05],
        [3.40477239e-01, 2.48220228e-01, 1.80750252e-10, 1.80750252e-10,
         2.35980643e-05],
        [4.86182085e-01, 3.01163278e-03, 3.42167285e-10, 3.42167285e-10,
         6.84680979e-03],
     

In [60]:
y_test

array([[2.01490302],
       [2.22137504],
       [2.18041746],
       [1.31103188],
       [1.41098697],
       [1.69193913],
       [2.40243043],
       [2.22462355],
       [1.05779029],
       [1.41585316],
       [2.99322914],
       [0.33647224],
       [1.97129938],
       [2.90142159],
       [2.86220088],
       [1.28093385],
       [1.5260563 ],
       [0.90421815],
       [1.38629436],
       [1.25276297],
       [1.54968791],
       [1.53686722],
       [2.74084002],
       [2.67414865]])