# An Introduction to Neural Networks

### How to implement the Neural Network from scratch in Python.

The basic unit of a neural network is <b>neuron</b>. 

A neuron takes inputs, does some math with them, and produces one output.

</br>

![neuron](./img/neuron.png)

3 things are happening in neuron. First, each input is multiplied by a weight:

$x_{1}\rightarrow x_{1}\star w_{1}$

$x_{2}\rightarrow x_{2}\star w_{2}$

</br>

Next, all the weighted inputs are added together with a bias b:

$(x_{1} \star w_{1}) + (x_{2} \star w_{2}) + b$

</br>

Finally, the sum is passed through an activation function:

$y = f((x_{1} \star w_{1}) + (x_{2} \star w_{2}) + b)$

</br>

The activation function is used to turn an unbounded input into an output that has a nice, predictable form. A commonly used activation function is the sigmoid function:

![sigmoid_function](./img/sigmoid_function.png)

The sigmoid function only outputs numbers in the range $(0, 1)$. You can think of it as compressing $(-\infty, +\infty)$ to $(0, 1)$ - big negative numbers become $~0$, and big positive numbers become $~1$.

### A Simple Example
</br>
Assume we have a 2-input neuron that uses the sigmoid activation function and has the following parameters:

$w = [0, 1]$

$b = 4$

</br>

$w = [0, 1]$ is just a way of writing $w_{1} = 0$, $w_{2} = 1$ in vector form.

Now, let’s give the neuron an input of $x = [2,3]$

We’ll use the dot product to write things more concisely:

$(w\cdot x) + b = ((w_{1} \star x_{1}) + (w_{2} \star x_{2}) + b) = 0 \star 2 + 1 \star 3 + 4 = 7$

$y = f(w⋅x + b) = f(7) = 0.999$

</br>

The neuron outputs $0.999$ given the inputs $x = [2, 3]$. That’s it! This process of passing inputs forward to get an output is known as <b>feedforward</b>.

### Coding a Neuron

We’ll use NumPy, a popular and powerful computing library for Python, to help us do math:

In [None]:
import numpy as np
def sigmoid(x):
  # Our activation function: f(x) = 1 / (1 + e^(-x))
  return 1 / (1 + np.exp(-x))

class Neuron:
  def __init__(self, weights, bias):
    self.weights = weights
    self.bias = bias

  def feedforward(self, inputs):
    # Weight inputs, add bias, then use the activation function
    total = np.dot(self.weights, inputs) + self.bias
    return sigmoid(total)

weights = np.array([0, 1]) # w1 = 0, w2 = 1
bias = 4                   # b = 4
n = Neuron(weights, bias)

x = np.array([2, 3])       # x1 = 2, x2 = 3
print(n.feedforward(x))    # 0.9990889488055994

0.9990889488055994


Recognize those numbers? That’s the example we just did! We get the same answer of $0.999$.

## 2. Combining Neurons into a Neural Network

A neural network is nothing more than a bunch of neurons connected together. Here’s what a simple neural network might look like:

</br>

![simple_NN](./img/simple_NN.png)

</br>

This network has 2 inputs, a hidden layer with 2 neurons ($h_{1}$  and $h_{2}$), and an output layer with 1 neuron ($o_{1}$). Notice that the inputs for $o_{1}$ are the outputs from $h_{1}$ and $h_{2}$ - that’s what makes this a network.

</br>

A <b>hidden layer</b> is any layer between the input (first) layer and output (last) layer. There can be multiple hidden layers!

## An Example: Feedforward

</br>

Let’s use the network pictured above and assume all neurons have the same weights $w = [0, 1]$, the same bias $b = 0$, and the same sigmoid activation function. Let $h_{1}$, $h_{2}$, $o_{1}$ denote the outputs of the neurons they represent.

</br>

What happens if we pass in the input $x = [2, 3]$?

$h_{1} = h_{2} = f(w \cdot x + b) = f((0\star2)+(1\star3)+0) = f(3) = 0.9526$

$o_{1} = f(w \cdot [h_{1}, h_{2}] + b) = f((0 \star h_{1}) + (1 \star h_{2}) + 0) = f(0.9526) = 0.7216$

</br>

The output of the neural network for input $x = [2, 3]$ is $0.7216$. 

</br>

A neural network can have <b>any number of layers</b> with <b>any number of neurons</b> in those layers. The basic idea stays the same: feed the input(s) forward through the neurons in the network to get the output(s) at the end. For simplicity, we’ll keep using the network pictured above for the rest of this post.

### Coding a Neural Network: Feedforward

In [None]:
class OurNeuralNetwork:
  '''
  A neural network with:
    - 2 inputs
    - a hidden layer with 2 neurons (h1, h2)
    - an output layer with 1 neuron (o1)
  Each neuron has the same weights and bias:
    - w = [0, 1]
    - b = 0
  '''
  def __init__(self):
    weights = np.array([0, 1])
    bias = 0

    # The Neuron class here is from the previous section
    self.h1 = Neuron(weights, bias)
    self.h2 = Neuron(weights, bias)
    self.o1 = Neuron(weights, bias)

  def feedforward(self, x):
    out_h1 = self.h1.feedforward(x)
    out_h2 = self.h2.feedforward(x)

    # The inputs for o1 are the outputs from h1 and h2
    out_o1 = self.o1.feedforward(np.array([out_h1, out_h2]))

    return out_o1

network = OurNeuralNetwork()
x = np.array([2, 3])
print(network.feedforward(x)) # 0.7216325609518421

0.7216325609518421


We got 0.7216 !

## 3. Training a Neural Network, Part 1

</br>

Say we have the following measurements:

</br>

![measurements](./img/measurements.png)

</br>

Let’s train our network to predict someone’s gender given their weight and height:

</br>

![predict_gender](./img/predict_gender.png)

</br>

We’ll represent Male with a 00 and Female with a 11, and we’ll also shift the data to make it easier to use:

</br>

![measurements_numbers](./img/measurements_numbers.png)

</br>



### Loss

</br>

Before we train our network, we first need a way to quantify how “good” it’s doing so that it can try to do “better”. That’s what the <b>loss</b> is.


We’ll use the mean squared error (MSE) loss:

</br>

$MCE = \frac{1}{n}\sum_i^n(y_{true} - y_{pred})^{2}$

</br>

Let’s break this down:


* n is the number of samples, which is 44 (Alice, Bob, Charlie, Diana).
* y represents the variable being predicted, which is Gender.
* $y_{true}$ is the true value of the variable (the “correct answer”). For example, $y_{true}$ for Alice would be 1 (Female).
* $y_{pred}$ is the predicted value of the variable. It’s whatever our network outputs.

</br>

$(y_{true} - y_{pred})^{2}$ is known as the <b>squared error</b>. Our loss function is simply taking the average over all squared errors (hence the name mean squared error). The better our predictions are, the lower our loss will be!

</br>

Better predictions = Lower loss.

<b>Training a network = trying to minimize its loss.</b>



### An Example Loss Calculation

</br>

Let’s say our network always outputs 00 - in other words, it’s confident all humans are Male 🤔. What would our loss be?

</br>

![example_loss_calculation](./img/example_loss_calculation.png)

</br>

$MCE = \frac{1}{n}(1 + 0 + 0 + 1) = 0.5$

### Code: MSE Loss

In [None]:
def mse_loss(y_true, y_pred):
  # y_true and y_pred are numpy arrays of the same length.
  return ((y_true - y_pred) ** 2).mean()

y_true = np.array([1, 0, 0, 1])
y_pred = np.array([0, 0, 0, 0])

print(mse_loss(y_true, y_pred)) # 0.5

0.5


## 4. Training a Neural Network, Part 2

</br>

We now have a clear goal: <b>minimize the loss</b> of the neural network. We know we can change the network’s weights and biases to influence its predictions, but how do we do so in a way that decreases loss?

</br>


For simplicity, let’s pretend we only have Alice in our dataset:

</br>

![only_Alice](./img/only_Alice.png)

</br>

Then the mean squared error loss is just Alice’s squared error:

</br>

$MCE = \frac{1}{1}\sum_i^n(y_{true} - y_{pred})^{2} = (y_{true} - y_{pred})^{2} = (1 - y_{pred})^{2}$

</br>

Another way to think about loss is as a function of weights and biases. Let’s label each weight and bias in our network:

</br>

![predict_gender_with_w_and_b](./img/predict_gender_with_w_and_b.png)

</br>

Then, we can write loss as a multivariable function:

$L(w_{1}, w_{2}, w_{3}, w_{4}, w_{5}, w_{6}, b_{1}, b_{2}, b_{3})$

</br>

Imagine we wanted to tweak $w_{1}$. How would loss $L$ change if we changed $w_{1}$?

That’s a question the partial derivative $\frac{\partial L}{\partial w_{1}}$ can answer. How do we calculate it?

To start, let’s rewrite the partial derivative in terms of $\frac{\partial y_{pred}}{\partial w_{1}}$ instead:



$\frac{\partial L}{\partial w_{1}} = \frac{\partial L}{\partial y_{pred}} \star \frac{\partial y_{pred}}{\partial w_{1}}$

(This works because of the Chain Rule.)

</br>

We can calculate $\frac{\partial L}{\partial y_{pred}}$ because we computed $L = (1 - y_{pred})^{2}$ above:



$\frac{\partial L}{\partial y_{pred}} = \frac{\partial (1 - y_{pred})^{2}}{\partial y_{pred}} = -2(1 - y_{pred})$

</br>

Now, let’s figure out what to do with $\frac{\partial y_{pred}}{\partial w_{1}}$. Just like before, let $h_{1}, h_{2}, o_{1}$ be the outputs of the neurons they represent. Then

</br>

$y_{pred} = o_{1} = f(w_{5}h_{1} + w_{6}h_{2} + b_{3})$   
(f is the sigmoid activation function)

</br>

Since $w_{1}$ only affects $h_{1}$ (not $h_{2}$), we can write



$\frac{\partial y_{pred}}{\partial w_{1}} = \frac{\partial y_{pred}}{\partial h_{1}} \star \frac{\partial h_{1}}{\partial w_{1}}$

$\frac{\partial y_{pred}}{\partial h_{1}} = w_{5} \star f^{'}(w_{5}h_{1} + w_{6}h_{2} + b_{3})$

(More Chain Rule.)


</br>

We do the same thing for $\frac{\partial h_{1}}{\partial w_{1}}$:



$h_{1} = f(w_{1}x_{1} + w_{2}x_{2} + b_{1})$

$\frac{\partial h_{1}}{\partial w_{1}} = x_{1} \star f^{'}(w_{1}x_{1} + w_{2}x_{2} + b_{1})$

(You guessed it, Chain Rule.)

</br>

$x_{1}$ here is weight, and $x_{2}$ is height. This is the second time we’ve seen $f^{'}(x)$ (the derivate of the sigmoid function) now! Let’s derive it:

$f(x) = \frac{1}{1 + e^{-x}}$

$f^{'}(x) = \frac{e^{-x}}{(1 + e^{-x})^{2}} = f(x) \star (1 - f(x))$

</br>

We’ll use this nice form for $f^{'}(x)$ later.

</br>

We’re done! We’ve managed to break down $\frac{\partial L}{\partial w_{1}}$ into several parts we can calculate:

$\frac{\partial L}{\partial w_{1}} = \frac{\partial L}{\partial y_{pred}} \star \frac{\partial y_{pred}}{\partial h_{1}} \star \frac{\partial h_{1}}{\partial w_{1}}$

</br>

This system of calculating partial derivatives by working backwards is known as <b>backpropagation</b>, or “backprop”.

</br>

Let’s do an example to see this in action!

### Example: Calculating the Partial Derivative

</br>

We’re going to continue pretending only Alice is in our dataset:

![only_Alice](./img/only_Alice.png)

### Code: A Complete Neural Network

In [None]:
def sigmoid(x):
  # Sigmoid activation function: f(x) = 1 / (1 + e^(-x))
  return 1 / (1 + np.exp(-x))

def deriv_sigmoid(x):
  # Derivative of sigmoid: f'(x) = f(x) * (1 - f(x))
  fx = sigmoid(x)
  return fx * (1 - fx)

def mse_loss(y_true, y_pred):
  # y_true and y_pred are numpy arrays of the same length.
  return ((y_true - y_pred) ** 2).mean()

class OurNeuralNetwork:
  '''
  A neural network with:
    - 2 inputs
    - a hidden layer with 2 neurons (h1, h2)
    - an output layer with 1 neuron (o1)

  *** DISCLAIMER ***:
  The code below is intended to be simple and educational, NOT optimal.
  Real neural net code looks nothing like this. DO NOT use this code.
  Instead, read/run it to understand how this specific network works.
  '''
  def __init__(self):
    # Weights
    self.w1 = np.random.normal()
    self.w2 = np.random.normal()
    self.w3 = np.random.normal()
    self.w4 = np.random.normal()
    self.w5 = np.random.normal()
    self.w6 = np.random.normal()

    # Biases
    self.b1 = np.random.normal()
    self.b2 = np.random.normal()
    self.b3 = np.random.normal()

  def feedforward(self, x):
    # x is a numpy array with 2 elements.
    h1 = sigmoid(self.w1 * x[0] + self.w2 * x[1] + self.b1)
    h2 = sigmoid(self.w3 * x[0] + self.w4 * x[1] + self.b2)
    o1 = sigmoid(self.w5 * h1 + self.w6 * h2 + self.b3)
    return o1

  def train(self, data, all_y_trues):
    '''
    - data is a (n x 2) numpy array, n = # of samples in the dataset.
    - all_y_trues is a numpy array with n elements.
      Elements in all_y_trues correspond to those in data.
    '''
    learn_rate = 0.1
    epochs = 1000 # number of times to loop through the entire dataset

    for epoch in range(epochs):
      for x, y_true in zip(data, all_y_trues):
        # --- Do a feedforward (we'll need these values later)
        sum_h1 = self.w1 * x[0] + self.w2 * x[1] + self.b1
        h1 = sigmoid(sum_h1)

        sum_h2 = self.w3 * x[0] + self.w4 * x[1] + self.b2
        h2 = sigmoid(sum_h2)

        sum_o1 = self.w5 * h1 + self.w6 * h2 + self.b3
        o1 = sigmoid(sum_o1)
        y_pred = o1

        # --- Calculate partial derivatives.
        # --- Naming: d_L_d_w1 represents "partial L / partial w1"
        d_L_d_ypred = -2 * (y_true - y_pred)

        # Neuron o1
        d_ypred_d_w5 = h1 * deriv_sigmoid(sum_o1)
        d_ypred_d_w6 = h2 * deriv_sigmoid(sum_o1)
        d_ypred_d_b3 = deriv_sigmoid(sum_o1)

        d_ypred_d_h1 = self.w5 * deriv_sigmoid(sum_o1)
        d_ypred_d_h2 = self.w6 * deriv_sigmoid(sum_o1)

        # Neuron h1
        d_h1_d_w1 = x[0] * deriv_sigmoid(sum_h1)
        d_h1_d_w2 = x[1] * deriv_sigmoid(sum_h1)
        d_h1_d_b1 = deriv_sigmoid(sum_h1)

        # Neuron h2
        d_h2_d_w3 = x[0] * deriv_sigmoid(sum_h2)
        d_h2_d_w4 = x[1] * deriv_sigmoid(sum_h2)
        d_h2_d_b2 = deriv_sigmoid(sum_h2)

        # --- Update weights and biases
        # Neuron h1
        self.w1 -= learn_rate * d_L_d_ypred * d_ypred_d_h1 * d_h1_d_w1
        self.w2 -= learn_rate * d_L_d_ypred * d_ypred_d_h1 * d_h1_d_w2
        self.b1 -= learn_rate * d_L_d_ypred * d_ypred_d_h1 * d_h1_d_b1

        # Neuron h2
        self.w3 -= learn_rate * d_L_d_ypred * d_ypred_d_h2 * d_h2_d_w3
        self.w4 -= learn_rate * d_L_d_ypred * d_ypred_d_h2 * d_h2_d_w4
        self.b2 -= learn_rate * d_L_d_ypred * d_ypred_d_h2 * d_h2_d_b2

        # Neuron o1
        self.w5 -= learn_rate * d_L_d_ypred * d_ypred_d_w5
        self.w6 -= learn_rate * d_L_d_ypred * d_ypred_d_w6
        self.b3 -= learn_rate * d_L_d_ypred * d_ypred_d_b3

      # --- Calculate total loss at the end of each epoch
      if epoch % 10 == 0:
        y_preds = np.apply_along_axis(self.feedforward, 1, data)
        loss = mse_loss(all_y_trues, y_preds)
        print("Epoch %d loss: %.3f" % (epoch, loss))

# Define dataset
data = np.array([
  [-2, -1],  # Alice
  [25, 6],   # Bob
  [17, 4],   # Charlie
  [-15, -6], # Diana
])
all_y_trues = np.array([
  1, # Alice
  0, # Bob
  0, # Charlie
  1, # Diana
])

# Train our neural network!
network = OurNeuralNetwork()
network.train(data, all_y_trues)

Epoch 0 loss: 0.353
Epoch 10 loss: 0.237
Epoch 20 loss: 0.146
Epoch 30 loss: 0.101
Epoch 40 loss: 0.075
Epoch 50 loss: 0.059
Epoch 60 loss: 0.047
Epoch 70 loss: 0.039
Epoch 80 loss: 0.033
Epoch 90 loss: 0.028
Epoch 100 loss: 0.025
Epoch 110 loss: 0.022
Epoch 120 loss: 0.020
Epoch 130 loss: 0.018
Epoch 140 loss: 0.016
Epoch 150 loss: 0.015
Epoch 160 loss: 0.013
Epoch 170 loss: 0.012
Epoch 180 loss: 0.012
Epoch 190 loss: 0.011
Epoch 200 loss: 0.010
Epoch 210 loss: 0.010
Epoch 220 loss: 0.009
Epoch 230 loss: 0.009
Epoch 240 loss: 0.008
Epoch 250 loss: 0.008
Epoch 260 loss: 0.007
Epoch 270 loss: 0.007
Epoch 280 loss: 0.007
Epoch 290 loss: 0.006
Epoch 300 loss: 0.006
Epoch 310 loss: 0.006
Epoch 320 loss: 0.006
Epoch 330 loss: 0.005
Epoch 340 loss: 0.005
Epoch 350 loss: 0.005
Epoch 360 loss: 0.005
Epoch 370 loss: 0.005
Epoch 380 loss: 0.005
Epoch 390 loss: 0.004
Epoch 400 loss: 0.004
Epoch 410 loss: 0.004
Epoch 420 loss: 0.004
Epoch 430 loss: 0.004
Epoch 440 loss: 0.004
Epoch 450 loss: 0.004

We can now use the network to predict genders:

In [None]:
# Make some predictions
emily = np.array([-7, -3]) # 128 pounds, 63 inches
frank = np.array([20, 2])  # 155 pounds, 68 inches
print("Emily: %.3f" % network.feedforward(emily)) # 0.951 - F
print("Frank: %.3f" % network.feedforward(frank)) # 0.039 - M

Emily: 0.967
Frank: 0.039
