# Neural Networks

## Biological Neuron
It is an unusual-looking cell mostly found in animal brains. It’s composed of a cell body containing the nucleus and most of the cell’s complex components, many branching extensions called dendrites, plus one very long extension called the axon. The axon’s length may be just a few times longer than the cell body, or up to tens of thousands of times longer. Near its extremity the axon splits off into many branches called telodendria, and at the tip of these branches are minuscule structures called synaptic terminals (or simply synapses), which are connected to the dendrites or cell bodies of other neurons. Biological neurons produce short electrical impulses called action potentials (simply signals) which travel along the axons and make the synapses release chemical signals called neurotransmitters. When a neuron receives a sufficient amount of these neurotransmitters within a few milliseconds, it fires its own electrical impulses

<img src='images\neurons.png' height='200px'>
<img src='images\bnn.png' height='200px'>

## Artificial Neurons

Artificial networks are inspired by the human brain and how it is interconnected. They both have neurons, activations, and large interconnectivity. However, it is not a perfect comparison as the underlying process is different.


<img src='images\bio vs ai.png' height='300px'>

This is similar to what we saw in logistic regression, the only difference is that in logsitic regresssion what we call the activation function was a sigmoid. In neural networks, there is a wide variety of activation functions to choose from. We will discuss activation function later in this lesson. 

Within a neural network, each neuron performs a computation akin to logistic regression. It takes input from the previous layer, applies weights to these inputs, sums them up, and then applies an activation function. This process resembles the calculation in logistic regression where features are weighted and summed before being passed through a sigmoid activation function.

A shallow neural network would have only 2 layers, one hidden and one output layer. 

<img src='images\shallow network.png' height='300px'>

Adding more hidden layers will make it a deep neural network.

<img src='images\2hidden.jpg' height='300px'>


**- Learning Complex Patterns:** By linking multiple neurons together in layers, neural networks can capture intricate patterns and relationships in the data. While logistic regression is limited to linear relationships, neural networks excel at modeling nonlinearities due to their layered structure and the activation functions applied at each neuron.

**- Scaling Complexity:** Just as complex structures can be built from simple building blocks, neural networks scale the capabilities of logistic regression by layering multiple logistic regression units. Each layer adds another level of abstraction, allowing the network to learn increasingly complex representations of the data.

### Anatomy of a Neuron:

**1. Inputs:** Neurons receive input signals from the previous layer or directly from the input data. These inputs represent the features of the dataset.

**2. Weights:** Each input is associated with a weight, which determines its importance in the computation. Just like logistic regression assigns coefficients to features, neural networks adjust weights during training to optimize performance.

**3. Activation Function:** After summing the weighted inputs, the neuron applies an activation function. This function introduces nonlinearity into the model, allowing neural networks to learn complex mappings between inputs and outputs. Common activation functions include the sigmoid function, which is used in logistic regression, as well as others like Softmax, ReLU (Rectified Linear Unit) and tanh.

**3. Output:** Finally, the neuron produces an output value, which is transmitted to the next layer of neurons. In classification tasks, this output typically represents the probability of a certain class, just as logistic regression outputs probabilities for binary classification problems.

#### Notation

$$ f_i^{[l]}(x) = w_{i, 1}^{[l]}.f_1^{[l-1]}(x)+ w_{i, 2}^{[l]}.f_2^{[l-1]}(x) + b_1^{[l]} $$

- The superscript $[l]$ indicates the layer
- The subscript $i, j$, $"i"$ indicates the neuron number, $"j"$ indicates the coeffiecent number 

## Activation Functions

Activation functions play a vital role in neural networks by transforming the input signal of a node into an output signal, which is then forwarded to the next layer. They enables neural networks to learn intricates patterns in data, breaking away from solely linear relationships. By introducing non-linearities, activation functions empower neural networks to capture and understand complex mappings between inputs and outputs. Without them, the network's capacity to learn would be limited.

So why do we need it in the first place? A neural network with layers only having linear activations will be the same as having a single linear layer.

<img src='images\linear activation.png' height='200px'>

$$

Layer 1, neuron 1: f_1^{[1]}(x) = w_{11}^{[1]}.x + b_1^{[1]}

\\[10pt]

Layer 1, neuron 2: f_2^{[1]}(x) = w_{21}^{[1]}.x + b_2^{[1]}

\\[20pt]

Layer 2, neuron 1: f_1^{[2]}(x) = w_{11}^{[2]}.f_1^{[1]}(x)+ w_{12}^{[2]}.f_2^{[1]}(x) + b_1^{[2]}

\\[10pt] 

f_1^{[2]}(x) = w_{11}^{[2]}.(w_{11}^{[1]}.x + b_1^{[1]})+ w_{12}^{[2]}.(w_{21}^{[1]}.x + b_2^{[1]}) + b_1^{[2]}

\\[10pt] 

f_1^{[2]}(x) = (w_{11}^{[2]}w_{11}^{[1]} +  w_{12}^{[2]}w_{21}^{[1]})x + w_{11}^{[2]}b_1^{[1]} + w_{12}^{[2]}b_2^{[1]} + b_1^{[2]}

\\[10pt] 

f_1^{[2]}(x) = w.x + b

$$

We are back to a linear model... That's why we need activations to fit more complex relations. Here is what a neural network, with two hidden layers and only linear activations, would fit for a sine function

A (1 - input) (8 neurons, 1st hidden layer) (8 neurons, 2nd hidden layer) (1 neuron, output layer)

<img src='images\1881 nn.png' height='300px'>

With only linear activations, here is how it will fit a sine function:

<img src='images\1881 linear activation nn.png' height='200px'>

Where as this is the result when we use a ***ReLU*** activation function:

<img src='images\relu function.png' height='200px'> &nbsp; <img src='images\1881 relu activation nn.png' height='200px'>


let's see how using the ***ReLU*** activation function can help a neural network learn a sine function:

<video controls src="videos\how activations work.mp4">

### Backpropagation

From what we already saw, we can already tell that a single layer neural network is just like logistic regression, so we can simply use gradient descent to update the weights and biases, and it is straight forward to implement. However, it is more much more complicated for deeper networks. It was complicated enough that it took about 30 years before researchers figured out how to do it.

For many years researchers struggled to find a way to train neural networkss without success. But in 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published a groundbreaking paper that introduced the backpropagation training algorithm, which is still used today.

In short, it is Gradient Descent using an efficient technique for computing the gradients automatically in just two passes through the network (one forward, one backward), the backpropagation algorithm is able to compute the gradient of the network’s error with regard to every single model parameter. 

In other words, it can find out how each connection weight and each bias term should be tweaked in order to reduce the error. Once it has these gradients, it just performs a regular Gradient Descent step, and the whole process is repeated until the network converges to the solution.

<img src='images\backprop_diagram.png' height='500px'>

In the forward pass through the network, our data and operations go from bottom to top here. We pass the input $x$ through a linear transformation $L_1$ with weights $W_1$ and biases $b_1$. The output then goes through the sigmoid operation $S$ and another linear transformation $L_2$. Finally we calculate the loss $\ell$. We use the loss as a measure of how bad the network's predictions are. The goal then is to adjust the weights and biases to minimize the loss.

To train the weights with gradient descent, we propagate the gradient of the loss backwards through the network. Each operation has some gradient between the inputs and outputs. As we send the gradients backwards, we multiply the incoming gradient with the gradient for the operation. Mathematically, this is really just calculating the gradient of the loss with respect to the weights using the chain rule.

$$
\large \frac{\partial \ell}{\partial W_1} = \frac{\partial \ell}{\partial L_2} \frac{\partial L_2}{\partial S} \frac{\partial S}{\partial L_1} \frac{\partial L_1}{\partial W_1} 
$$


No need to sweat over the details, this knowledge requires some knowledge in vector calculus. What we are doing is finding a way to approximate the gradients for each parameter so we can use gradient descent over it.

We update our weights using this gradient with some learning rate $\alpha$. 

$$
\large W_1 = W_1 - \alpha \frac{\partial \ell}{\partial W_1}
$$

The learning rate $\alpha$ is set such that the weight update steps are small enough that the iterative method settles in a minimum.

# Introduction to Tensorflow

To build neural networks we will be using a library called TensorFlow. TensorFlow is an open source and an end to end platform used for building machine learning models. Being end to end, you can prepare data, build models, diagnose, improve, and deploy them.

TensorFlow uses Keras at its backend. Keras is a well beautifully designed API for building deep learning models in popular fields such as Computer Vision and Natural Language Processing.

TensorFlow has got a strong community, from users, learning resources and whole range of technical supports. Not only it powers majority of Google apps such as YouTube, Maps and Google Photos, it is also widely used across startups and other big techs.

## The Basics of Tensors

TensorFlow uses arrays called tensors. A tensor is a multidimensional array of the same data type. A tensor can be a scalar (single number), a vector, or a matrix.

Tensors are like NumPy arrays, except that tensors have GPU(Graphical Processing Unit) support.

A typical tensor has the following information:

- Shape: The length or number of elements of each of the tensor dimension/axes.
- Rank: The number of dimensions/axes in a tensor. A scalar tensor (a single number) has rank 0, a vector has a rank 1 (a vector is a 1D), and a matrix has rank 2 (or 2D).
- Axis/Dimension: This is a particular dimension of a tensor
- Size: This is the total number of items in the tensor.
  
But why use tensor/NumPy arrays?

Well, almost all types of data can be represented as an array of numbers. Take an example:

- An Image can be represented as an array of pixels.
- Any text data can be converted into an array of numbers (or tokens representing words)
- Video (made of sequence of images) can be represented as an array of numbers.
  
Having the ability to convert these raw data into tensors/arrays make it easy to preprocess it, either when performing conventional numerical computations or when it is the data we are preparing to feed to a machine learning model. Take a simple example, we can not feed a raw text to a machine learning model. That text has to be converted into numbers.

### Packages

In [None]:
import tensorflow as tf
import numpy as np

### Constant Tensors

The most basic way of creating a tensor is using `tf.constant()`, this will create a constant that is immutable (cannot be changed)

Here is a "scalar" or "rank-0" tensor . A scalar contains a single value, and no "axes".

In [3]:
rank_0_tensor = tf.constant(4) # A scalar
print(rank_0_tensor)

tf.Tensor(4, shape=(), dtype=int32)


A "vector" or "rank-1" tensor is like a list of values. A vector has one axis:

In [5]:
# Let's make this a float tensor.
rank_1_tensor = tf.constant([2.0, 3.0, 4.0])
print(rank_1_tensor)

tf.Tensor([2. 3. 4.], shape=(3,), dtype=float32)


A "matrix" or "rank-2" tensor has two axes:

In [6]:
# If you want to be specific, you can set the dtype (see below) at creation time
rank_2_tensor = tf.constant([[1, 2],
                             [3, 4],
                             [5, 6]], dtype=tf.float16)
print(rank_2_tensor)

tf.Tensor(
[[1. 2.]
 [3. 4.]
 [5. 6.]], shape=(3, 2), dtype=float16)


<table style="background-color: white; color: black;">
<tr>
  <th>A scalar, shape: <code>[]</code></th>
  <th>A vector, shape: <code>[3]</code></th>
  <th>A matrix, shape: <code>[3, 2]</code></th>
</tr>
<tr>
  <td>
   <img src="images/scalar.png" alt="A scalar, the number 4" />
  </td>

  <td>
   <img src="images/vector.png" alt="The line with 3 sections, each one containing a number."/>
  </td>
  <td>
   <img src="images/matrix.png" alt="A 3x2 grid, with each cell containing a number.">
  </td>
</tr>
</table>

The tensor above has a shape `(3,2)` which means our tensor has 3 rows and 2 columns. 

You can also check the number of dimensions or axes of a tensor using `tensor_name.ndim`

In [11]:
print('Number of dimensions in rank 0 tensor:', rank_0_tensor.ndim)
print('Number of dimensions in rank 1 tensor:', rank_1_tensor.ndim)
print('Number of dimensions in rank 2 tensor:', rank_2_tensor.ndim)


Number of dimensions in rank 0 tensor: 0
Number of dimensions in rank 1 tensor: 1
Number of dimensions in rank 2 tensor: 2


Just like NumPy array, a tensor can have many dimensions. Here is an example of a tensor with 3 dimensions.

In [14]:
tensor_3d = tf.constant([
  [[0, 1, 2, 3, 4],
   [5, 6, 7, 8, 9]],
  [[10, 11, 12, 13, 14],
   [15, 16, 17, 18, 19]],
  [[20, 21, 22, 23, 24],
   [25, 26, 27, 28, 29]],])

print(tensor_3d)
print('Number of dimensions in tensor_3d:', tensor_3d.ndim)

tf.Tensor(
[[[ 0  1  2  3  4]
  [ 5  6  7  8  9]]

 [[10 11 12 13 14]
  [15 16 17 18 19]]

 [[20 21 22 23 24]
  [25 26 27 28 29]]], shape=(3, 2, 5), dtype=int32)
Number of dimensions in tensor_3d: 3


<table style="background-color: white; color: black;">
<tr>
  <th colspan=3>A 3-axis tensor, shape: <code>[3, 2, 5]</code></th>
<tr>
<tr>
  <td>
   <img src="images/3-axis_numpy.png"/>
  </td>
  <td>
   <img src="images/3-axis_front.png"/>
  </td>

  <td>
   <img src="images/3-axis_block.png"/>
  </td>
</tr>

</table>

A tensor can be converted into NumPy array by calling `tensor_name.numpy` or `np.array(tensor_name)`. 

TensorFlow is well integrated with NumPy. And if not yet done, TensorFlow previously posted that they are working on getting the whole of NumPy into TensorFlow. 

### Variabl Tensors

A tensor created with tf.constant() is immutable, it can not be changed. Such kind of tensor can not be used as weights in neural networks because they need to be updated in backpropogation for example.

With tf.Variable(), we can create tensors that can be mutable and thus can be used in things like updating the weights of neural networks like said above.

Creating variable tensor is as simple as the former.

In [15]:
var_tensor = tf.Variable([
                         [[1,2,3,4,5],
                         [6,7,8,9,8]],
                         [[1,3,5,7,9],
                         [2,4,6,8,1]],
                         [[1,2,3,5,4],
                         [3,4,5,6,7]], ])

print(var_tensor)

<tf.Variable 'Variable:0' shape=(3, 2, 5) dtype=int32, numpy=
array([[[1, 2, 3, 4, 5],
        [6, 7, 8, 9, 8]],

       [[1, 3, 5, 7, 9],
        [2, 4, 6, 8, 1]],

       [[1, 2, 3, 5, 4],
        [3, 4, 5, 6, 7]]])>


It can also be converted to NumPy array, just like tensors created with `tf.constant()`

In [16]:
# Converting a variable tensor into NumPy array

var_tensor.numpy()

array([[[1, 2, 3, 4, 5],
        [6, 7, 8, 9, 8]],

       [[1, 3, 5, 7, 9],
        [2, 4, 6, 8, 1]],

       [[1, 2, 3, 5, 4],
        [3, 4, 5, 6, 7]]])

In [2]:

# Set the random seed so things are reproducible
tf.random.set_seed(7)

# Create a random tensor
x = tf.random.normal((2,2))

# Calculate gradient
with tf.GradientTape() as g:
    g.watch(x)
    y = x ** 2
    
dy_dx = g.gradient(y, x)

# Calculate the actual gradient of y = x^2
true_grad = 2 * x

# Print the gradient calculated by tf.GradientTape
print('Gradient calculated by tf.GradientTape:\n', dy_dx)

# Print the actual gradient of y = x^2
print('\nTrue Gradient:\n', true_grad)

# Print the maximum difference between true and calculated gradient
print('\nMaximum Difference:', np.abs(true_grad - dy_dx).max())

Gradient calculated by tf.GradientTape:
 tf.Tensor(
[[1.1966898  0.12552416]
 [0.29263481 0.96963763]], shape=(2, 2), dtype=float32)

True Gradient:
 tf.Tensor(
[[1.1966898  0.12552415]
 [0.29263484 0.9696376 ]], shape=(2, 2), dtype=float32)

Maximum Difference: 5.9604645e-08
