# The Perceptron
## The structural building block of deep learning
The idea of a perceptron (or a single neuron) is a fundamental building block of a neural network.

It can be defined by its forward propogation of information, it is the product of a Non-Linear activation function with the linear combination of inputs added with a bias term. The constants used during the linear Combination are called "Weights". The bias term allows us to shift left and right along the activation function. So this is just a shifting scaler designed within the equation. Bias helps us to handle the data sets where the classes are not centred about the origin. For simplicity I will assume the output is z (that is z is the input for the activation function)

<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*5a_Ubdxg86cybTVhJAfWow.png" alt="Perceptron with Input, Weights, Summation, and Activation Function">

## Activation Functions
The point of Activation functions is to introduce non linearities because real data in the real world is heavily non linear.

Non linearities (activation functions which are non linear) helps us to approximate arbitarily complex functions with enough depth in a model

Common actication functions are the sigmoid, hyperbolic tangent. 

in TensorFlow it is tf.math.sigmoid(z) &  tf.math.tanh(z)

in Torch it is torch.sigmoid(z) & torch.tanh(z)

## Building Neural Networks with Perceptrons
First we will build multi output perceptrons. Both neurons have the same input but because the weights are different the outputs are different

In [5]:
import tensorflow as tf
class MyDenseLayer(tf.keras.layers.Layer):
    def __init__(self, input_dim, output_dim):
        super(MyDenseLayer, self).__init__()
        self.w = self.add_weight([input_dim, output_dim])
        self.b = self.add_weight([1, output_dim])
    def call(self, inputs):
        z = tf.matmul(inputs, self.w) + self.b
        output = tf.math.sigmoid(z)
        return output
        #this is the manual way to do it

In [6]:
layer = tf.keras.layers.Dense(units=2)
#this is a function which you can call

In [8]:
#Multi output perception
model = tf.keras.Sequential([
    tf.keras.layers.Dense(5),
    tf.keras.layers.Dense(2)
])

More complex tasks require more depth so you introduce more hierachal non linearities, after one layer if you have a single dense connection followed by a non linearity you have a limited amount of complexity that you can extract cause it is only coming from one none linearity so it is limited to the expressive capacity of a single non linearity. as you get more and more complex tasks you require deeper and deeper expressive functions. More outputs implies your need to predict more things.

For an example to generate an image you require to generate values for every pixel of theat image that implies a lot of outputs.

## Applying Neural Networks
Without Training a neural network it is like a baby that has no knowledge about the current world as it doesn't know anything about the problem and it needs to first learn about the problem.

In order to train our model it first needs to understand when it makes bad predictions. A bad prediction means that it has to be able to quantify how bad the prediction is and how good a prediction is. This is called a loss in a neural network.
The loss will be a measure of how far its predictions are from the ground truth. Smaller the loss closer the truth and prediction is. Empirical loss measures the total loss over our entire dataset. Cross Entropy Loss can be used with models that output a probability between 0 and 1. Mean Squared Error Loss can be used with regression models that ouput continuos real numbers.

## Training Neural Networks
We want to find the network weights that achieve the minimal loss in our entire dataset.

Loss Optimization is a function of the network weights. To do so we use a method called Gradient Descent
<img src="https://global.discourse-cdn.com/dlai/optimized/3X/f/5/f58df86a4c92695569d9536d7e752161cd0f98fb_2_690x371.jpeg">


In [None]:
weights = tf.Variable([tf.random.normal()])
while True:
    with tf.GradientTape() as g:
        loss = compute_loss(weights)
        gradient = g.gradient(loss, weights)
    weights=weights - lr*gradient

## Neural Networks in Practice
Loss Functions Can Be Difficult to Optimize.

Learning rate decides how quick does a gradient move in backword propogation.
If we set the learning rate too slow then we basically start from the point and we gt stuck in some of the local minimum that may not be the best minimum that we can get up to.
If we set it too Large then we overshoot that is we start to step in the right direction and then we explode out of the stable paces of learning.

### How do we set the learning rate?
Build a design that adapts itself while optimising. That is your learning rates will increase or decrease as a function of the gradients and the data.
Many adaptive learning rates have been build and stroed in TF and Torch

In [1]:
tf.keras.optimizers.SGD
tf.keras.optimizers.Adam
tf.keras.optimizers.Adadelta
tf.keras.optimizers.Adagrad
tf.keras.optimizers.RMSProp

NameError: name 'tf' is not defined

## Overfitting
Ideally in ML we don't want to train models that work good only in our training set. What we want is that it should work well in a brand new dataset. We use Training data as a proxy to train it to work well on a new dataset that is an unseen test data.
<img src="https://miro.medium.com/v2/1*_7OPgojau8hkiPUiHoGK_w.png">
You want is to end up in the middle that is to record your training points but not rely on them too much or memorize them.

## Regularization
Durin training we randomly set activations to 0 like dropout to 25%, so what it will essentially do is say 25% of our neurons will dropout from the activation function that forces the network to not rely so much on the outputs of any one neuron. Even if we put the same data twice and put it to the model twice it will not be able to remember it due to the dropout and in return will increase its stoicasticity.
Early stopping is to stop before over training a dataset to reduce chances of overfitting