**Introduction to neural networks**

Think of a mathematical function. We have an equation of the form f(x) = x*x, here if we input any number in the function, it will return the number multiplied by itself (Or square). 
In general a function can be looked at as a machine that does something to our input (Or entry). 
The anatomy of a neural network is simple. we have neurons in layers. for instance, the first layer may have,

3 neurons stacked above each other (O1, O2 & O3).
Each one of these neurons is a function of the form Oj = IjWji + Bi , we can have multiple inputs going into each neuron together. the j here is the input number. say we have 4 inputs, the equation would look like:


O1 = I1W11 + I2W21 + I3W31 + I4W41 + B1

O2 = I1W12 + I2W22 + I3W32 + I4W42 + B2

O3 = I1W13 + I2W23 + I3W33 + I4W43 + B3


O1 would read as, 
output 1 = Input 1 * Weight from input 1 to output 1 + input 2 * weight from input 2 to output 1 + input 3 * weight from input 3 to output 1 + input 4 * weight from input 4 to output 1. 

Note:
1. Here output 1 is the 1st neuron in the first layer.
2. The inputs here refer to components of a single entry.

Therefore O1, O2 & O3 would make up the first layer, if we have a 2nd layer with 2 neurons. the outputs from the first layer would then become the inputs of the second layer.

One question to have is:
  what is W and C here, where does it come from?


To understand this, first let's go back to our function example.
in neural network the function is not really defined for us. what that means is that the equation part "x*x" won't be present. we will have mulitple inputs (multiple entries)  like 2,3,5,6,7 and  we'll be having corresponding outputs as 4,9,25,36,49. these neural networks will be given the inputs one by one, here there would be a single neuron giving the inputs and we will give the corresponding output. the neural network is such a machine that would throw out the equation 'X*X' for us. 

Coming back to W and B, these are randomly initialized numbers. After every input is given, it first goes through all the calculations from the first layer, then the second layer after which it reaches some final output(s). These initially would most of the times be wrong.  

Remember we have a true output as well. inputs and the true outputs were the only things we had to begin with. the network output and the true outputs will be compared. and we will have an idea about how the network has done.  we will do this for several entries. and get a number associated with how good our network is doing, this number is called, 'loss'.

After this we will try to tweak these weights and biases using calculus in such a way that the loss is minimum and makes our network behave in a way that its output would be same as the true outputs. this process of updating the weights and biases is called back-propagation.


After we have given enough examples, we will have a set of weights and biases that would act like that 'equation'. If we have moe inputs where we don't have the outputs. this 'equation' or the combination of 'trained' weights and biases would give us the correct outputs associated with the input. or at least most of the times it would do so. this brings us to the accuracy of a 'model', this is usually in percentage, which tells us how many times out of 100 inputs would we have a correct output.


This can be used for much more complex problems like cat-dog classifier. where after training our weights and biases from hundreds and thousands of examples. it will be able to recognize whether an image contains a cat or a dog. or detecting the letter from handwriting. there are several other applications of neural networks. but they generally work for structured data, which simply means that for every entry we have well defined components/attributes etc (these are the different inputs that go into the neurons of the first layer.)



**What we will cover**

We will program the neurons, layers, activation functions, optimization back-propagation all from scratch in python, first we'll do it without using any 3rd party libraries and then we will use numpy.

optimization means the loss function that we will be using and activation functions are those functions that are added to every single neuron in a layer to bring about non linearity. we will understand this in more details later. 

numpy stands for numerical python. it allows us to work with arrays which are like lists. the difference is that arrays work much faster than lists. there are also various different funtionality in numpy like broadcasting etc. we will look at these later as well.


**let's get started**

In a fully connected neural network every neuron has a connection with every  single previous neuron. so let's say that we have 3 neurons that are feeding into this neuron that we are going to build. so they'll have some output which will become the input of this neuron.
Please note that we are just making up the inputs & weights here, but in reality the inputs will be given according to some data that may come from sensors, databases or from previous neurons.
Now, to have some control over the value of our output, we need weights and biases. since we cannot really change the input values. they will be randomly initialized first. which will then be tuned during the back-propagation. 
note that getting to the best possible values for the weights or biases for our purspose is reached iteratively. and we reach here through back-propagation and forward propagation done many times. this is how the 'learning' happens.
The optimizer tunes the values of the weights and biases in an attempt to fit to the data.

In [None]:
#Coding a single neuron having 3 inputs from the previous layer

inputs = [2.2,3.4,3] #Outputs from sensors neurons in a list
weights = [1.3,4.3,5.6] #Every input for this new neuron will have its own weights associated with it
bias = 5 #every unique neuron will also have its own unique bias

output = inputs[0]*weights[0] + inputs[1]*weights[1] + inputs[2]*weights[2] + bias #the linear calculations for our neuron 'output'. the brackets has the value in index of associated list
print(output)

39.28


Now let's say that we have a layer of 3 neurons. they will have the same inputs from every neuron of the previous layer but the weights assigned will be different. and every neuron of our layer will have its own bias.
We'll eventually use loops or numpy to make this a lot less messier. 

In [None]:
#coding a layer of 3 neurons having 4 inputs from the previous layer

inputs = [2,1,3,4] # We will only have one set of inputs coming from previous layer

weights1 = [1.2,3.3,5.4,6.2] #Set of weights that will multiply with inputs to make the first neuron in current layer
weights2 = [3.2,5.5,4.9,2.3] #Set of weights that will multiply with inputs to make the second neuron in current layer
weights3 = [5.2,1.8,2.4,2.2] #Set of weights that will multiply with inputs to make the third neuron in current layer

bias1 = 2 # For the first neuron
bias2 = 8 # For the second neuron
bias3 = 4 # For the third neuron

output = [inputs[0]*weights1[0] + inputs[1]*weights1[1] + inputs[2]*weights1[2] + inputs[3]*weights1[3] + bias1, 
          inputs[0]*weights2[0] + inputs[1]*weights2[1] + inputs[2]*weights2[2] + inputs[3]*weights2[3] + bias2, 
          inputs[0]*weights3[0] + inputs[1]*weights3[1] + inputs[2]*weights3[2] + inputs[3]*weights3[3] + bias3] #

print(output) # see that we have 3 values each corresponding to values of our resulting neurons in layer.

[48.7, 43.8, 32.2]



**Lists, Arrays and matrices**

Now we will use numpy, we will also take a look at the shape.




*   A list is a 1-d array in numpy and a vector in mathematics with shape (n, ).
*   A list of lists is a 2-d array in numpy and a matrix in mathematics with shape (n,m), arrays have to be homologous which means that they'll have to have the same size at each dimension. since we could not really say what the shape would be in such a case.


*   A list of list of list is a 3-d array. or a 3-d matrix. of shape (n,m,k).


Dot product of 2 vector result in a single scalar value

Now that we have the terminology out of the way, let's simplify things.

In [20]:

inputs = [2,1,3,4]
weights  = np.array([[1.2,3.3,5.4,6.2],[3.2,5.5,4.9,2.3],[5.2,1.8,2.4,2.2]]) # 2-d array or a list of list
biases = [2,4,8]

#First let's try using a for loop
# we'll first start with an empty list, which will contain the output of our layer
# Then we'll multiply every ith list of weights with the input and add 1st bias for the first neuron, second list of weights * inputs + 2nd bias for second neuron and so on..

layer = []
for weight_n , bias_n in zip(weights, biases): #zip takes the ith element of each input and zips them into a tuple, we are accessing the elements through the for loop
  output = 0
  for input_n, weight_N in zip(inputs,weight_n):
    output += input_n*weight_N 
  output += bias_n
  layer.append(output)

print(layer)




[48.7, 39.8, 36.2]


So now we will take the dot product where each element of two matrices are multiplied element wise and then they are summed up. (matrix multiplication)

In [4]:
#doing the same with numpy
import numpy as np

inputs = [1,2,3,4]
weight = [5,6,7,8]
bias = 5

print(np.dot(weight,inputs)+bias) #here elements at index i in inputs and weights are being mulitplied and then all elements are added together along with the bias
#note that here the inputs and weight both are vectors so the order in which we write them will not matter but it will when we use matrices


inputs = [12,34,53,43]
weights  = [[1.2,3.3,5.4,6.2],[3.2,5.5,4.9,2.3],[5.2,1.8,2.4,2.2]]
bias = [2,5,7]

output = np.dot(weights,inputs) + bias #We'll get shape error if input comes first, because we cannot multiply 4 elements with 3. bias is then added element index wise
output


75


array([681.4, 589. , 352.4])

Now we will take in inputs in batches. which means we'll have more individual data/entries, having the same number of attributes. we are going to put it into the list of inputs. note, the number of neurons we have are the same, so the number of weights and biases won't change for them. (the values will just update later on).

In [5]:
inputs = [[12,34,53,43],[34,53,23,74],[6,4,2,5]] #inputs is a matrix now
weights  = [[1.2,3.3,5.4,6.2],[3.2,5.5,4.9,2.3],[5.2,1.8,2.4,2.2]]
bias = [2,5,7]

output = np.dot(inputs,np.array(weights).T) + bias #here we'll get a shape error if we simply take weights*inputs we need to take transpose of either one, here the order won't matter
output

array([[681.4, 589. , 352.4],
       [800.7, 688.2, 497.2],
       [ 64.2,  67.5,  61.2]])

In [19]:
#Doing this for 2 layers we are going to take the output form the first layer as the input for the second layer

inputs = [[1,2,3,2.5],[2.0,5.0,-1.0,2.0],[-1.5,2.7,3.3,-0.8]]
weights = [[0.2,0.8,-0.5,1],[0.5,-0.91,0.26,-0.5],[-0.26,-0.27,0.17,0.87]]
bias =[ 2,3,0.5]

weights2 = [[0.1,-0.14,0.5],[-0.5,0.12,-0.33],[-0.44,0.73,-0.13]]
bias2 =[ -1,2,-0.5]

layer1_outputs = np.dot(inputs, np.array(weights).T)+bias
layer2_outputs = np.dot(layer1_outputs, np.array(weights2).T)+bias2 #We take layer1 outputs as our input for this layer. and the different set of weights and biases

print(layer2_outputs)



[[ 0.5031  -1.04185 -2.03875]
 [ 0.2434  -2.7332  -5.7633 ]
 [-0.99314  1.41254 -0.35655]]


If we are looking to add an arbitrary number of neurons and layers, we will need to create a method. the random function in numpy generates an array of random numbers of the given shape. the biases are all initialized as zeros.
we don't initialize weights as zero because that would mean multiply inputs with zeros and we will have 0 as inputs for all neurons in the next layer. 
We also scale down our weights since these weights multiply with the inputs at every layer, the value after a few multiplications could explode as a really big number for the final layer making further calculaitons difficult.

In [None]:
#Generalizing for everything

inputs = [[1,2,3,2.5],
          [2.0,5.0,-1.0,2.0],
          [-1.5,2.7,3.3,-0.8]] #The inputs will be the same 

class layer_dense:
  def __init__(self, n_inputs, n_neurons):
    self.weights = 0.10* np.random.randn(n_inputs,n_neurons) # we take weight like this so that we don't have to do a transpose everytime, and scale it down when we multiply 0.10
    self.bias = np.zeros((1,n_neurons))
  def forward(self, inputs):
    self.output = np.dot(inputs,self.weights) + self.bias

layer1 = layer_dense(4,5)
layer2 = layer_dense(5,9)

layer1.forward(inputs)
layer2.forward(layer1.output)

print(layer2.output)




[[ 4.57089034e-02 -5.41558048e-03 -1.98420004e-02 -1.80550777e-02
  -1.77800832e-02  6.12640653e-02 -3.69887455e-02 -8.20820222e-05
   1.91196544e-02]
 [ 5.60240361e-02  5.66563448e-02 -6.94201882e-02 -7.57556957e-02
  -3.17718544e-02 -9.86291349e-03 -5.49817928e-02 -3.79518315e-02
   5.12720715e-02]
 [ 5.35053018e-02 -8.28123966e-03 -5.70026774e-02 -2.37405478e-03
   1.84340650e-02  3.57871734e-02 -8.05422651e-03  2.85464219e-02
   2.30146612e-02]]


**Activation functions**

Now let's take a look at activations functions.
they are necessary to bring non-lineraity to our neural network. which means that it provides granularity to our output and scales it between some values.
the outputs of every neuron is in linear form. like that of a straight line. this will make it difficult to model data that is distributed for example in a spiral way. using these activation functions helps the model to shape accroding to the data's distribution. giving higher accuracy during predictions or testing.
so we have sigmoid function where y = 1/1+e^(-x) which rescales input between 0 and 1. then we have ReLu which gives linear function as well but if value s of x is less than 0 then it clips it at 0. this may seem linear but we only get non linearity when multiple neurons with relu work together.

In [None]:
import numpy as np

class Activation_ReLu:
  def forward(self,inputs):
    self.output = np.maximum(0,inputs)

#We will use ReLU on all of our layers except the final one

**softmax activation for final layer**

We will also take a look at the soft-max function since we cannot really use ReLU in the final layer of our network. because it will clip the negative values to 0. softmax will give us the probabilities of each one being the right answer. for example what is the probablility that a certain picture is a cat picture or a dog picture.
first we will take the exponentitation of our output values after which we will normalize it which means the individual values from our output matrix over the sum of all the values in that matrix.
one problem that we would come across with exponentiation is that the explosion of values as the input to the funciton grows, this could reach overflow as well.
to prevent this we are actually going to subtract the maximum value from our outputs with our value. this will make sure the values in the exponentiation is between 0 to 1.  one concern we may have is that how will this impact the output of softmax activation funciton. once we do the normalization, we'll see that the actual output is just the same as before.


In [None]:
import math


outputs = [1.2,3.5,2.5]

exponentiated = np.exp(outputs) 
normalized = exponentiated / np.sum(exponentiated) # This combination of exponentiation and normalization is called softmax

class Activation_softmax:
  def forward(self,inputs):
    exponentiated_values = np.exp(inputs - np.max(inputs, axis = 1, keepdims = True)) #if we don't provide the axis it will return the max from the whole batch of inputs
    normalized = exponentiated_values / np.sum(exponentiated_values, axis = 1, keepdims= True) #if we don't use the two parameters we will get a single value for the sum
    self.output = normalized



print(sum(normalized)) #note that all probabilities add up to 1 

1.0


**The forward pass**

Now that we have all the components for the forward pass, let's piece it all together. we are still using custom inputs but we can use whatever inputs we would like to use. I have only used 2 layers for the purpose of illustration.

In [None]:
x = [[1,2,3,2.5],
     [2.0,5.0,-1.0,2.0],
     [-1.5,2.7,3.3,-0.8]]

layer1 = layer_dense(4,5)
layer1_activation = Activation_ReLu()

layer2 = layer_dense(5,2)
layer2_activation = Activation_softmax()

layer1.forward(x)
layer1_activation.forward(layer1.output)

layer2.forward(layer1_activation.output)
layer2_activation.forward(layer2.output)

print(layer2_activation.output) #This is our final output
print(np.sum(layer2_activation.output, axis=1, keepdims= True)) #Note that the sum of all the probabilities add to 1


[[0.55838635 0.44161365]
 [0.54539434 0.45460566]
 [0.51320077 0.48679923]]
[[1.]
 [1.]
 [1.]]


**Loss function (Categorical-cross entropy)**

We would now need a loss function. a loss function tells us how bad the current value for weights and biases are. we will use a loss called categorical cross entropy and we will one hot enconde our target classes(true outputs). 
example for one hot encoding is, suppose we have 3 classes then we will encode the class at first index -> [1,0,0] , second index -> [0,1,0], third index = [0,0,1] for our individual true outputs for corresponding to every entry.


categorical cross entropy will take the negative sum, of the target value mulitplied by the log of the predicted value for each class in the distribution


In [None]:
#let's understand with an example first, note that this is for a single input.

softmax_output = [0.2,0.5,0.3] #The predicted output 
target_output = [1,0,0] #The desired output

loss = -(np.log(softmax_output[0])*target_output[0] +
         np.log(softmax_output[1])*target_output[1] +
         np.log(softmax_output[2])*target_output[2])

print(loss)
print(-np.log(0.2)) #Both of these values are the same since we are only taking the class prediction probabilities of the correct class as the loss

#Now we need to take these inputs in batches

softmax_outputs = [[0.3,0.4,0.3],
                   [0.1,0.7,0.2],
                   [0.9,0.1,0.0]]

#Let's say that the targets are cat, cat, dog where the first index in the input is the dog class and 2nd is the cat class and thrid is bird class
#so we will have the corresponding correct distributions at the indexes [1,1,0]
#the correct labels would then be [0.4,0.7,0.9]
#we can do this the following way

softmax_outputs = np.array([[0.3,0.4,0.3],
                            [0.1,0.7,0.2],
                            [0.9,0.1,0.0]])

class_target = [1,1,0]


print(softmax_outputs[[0,1,2], class_target]) # Here the first list has the list_index and the second list has the correspoding value's index


log_loss = -np.log(softmax_outputs[range(len(softmax_outputs)),class_target])
avg_loss = np.mean(log_loss)
print(avg_loss)

#here we are taking the range of len of softmax to generalize for the number of inputs.
#Finally we are taking negative log of these values to get the corresponding losses for the inputs
#And then we would take the mean of those losses
#Sometimes we may come accross a problem where we would have a zero in our distribution, therefore the log of it would be infinite and avg of inf is inf
#to deal with the above issue we will simply clip the values between (1e-7) which is a pretty insignificant value and (1-1e-7)


class Loss:
  def calculate(self, output, y):
    sample_losses = self.forward(output,y)
    data_loss = np.mean(sample_losses)
    return data_loss

class Loss_CategoricalCrossEntropy(Loss):
  def forward(self, y_pred, y_true):
    samples = len(y_pred)
    y_pred_clipped = np.clip(y_pred, 1e-7, 1-1e-7)

    if len(y_true.shape) == 1 :
      correct_prob = y_pred_clipped[range(samples), y_true]
    elif len(y_true.shape) == 2 :
      correct_prob = np.sum(y_pred_clipped*y_true, axis =1)

    negative_log_loss = np.log(correct_prob)
    return negative_log_loss


1.6094379124341003
1.6094379124341003
[0.4 0.7 0.9]
0.4594420638235713


In [None]:
x = np.array([[1,2,3,2.5],
     [2.0,5.0,-1.0,2.0],
     [-1.5,2.7,3.3,-0.8]]) #input values

y = np.array([[0,1,0],
     [0,1,0],
     [1,0,0]]) #Correct labels


layer1 = layer_dense(4,5)
layer1_activation = Activation_ReLu()

layer2 = layer_dense(5,3)
layer2_activation = Activation_softmax()

layer1.forward(x)
layer1_activation.forward(layer1.output)

layer2.forward(layer1_activation.output)
layer2_activation.forward(layer2.output)

loss_function = Loss_CategoricalCrossEntropy()
loss = loss_function.calculate(layer2_activation.output, y) #we need to make sure that the shape matches for the output of the softmax and the correct target labels
 

print(loss)

-1.1259439547329413


#Backpropagation will be continued soon