##### Neural networks were built to mimic the functioning of the human brain.

##### Traditional learning algorithms were not able to scale their performance with the amount of data available and reached a stagnation point. This is where neural networks come into play. They are able to learn from large amounts of data and generalize well on unseen data. 

##### For understanding the working of neural networks, we need to understand the basic building blocks of neural networks. Let us take a simple example of demand prediction for a product. Let x be the price of the product and y be the demand for the product. We can represent this relationship as y = f(x). The function f(x) is the neural network that we are trying to learn. Let us use logistic regression to classify the product as high demand or low demand. The logistic regression model can be represented as y = sigmoid(w*x + b). Here w and b are the weights and bias of the model. The sigmoid function is used to convert the output of the linear model to a probability value between 0 and 1. This is a single neuron neural network. 

##### Now let us consider a more complex example where we have multiple features to predict the demand for the product. We can represent this as y = f(x1, x2, x3, ..., xn). The function f(x1, x2, x3, ..., xn) is the neural network that we are trying to learn. We can represent this as y = sigmoid(w1*x1 + w2*x2 + w3*x3 + ... + wn*xn + b). This is a multi-layer neural network with multiple neurons in the hidden layer. 

##### A layer in a neural network is a collection of neurons that perform the same operation on the input data. The input layer is the first layer of the neural network that takes the input data. The hidden layers are the layers between the input and output layers. The output layer is the last layer of the neural network that produces the output.

##### Activation functions are used to introduce non-linearity in the neural network. The most commonly used activation functions are sigmoid, tanh, and ReLU. The sigmoid function is used to convert the output of the linear model to a probability value between 0 and 1. The tanh function is used to convert the output of the linear model to a value between -1 and 1. The ReLU function is used to introduce non-linearity in the neural network. 

##### ReLU is the most commonly used activation function in deep learning. It is defined as f(x) = max(0, x). ReLU is computationally efficient and does not suffer from the vanishing gradient problem. The vanishing gradient problem occurs when the gradient of the activation function becomes very small, leading to slow convergence of the neural network.

##### Each neuron in a particular layer has access to all the outputs of the previous layer and it eventually learns to ignore the less important features or activations. This is done by adjusting the weights and biases of the neurons during the training process. The weights and biases are updated using an optimization algorithm such as gradient descent.

##### A neural network does the feature engineering which we had to do manually in traditional machine learning algorithms. It automatically learns the important features from the data and generalizes well on unseen data. This is the power of neural networks and deep learning.

##### Face recognition is done using neural networks. Basically any image is represented as a matrix of pixel values. The neural network learns the important features from the image and classifies it as a particular person. This is done by training the neural network on a large dataset of images.

##### By convention we use layer 0 for the input layer, layer 1 for the first hidden layer, layer 2 for the second hidden layer, and so on. The output layer is the last layer of the neural network. We use a superscript in square brackets to denote the layer number. For example, a[1] is the activation of the first hidden layer and so on.

##### Activation value of layer l, unit neuron j is z[l][j] = g(w[l][j] . a[l-1] + b[l][j]) where g is the activation function. The activation value of layer l, unit neuron j is a[l][j] = g(z[l][j]). One example of g is the sigmoid function.

#### Forward propogation

##### Forward propogation is used for making predictions from neural networks. It is the process of passing the input data through the neural network to get the output. The output of the neural network is the prediction made by the model. The forward propagation algorithm is as follows:

##### 1. Initialize the input layer with the input data.
##### 2. For each layer l from 1 to L:
#####     a[l] = g(z[l])
#####     z[l] = w[l] . a[l-1] + b[l]
##### 3. The output of the neural network is the activation value of the output layer.


#### Implementation of forward propogation

In [2]:
import numpy as np

In [1]:
def dense(W,b,a_in):
    # The columns of W are the weights of the units
    units=W.shape[1]
    a_out=np.zeros(units)
    for i in range(units):
        input=W[:,i]*b[i]
        z=np.dot(a_in,input)+b[i]
        a_out[i]=1/(1+np.exp(-z))
    return a_out


##### While compiling our model in tensorflow we need to specify the optimizer and the loss function. The optimizer is used to update the weights and biases of the neural network during training. The loss function is used to measure the error between the predicted output and the actual output. The loss function is minimized during training to improve the performance of the neural network. One of the examples of loss functions is binary cross entropy function which is used for binary classification problems. The binary cross entropy function is defined as L(y, y_hat) = -1/m * sum(y * log(y_hat) + (1-y) * log(1-y_hat)) where y is the actual output and y_hat is the predicted output. The binary cross entropy function measures the error between the predicted output and the actual output. The optimizer is used to minimize the loss function during training. One of the examples of optimizers is the Adam optimizer which is used to update the weights and biases of the neural network during training. The Adam optimizer is an extension of the stochastic gradient descent algorithm and is computationally efficient.

##### Tensorflow computes the partial derivatives in gradient descent using backpropogation. Backpropogation is the process of computing the gradients of the loss function with respect to the weights and biases of the neural network. The gradients are used to update the weights and biases of the neural network during training. The backpropagation algorithm is as follows:

##### 1. Compute the error between the predicted output and the actual output.
##### 2. Compute the gradients of the loss function with respect to the weights and biases of the neural network.
##### 3. Update the weights and biases of the neural network using the gradients and the learning rate.


##### ReLU-Rectified Linear Unit. It is basically max(0,x) where x is the input to the neuron. It is computationally efficient and does not suffer from the vanishing gradient problem. The vanishing gradient problem occurs when the gradient of the activation function becomes very small, leading to slow convergence of the neural network. It is the most commonly used activation function in deep learning.

### Choosing the right activation function

##### When the output is either a 0 or 1(binary classification problem) then the sigmoid function is used as the activation function. 
##### When the output is a value between -1 and 1 then the tanh function is used as the activation function.
##### When the output is a value greater than 0 then the ReLU function is used as the activation function.

##### In the hidden layers we mostly use only the ReLU function as the activation function. This is because the ReLU function is computationally efficient.

##### For multiclass classfication problem we use softmax regression which is a generalization of logistic regression. 

### Softmax regression

##### If there are n different possible outputs then the softmax function is used as the activation function. The softmax function is defined as a[i] = e^(z[i]) / sum(e^(z[j])) where i = 1 to n. The softmax function converts the output of the linear model to a probability value between 0 and 1. The output of the softmax function is a probability distribution over the n different possible outputs. The output with the highest probability is the predicted output of the model.

##### Cost function of softmax regression is the cross entropy loss function. The cross entropy loss function is defined as L(y, y_hat) = -1/m * sum(y * log(y_hat)) where y is the actual output and y_hat is the predicted output. The cross entropy loss function measures the error between the predicted output and the actual output. The cross entropy loss function is minimized during training to improve the performance of the neural network.

##### For using softmax regression we simply change the activation function of the output layer to softmax function and the loss function to cross entropy loss function. The rest of the neural network remains the same. Also, the number of output neurons is equal to the number of classes in the multiclass classification problem.

##### Multi-label classification problems: If there can be multiple labels associated with a single input then it is a multi-label classification problem.

##### SparseCategorialCrossentropy or CategoricalCrossEntropy 
Tensorflow has two potential formats for target values and the selection of the loss defines which is expected.

SparseCategorialCrossentropy: expects the target to be an integer corresponding to the index. For example, if there are 10 potential target values, y would be between 0 and 9.
CategoricalCrossEntropy: Expects the target value of an example to be one-hot encoded where the value at the target index is 1 while the other N-1 entries are zero. An example with 10 potential target values, where the target is 2 would be [0,0,1,0,0,0,0,0,0,0].

Directly using softmax regression in the output layer as the activation function is actually numerically unstable and there is a better method of implementing it.

##### Not preferred way of implementing softmax regression
##### model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(4, activation = 'softmax')    # < softmax activation here
    ]
)
##### model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.001),
)

##### Preferred way of implementing softmax regression
##### preferred_model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(4, activation = 'linear')   #<-- Note
    ]
)
##### preferred_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),  #<-- Note
    optimizer=tf.keras.optimizers.Adam(0.001),
)



Right now there are better optimization algorithms used instead of  gradient descent. One of them is ADAM-Adaptive Moment Estimation. It basically changes the learning rate during training. It is computationally efficient and is used to update the weights and biases of the neural network during training. It is an extension of the stochastic gradient descent algorithm. It is used to minimize the loss function during training to improve the performance of the neural network.

The layers we have been using are called dense layers. In dense layers each layer gets all the outputs of the previous layer as the input. 

Convolutional layer: Convolutional layers are used to extract features from the input data. They are used in image recognition tasks. Convolutional layers are used to detect patterns in the input data. They are used to extract features from the input data. Convolutional layers are used in image recognition tasks. They take only a part of the output of the previous layer as the input thus making the training process computationally efficient. If we have multiple convolutional layers then such a neural network is called a Convolutional Neural Network(CNN). CNNs are used in image recognition tasks.

## Debugging a learning algorithm

For example we implemented regularized linear regression and it is not working as expected. We can use the following steps to debug the learning algorithm:

1. Get more training examples: If the training set is small then the model may not be able to learn the underlying pattern in the data. We can get more training examples to improve the performance of the model.

2. Try a smaller set of features: If the model is overfitting the training data then we can try a smaller set of features to improve the performance of the model. We can remove the less important features from the model to improve the performance of the model.

3. Try a larger set of features: If the model is underfitting the training data then we can try a larger set of features to improve the performance of the model. We can add more features to the model to improve the performance of the model.

4. Try a different model: If the model is not performing well then we can try a different model to improve the performance of the model. We can try a different learning algorithm to improve the performance of the model.

5. Try a different optimization algorithm: If the model is not converging then we can try a different optimization algorithm to improve the performance of the model. We can try a different optimization algorithm to improve the performance of the model.

6. Try a different activation function: If the model is not converging then we can try a different activation function to improve the performance of the model. We can try a different activation function to improve the performance of the model.

7. Try a different loss function: If the model is not converging then we can try a different loss function to improve the performance of the model. We can try a different loss function to improve the performance of the model.

8. Try a different regularization parameter: If the model is overfitting the training data then we can try a different regularization parameter to improve the performance of the model. We can try a different regularization parameter to improve the performance of the model.

9. Try a different learning rate: If the model is not converging then we can try a different learning rate to improve the performance of the model. We can try a different learning rate to improve the performance of the model.

10. Try a different batch size: If the model is not converging then we can try a different batch size to improve the performance of the model. We can try a different batch size to improve the performance of the model.

11. Try a different number of epochs: If the model is not converging then we can try a different number of epochs to improve the performance of the model. We can try a different number of epochs to improve the performance of the model.

12. Try a different initialization method: If the model is not converging then we can try a different initialization method to improve the performance of the model. We can try a different initialization method to improve the performance of the model.



### Evaluating the performance of a model

We split the data into training and testing data. We train the model on the training data and evaluate the performance of the model on the testing data. 

For deciding which model to use we split the data into three subsets- training, validation and testing data. We train the model on the training data and evaluate the performance of the model on the validation data. We select the model with the best performance on the validation data. We then evaluate the performance of the selected model on the testing data. This is called the holdout method.