**Articles:**<br/>
[Perceptrons](https://direct.mit.edu/books/edited-volume/5431/chapter-abstract/3958520/1969-Marvin-Minsky-and-Seymour-Papert-Perceptrons?redirectedFrom=PDF)<br/>
[The Organization of Behavior](https://pubmed.ncbi.nlm.nih.gov/10643472/)<br/>
[Learning Internal Representations by Error Propagation](https://www.semanticscholar.org/paper/Learning-internal-representations-by-error-Rumelhart-Hinton/319f22bd5abfd67ac15988aa5c7f705f018c3ccd)<br/>
[A logical calculus of the ideas immanent in nervous activity](https://link.springer.com/article/10.1007/BF02478259)<br/>
[The perceptron: A probabilistic model for information storage and organization in the brain](https://www.semanticscholar.org/paper/The-perceptron%3A-a-probabilistic-model-for-storage-Rosenblatt/5d11aad09f65431b5d3cb1d85328743c9e53ba96)<br/>

----
**Backpropagation algorithm in deep learning & machine learning model**<br/>
***1- Forward Pass***<br/>
`Input Layer:` The input features are fed into the network.<br/>
`Hidden Layers:` Each neuron in a hidden layer sums up the weighted input from the previous layer and applies an activation function to produce its own output. This process continues through all hidden layers.<br/>
`Output Layer:` The final layer produces the network’s output using the same process of weighted sums and activation.<br/>
$z^l = W^l a^{l-1} + b^l$<br/>
$a^l = f(z^l)$<br/>
Where each layer *l* have weights $W^l$ and biases $b^l$. The output $z^l$ of each layer before applying the activation function. $a^{l-1}$ is the output from the previous layer after the activation function has been applied (for the input layer, $a^0 = x$). *f* is the activation function (e.g., sigmoid, ReLU).<br/>

***2- Loss Calculation***<br/>
After the forward pass, compare the output of the network to the actual target values using a loss function (like mean squared error for regression tasks or cross-entropy for classification tasks).<br/>
Calculate the total error (loss).<br/>
$C = \frac{1}{2} \sum (y - a^L)^2$<br/>
Define the loss function *C* based on the network’s output $a^L$ (where *L* is the last layer) and the true labels *y*. <br/>

***3- Backward Pass (Backpropagation)***<br/>
`Compute Output Error:` Determine the error at the output layer (the difference between the predicted and actual values).<br/>
`Gradient of the Loss Function:` Calculate the gradient of the loss function with respect to the output of the network. This gradient will tell how much the loss would change with a small change in output.<br/>
`Backpropagate the Error:`<br/>
  3.1- *Output to Hidden Layer:* For each neuron in the output layer, distribute its error backward to all neurons in the hidden layers that contribute directly to it, based on the strength (weight) of their connection and the gradient of the activation function used at the neurons.<br/>
  3.2- *Hidden Layers to Input:* Repeat this process for each hidden layer, moving from the outermost hidden layer to the input layer.<br/>
The error for the output layer $\delta^L$ is calculated as: $\delta^L = \frac{\partial C}{\partial a^L} \odot f'(z^L)$. 
For mean squared error, $\frac{\partial C}{\partial a^L} = (a^L - y)$, and $f'$ is the derivative of the activation function.<br/>
For each layer *l* from *L-1* to *1*, the error $\delta^l$ is calculated as: $\delta^l = ((W^{l+1})^T \delta^{l+1}) \odot f'(z^l)$<br/>
The gradient of the cost function with respect to the weights and biases in each layer is calculated as:
$\frac{\partial C}{\partial W^l} = \delta^l (a^{l-1})^T; $ $\frac{\partial C}{\partial b^l} = \delta^l$<br/>

***4- Update Weights and Biases***<br/>
`Calculate Gradient:` For each weight and bias, calculate the gradient of the loss function with respect to that parameter.<br/>
`Adjust Parameters:` Update the weights and biases in the opposite direction of the gradient to minimize the loss. This is usually done using an optimizer like gradient descent. The size of the step taken in each update is determined by the learning rate.<br/>
Update the weights and biases by moving against the gradient:
$W^l = W^l - \eta \frac{\partial C}{\partial W^l}; $ $b^l = b^l - \eta \frac{\partial C}{\partial b^l}$<br/>
$\eta$: Learning rate.<br/>
Repeat steps 1 through 4 for multiple epochs or until the network's performance stops improving. Each full pass through the training data is called an epoch.