**Import Libraries**

In [1]:
import numpy as np
import tensorflow as tf                                         # print(tf.__version__)
from tensorflow import keras
from keras.datasets import fashion_mnist, mnist
import matplotlib.pyplot as plt

***Classification***

----
**Step 1. Load data**<br/>
Path data → C:\Users\name pc\.keras<br/>
[MNIST Dataset](https://www.tensorflow.org/datasets/catalog/mnist)<br>
[Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist)

In [5]:
name = 1                                                            # 1:mnist; 2:fashion_mnist
if name == 1:
    (x_train, y_train), (x_test, y_test) = mnist.load_data()         # 70,000 28x28
else:
    (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data() # 70,000 28x28
    
print(f"x_train.shape:{x_train.shape}")                              # (Number of images, 28 , 28)
print(f"x_test.shape:{x_test.shape}")                                # (Number of images, 28 , 28)
print(f"y_train.shape:{y_train.shape}")                              # (Number of label)
print(f"y_test.shape:{y_test.shape}")                                # (Number of label)
print(f"y_train[:10]:{y_train}")                                     # Train labels
print(f"x_train[0,]: {x_train[0, 0:2, :]}")                          # (Image 0, 0:2 out of 28, 28) ==> Ankle boot

labels = np.unique(y_test)
print(f"labels: {labels}")

x_train.shape:(60000, 28, 28)
x_test.shape:(10000, 28, 28)
y_train.shape:(60000,)
y_test.shape:(10000,)
y_train[:10]:[5 0 4 ... 5 6 8]
x_train[0,]: [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
labels: [0 1 2 3 4 5 6 7 8 9]


**Step 2. Normalize data, reshape & binary class**<br/>
pixel values between 0 and 255; Scale these values to a range of 0 to 1 before feeding them to the neural network model.

In [7]:
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255

x_train = np.expand_dims(x_train, -1)          # Make sure images have shape (28, 28, 1)
x_test = np.expand_dims(x_test, -1)

y_train = keras.utils.to_categorical(y_train, len(labels)) # convert class vectors to binary class matrices
y_test = keras.utils.to_categorical(y_test, len(labels))

**Articles:**<br/>
[Perceptrons](https://direct.mit.edu/books/edited-volume/5431/chapter-abstract/3958520/1969-Marvin-Minsky-and-Seymour-Papert-Perceptrons?redirectedFrom=PDF)<br/>
[The Organization of Behavior](https://pubmed.ncbi.nlm.nih.gov/10643472/)<br/>
[Learning Internal Representations by Error Propagation](https://www.semanticscholar.org/paper/Learning-internal-representations-by-error-Rumelhart-Hinton/319f22bd5abfd67ac15988aa5c7f705f018c3ccd)<br/>
[A logical calculus of the ideas immanent in nervous activity](https://link.springer.com/article/10.1007/BF02478259)<br/>
[The perceptron: A probabilistic model for information storage and organization in the brain](https://www.semanticscholar.org/paper/The-perceptron%3A-a-probabilistic-model-for-storage-Rosenblatt/5d11aad09f65431b5d3cb1d85328743c9e53ba96)<br/>

----
**Backpropagation algorithm in deep learning & machine learning model**<br/>
***1- Forward Pass***<br/>
`Input Layer:` The input features are fed into the network.<br/>
`Hidden Layers:` Each neuron in a hidden layer sums up the weighted input from the previous layer and applies an activation function to produce its own output. This process continues through all hidden layers.<br/>
`Output Layer:` The final layer produces the network’s output using the same process of weighted sums and activation.<br/>
$z^l = W^l a^{l-1} + b^l$<br/>
$a^l = f(z^l)$<br/>
Where each layer *l* have weights $W^l$ and biases $b^l$. The output $z^l$ of each layer before applying the activation function. $a^{l-1}$ is the output from the previous layer after the activation function has been applied (for the input layer, $a^0 = x$). *f* is the activation function (e.g., sigmoid, ReLU).<br/>

***2- Loss Calculation***<br/>
After the forward pass, compare the output of the network to the actual target values using a loss function (like mean squared error for regression tasks or cross-entropy for classification tasks).<br/>
Calculate the total error (loss).<br/>
$C = \frac{1}{2} \sum (y - a^L)^2$<br/>
Define the loss function *C* based on the network’s output $a^L$ (where *L* is the last layer) and the true labels *y*. <br/>

***3- Backward Pass (Backpropagation)***<br/>
`Compute Output Error:` Determine the error at the output layer (the difference between the predicted and actual values).<br/>
`Gradient of the Loss Function:` Calculate the gradient of the loss function with respect to the output of the network. This gradient will tell how much the loss would change with a small change in output.<br/>
`Backpropagate the Error:`<br/>
  3.1- *Output to Hidden Layer:* For each neuron in the output layer, distribute its error backward to all neurons in the hidden layers that contribute directly to it, based on the strength (weight) of their connection and the gradient of the activation function used at the neurons.<br/>
  3.2- *Hidden Layers to Input:* Repeat this process for each hidden layer, moving from the outermost hidden layer to the input layer.<br/>
The error for the output layer $\delta^L$ is calculated as: $\delta^L = \frac{\partial C}{\partial a^L} \odot f'(z^L)$. 
For mean squared error, $\frac{\partial C}{\partial a^L} = (a^L - y)$, and $f'$ is the derivative of the activation function.<br/>
For each layer *l* from *L-1* to *1*, the error $\delta^l$ is calculated as: $\delta^l = ((W^{l+1})^T \delta^{l+1}) \odot f'(z^l)$<br/>
The gradient of the cost function with respect to the weights and biases in each layer is calculated as:
$\frac{\partial C}{\partial W^l} = \delta^l (a^{l-1})^T; $ $\frac{\partial C}{\partial b^l} = \delta^l$<br/>

***4- Update Weights and Biases***<br/>
`Calculate Gradient:` For each weight and bias, calculate the gradient of the loss function with respect to that parameter.<br/>
`Adjust Parameters:` Update the weights and biases in the opposite direction of the gradient to minimize the loss. This is usually done using an optimizer like gradient descent. The size of the step taken in each update is determined by the learning rate.<br/>
Update the weights and biases by moving against the gradient:
$W^l = W^l - \eta \frac{\partial C}{\partial W^l}; $ $b^l = b^l - \eta \frac{\partial C}{\partial b^l}$<br/>
$\eta$: Learning rate.<br/>
Repeat steps 1 through 4 for multiple epochs or until the network's performance stops improving. Each full pass through the training data is called an epoch.

**Loss functions in regression**<br/>
1. `Mean Squared Error (MSE):` $\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$<br/>
MSE calculates the average of the squared differences between the actual values $y_i$ and the predicted values $\hat{y}_i$.
MSE heavily penalizes large errors due to the squared errors, making it sensitive to outliers. It is a smooth, differentiable function that is useful for gradient-based optimization algorithms.

2. `Mean Absolute Error (MAE):` $\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$
The MAE calculates the average of the absolute differences between the actual and predicted values. Unlike MSE, MAE treats all errors linearly, making it more robust to outliers. However, the absolute value function is less smooth than the squared error function, which can make optimization more challenging.

3. `Huber Loss:`
   $L_{\delta}(a) = \begin{cases} \frac{1}{2}a^2 & \text{for } |a| \leq \delta \\ \delta(|a| - \frac{1}{2}\delta) & \text{for } |a| > \delta \end{cases};  a = y_i - \hat{y}_i$ <br/>
  The Huber loss function combines the advantages of both MSE and MAE, behaving quadratically for small errors and linearly for large errors, with a threshold parameter $\delta$ determining the transition. It is less sensitive to outliers than MSE but more sensitive than MAE, and the parameter $\delta$ can be adjusted based on the data. Unlike MAE, the Huber loss function is differentiable everywhere, making it suitable for gradient-based optimization.

**Loss functions in classification**<br/>
  1. `Categorical Cross-Entropy Loss (Log Loss):` Cross-entropy loss compares the true labels with the predicted probabilities. This type of loss is widely used in multi-class and binary classifications (binary cross-entropy). The loss value increases when the predicted probability deviates from the true label, and it penalizes confident but incorrect predictions. Cross-entropy is differentiable, making it suitable for optimization methods based on gradients. Commonly used in neural networks for multi-class classification with softmax output.<br/>
  *Target Labels:*The true labels are provided as a one-hot encoded vector, where the vector is of length C (the number of classes), and only the index corresponding to the true class has a value of 1, with all other indices being 0 ([1,0,0],[0,1,0],[0,0,1]).<br/>
  $\text{Cross-Entropy} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{p}_{i,c})$<br/>
  $y_{i,c}$ is a binary indicator (0 or 1) if class label *c* is the correct classification for observation *i*.<br/> 
  $\hat{p}_{i,c}$ is the predicted probability that observation *i* is of class *c*.<br/>
  *C* is the number of classes.

  2. `Sparse Categorical Cross-Entropy:` It is used when labels are provided as integers representing the correct class index, a more efficient alternative, especially for problems with a large number of classes.<br/>
  *Target Labels:* The true labels are provided as integers instead of one-hot encoded vectors ([0],[1],[2]).<br/>
  $\text{Sparse Cross-Entropy} = -\sum_{i=1}^{n} \log(\hat{p}_{i,\text{true label}})$ <br/>
  $\hat{p}_{i,\text{true label}}$ is the predicted probability for the correct class.
  
  3. `Hinge Loss (Used in Support Vector Machines - SVMs):` Hinge Loss is primarily used in training SVMs. Its primary goal is to ensure that predictions are not only correct but also confidently correct by a certain margin. Correct predictions with a margin of at least 1 have zero hinge loss, while incorrect predictions are penalized based on their distance from the decision boundary. Hinge loss is used for binary classification but can be adapted for multi-class classification.<br/> 
  $\text{Hinge Loss} = \sum_{i=1}^{n} \max(0, 1 - y_i \cdot f(\mathbf{x}_i))$<br/> 
  $y_i$ is the true label (-1 or +1 for binary classification)<br/> 
  $f(\mathbf{x}_i)$ is the predicted score (before applying the sign function to determine the class).

  4. `Kullback-Leibler Divergence (KL Divergence):` KL Divergence measures the difference between two probability distributions, commonly used in classification tasks where understanding the probabilistic interpretation of the output is important. When the predicted distribution matches the true distribution, the KL divergence is zero, but it increases as the distributions diverge. KL divergence is often used alongside other loss functions or in tasks such as language modeling, where output distributions are compared.<br/>
  $D_{\text{KL}}(P \parallel Q) = \sum_{i=1}^{n} p_i \log\left(\frac{p_i}{q_i}\right)$<br/>
  $p_i$ is the true distribution.
  $q_i$ is the predicted distribution.