In [13]:
from tensorflow.keras.datasets import mnist
import numpy as np

# Introduction To neural Networks  

### What is a Neural Network?
A neural network is a series of algorithms that attempt to recognize underlying relationships in a set of data through a process. The network consists of layers of nodes, or neurons, each performing simple computations and passing the results to the next layer.

## Components of a Neural Network 
1. Neurons: Basic units (node) of a neural network that receive input, process it, and pass it on to other neurons. Each neuron has weights and biases that are adjusted during training to minimize errors. The neurons organized in layers. 



2. Layers: There is three types of Layers:
    - Input Layer: The first layer that receives the input data.
    - Hidden Layers: Intermediate layers that process inputs from the input layer. There can be one or more hidden layers in a neural network.
    - Output Layer: The final layer that produces the output predictions.


# Loss
 The difference between the actual output and the predicted output is calculated using a loss function. This error measure guides the learning process.
 
 
We can think at loss that simply sums the ${\ell}_2$ distance between the predicted output form the actual output:
$MSE={\frac 1 n} \sum ^n _{i=1} (y_i - \hat y_i)^2$

In neural network, we will use the "Cross-Entropy Loss".
Cross-Entropy Loss: Used for classification tasks, it measures the difference between two probability distributions – the true labels and the predicted probabilities
${\text {Cross-Entropy Loss} = -\sum ^n _{i=1}[y_i \cdot log(\hat {y_i})+(1-y_i)\cdot log(1-\hat{y_i})]$

- Role in Training: The loss function guides the optimization process. During training, the goal is to minimize the loss, which means making the predictions as accurate as possible.

- Loss Curve: A plot of loss versus training epochs can help visualize how well the model is learning. A decreasing loss indicates that the model is improving.

#### Task: implement cross entropy loss.


In [14]:
def crossEntropy_loss(output, targets, buffer):
    """
    Calculates the categorical cross-entropy loss.

    Args:
        output (Tensor): Model predictions of shape (num_samples, output_size).
        targets (Tensor): Target labels of shape (num_samples, output_size).

    Returns:
        float: Categorical cross-entropy loss value.
    """
    # buffer = 1e-10
    loss = -np.mean(targets * np.log(output + buffer))
    return loss

Another way to measure our model, is $accuracy$.

Accuracy is a metric used to evaluate the performance of a classification model. It measures the proportion of correct predictions out of the total number of predictions.

 Accuracy is calculated as the number of correct predictions divided by the total number of predictions. It is often expressed as a percentage.
 
$Accuracy = {\frac {\text {Number of Correct Predictions}} {\text {Total Number of Predictions}} \cdot 100}$ 



If $y_i$ is the actual label and $\hat {y_i}$ is the predicted label for $n$ samples:
$Accuracy = {\frac 1 n} \sum ^n _{i=1} \mathds{1}(y_i = \hat{y_i})$ , where 1 is the indicator function, which is 1 if the condition is true and 0 otherwise.

#### Implement the following accuracy function for classification problem (possible labels are 0,1) 

In [15]:
def accuracy(output: np.ndarray, targets: np.ndarray) -> float:
    """
    Calculates the accuracy of the model predictions.

    Args:
        output (Tensor): Model predictions of shape (num_samples, output_size).
        targets (Tensor): Target labels of shape (num_samples, output_size).

    Returns:
        float: Accuracy value.
    """
    predicted_labels = np.argmax(output, axis=1)
    true_labels = np.argmax(targets, axis=1)
    return np.mean(predicted_labels == true_labels)

## Activation functions
Activation functions play a crucial role in neural networks by introducing non-linearity into the model. This non-linearity allows the network to learn and model complex relationships between inputs and outputs. Here’s a detailed overview of various activation functions and their roles:

1. Non-linearity: Without activation functions, a neural network would simply perform linear transformations, making it incapable of solving non-linear problems.
2. Enabling Learning: Activation functions enable backpropagation by providing gradients needed for updating weights.
3. Controlling Outputs: They help in squashing the output to a specific range, making the network's behavior more predictable and stable.

## Common Activation Functions:
- sigmoid/logistic function: $\sigma (x) =  \frac 1 {1+e^{-x}}$.
  Properties:
  - Outputs values between 0 and 1.
  - Smooth gradient, preventing abrupt changes in output.
- Rectified Linear Unit (ReLU):$

\text{ReLU}(x)	=max\left\{ 0,x\right\} $.
This function gives us:
    - outputs values between 0 and infinity.
    - Introduces sparsity by setting negative values to zero.
- SoftMax
    $\text{Softmax}(x_i)=\frac {e^{x_i}} {\sum ^C _{j=1}e^{x_j}} \text{(where C is the number of classes).}$
    Properties:
    - Outputs a probability distribution over classes.
    - Commonly used in the output layer of multi-class classification problems
    - Provides probabilistic interpretation.


In [16]:
def sigmoid(x):
    return 1/(1+np.exp(-x))

def ReLU(x):
    return np.maximum(0, x)

def softmax(x):
    exp_values = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_values / np.sum(exp_values, axis=-1, keepdims=True)

### Gradient
Gradient is a vector that represents the direction and rate of fastest increase of a function. In neural networks, it is used to adjust the weights and biases to minimize the loss function.