In [35]:
from tensorflow.keras.datasets import mnist
import numpy as np

# Introduction To neural Networks  

### What is a Neural Network?
A neural network is a series of algorithms that attempt to recognize underlying relationships in a set of data through a process. The network consists of layers of nodes, or neurons, each performing simple computations and passing the results to the next layer.

# Basic Terms

# Loss
 The difference between the actual output and the predicted output is calculated using a loss function. This error measure guides the learning process.
 
 
We can think at loss that simply sums the ${\ell}_2$ distance between the predicted output form the actual output:
$MSE={\frac 1 n} \sum ^n _{i=1} (y_i - \hat y_i)^2$

In neural network, we will use the "Cross-Entropy Loss".
Cross-Entropy Loss: Used for classification tasks, it measures the difference between two probability distributions – the true labels and the predicted probabilities
${\text {Cross-Entropy Loss} = -\sum ^n _{i=1}[y_i \cdot log(\hat {y_i})+(1-y_i)\cdot log(1-\hat{y_i})]$

- Role in Training: The loss function guides the optimization process. During training, the goal is to minimize the loss, which means making the predictions as accurate as possible.

- Loss Curve: A plot of loss versus training epochs can help visualize how well the model is learning. A decreasing loss indicates that the model is improving.

#### Task: implement cross entropy loss.


In [36]:
def crossEntropy_loss(output, targets, buffer):
    """
    Calculates the categorical cross-entropy loss.

    Args:
        Output (Tensor): Model predictions of shape (num_samples, output_size).
        Targets (Tensor): Target labels of shape (num_samples, output_size).

    Returns:
        float: Categorical cross-entropy loss value.
    """
    # buffer = 1e-10
    loss = -np.mean(targets * np.log(output + buffer))
    return loss

## Accuracy 
Another way to measure our model, is $accuracy$.
Accuracy is a metric used to evaluate the performance of a classification model. It measures the proportion of correct predictions out of the total number of predictions.
Accuracy is calculated as the number of correct predictions divided by the total number of predictions. It is often expressed as a percentage.
 

$Accuracy = {\frac {\text {Number of Correct Predictions}} {\text {Total Number of Predictions}} \cdot 100}$ 



If $y_i$ is the actual label and $\hat {y_i}$ is the predicted label for $n$ samples:
$Accuracy = {\frac 1 n} \sum ^n _{i=1} \mathds{1}(y_i = \hat{y_i})$ , where 1 is the indicator function, which is 1 if the condition is true and 0 otherwise.

#### Implement the following accuracy function for classification problem (possible labels are 0,1) 

In [37]:
def accuracy(output: np.ndarray, targets: np.ndarray) -> float:
    """
    Calculates the accuracy of the model predictions.

    Args:
        Output (Tensor): Model predictions of shape (num_samples, output_size).
        Targets (Tensor): Target labels of shape (num_samples, output_size).

    Returns:
        float: Accuracy value.
    """
    predicted_labels = np.argmax(output, axis=1)
    true_labels = np.argmax(targets, axis=1)
    return np.mean(predicted_labels == true_labels)

## Activation functions
Activation functions play a crucial role in neural networks by introducing non-linearity into the model. This non-linearity allows the network to learn and model complex relationships between inputs and outputs. Here’s a detailed overview of various activation functions and their roles:

1. Non-linearity: Without activation functions, a neural network would perform linear transformations, making it incapable of solving non-linear problems.
2. Enabling Learning: Activation functions enable backpropagation by providing gradients needed for updating weights.
3. Controlling Outputs: They help in squashing the output to a specific range, making the network's behavior more predictable and stable.

## Common Activation Functions:
- sigmoid/logistic function: $\sigma (x) =  \frac 1 {1+e^{-x}}$.
  Properties:
  - Outputs values between 0 and 1.
  - Smooth gradient, preventing abrupt changes in output.
- Rectified Linear Unit (ReLU):$

\text{ReLU}(x)	=max\left\{ 0,x\right\} $.
This function gives us:
    - outputs values between zero and infinity.
    - Introduces sparsity by setting negative values to zero.
- SoftMax
    $\text{Softmax}(x_i)=\frac {e^{x_i}} {\sum ^C _{j=1}e^{x_j}} \text{(where C is the number of classes).}$
    Properties:
    - Outputs a probability distribution over classes.
    - Commonly used in the output layer of multi-class classification problems
    - Provides probabilistic interpretation.


In [38]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def ReLU(x):
    return np.maximum(0, x)


def softmax(x):
    exp_values = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_values / np.sum(exp_values, axis=-1, keepdims=True)

### Gradient
Gradient is a vector that represents the direction and rate of the fastest increase a function. In neural networks, it is used to adjust the weights and biases to minimize the loss function. Usually, we denote the gradient with the symbol $\nabla$.


In a single dimension, the gradient of a function $f(x)$ with respect to $x$ is the derivative - $\frac {df} {dx}$. It represents the rate of change in the function at a specific point and indicates the direction of the steepest ascent.

In multiple dimensions, the gradient generalizes to a vector of partial derivatives. For a function $f(x)$ where $x=[x_1,x_2,...x_n]$ is an $n$-dimensional vector, the gradient is a vector of the form:
$\nabla f(x)=\left[\frac{\partial f}{\partial x_{1}},\frac{\partial f}{\partial x_{2}},...,\frac{\partial f}{\partial x_{n}}\right]$

Each component $\frac{\partial f}{\partial x_{i}} $ represents the rate of change of $f$ with respect to the variable $x_{i}$. 

In [39]:
""" Say we have the function f(x,y,z) = x + 2y**2 + 5z**3. Calculate the gradient. """
# Define the function f
def f(v):
    x, y, z = v
    return x + 2 * (y ** 2) + 5 * (z ** 3)


# Define the function to compute the gradient of f
def grad_f(v):
    x, y, z = v
    df_dx = x
    df_dy = 2 * (2 * y)
    df_dz = 5 * (3 * z**2)
    return np.array([df_dx, df_dy, df_dz])

# Usage
v = np.array([3.0, 4.0, 5.0])
function_value = f(v)
gradient = grad_f(v)
print(f"f({v}) = {function_value}, Gradient of f at {v} is {gradient}")

f([3. 4. 5.]) = 660.0, Gradient of f at [3. 4. 5.] is [  3.  16. 375.]


In [40]:
# This was a gradient of a very specific case.
# This time, we will use numpy function. #todo

gradient = np.gradient(v)

##### 
Properties of Gradient:
1. Direction of Steepest Ascent: The gradient vector points in the direction of the steepest increase  the function. Moving in the opposite direction of the gradient leads to the steepest decrease, which is used in optimization algorithms like gradient descent.


2. Magnitude and Direction: The magnitude of the gradient vector indicates how steep the slope is. A larger magnitude means a steeper slope, while a smaller magnitude indicates a flatter region.

## Learning Rate

The learning rate is a hyperparameter in the training of neural networks and other machine learning models. 
It determines the size of the steps the model takes to update the weights in response to the error computed during training. 

- If the learning rate is too high, the model may take steps that are too large and overshoot the optimal point. This can cause the loss function to oscillate or even diverge, failing to converge to a minimum.
- If the learning rate is too low, the model will take tiny steps, making the training process slow. It might get stuck in local minima and may take a long time to converge to the global minimum, if at all.
- An optimal learning rate is one that is small enough to ensure convergence and large enough to make the training process efficient. Finding this optimal value often requires experimentation and tuning.

#### Using Learning Rate And gradient To Update The weight, in "Gradient Decent"
Every time 


## Components of a Neural Network 
1. Neurons: Basic units (node) of a neural network that receive input, process it, and pass it on to other neurons. Each neuron has weights and biases that are adjusted during training to minimize errors. The neurons organized in layers. 



2. Layers: There is three types of Layers:
    - Input Layer: The first layer that receives the input data.
    - Hidden Layers: Intermediate layers that process inputs from the input layer. There can be one or more hidden layers in a neural network.
    - Output Layer: The final layer that produces the output predictions.
