<h1>Building a Neural network from scratch</h1>
<p>We are going to use just numpy for calculations with matrices and matplotlib.pyplot for showing results on charts</p>

In [20]:
import numpy as np
import matplotlib.pyplot as plt

<h4>ReLU:</h4>
<p>We use ReLU to set all the negative inputs to zero and leave positive inputs unchanged.<br>
It's simplicity makes it computationally efficient, especially in large neural networks.<br>
ReLU introduces non-linearity into the network, allowing it to learn complex patterns.<br>
For z > 0 the gradient is constant (dReLu(z)/dz =1), which help us avoid the vanishing gradient problem (which is common in sigmoid/tanh). <br>
Also it sets some activations to zero, which can help with computational efficiency and reduce the risk of overfitting.</p>

In [21]:
def relu(Z):
    return np.maximum(0, Z)

def relu_derivative(Z):
    return Z > 0

<h4>Softmax Function:</h4>
<p>It converts raw scores z(logits) into probabilities and also ensures that the probabilities sum to 1 across all classes.<br>
Exponentiating the logits ensures all values are positive and also Larger logits get amplified more than smaller logits.<br>
Dividing by the sum of all exponentiated logits ensures the output values are probabilities is also some kind of normalization.The subtraction of max(Z) prevents numerical overflow(e.g for large exponentials).</p>
<p>In code below axis=0 ensures softmax is computed for each column(e.g for each class in a multi-class classification setting).<br>
And Also keepdims=True ensures that the result sum has the same dimensions as expZ, allowing for element-wise division without broadcasting issues.</p>
<h5>fromula for it is:</h5>

$$\text{softmax}(z_i) = \frac{\exp(z_i)}{\sum_{j=1}^{n} \exp(z_j)}$$


In [22]:
def softmax(Z):
    expZ = np.exp(Z - np.max(Z, axis=0, keepdims=True))
    return expZ / np.sum(expZ, axis=0, keepdims=True)

<h4>Cross entropy loss function</h4>
<p>This function computes the cross-entropy loss, which measures the difference between the predicted probabilities (y_pred) and the true labels (y_true). It is commonly used in classification tasks, especially for multi-class classification with softmax outputs.</p>
<h5>fromula for this loss function is:</h5>

$$J = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} y_k^{(i)} \log(\hat{y}_k^{(i)})$$

If yk = 1(the correct class), the loss for that sample is equal to:

$$L = -\log(\hat{y}_k)$$
Encourages hat{yk} (the predicted probability for the correct class) to approach 1.<br>
The small value 1e-8 is added to hat{yk} to prevent taking the logarithm of 0, which would result in an undefined value

<h5>When Combined with Softmax Output:</h5>
<p>After simplifying, the gradient of the combined softmax and cross-entropy loss is:<br></p>

$$\frac{\partial z_k^{(i)}}{\partial J} = \hat{y}_k^{(i)} - y_k^{(i)}$$
<p>And when we Stack the derivatives for all m examples into a matrix: </p>

$$\frac{\partial Z}{\partial J} = \hat{Y} - Y$$


In [23]:
def cross_entropy_loss(y_true, y_pred):
    m = y_true.shape[1]
    loss = -np.sum(y_true * np.log(y_pred + 1e-8)) / m
    return loss

def cross_entropy_derivative(y_true, y_pred):
    return y_pred - y_true

<h4>Initializing the weights and biases</h4>
<p>Proper initialization of weights is critical for training the network effectively, as it impacts the convergence speed and overall performance.<br>
Poor initialization (e.g., very large or very small weights) can cause gradients to shrink (vanish) or grow (explode) exponentially as they propagate through layers.<br>
We set a seed for NumPy's random number generator to ensure reproducibility.<br>
Using the same seed produces the same random numbers every time the code runs, which is useful for debugging or comparing results.<br>
For initializing weights we use He initialization.<br>
For each weight we generate random values from a standard normal distribution(N(0, 1)) and scale it by * np.sqrt(2 / layer_dims[l-1])<br>
Using He initialization ensures the variance of the activations is maintained as the signal passes through layers, avoiding the problems of vanishing or exploding gradients<br></p>
<p>We initialize the biases to zeros by np.zeros()<br>
This doesn’t break symmetry, unlike initializing weights to zeros (which would cause all neurons in the layer to learn identical features).</p>



In [24]:
def initialize_parameters(layer_dims):
    np.random.seed(0)
    parameters = {}
    L = len(layer_dims)
    
    for l in range(1, L):
        parameters[f"W{l}"] = np.random.randn(layer_dims[l], layer_dims[l-1]) * np.sqrt(2 / layer_dims[l-1])
        parameters[f"b{l}"] = np.zeros((layer_dims[l], 1))
    return parameters

<h4>Dropout</h4>
<p>Dropout is a regularization technique where a random subset of neurons is "dropped" (set to 0) during training. This prevents the network from becoming too reliant on specific neurons and helps generalization.<br>
the process works like below:<br>
if dropout is enabled (keep_prob < 1.0):<br>
We create a random matrix D where each element is True with probability keep_prob and False otherwise.<br>
Then we drop some neurons in A by element-wise multiplying A and D and after that we scale the remaining activations by dividing by keep_prob.This will help us to maintain the expected value of the activations.After all, we store the dropout mask D in the cache.<br>
<h5>There are many benefits using this regularization technique:</h5>
<p>By randomly deactivating neurons, dropout prevents the network from relying too much on specific neurons.<br>
Also this technique forces the network to learn redundant representations since no single neuron can dominate.<br></p>


<h4>Forward propagation</h4>
<h5>In this function we calculates the outputs of each layer in the neural network from the input layer to the output layer.</h5>
<p>L is the total number of layers in the network (excluding the input layer). Each layer has weights and biases, hence parameters contains 2 * L keys.<br>
A is initially set to the input matrix X and cache is used to store intermediate values (activations A and pre-activations Z) for each layer, which will be needed during backward propagation.<br>
For each layer l we compute the pre-activation Z using the formula:<br>

$$Z^{[l]} = W^{[l]} \cdot A^{[l-1]} + b^{[l]}$$

where W[l] is the weight matrix for layer l.<br> A[l−1] is the activation from the previous layer (or input X for the first layer) <br>and b[l] is the bias vector for layer l.<br>
Then we compute the activation A using the ReLU activation function:
$$A^{[l]} = \text{ReLU}(Z^{[l]})$$
We also use Dropout Regularization the way we described above<br></p>
<p>in final layer:</p>
<p>we compute the pre-activation ZL for the final layer using the same formula as above.
then we compute the activation AL using the softmax function<br>
After all, We store the pre-activation ZL and activation AL for the final layer in the cache and Al will be our final output of the network</p>
<p>To put it in a nutshell:<br>
    We applie the weight and bias transformations(Z) layer-by-layer, starting from the input layer and moving toward the output layer.<br>
    Then we use non-linear activation functions (ReLU, softmax) to introduce non-linearity, allowing the network to model complex relationships.<br>
    Also we optionally use dropout to prevent overfitting by randomly deactivating neurons during training.<br>
    And in the end we store all intermediate values in cache, ensuring the data needed for backpropagation is available<br></p>

In [25]:
def forward_propagation(X, parameters, keep_prob=1.0):
    L = len(parameters) // 2
    A = X
    cache = {"A0": A}
    
    for l in range(1, L):
        Z = np.dot(parameters[f"W{l}"], A) + parameters[f"b{l}"]
        A = relu(Z)
        
        if keep_prob < 1.0:
            D = np.random.rand(A.shape[0], A.shape[1]) < keep_prob
            A = A * D
            A = A / keep_prob
            cache[f"D{l}"] = D
        
        cache[f"Z{l}"] = Z
        cache[f"A{l}"] = A

    ZL = np.dot(parameters[f"W{L}"], A) + parameters[f"b{L}"]
    AL = softmax(ZL)
    
    cache[f"Z{L}"] = ZL
    cache[f"A{L}"] = AL
    return AL, cache

<h4>L2 regularization</h4>
<h5>L2 regularization, also known as weight decay, is a technique used in machine learning to prevent overfitting by discouraging large weights in the model. It works by adding a penalty term to the cost function, which is proportional to the sum of the squared weights of the model. This encourages the model to keep the weights small, leading to a simpler and more generalizable model.</h5>
<p>Cost Function with L2 Regularization:<br>
The cost function J is modified to include a regularization term:

$$J_{\text{regularized}} = J_{\text{original}} + \frac{\lambda}{2m} \sum_{l=1}^{L} \|W^{[l]}\|_F^2$$

where J_originial is the original cost function(e.g. cross-entropy loss).<br>
λ is The regularization parameter that controls the strength of regularization.<br>
m is Number of training examples.<br>
and $$\|W^{[l]}\|_F^2$$ is The Frobenius norm (sum of squared elements) of the weight matrix W[l] for layer l.</p>
<p>Large weights in a model can lead to overfitting, as the model may memorize the training data instead of learning general patterns. L2 regularization discourages large weights.<br>
By penalizing large weights, L2 regularization encourages the model to rely on more features with small contributions rather than focusing heavily on a few.<br>
A model with smaller weights is often better at making predictions on unseen data.</p>
