<a href="https://colab.research.google.com/github/AnanyaKodali/MAT-494/blob/main/3_7_Neural_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**3.7: Neural Networks**

**Definition 3.7.1**

Neural Networks are computational models inspired by the human brain's structure and functionality. They consist of layers of interconnected nodes, known as neurons, which process and transmit information. Each neuron receives input, applies a transformation through an activation function, and passes the output to subsequent layers. Neural networks are designed to recognize patterns, learn from data, and make intelligent decisions or predictions.

**Definition 3.7.2**

Structure of Neural Networks:
* Input Layer: Receives the initial data.
* Hidden Layers: Intermediate layers that transform the input using weights, biases, and activation functions.
* Output Layer: Produces the final prediction or classification.


##**3.7.1: Mathematical Formulation**

**Definition 3.7.3**

In a neural network, each neuron performs a weighted sum of its inputs, adds a bias, and applies an activation function to produce its output. This process can be mathematically described as follows for layer $𝑙$:
* $\begin{aligned}
\mathbf{Z}^{(l)} = \mathbf{W}^{(l)} \mathbf{A}^{(l-1)} + \mathbf{b}^{(l)} \\
\mathbf{A}^{(l)}= \sigma(\mathbf{Z}^{(l)})
\end{aligned}$

Where,
* $\mathbf{W}^{(l)} \in \mathbb{R}^{n^{(l)} \times n^{(l-1)}}$ is the weight matrix for layer $𝑙$
* $\mathbf{b}^{(l)} \in \mathbb{R}^{n^{(l)} \times 1}$ is the bias vector for layer $𝑙$
* $\mathbf{A}^{(l-1)} \in \mathbb{R}^{n^{(l-1)} \times m}$  is the activation matrix from the previous layer
* $\mathbf{Z}^{(l)} \in \mathbb{R}^{n^{(l)} \times m}$ is the linear combination of inputs and weights
* $\mathbf{A}^{(l)} \in \mathbb{R}^{n^{(l)} \times m}$  is the activation output.
* $σ$ is the activation function.
* $𝑚$ is the number of training examples.

**Example 3.7.1: Simple Neural Network**

Objective: Initialization of parameters and forward propagation for a simple neural network

In [1]:
# Neural Networks: Initialization and Forward Propagation

import numpy as np

# Initialize parameters
def initialize_parameters(layer_dims):
    """
    Initialize weights and biases for each layer.

    Arguments:
    layer_dims -- List containing the dimensions of each layer.

    Returns:
    parameters -- Dictionary containing initialized weights and biases.
    """
    np.random.seed(42)
    parameters = {}
    L = len(layer_dims)  # Number of layers

    for l in range(1, L):
        parameters[f"W{l}"] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01
        parameters[f"b{l}"] = np.zeros((layer_dims[l], 1))

    return parameters

# Forward propagation
def forward_propagation(X, parameters):
    """
    Implement forward propagation for the neural network.

    Arguments:
    X -- Input data of shape (n_x, m).
    parameters -- Dictionary containing weights and biases.

    Returns:
    activations -- Dictionary containing activations for each layer.
    """
    activations = {"A0": X}
    L = len(parameters) // 2  # Number of layers

    for l in range(1, L + 1):
        W = parameters[f"W{l}"]
        b = parameters[f"b{l}"]
        Z = np.dot(W, activations[f"A{l-1}"]) + b
        A = relu(Z)  # Using ReLU activation
        activations[f"A{l}"] = A

    return activations

# ReLU Activation Function
def relu(x):
    """ReLU activation function."""
    return np.maximum(0, x)

# Example usage
if __name__ == "__main__":
    # Define layer dimensions
    layer_dims = [2, 4, 1]  # 2-input features, 4 neurons in hidden layer, 1 output neuron

    # Initialize parameters
    parameters = initialize_parameters(layer_dims)

    # Generate dummy input data (2 features, 3 examples)
    X = np.array([[1, 2, -1],
                  [3, -1, 2]])

    # Perform forward propagation
    activations = forward_propagation(X, parameters)

    # Print activations
    for key in activations:
        print(f"{key}:\n{activations[key]}\n")

A0:
[[ 1  2 -1]
 [ 3 -1  2]]

A1:
[[0.00081921 0.01131693 0.        ]
 [0.05216778 0.         0.02398371]
 [0.         0.         0.        ]
 [0.03881517 0.02390991 0.        ]]

A2:
[[9.84217472e-05 0.00000000e+00 1.30126037e-04]]



##**3.7.2: Activation Functions**

**Definition 3.7.4**

Activation functions introduce non-linearity into the neural network, enabling it to learn complex patterns. Below are four common activation functions used in neural networks.



**Definition 3.7.4**

*Step Function*

The Step Function activates the neuron if the input exceeds a certain threshold. It's a binary function returning 1 or 0.
* $\sigma(x) =
\begin{cases}
1 & \text{if } x \geq 0 \\
0 & \text{otherwise}
\end{cases}$

In [3]:
# Step Function Activation

def binary_step(x):
    """
    Binary Step Activation Function.

    Arguments:
    x -- Input array.

    Returns:
    Output array after applying binary step function.
    """
    return np.heaviside(x, 1)

In [4]:
# Example of Binary Step Function
x = np.array([-2, -1, 0, 1, 2])
print("Binary Step Output:", binary_step(x))

Binary Step Output: [0. 0. 1. 1. 1.]


## ReLU Function

**Definition 3.7.5**

The Rectified Linear Unit (ReLU) activation function outputs the input directly if it is positive; otherwise, it outputs zero
* $\sigma(x) = \max(0, x)$

In [5]:
# ReLU Activation Function

def relu(x):
    """
    ReLU Activation Function.

    Arguments:
    x -- Input array.

    Returns:
    Output array after applying ReLU function.
    """
    return np.maximum(0, x)

In [6]:
# Example of ReLU Function
x = np.array([-2, -1, 0, 1, 2])
print("ReLU Output:", relu(x))

ReLU Output: [0 0 0 1 2]


##Sigmoid Function

**Definition 3.7.6**

The Sigmoid activation function maps any real-valued number into the range (0, 1), making it suitable for binary classification.
* $\sigma(x) = \frac{1}{1 + e^{-x}}$

In [7]:
# Sigmoid Activation Function

def sigmoid(x):
    """
    Sigmoid Activation Function.

    Arguments:
    x -- Input array.

    Returns:
    Output array after applying sigmoid function.
    """
    return 1 / (1 + np.exp(-x))

In [8]:
# Example of Sigmoid Function
x = np.array([-2, -1, 0, 1, 2])
print("Sigmoid Output:", sigmoid(x))

Sigmoid Output: [0.11920292 0.26894142 0.5        0.73105858 0.88079708]


##Softmax Function

**Definition 3.7.7**

The Softmax activation function generalizes the sigmoid function for multiclass classification, converting logits into probabilities that sum to 1.

* $\sigma(z)_i = \frac{e^{z_i}}{\sum_{j} e^{z_j}}$

In [9]:
# Softmax Activation Function

def softmax(x):
    """
    Softmax Activation Function.

    Arguments:
    x -- Input array of shape (n_classes, m).

    Returns:
    Output array after applying softmax function.
    """
    e_x = np.exp(x - np.max(x, axis=0, keepdims=True))  # Stability improvement
    return e_x / np.sum(e_x, axis=0, keepdims=True)

In [10]:
# Example of Softmax Function
x = np.array([[2.0, 1.0, 0.1],
              [1.0, 3.0, 0.2],
              [0.2, 0.5, 2.0]])
print("Softmax Output:\n", softmax(x))

Softmax Output:
 [[0.65223985 0.11116562 0.11375186]
 [0.23994563 0.82140902 0.12571524]
 [0.10781452 0.06742536 0.7605329 ]]


##**3.7.3. Cost Function**

##Cross-Entropy Loss

**Definition 3.7.8**

The Cross-Entropy Loss is commonly used for binary classification tasks. It measures the difference between the true labels and the predicted probabilities.
* $J = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right]$

Where,
* $m$ is the number of training examples
* $y^{(i)}$ is the true label (0 or 1) for example $i$
* $\hat{y}^{(i)}$ is the predicted probability for example $i$

In [11]:
# Cross-Entropy Loss Function

def cost_function(AL, Y):
    """
    Compute the cross-entropy loss.

    Arguments:
    AL -- Probability vector corresponding to label predictions, shape (1, m)
    Y -- True "label" vector, shape (1, m)

    Returns:
    cost -- Cross-entropy cost
    """
    m = Y.shape[1]
    # To prevent log(0), clip AL to [1e-15, 1 - 1e-15]
    AL = np.clip(AL, 1e-15, 1 - 1e-15)
    cost = - (1/m) * np.sum(Y * np.log(AL) + (1 - Y) * np.log(1 - AL))
    return np.squeeze(cost)

*Code Analysis*
1. Input Parameters:
 * AL: Predicted probabilities from the output layer (shape: (1, m)).
 * Y: True labels (shape: (1, m)).

2. Clipping AL:
 * Prevents taking the logarithm of 0 by ensuring all probability values are within (1e-15,1-1e-15).
3. Cost Calculation:
 * Computes the average cross-entropy loss over all training examples.

In [12]:
# Example of Cross-Entropy Loss Function
AL = sigmoid(np.array([[0.9, 0.2, 0.8, 0.4]]))
Y = np.array([[1, 0, 1, 0]])
print("Cross-Entropy Loss:", cost_function(AL, Y))

Cross-Entropy Loss: 0.6058521656153525


##**3.7.4: Backpropagation**

**Definition 3.7.9**

Backpropagation is the cornerstone of training neural networks. It efficiently computes the gradients of the loss function with respect to each weight and bias by applying the chain rule of calculus. This process allows the network to update its parameters to minimize the loss.



**Mathematical Intuition 3.7.10**

For each layer $l$ in the network, backpropagation involves:
1. Compute Gradient of Cost with Respect to Output Activation:
 * $\delta^{(L)} = \mathbf{A}^{(L)} - \mathbf{Y}$
 * where $L$ is the output layer.
2. Compute Gradients for Each Layer:
 * $\delta^{(l)} = (\mathbf{W}^{(l+1)})^T \delta^{(l+1)} \cdot g'(Z^{(l)})$
 * where $𝑔^′$ is the derivative of the activation function.
3. Compute Gradients with Respect to Weights and Biases:
 * $\frac{\partial J}{\partial \mathbf{W}^{(l)}} = \frac{1}{m} \delta^{(l)} \mathbf{A}^{(l-1)T}$
 * $\frac{\partial J}{\partial \mathbf{b}^{(l)}} = \frac{1}{m} \sum \delta^{(l)}$

In [13]:
# Backpropagation Implementation

def backward_propagation(parameters, activations, Y):
    """
    Implement backpropagation to compute gradients.

    Arguments:
    parameters -- Dictionary containing weights and biases.
    activations -- Dictionary containing activations from forward propagation.
    Y -- True labels, shape (1, m)

    Returns:
    grads -- Dictionary containing gradients with respect to each parameter.
    """
    grads = {}
    L = len(parameters) // 2  # Number of layers
    m = Y.shape[1]

    # Initialize backpropagation
    # Compute derivative for the output layer
    A_final = activations[f"A{L}"]
    dZ = A_final - Y  # For sigmoid activation and cross-entropy loss
    grads[f"dW{L}"] = (1/m) * np.dot(dZ, activations[f"A{L-1}"].T)
    grads[f"db{L}"] = (1/m) * np.sum(dZ, axis=1, keepdims=True)

    # Loop from l=L-1 to l=1
    for l in reversed(range(1, L)):
        W_next = parameters[f"W{l+1}"]
        dA = np.dot(W_next.T, dZ)
        Z = np.dot(parameters[f"W{l}"], activations[f"A{l-1}"]) + parameters[f"b{l}"]
        dZ = dA * relu_derivative(Z)
        grads[f"dW{l}"] = (1/m) * np.dot(dZ, activations[f"A{l-1}"].T)
        grads[f"db{l}"] = (1/m) * np.sum(dZ, axis=1, keepdims=True)

    return grads

def relu_derivative(Z):
    """
    Compute the derivative of ReLU activation function.

    Arguments:
    Z -- Linear combination input to the activation function.

    Returns:
    dZ -- Gradient of ReLU with respect to Z.
    """
    dZ = np.array(Z > 0, dtype=float)
    return dZ

*Code Analysis*
1. Initialization:
 * grads: Dictionary to store gradients.
 * L: Number of layers.
 * m: Number of training examples.

2. Output Layer Gradient (dZ):
 * For sigmoid activation combined with cross-entropy loss, the gradient simplifies to $\mathbf{A}^{(L)} - \mathbf{Y}$
3. Gradients for Output Layer:
 * dW and db are computed using the gradients.
4. Backpropagation through Hidden Layers:
 * dA: Gradient of the activation from the next layer.
 * dZ: Gradient with respect to $Z^{(l)}$, applying the derivative of the activation function (ReLU in this case).
 * grads["dW{l}"] and grads["db{l}"]: Gradients with respect to weights and biases.
5. ReLU Derivative (relu_derivative):
 * Returns 1 where Z &gt; 0, else 0.

In [17]:
# Example of Backpropagation

# Assume we have already performed forward propagation
parameters = {
    "W1": np.array([[0.1, -0.2],
                   [0.4, 0.5],
                   [-0.3, 0.2],
                   [0.1, -0.5]]),
    "b1": np.array([[0.0],
                   [0.0],
                   [0.0],
                   [0.0]]),
    "W2": np.array([[0.3, -0.1, 0.2, 0.4]]),
    "b2": np.array([[0.0]])
}

# Dummy activations
activations = {
    "A0": np.array([[1, 2, -1],
                   [3, -1, 2]]),
    "A1": relu(np.dot(parameters["W1"], activations["A0"]) + parameters["b1"]),
    "A2": sigmoid(np.dot(parameters["W2"], activations["A1"]) + parameters["b2"])
}

# True labels
Y = np.array([[1, 0, 1]])

# Perform backpropagation
grads = backward_propagation(parameters, activations, Y)

# Print gradients
for key in grads:
    print(f"{key}:\n{grads[key]}\n")

dW2:
[[ 0.0788612  -0.37407729 -0.16524792  0.13800709]]

db2:
[[-0.14033533]]

dW1:
[[ 0.1182918  -0.0591459 ]
 [-0.03768248  0.10496144]
 [-0.00349624 -0.17049228]
 [ 0.15772239 -0.0788612 ]]

db1:
[[ 0.0591459 ]
 [ 0.01403353]
 [-0.06749766]
 [ 0.0788612 ]]



##**3.7.5: Backpropagation Algorithm**

**Mathematical Intuition 3.7.11**

The Backpropagation Algorithm involves the following steps to update the network's parameters and minimize the loss:

1. Forward Propagation: Compute the linear combination $\mathbf{Z}^{(l)}$ and activation $\mathbf{A}^{(l)}$ for each layer.
2. Compute Loss: Calculate the cost using the cost function (e.g., cross-entropy loss).
3. Backward Propagation: Compute the gradients of the loss with respect to each parameter using the backpropagation process.
5. Update Parameters: Adjust the weights and biases using the computed gradients and the learning rate.
6. Repeat:Iterate through the forward and backward propagation steps for a predefined number of epochs or until convergence.


In [18]:
# Backpropagation Algorithm Implementation

def update_parameters(parameters, grads, learning_rate):
    """
    Update parameters using gradient descent.

    Arguments:
    parameters -- Dictionary containing weights and biases.
    grads -- Dictionary containing gradients.
    learning_rate -- Learning rate for parameter updates.

    Returns:
    parameters -- Dictionary with updated weights and biases.
    """
    L = len(parameters) // 2  # Number of layers

    for l in range(1, L + 1):
        parameters[f"W{l}"] -= learning_rate * grads[f"dW{l}"]
        parameters[f"b{l}"] -= learning_rate * grads[f"db{l}"]

    return parameters

# Example Usage of Backpropagation Algorithm
if __name__ == "__main__":
    # Initialize parameters
    layer_dims = [2, 4, 1]
    parameters = initialize_parameters(layer_dims)

    # Generate dummy input data
    X = np.array([[1, 2, -1],
                  [3, -1, 2]])

    # Define true labels
    Y = np.array([[1, 0, 1]])

    # Perform forward propagation
    activations = forward_propagation(X, parameters)

    # Compute cost
    AL = activations["A2"]
    cost = cost_function(AL, Y)
    print(f"Initial Cost: {cost}\n")

    # Perform backpropagation
    grads = backward_propagation(parameters, activations, Y)

    # Update parameters
    learning_rate = 0.1
    parameters = update_parameters(parameters, grads, learning_rate)

    # Perform forward propagation with updated parameters
    activations = forward_propagation(X, parameters)
    AL = activations["A2"]
    cost = cost_function(AL, Y)
    print(f"Updated Cost: {cost}\n")

Initial Cost: 6.057751944328228

Updated Cost: 1.8257783625976673



*Code Analysis*
1. Parameter Update (update_parameters):
 * Adjusts each weight and bias by subtracting the product of the learning rate and the corresponding gradient.
 * Ensures that the parameters move in the direction that minimizes the loss.

2. Example Usage:
 * Initializes a simple neural network.
 * Generates dummy input data and defines true labels.
 * Performs forward propagation to compute activations.
 * Calculates the initial cost.
 * Executes backpropagation to compute gradients.
 * Updates the parameters using the gradients.
 * Performs forward propagation again to observe the change in cost

**Example 3.7.2: PyTorch Implementation**


In [19]:
# Neural Networks with PyTorch

import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(2, 4)  # Input layer to hidden layer
        self.fc2 = nn.Linear(4, 1)  # Hidden layer to output layer
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out

# Instantiate the model
model = SimpleNN()
print(model)

SimpleNN(
  (fc1): Linear(in_features=2, out_features=4, bias=True)
  (fc2): Linear(in_features=4, out_features=1, bias=True)
  (relu): ReLU()
  (sigmoid): Sigmoid()
)


*Code Analysis*
1. Model Architecture (SimpleNN):
 * fc1: Fully connected layer mapping 2 input features to 4 neurons in the hidden layer.
 * fc2: Fully connected layer mapping 4 neurons to 1 output neuron.
2. Activation Functions:
 * ReLU: Applied after the first layer.
 * Sigmoid: Applied after the second layer to output probabilities.
3. Model Instantiation:
 * Creates an instance of the SimpleNN class and prints its architecture.

**Example 3.7.2: Training Example**


In [20]:
# Training a Simple Neural Network with PyTorch

import torch
import torch.nn as nn
import torch.optim as optim

# Define the neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(2, 4)  # Input layer to hidden layer
        self.fc2 = nn.Linear(4, 1)  # Hidden layer to output layer
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out

# Instantiate the model
model = SimpleNN()
print(model)

# Define loss function and optimizer
criterion = nn.BCELoss()  # Binary Cross-Entropy Loss
optimizer = optim.SGD(model.parameters(), lr=0.1)  # Stochastic Gradient Descent

# Dummy data
X = torch.tensor([[1.0, 0.0],
                  [0.0, 1.0],
                  [1.0, 1.0],
                  [0.0, 0.0]])
Y = torch.tensor([[1.0],
                  [0.0],
                  [1.0],
                  [0.0]])

# Training loop
num_epochs = 1000
for epoch in range(num_epochs):
    # Forward pass
    outputs = model(X)
    loss = criterion(outputs, Y)

    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print loss every 100 epochs
    if (epoch+1) % 100 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

print("Training complete.")

# Testing the model
with torch.no_grad():
    predicted = model(X)
    predicted = (predicted >= 0.5).float()
    print("Predicted labels:\n", predicted)
    print("True labels:\n", Y)

SimpleNN(
  (fc1): Linear(in_features=2, out_features=4, bias=True)
  (fc2): Linear(in_features=4, out_features=1, bias=True)
  (relu): ReLU()
  (sigmoid): Sigmoid()
)
Epoch [100/1000], Loss: 0.2867
Epoch [200/1000], Loss: 0.0585
Epoch [300/1000], Loss: 0.0247
Epoch [400/1000], Loss: 0.0144
Epoch [500/1000], Loss: 0.0098
Epoch [600/1000], Loss: 0.0072
Epoch [700/1000], Loss: 0.0057
Epoch [800/1000], Loss: 0.0046
Epoch [900/1000], Loss: 0.0039
Epoch [1000/1000], Loss: 0.0034
Training complete.
Predicted labels:
 tensor([[1.],
        [0.],
        [1.],
        [0.]])
True labels:
 tensor([[1.],
        [0.],
        [1.],
        [0.]])


*Code Analysis*

1. Model Definition:
 * Defines a simple neural network with one hidden layer using ReLU and sigmoid activation functions.
2. Loss Function and Optimizer:
 * BCELoss: Suitable for binary classification tasks.
 * SGD: Optimizer with a learning rate of 0.1.
3. Dummy Data:
 * Four samples with two features each, forming a simple XOR-like pattern.
4. Training Loop:
 * Performs forward propagation, computes loss, backpropagates the gradients, and updates the model parameters.
 * Prints the loss every 100 epochs to monitor training progress.
5. Testing the Model:
 * After training, makes predictions on the training data.
 * Applies a threshold of 0.5 to convert probabilities to binary labels.
 * Prints the predicted and true labels for comparison.

**Example 3.7.3: Complete Neural Network**

In [21]:
# Complete Neural Network Training with Backpropagation

import numpy as np

# Activation functions and their derivatives
def sigmoid(z):
    """Sigmoid activation function."""
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    """Derivative of sigmoid function."""
    s = sigmoid(z)
    return s * (1 - s)

def relu(z):
    """ReLU activation function."""
    return np.maximum(0, z)

def relu_derivative(z):
    """Derivative of ReLU function."""
    return (z > 0).astype(float)

# Initialize parameters
def initialize_parameters(layer_dims):
    """
    Initialize weights and biases.

    Arguments:
    layer_dims -- List containing the number of units in each layer.

    Returns:
    parameters -- Dictionary containing initialized weights and biases.
    """
    np.random.seed(42)
    parameters = {}
    L = len(layer_dims)  # Number of layers

    for l in range(1, L):
        parameters[f"W{l}"] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01
        parameters[f"b{l}"] = np.zeros((layer_dims[l], 1))

    return parameters

# Forward propagation
def forward_propagation(X, parameters):
    """
    Perform forward propagation.

    Arguments:
    X -- Input data, shape (n_x, m)
    parameters -- Dictionary containing weights and biases.

    Returns:
    activations -- Dictionary containing activations for each layer.
    Z_values -- Dictionary containing linear combinations for each layer.
    """
    activations = {"A0": X}
    Z_values = {}
    L = len(parameters) // 2  # Number of layers

    for l in range(1, L + 1):
        W = parameters[f"W{l}"]
        b = parameters[f"b{l}"]
        Z = np.dot(W, activations[f"A{l-1}"]) + b
        Z_values[f"Z{l}"] = Z

        if l != L:
            A = relu(Z)
        else:
            A = sigmoid(Z)
        activations[f"A{l}"] = A

    return activations, Z_values

# Compute cost
def compute_cost(AL, Y):
    """
    Compute the cross-entropy cost.

    Arguments:
    AL -- Probability vector corresponding to label predictions, shape (1, m)
    Y -- True "label" vector, shape (1, m)

    Returns:
    cost -- Cross-entropy cost
    """
    m = Y.shape[1]
    cost = - (1/m) * np.sum(Y * np.log(AL + 1e-15) + (1 - Y) * np.log(1 - AL + 1e-15))
    return np.squeeze(cost)

# Backward propagation
def backward_propagation(parameters, activations, Z_values, Y):
    """
    Perform backward propagation.

    Arguments:
    parameters -- Dictionary containing weights and biases.
    activations -- Dictionary containing activations from forward propagation.
    Z_values -- Dictionary containing linear combinations for each layer.
    Y -- True "label" vector, shape (1, m)

    Returns:
    grads -- Dictionary containing gradients with respect to each parameter.
    """
    grads = {}
    L = len(parameters) // 2  # Number of layers
    m = Y.shape[1]

    # Initialize backpropagation
    AL = activations[f"A{L}"]
    dAL = - (np.divide(Y, AL + 1e-15) - np.divide(1 - Y, 1 - AL + 1e-15))

    # Backprop for output layer
    dZL = AL - Y  # derivative of cost w.r.t Z at output layer
    grads[f"dW{L}"] = (1/m) * np.dot(dZL, activations[f"A{L-1}"].T)
    grads[f"db{L}"] = (1/m) * np.sum(dZL, axis=1, keepdims=True)

    # Backprop through hidden layers
    for l in reversed(range(1, L)):
        dA = np.dot(parameters[f"W{l+1}"].T, dZL)
        dZ = dA * relu_derivative(Z_values[f"Z{l}"])
        grads[f"dW{l}"] = (1/m) * np.dot(dZ, activations[f"A{l-1}"].T)
        grads[f"db{l}"] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
        dZL = dZ  # Update dZL for next iteration

    return grads

# Update parameters
def update_parameters(parameters, grads, learning_rate):
    """
    Update parameters using gradient descent.

    Arguments:
    parameters -- Dictionary containing weights and biases.
    grads -- Dictionary containing gradients.
    learning_rate -- Learning rate for parameter updates.

    Returns:
    parameters -- Dictionary with updated weights and biases.
    """
    L = len(parameters) // 2  # Number of layers

    for l in range(1, L + 1):
        parameters[f"W{l}"] -= learning_rate * grads[f"dW{l}"]
        parameters[f"b{l}"] -= learning_rate * grads[f"db{l}"]

    return parameters

# Neural network model
def neural_network_model(X, Y, layer_dims, learning_rate=0.01, num_iterations=10000, print_cost=False):
    """
    Train a neural network.

    Arguments:
    X -- Input data, shape (n_x, m)
    Y -- True labels, shape (1, m)
    layer_dims -- List containing the dimensions of each layer.
    learning_rate -- Learning rate for gradient descent.
    num_iterations -- Number of iterations to train.
    print_cost -- If True, print the cost every 1000 iterations.

    Returns:
    parameters -- Trained weights and biases.
    """
    parameters = initialize_parameters(layer_dims)

    for i in range(1, num_iterations + 1):
        # Forward propagation
        activations, Z_values = forward_propagation(X, parameters)

        # Compute cost
        cost = compute_cost(activations[f"A{len(layer_dims)-1}"], Y)

        # Backward propagation
        grads = backward_propagation(parameters, activations, Z_values, Y)

        # Update parameters
        parameters = update_parameters(parameters, grads, learning_rate)

        # Print cost
        if print_cost and i % 1000 == 0:
            print(f"Cost after iteration {i}: {cost:.6f}")

    return parameters

# Example Usage
if __name__ == "__main__":
    # Define layer dimensions
    layer_dims = [2, 4, 1]  # 2 input features, 4 neurons in hidden layer, 1 output neuron

    # Generate dummy input data (2 features, 3 examples)
    X = np.array([[1, 2, -1],
                  [3, -1, 2]])  # Shape: (2, 3)

    # Define true labels
    Y = np.array([[1, 0, 1]])  # Shape: (1, 3)

    # Train the neural network
    parameters = neural_network_model(X, Y, layer_dims, learning_rate=0.1, num_iterations=10000, print_cost=True)

    # Perform forward propagation with trained parameters
    activations, Z_values = forward_propagation(X, parameters)

    # Compute final cost
    final_cost = compute_cost(activations[f"A{len(layer_dims)-1}"], Y)
    print(f"\nFinal Cost: {final_cost}")

Cost after iteration 1000: 0.000804
Cost after iteration 2000: 0.000328
Cost after iteration 3000: 0.000199
Cost after iteration 4000: 0.000141
Cost after iteration 5000: 0.000108
Cost after iteration 6000: 0.000088
Cost after iteration 7000: 0.000073
Cost after iteration 8000: 0.000063
Cost after iteration 9000: 0.000055
Cost after iteration 10000: 0.000049

Final Cost: 4.853236898537035e-05


*Code Analysis*
1. Activation Functions and Derivatives:
 * Sigmoid: Used in the output layer for binary classification.
 * ReLU: Used in hidden layers for non-linear transformations.

2. Parameter Initialization (initialize_parameters):
 * Initializes weights with small random values and biases with zeros based on layer dimensions.
3. Forward Propagation (forward_propagation):
 * Computes activations and linear combinations for each layer.
 * Uses ReLU for hidden layers and sigmoid for the output layer.
4. Cost Calculation (compute_cost):
 * Computes cross-entropy loss between predicted probabilities and true labels.
5. Backward Propagation (backward_propagation):
 * Computes gradients of the loss with respect to weights and biases using the chain rule.
 * Applies derivatives of activation functions.
6. Parameter Update (update_parameters):
 * Updates weights and biases using gradient descent.
7. Neural Network Model (neural_network_model):
 * Trains the neural network over a specified number of iterations.
 * Optionally prints the cost at intervals to monitor training.
8. Example Usage:
 * Trains a simple neural network on dummy data.
 * Prints the final cost after training.