### 1️⃣ What is a Neural Network?

A neural network is a mathematical model inspired by the human brain. Just like our brain has neurons that process information, a neural network has artificial neurons (also called perceptrons) that learn patterns from data.

#### 🔍 Real-World Analogy: Learning to Recognize Cats

Imagine you’re teaching a child to recognize cats.

You show different pictures and say, “This is a cat” or “This is not a cat.” \
The child remembers patterns like whiskers, ears, and fur. \
Next time, when shown a new image, they use these patterns to decide if it's a cat. \
A neural network does exactly this—except it learns the patterns mathematically. 

#### 2️⃣ The Core Structure of a Neural Network

A neural network consists of three main layers:

1️⃣ Input Layer – Takes raw data (e.g., pixel values of an image). \
2️⃣ Hidden Layers – Extracts features & learns patterns. \
3️⃣ Output Layer – Produces predictions (e.g., “Cat” or “Not Cat”).

**🖼 Visual Representation:**

        Input Layer     Hidden Layer       Output Layer
      [ X1  X2  X3 ] → [ H1  H2  H3 ] →     [  Y  ]

### 3️⃣ The Building Blocks of a Neural Network

#### 🔹 Neurons (Perceptrons)

Each **neuron** takes input, processes it, and passes the result forward.

🔢 **Mathematical Representation:** Each neuron calculates:

$$Z = W_1X_1 + W_2X_2 + ... + W_nX_n + B$$

where:
* $X_i$ are inputs (features)
* $W_i$ are weights (importance of each input)
* $B$ is bias (adjusts output shift)

#### 🔹 Activation Functions (Adding Non-Linearity)

The **activation function** determines whether a neuron **fires** or not. Without it, the network is just a linear model.

🛠 **Common Activation Functions:**

✅ **Sigmoid** (Good for probabilities):

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

✅ **ReLU** (Most commonly used in deep networks):

$$g(z) = max(0, z)$$

### What is Forward Propagation?

Forward propagation is the process of passing inputs through the network, layer by layer, to get the final output.

Example: Predicting House Prices 🏠

Imagine you want to predict the price of a house using:

* Size (square feet)
* Number of bedrooms
* Location quality (1-10 scale)
* Let’s assume a simple neural network with:

* 3 input neurons (one for each feature)
* 1 hidden layer (3 neurons)
* 1 output neuron (house price prediction)

#### The Mathematics of Forward Propagation

Each neuron in the **hidden layer** applies this formula:

$$Z = W_1X_1 + W_2X_2 + W_3X_3 + B$$

Then, it applies an **activation function** (e.g., ReLU or sigmoid) to introduce non-linearity:

$$A = g(Z)$$

The **output neuron** repeats this process to compute the final prediction.

#### **Forward Propagation in Python (OOP-based)**

In [2]:
import numpy as np

# Define the ReLU activation function
def relu(z):
    return np.maximum(0, z)

# Define a simple neural network class
class SimpleNeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize weights and biases randomly
        self.W1 = np.random.randn(hidden_size, input_size)  # Weights for input to hidden layer
        self.b1 = np.zeros((hidden_size, 1))               # Bias for hidden layer
        self.W2 = np.random.randn(output_size, hidden_size) # Weights for hidden to output layer
        self.b2 = np.zeros((output_size, 1))               # Bias for output layer

    def forward(self, X):
        """
        Perform forward propagation.
        X: Input data (features)
        """
        # Compute the hidden layer activations
        Z1 = np.dot(self.W1, X) + self.b1  # Linear transformation
        A1 = relu(Z1)                      # Apply activation function
        
        # Compute the output layer
        Z2 = np.dot(self.W2, A1) + self.b2
        return Z2  # Output value (raw prediction)

# Example usage
np.random.seed(42)
nn = SimpleNeuralNetwork(input_size=3, hidden_size=3, output_size=1)

# Sample input (Size, Bedrooms, Location Quality)
X = np.array([[1200], [3], [8]])  # Input column vector

# Perform forward propagation
prediction = nn.forward(X)
print(f"Predicted house price: {prediction[0,0]:.2f}")

Predicted house price: -1401.69


### What is Backpropagation?

Backpropagation is the algorithm that allows a neural network to **adjust its weights and biases** based on the error in its predictions.

#### 🛠 Step-by-Step Process

1. **Compute the error** between the predicted output and the actual value.
2. **Calculate gradients** to see how much each weight contributed to the error.
3. **Update weights** using gradient descent to minimize future errors.

#### Understanding Loss & Gradient Descent

A neural network learns by minimizing a **loss function**, which measures the difference between predicted and actual values.

#### 🔹 Common Loss Functions

* **Mean Squared Error (MSE)** (for regression) 
$$L = \frac{1}{n} \sum (Y - \hat{Y})^2$$

* **Cross-Entropy Loss** (for classification) 
$$L = -\sum Y \log(\hat{Y})$$

The network **reduces this loss** using **gradient descent**, which updates weights in the direction that decreases the error.

#### The Mathematics of Backpropagation

For each weight $W$, we compute:

$$\frac{\partial L}{\partial W} = \text{how much the loss changes with respect to } W$$

Then, we update $W$ using **gradient descent**:

$$W = W - \alpha \cdot \frac{\partial L}{\partial W}$$

where $\alpha$ is the **learning rate** (controls how much we adjust weights per step).

In [3]:
import numpy as np

# Define activation functions and their derivatives
def relu(Z):
    return np.maximum(0, Z)

def relu_derivative(Z):
    return (Z > 0).astype(float)

def sigmoid(Z):
    return 1 / (1 + np.exp(-Z))

def sigmoid_derivative(Z):
    return sigmoid(Z) * (1 - sigmoid(Z))

class SimpleNeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size, learning_rate=0.01):
        np.random.seed(42)
        self.learning_rate = learning_rate
        self.W1 = np.random.randn(hidden_size, input_size)
        self.b1 = np.zeros((hidden_size, 1))
        self.W2 = np.random.randn(output_size, hidden_size)
        self.b2 = np.zeros((output_size, 1))

    def forward(self, X):
        self.Z1 = np.dot(self.W1, X) + self.b1
        self.A1 = relu(self.Z1)
        self.Z2 = np.dot(self.W2, self.A1) + self.b2
        self.A2 = sigmoid(self.Z2)
        return self.A2

    def backward(self, X, Y):
        m = X.shape[1]  # Number of samples
        
        # Compute gradients
        dZ2 = self.A2 - Y  # Error at output layer
        dW2 = (1/m) * np.dot(dZ2, self.A1.T)
        db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)

        dZ1 = np.dot(self.W2.T, dZ2) * relu_derivative(self.Z1)
        dW1 = (1/m) * np.dot(dZ1, X.T)
        db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)

        # Update weights using gradient descent
        self.W1 -= self.learning_rate * dW1
        self.b1 -= self.learning_rate * db1
        self.W2 -= self.learning_rate * dW2
        self.b2 -= self.learning_rate * db2

    def train(self, X, Y, epochs=1000):
        for i in range(epochs):
            self.forward(X)
            self.backward(X, Y)
            if i % 100 == 0:
                loss = np.mean((self.A2 - Y) ** 2)
                print(f"Epoch {i}, Loss: {loss:.4f}")

# Example dataset (X: 3 features, Y: binary output)
X = np.array([[1200, 3, 8], [850, 2, 5], [1500, 4, 9]]).T / 1000
Y = np.array([[1, 0, 1]])  # Expected outputs (1 = expensive, 0 = cheap)

# Train neural network
nn = SimpleNeuralNetwork(input_size=3, hidden_size=3, output_size=1, learning_rate=0.1)
nn.train(X, Y, epochs=1000)

# Predict on new data
test_X = np.array([[1000], [3], [7]]) / 1000
prediction = nn.forward(test_X)
print(f"Predicted price category: {prediction[0,0]:.2f}")

Epoch 0, Loss: 0.4811
Epoch 100, Loss: 0.1614
Epoch 200, Loss: 0.1077
Epoch 300, Loss: 0.0603
Epoch 400, Loss: 0.0293
Epoch 500, Loss: 0.0139
Epoch 600, Loss: 0.0072
Epoch 700, Loss: 0.0046
Epoch 800, Loss: 0.0033
Epoch 900, Loss: 0.0023
Predicted price category: 0.50


#### 1️⃣ Key Challenges in Training Deep Networks

When training deep networks, we face several challenges:

❌ **Slow Learning** – Training can take a long time if optimization is inefficient. \
❌ **Vanishing/Exploding Gradients** – Gradients become too small or too large in deep networks. \
❌ **Overfitting** – The network memorizes training data instead of generalizing to new data. \
❌ **Poor Convergence** – The model gets stuck in bad local minima and stops improving.

We'll now explore **optimization techniques** and **regularization methods** to tackle these problems.

#### 2️⃣ Optimization: Improving Gradient Descent

#### 🔹 **Gradient Descent Recap**

Gradient descent updates weights using:

$$W = W - \alpha \cdot \frac{\partial L}{\partial W}$$

where $\alpha$ is the **learning rate**.

However, **vanilla gradient descent** has problems like slow convergence. Let's look at better optimization techniques.



#### 🔹 **Advanced Optimizers**

✅ **Stochastic Gradient Descent (SGD)**
* Instead of computing gradients over **all** training samples, it updates weights **after each batch**.
* Improves speed but introduces noise.

✅ **Momentum-based Gradient Descent**
* Adds a "velocity" term to the weight updates to smoothen training.
$$v_t = \beta v_{t-1} + (1 - \beta) \frac{\partial L}{\partial W}$$
$$W = W - \alpha v_t$$
* Helps escape bad local minima.

✅ **RMSprop (Root Mean Square Propagation)**
* Uses adaptive learning rates by normalizing gradients.
* Good for non-stationary data (e.g., time-series).

✅ **Adam (Adaptive Moment Estimation)**
* Combines **Momentum + RMSprop**
* Most widely used optimizer for deep learning.

#### 3️⃣ Regularization: Preventing Overfitting

If a neural network is too complex, it **memorizes** training data instead of generalizing. Regularization helps by **adding constraints** to the model.

#### 🔹 **L1 & L2 Regularization (Weight Decay)**

* **L1 (Lasso)**: Encourages sparsity (some weights become 0).
* **L2 (Ridge)**: Shrinks large weights, preventing overfitting.

$$L = \frac{1}{n} \sum (Y - \hat{Y})^2 + \lambda \sum W^2$$

where $\lambda$ is the regularization strength.

#### 🔹 **Dropout Regularization**

* **Randomly drops** neurons during training to prevent reliance on specific neurons.
* Example: If dropout = 0.5, each neuron has a **50% chance of being ignored** in each step.

#### 🔹 **Batch Normalization**

* Normalizes activations to **reduce internal covariate shift**.
* Speeds up training and improves stability.

#### **Implementing Optimizers & Regularization in Python (OOP-Based):**

In [5]:
import numpy as np

# Define activation functions and derivatives
def relu(Z):
    return np.maximum(0, Z)

def relu_derivative(Z):
    return (Z > 0).astype(float)

def sigmoid(Z):
    return 1 / (1 + np.exp(-Z))

def sigmoid_derivative(Z):
    return sigmoid(Z) * (1 - sigmoid(Z))

class OptimizedNeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size, learning_rate=0.01, l2_lambda=0.01, dropout_rate=0.2):
        np.random.seed(42)
        self.learning_rate = learning_rate
        self.l2_lambda = l2_lambda
        self.dropout_rate = dropout_rate
        
        # Initialize weights
        self.W1 = np.random.randn(hidden_size, input_size) * 0.01
        self.b1 = np.zeros((hidden_size, 1))
        self.W2 = np.random.randn(output_size, hidden_size) * 0.01
        self.b2 = np.zeros((output_size, 1))

        # Adam Optimizer parameters
        self.vdW1, self.vdW2 = 0, 0
        self.sdW1, self.sdW2 = 0, 0
        self.beta1, self.beta2, self.epsilon = 0.9, 0.999, 1e-8

    def forward(self, X, training=True):
        self.Z1 = np.dot(self.W1, X) + self.b1
        self.A1 = relu(self.Z1)
        
        # Dropout Regularization
        if training:
            self.dropout_mask = (np.random.rand(*self.A1.shape) > self.dropout_rate) / (1 - self.dropout_rate)
            self.A1 *= self.dropout_mask

        self.Z2 = np.dot(self.W2, self.A1) + self.b2
        self.A2 = sigmoid(self.Z2)
        return self.A2

    def backward(self, X, Y):
        m = X.shape[1]

        # Compute gradients
        dZ2 = self.A2 - Y
        dW2 = (1/m) * np.dot(dZ2, self.A1.T) + (self.l2_lambda/m) * self.W2  # L2 Regularization
        db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)

        dZ1 = np.dot(self.W2.T, dZ2) * relu_derivative(self.Z1)
        dW1 = (1/m) * np.dot(dZ1, X.T) + (self.l2_lambda/m) * self.W1  # L2 Regularization
        db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)

        # Adam Optimizer updates
        self.vdW1 = self.beta1 * self.vdW1 + (1 - self.beta1) * dW1
        self.vdW2 = self.beta1 * self.vdW2 + (1 - self.beta1) * dW2

        self.sdW1 = self.beta2 * self.sdW1 + (1 - self.beta2) * (dW1 ** 2)
        self.sdW2 = self.beta2 * self.sdW2 + (1 - self.beta2) * (dW2 ** 2)

        # Bias correction
        vdW1_corr = self.vdW1 / (1 - self.beta1)
        vdW2_corr = self.vdW2 / (1 - self.beta1)
        sdW1_corr = self.sdW1 / (1 - self.beta2)
        sdW2_corr = self.sdW2 / (1 - self.beta2)

        # Update weights
        self.W1 -= self.learning_rate * vdW1_corr / (np.sqrt(sdW1_corr) + self.epsilon)
        self.W2 -= self.learning_rate * vdW2_corr / (np.sqrt(sdW2_corr) + self.epsilon)
        self.b1 -= self.learning_rate * db1
        self.b2 -= self.learning_rate * db2

    def train(self, X, Y, epochs=1000):
        for i in range(epochs):
            self.forward(X)
            self.backward(X, Y)
            if i % 100 == 0:
                loss = np.mean((self.A2 - Y) ** 2)
                print(f"Epoch {i}, Loss: {loss:.4f}")

# Train the improved model
X = np.array([[1200, 3, 8], [850, 2, 5], [1500, 4, 9]]).T / 1000
Y = np.array([[1, 0, 1]])

nn = OptimizedNeuralNetwork(input_size=3, hidden_size=3, output_size=1, learning_rate=0.01)
nn.train(X, Y, epochs=1000)

Epoch 0, Loss: 0.2500
Epoch 100, Loss: 0.2104
Epoch 200, Loss: 0.2036
Epoch 300, Loss: 0.2036
Epoch 400, Loss: 0.2039
Epoch 500, Loss: 0.2163
Epoch 600, Loss: 0.1353
Epoch 700, Loss: 0.2127
Epoch 800, Loss: 0.1951
Epoch 900, Loss: 0.2088
