#### 1. Neural Network Architecture Diagram

```mermaid
graph TD
   subgraph "Input Layer"
       A["Feature 1"]
       B["Feature 2"]
       C["Feature 3"]
       D["..."]
       E["Feature n"]
   end
   
   subgraph "Hidden Layer (2 neurons)"
       F["Hidden 1<br/>sigmoid"]
       G["Hidden 2<br/>sigmoid"]
   end
   
   subgraph "Output Layer"
       H["Output<br/>sigmoid<br/>(Binary Classification)"]
   end
   
   A --> F
   A --> G
   B --> F
   B --> G
   C --> F
   C --> G
   D --> F
   D --> G
   E --> F
   E --> G
   
   F --> H
   G --> H
   
   style A fill:#E3F2FD
style B fill:#E1F5FE
style C fill:#B3E5FC
style D fill:#81D4FA
style E fill:#4FC3F7
style F fill:#29B6F6
style G fill:#03A9F4
style H fill:#039BE5
```

# 2. Numerical Example: Student Grade Prediction

**Problem**: Predict if a student passes (1) or fails (0) based on study hours and sleep hours.

**Training Data**: One sample
- Student: study_hours = 6, sleep_hours = 8, actual_grade = 1 (pass)

**Initial Parameters**:
```
Input features: x = [6, 8]
Target: y = 1

Initial Weights:
weights_input_hidden = [[0.2, 0.3],    # From input 1 to [hidden 1, hidden 2]
                        [0.4, 0.1]]     # From input 2 to [hidden 1, hidden 2]

weights_hidden_output = [0.5, -0.2]    # From [hidden 1, hidden 2] to output

Learning rate = 0.1
```

Different weights are essential for the neural network to learn effectively. Here's why:

###### Breaking Symmetry

**If all weights were identical:**
```
weights_input_hidden = [[0.2, 0.2],    # Same weights
                        [0.2, 0.2]]     # Same weights
```

**Problem**: Both hidden neurons would:
- Receive identical inputs: `0.2×input1 + 0.2×input2`
- Produce identical outputs
- Receive identical error gradients during backpropagation
- Update by identical amounts

**Result**: The two hidden neurons would remain functionally identical throughout training, effectively reducing your network to having only one hidden neuron.

###### Mathematical Proof of the Problem

**Forward pass with identical weights:**
```
hidden_input = [6, 8] · [[0.2, 0.2],
                         [0.2, 0.2]]
hidden_input = [4.8, 4.8]  # Identical values
hidden_output = [sigmoid(4.8), sigmoid(4.8)] = [0.992, 0.992]  # Identical
```

**Backpropagation with identical weights:**
Both neurons receive the same error signal and update identically, maintaining the symmetry forever.

###### Why Different Weights Enable Learning

**With different weights:**
```
weights_input_hidden = [[0.2, 0.3],    # Different weights
                        [0.4, 0.1]]     # Different weights
```

**Each neuron specializes:**
- **Hidden neuron 1**: Emphasizes input2 (weight 0.4 vs 0.2)
- **Hidden neuron 2**: Emphasizes input1 (weight 0.3 vs 0.1)

**This allows:**
- **Neuron 1** might learn to detect "high sleep hours" patterns
- **Neuron 2** might learn to detect "high study hours" patterns
- **Combined** they can capture complex relationships between both inputs

###### The Diversity Principle

Different initial weights create **functional diversity**:

**Weight Matrix Analysis:**
```
From input 1: [0.2, 0.3] → Different influence on each hidden neuron
From input 2: [0.4, 0.1] → Different influence on each hidden neuron
```

This creates two distinct "feature detectors" in the hidden layer, each capable of learning different aspects of the input patterns.

###### Random Initialization Strategy

The code uses random initialization specifically to ensure diversity:
```python
weights_input_hidden = np.random.normal(scale=1 / n_features ** .5,
                                        size=(n_features, n_hidden))
```

This guarantees that each connection starts with a different value, breaking symmetry from the beginning and enabling each neuron to develop its own specialized function during training.

Without different weights, you lose the representational power that comes from having multiple neurons work together to capture different aspects of the data patterns.

###### Forward Propagation

**Step 1: Input to Hidden Layer**
```
hidden_input = x · weights_input_hidden
hidden_input = [6, 8] · [[0.2, 0.3],
                         [0.4, 0.1]]
hidden_input = [6×0.2 + 8×0.4, 6×0.3 + 8×0.1]
hidden_input = [1.2 + 3.2, 1.8 + 0.8]
hidden_input = [4.4, 2.6]
```

**Step 2: Apply Sigmoid to Hidden Layer**
```
hidden_output = sigmoid([4.4, 2.6])
hidden_output = [1/(1+e^(-4.4)), 1/(1+e^(-2.6))]
hidden_output = [0.988, 0.931]
```

**Step 3: Hidden to Output Layer**
```
output_input = hidden_output · weights_hidden_output
output_input = [0.988, 0.931] · [0.5, -0.2]
output_input = 0.988×0.5 + 0.931×(-0.2)
output_input = 0.494 - 0.186 = 0.308
```

**Step 4: Apply Sigmoid to Output**
```
output = sigmoid(0.308) = 1/(1+e^(-0.308)) = 0.576
```

**Prediction**: 0.576 (57.6% chance of passing)

###### Backpropagation

**Step 1: Calculate Output Error**
```
error = y - output = 1 - 0.576 = 0.424
output_error_term = error × output × (1 - output)
output_error_term = 0.424 × 0.576 × (1 - 0.576)
output_error_term = 0.424 × 0.576 × 0.424 = 0.104
```

**Step 2: Propagate Error to Hidden Layer**
```
hidden_error = output_error_term × weights_hidden_output
hidden_error = 0.104 × [0.5, -0.2] = [0.052, -0.021]

hidden_error_term = hidden_error × hidden_output × (1 - hidden_output)
For hidden neuron 1: 0.052 × 0.988 × (1 - 0.988) = 0.052 × 0.988 × 0.012 = 0.0006
For hidden neuron 2: -0.021 × 0.931 × (1 - 0.931) = -0.021 × 0.931 × 0.069 = -0.001

hidden_error_term = [0.0006, -0.001]
```

**Step 3: Calculate Weight Updates**

**Update weights_hidden_output**:
```
Δweights_hidden_output = learning_rate × output_error_term × hidden_output
Δweights_hidden_output = 0.1 × 0.104 × [0.988, 0.931]
Δweights_hidden_output = [0.0103, 0.0097]

New weights_hidden_output = [0.5, -0.2] + [0.0103, 0.0097] = [0.5103, -0.1903]
```

**Update weights_input_hidden**:
```
Δweights_input_hidden = learning_rate × hidden_error_term × x[:, None]
Δweights_input_hidden = 0.1 × [0.0006, -0.001] × [[6], [8]]

For input feature 1: 0.1 × [0.0006, -0.001] × 6 = [0.00036, -0.0006]
For input feature 2: 0.1 × [0.0006, -0.001] × 8 = [0.00048, -0.0008]

New weights_input_hidden = [[0.2, 0.3],     + [[0.00036, -0.0006],
                           [0.4, 0.1]]        [0.00048, -0.0008]]
                         = [[0.20036, 0.2994],
                            [0.40048, 0.0992]]
```

###### Learning Progress

**After Weight Update**:
- **Hidden layer weights** slightly adjusted to better capture input patterns
- **Output weights** increased for hidden neuron 1 (positive contribution) and decreased for hidden neuron 2 (negative contribution)
- **Next iteration** would use these updated weights for improved prediction

**Key Learning**: The network learned that the current prediction (57.6%) was too low for a passing student, so it adjusted weights to increase future predictions for similar input patterns.

This process repeats for all training samples and multiple epochs until the network learns to accurately distinguish between passing and failing students based on study and sleep hours.

In [2]:
import numpy as np
from data_prep import features, targets, features_test, targets_test

"""
Neural Network Implementation: Two-Layer Feedforward Network for Binary Classification

This implementation demonstrates a complete neural network training process including:
- Forward propagation through hidden and output layers
- Backpropagation with gradient computation
- Weight updates using batch gradient descent
- Training loss monitoring and test accuracy evaluation

Network Architecture:
- Input layer: n_features neurons (determined by data)
- Hidden layer: 2 neurons with sigmoid activation
- Output layer: 1 neuron with sigmoid activation (binary classification)

Training Process:
- Uses batch gradient descent (accumulates gradients over all samples)
- Updates weights after processing entire training set each epoch
- Monitors training loss every 10% of epochs for convergence tracking
"""

np.random.seed(21)

def sigmoid(x):
   """
   Sigmoid activation function with numerical stability improvements.
   
   Applies element-wise sigmoid transformation: f(x) = 1 / (1 + e^(-x))
   Includes clipping to prevent overflow for extreme input values.
   
   Args:
       x: Input values (scalar, array, or tensor)
       
   Returns:
       Sigmoid-transformed values in range (0, 1)
   """
   x = np.asarray(x, dtype=np.float64)  # Force conversion and dtype
   return 1 / (1 + np.exp(-np.clip(x, -500, 500)))  # Prevent overflow

# Hyperparameters
n_hidden = 2  # number of hidden units
epochs = 900
learnrate = 0.005
n_records, n_features = features.shape
last_loss = None

# Initialize weights using Xavier/Glorot initialization
# Scales initial weights by 1/sqrt(n_features) for stable training
weights_input_hidden = np.random.normal(scale=1 / n_features ** .5,
                                       size=(n_features, n_hidden))
weights_hidden_output = np.random.normal(scale=1 / n_features ** .5,
                                        size=n_hidden)

# Training Loop: Batch Gradient Descent
for e in range(epochs):
   # Initialize gradient accumulation arrays
   del_w_input_hidden = np.zeros(weights_input_hidden.shape)
   del_w_hidden_output = np.zeros(weights_hidden_output.shape)
   
   # Process each training sample and accumulate gradients
   for x, y in zip(features, targets):
       ## Forward Propagation ##
       # Hidden layer computation: input → hidden
       hidden_input = np.dot(x, weights_input_hidden)
       hidden_output = sigmoid(np.array(hidden_input, dtype=np.float64))
       
       # Output layer computation: hidden → output
       output = sigmoid(np.array(np.dot(hidden_output, weights_hidden_output), dtype=np.float64))                        

       ## Backpropagation ##
       # Calculate prediction error
       error = y - output
       
       # Output layer error term (gradient of loss w.r.t. output weights)
       # For sigmoid: derivative = output * (1 - output)
       output_error_term = error * output * (1 - output)

       # Propagate error back to hidden layer
       # Hidden layer's contribution to output error
       hidden_error = np.dot(output_error_term, weights_hidden_output)
       
       # Hidden layer error term (gradient of loss w.r.t. hidden weights)
       hidden_error_term = hidden_error * hidden_output * (1 - hidden_output)
       
       # Accumulate weight gradients (sum over all samples in batch)
       del_w_hidden_output += output_error_term * hidden_output
       del_w_input_hidden += hidden_error_term * np.array(x[:, None], dtype=np.float64)

   # Update weights using accumulated gradients
   # Divide by n_records to get average gradient over batch
   weights_input_hidden += learnrate * del_w_input_hidden / n_records
   weights_hidden_output += learnrate * del_w_hidden_output / n_records

   # Training Progress Monitoring
   # Print loss every 10% of total epochs
   if e % (epochs / 10) == 0:
       # Calculate current training loss on entire dataset
       hidden_output = sigmoid(np.dot(x, weights_input_hidden))
       out = sigmoid(np.dot(hidden_output, weights_hidden_output))
       loss = np.mean((out - targets) ** 2)
       
       # Warning if loss is increasing (possible overfitting or high learning rate)
       if last_loss and last_loss < loss:
           print("Train loss:", loss, "WARNING - Loss Increasing")
       else:
           print("Train loss:", loss)
       last_loss = loss

# Model Evaluation on Test Set
# Forward pass through trained network
hidden = sigmoid(np.dot(features_test, weights_input_hidden))
out = sigmoid(np.dot(hidden, weights_hidden_output))

# Convert probabilities to binary predictions (threshold = 0.5)
predictions = out > 0.5

# Calculate classification accuracy
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

Train loss: 0.2513572524259881
Train loss: 0.24996540718842905
Train loss: 0.24862005218904504
Train loss: 0.2473199321717981
Train loss: 0.24606380465584854
Train loss: 0.24485044179257037
Train loss: 0.243678632018683
Train loss: 0.24254718151769472
Train loss: 0.24145491550165454
Train loss: 0.24040067932493334
Prediction accuracy: 0.725
