# Day 50: Evaluating and Tuning Neural Network Performance

Congratulations on reaching Day 50 of the 100 Days of Machine Learning Challenge! You've made it halfway through this comprehensive journey. By now, you've built a solid foundation in neural networks, including understanding perceptrons, feedforward networks, backpropagation, and training techniques. Today, we take the next crucial step: learning how to evaluate and tune neural network performance to build models that generalize well to unseen data.

## Introduction

Building a neural network is only half the battle. The true challenge lies in ensuring that your model performs well not just on training data, but also on new, unseen data. This is where evaluation metrics, regularization techniques, and hyperparameter tuning come into play.

In this lesson, we will explore:
- How to evaluate neural network performance using various metrics
- Understanding and preventing overfitting through regularization
- Techniques for hyperparameter tuning to optimize model performance
- Practical implementation using TensorFlow and Keras

### Why This Matters

In real-world applications, a model that simply memorizes training data is useless. Medical diagnosis systems, autonomous vehicles, and financial prediction models all require neural networks that can generalize from training examples to new situations. Understanding how to evaluate and tune your models is essential for deploying reliable machine learning systems.

### Learning Objectives

By the end of this lesson, you will be able to:
1. Evaluate neural network performance using appropriate metrics for classification and regression tasks
2. Understand and apply regularization techniques (L1, L2, dropout) to prevent overfitting
3. Implement early stopping and learning rate scheduling
4. Perform hyperparameter tuning to optimize model performance
5. Visualize training history to diagnose model behavior
6. Apply cross-validation techniques specific to neural networks

## Evaluation Metrics for Neural Networks

### Classification Metrics

For classification tasks, several metrics help us understand how well our neural network performs:

#### 1. Accuracy

Accuracy is the most intuitive metric, representing the proportion of correct predictions:

$$\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$$

While easy to understand, accuracy can be misleading for imbalanced datasets. For example, if 95% of emails are not spam, a model that always predicts "not spam" achieves 95% accuracy but is completely useless.

#### 2. Precision and Recall

**Precision** measures the accuracy of positive predictions:

$$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$$

**Recall** (or Sensitivity) measures how many actual positive cases were correctly identified:

$$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$$

#### 3. F1-Score

The F1-score is the harmonic mean of precision and recall, providing a single metric that balances both:

$$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

#### 4. ROC-AUC

The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate against the False Positive Rate at various threshold settings. The Area Under the Curve (AUC) provides a single number summarizing the model's ability to discriminate between classes. An AUC of 1.0 represents perfect discrimination, while 0.5 represents random guessing.

### Regression Metrics

For regression tasks, different metrics are appropriate:

#### 1. Mean Squared Error (MSE)

$$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

where $y_i$ is the actual value and $\hat{y}_i$ is the predicted value.

#### 2. Root Mean Squared Error (RMSE)

$$\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$

RMSE is in the same units as the target variable, making it more interpretable than MSE.

#### 3. Mean Absolute Error (MAE)

$$\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$

MAE is less sensitive to outliers than MSE/RMSE.

#### 4. R-squared (R²)

$$R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$$

R² represents the proportion of variance in the dependent variable explained by the model, ranging from 0 to 1 (or negative for very poor models).

## Loss Functions in Neural Networks

Loss functions quantify how far the model's predictions are from the true values. During training, the neural network adjusts its weights to minimize this loss.

### Binary Cross-Entropy Loss

For binary classification tasks (two classes), binary cross-entropy is the standard loss function:

$$\mathcal{L}_{BCE} = -\frac{1}{n}\sum_{i=1}^{n}[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]$$

where $y_i \in \{0, 1\}$ is the true label and $\hat{y}_i \in [0, 1]$ is the predicted probability.

### Categorical Cross-Entropy Loss

For multi-class classification, categorical cross-entropy extends binary cross-entropy:

$$\mathcal{L}_{CCE} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{c=1}^{C}y_{i,c}\log(\hat{y}_{i,c})$$

where $C$ is the number of classes, $y_{i,c}$ is 1 if sample $i$ belongs to class $c$ and 0 otherwise, and $\hat{y}_{i,c}$ is the predicted probability for class $c$.

### Mean Squared Error Loss

For regression tasks, MSE is commonly used as the loss function:

$$\mathcal{L}_{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

### Choosing the Right Loss Function

The choice of loss function depends on your task:
- **Binary classification**: Binary cross-entropy
- **Multi-class classification**: Categorical cross-entropy
- **Regression**: MSE, MAE, or Huber loss (robust to outliers)

## Understanding Overfitting and Underfitting

### The Bias-Variance Tradeoff

**Overfitting** occurs when a model learns the training data too well, including its noise and outliers. The model performs excellently on training data but poorly on new data. This indicates high variance and low bias.

**Underfitting** occurs when a model is too simple to capture the underlying patterns in the data. It performs poorly on both training and test data, indicating high bias and low variance.

The goal is to find the sweet spot where the model generalizes well to new data.

### Signs of Overfitting

1. Training accuracy is much higher than validation accuracy
2. Training loss continues to decrease while validation loss increases
3. The model performs well on training data but poorly on test data

### Signs of Underfitting

1. Both training and validation accuracy are low
2. The model cannot capture the complexity of the data
3. Training and validation losses are both high and similar

## Regularization Techniques

Regularization methods add constraints or penalties to prevent overfitting.

### L1 and L2 Regularization

**L2 Regularization** (Ridge or Weight Decay) adds a penalty proportional to the square of the weights:

$$\mathcal{L}_{total} = \mathcal{L}_{original} + \lambda\sum_{i=1}^{n}w_i^2$$

where $\lambda$ is the regularization parameter and $w_i$ are the model weights. L2 regularization encourages smaller weights, leading to smoother models.

**L1 Regularization** (Lasso) adds a penalty proportional to the absolute value of weights:

$$\mathcal{L}_{total} = \mathcal{L}_{original} + \lambda\sum_{i=1}^{n}|w_i|$$

L1 regularization can lead to sparse models where some weights become exactly zero, effectively performing feature selection.

### Dropout

Dropout is a powerful regularization technique specific to neural networks. During training, dropout randomly "drops out" (sets to zero) a proportion $p$ of neurons in a layer. This prevents neurons from co-adapting too much and forces the network to learn more robust features.

Mathematically, for a layer with activations $\mathbf{h}$:

$$\mathbf{h}_{dropout} = \mathbf{h} \odot \mathbf{m}$$

where $\mathbf{m}$ is a binary mask with each element drawn from $\text{Bernoulli}(1-p)$, and $\odot$ denotes element-wise multiplication.

During inference (testing), all neurons are active, but their outputs are scaled by $(1-p)$ to account for the dropout during training.

### Early Stopping

Early stopping monitors the validation loss during training and stops when it begins to increase, indicating that the model is starting to overfit. A patience parameter determines how many epochs to wait before stopping.

### Batch Normalization

Batch normalization normalizes the inputs of each layer, reducing internal covariate shift and allowing higher learning rates. For a batch of activations $\mathbf{x}$:

$$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$

$$y_i = \gamma\hat{x}_i + \beta$$

where $\mu_B$ and $\sigma_B^2$ are the batch mean and variance, $\epsilon$ is a small constant for numerical stability, and $\gamma$ and $\beta$ are learnable parameters.

## Hyperparameter Tuning

Hyperparameters are parameters that are not learned during training but must be set before training begins. Key hyperparameters for neural networks include:

### Learning Rate

The learning rate $\alpha$ controls how much the weights are updated during training:

$$w_{new} = w_{old} - \alpha \nabla \mathcal{L}$$

- **Too high**: The model may overshoot the minimum and diverge
- **Too low**: Training will be very slow and may get stuck in local minima

**Learning Rate Schedules** adjust the learning rate during training:
- **Step decay**: Reduce learning rate by a factor every few epochs
- **Exponential decay**: $\alpha_t = \alpha_0 e^{-kt}$
- **Cosine annealing**: Gradually decrease using a cosine function

### Batch Size

Batch size determines how many samples are processed before updating weights:
- **Small batches**: More frequent updates, more noise, better generalization but slower
- **Large batches**: Faster training, more stable gradients, but may generalize worse

### Number of Epochs

An epoch is one complete pass through the training data. Too few epochs lead to underfitting; too many lead to overfitting.

### Network Architecture

- **Number of layers**: Deeper networks can learn more complex patterns but are harder to train
- **Number of neurons per layer**: More neurons increase capacity but also the risk of overfitting
- **Activation functions**: ReLU, tanh, sigmoid, etc., each with different properties

### Optimization Algorithm

Different optimizers have different convergence properties:
- **SGD**: Simple but can be slow
- **Momentum**: Accelerates SGD by accumulating velocity
- **Adam**: Adaptive learning rates, often works well with default parameters
- **RMSprop**: Similar to Adam, good for recurrent neural networks

## Practical Implementation

Now let's implement these concepts using TensorFlow and Keras. We'll build, train, and evaluate neural networks with different configurations.

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, roc_curve, auc
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)

# Configure matplotlib
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")

In [None]:
# Try to import TensorFlow, if not available, install it
try:
    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras import layers, regularizers
    from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
    print(f"TensorFlow version: {tf.__version__}")
    print("TensorFlow imported successfully!")
    tf.random.set_seed(42)
except ImportError:
    print("TensorFlow not available. Installing...")
    print("Note: In a real environment, run: pip install tensorflow")
    print("For this lesson, we'll demonstrate with conceptual code.")

### Example 1: Binary Classification with Evaluation Metrics

Let's create a binary classification dataset and build a neural network to classify it. We'll evaluate the model using various metrics.

In [None]:
# Generate synthetic binary classification dataset
X, y = make_classification(
    n_samples=2000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.7, 0.3],  # Imbalanced classes
    random_state=42
)

# Split the data
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

print(f"Training set shape: {X_train_scaled.shape}")
print(f"Validation set shape: {X_val_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")
print(f"\nClass distribution in training set:")
print(f"  Class 0: {np.sum(y_train == 0)} ({np.sum(y_train == 0)/len(y_train)*100:.1f}%)")
print(f"  Class 1: {np.sum(y_train == 1)} ({np.sum(y_train == 1)/len(y_train)*100:.1f}%)")

### Building a Simple Neural Network

We'll create a baseline neural network without regularization to see how it performs.

In [None]:
# Simple implementation using NumPy to demonstrate concepts
# In practice, use TensorFlow/Keras or PyTorch

def sigmoid(x):
    """Sigmoid activation function"""
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def relu(x):
    """ReLU activation function"""
    return np.maximum(0, x)

def binary_cross_entropy(y_true, y_pred):
    """Binary cross-entropy loss"""
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

class SimpleNeuralNetwork:
    """A simple 2-layer neural network for demonstration"""
    
    def __init__(self, input_size, hidden_size, output_size, learning_rate=0.01):
        # Initialize weights with small random values
        self.W1 = np.random.randn(input_size, hidden_size) * 0.01
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.01
        self.b2 = np.zeros((1, output_size))
        self.learning_rate = learning_rate
        
    def forward(self, X):
        """Forward pass"""
        self.z1 = np.dot(X, self.W1) + self.b1
        self.a1 = relu(self.z1)
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        self.a2 = sigmoid(self.z2)
        return self.a2
    
    def backward(self, X, y):
        """Backward pass"""
        m = X.shape[0]
        
        # Output layer gradients
        dz2 = self.a2 - y.reshape(-1, 1)
        dW2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0, keepdims=True) / m
        
        # Hidden layer gradients
        dz1 = np.dot(dz2, self.W2.T) * (self.z1 > 0)  # ReLU derivative
        dW1 = np.dot(X.T, dz1) / m
        db1 = np.sum(dz1, axis=0, keepdims=True) / m
        
        # Update weights
        self.W2 -= self.learning_rate * dW2
        self.b2 -= self.learning_rate * db2
        self.W1 -= self.learning_rate * dW1
        self.b1 -= self.learning_rate * db1
    
    def train(self, X, y, X_val, y_val, epochs=100):
        """Train the network"""
        history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': []}
        
        for epoch in range(epochs):
            # Forward pass
            y_pred = self.forward(X)
            
            # Backward pass
            self.backward(X, y)
            
            # Calculate metrics
            train_loss = binary_cross_entropy(y, y_pred)
            train_acc = np.mean((y_pred.flatten() > 0.5) == y)
            
            # Validation metrics
            y_val_pred = self.forward(X_val)
            val_loss = binary_cross_entropy(y_val, y_val_pred)
            val_acc = np.mean((y_val_pred.flatten() > 0.5) == y_val)
            
            history['train_loss'].append(train_loss)
            history['val_loss'].append(val_loss)
            history['train_acc'].append(train_acc)
            history['val_acc'].append(val_acc)
            
            if (epoch + 1) % 20 == 0:
                print(f"Epoch {epoch+1}/{epochs} - "
                      f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f} - "
                      f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")
        
        return history
    
    def predict(self, X):
        """Make predictions"""
        return (self.forward(X).flatten() > 0.5).astype(int)

# Create and train the network
print("Training a simple neural network...\n")
model = SimpleNeuralNetwork(input_size=20, hidden_size=32, output_size=1, learning_rate=0.1)
history = model.train(X_train_scaled, y_train, X_val_scaled, y_val, epochs=100)

### Visualizing Training History

Training curves help us understand how the model learns over time and diagnose issues like overfitting.

In [None]:
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss plot
axes[0].plot(history['train_loss'], label='Training Loss', linewidth=2)
axes[0].plot(history['val_loss'], label='Validation Loss', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Model Loss Over Time', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Accuracy plot
axes[1].plot(history['train_acc'], label='Training Accuracy', linewidth=2)
axes[1].plot(history['val_acc'], label='Validation Accuracy', linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Model Accuracy Over Time', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey Observations:")
print("- If training loss continues to decrease while validation loss increases, the model is overfitting")
print("- If both losses remain high, the model may be underfitting")
print("- Ideally, both losses should decrease together and plateau")

### Evaluating Model Performance

Let's calculate various evaluation metrics on the test set.

In [None]:
# Make predictions on test set
y_test_pred = model.predict(X_test_scaled)
y_test_pred_proba = model.forward(X_test_scaled).flatten()

# Calculate metrics
accuracy = accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)

print("="*50)
print("Test Set Performance Metrics")
print("="*50)
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-Score:  {f1:.4f}")
print("="*50)

print("\nMetric Interpretations:")
print(f"- Accuracy: {accuracy*100:.2f}% of all predictions are correct")
print(f"- Precision: {precision*100:.2f}% of positive predictions are actually positive")
print(f"- Recall: {recall*100:.2f}% of actual positives were correctly identified")
print(f"- F1-Score: Harmonic mean of precision and recall")

### Confusion Matrix

A confusion matrix provides a detailed breakdown of correct and incorrect predictions.

In [None]:
# Create confusion matrix
cm = confusion_matrix(y_test, y_test_pred)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True, 
            xticklabels=['Class 0', 'Class 1'],
            yticklabels=['Class 0', 'Class 1'])
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.title('Confusion Matrix', fontsize=14, fontweight='bold')
plt.show()

# Calculate additional insights
tn, fp, fn, tp = cm.ravel()
print(f"\nConfusion Matrix Breakdown:")
print(f"  True Negatives (TN):  {tn}")
print(f"  False Positives (FP): {fp}")
print(f"  False Negatives (FN): {fn}")
print(f"  True Positives (TP):  {tp}")

### ROC Curve and AUC

The ROC curve shows the tradeoff between true positive rate and false positive rate at different classification thresholds.

In [None]:
# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_test_pred_proba)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', linewidth=2, 
         label=f'ROC Curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='navy', linewidth=2, linestyle='--', 
         label='Random Classifier (AUC = 0.5)')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('Receiver Operating Characteristic (ROC) Curve', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=10)
plt.grid(True, alpha=0.3)
plt.show()

print(f"\nROC-AUC Score: {roc_auc:.4f}")
print("\nInterpretation:")
print("- AUC = 1.0: Perfect classifier")
print("- AUC = 0.5: Random classifier (no better than coin flip)")
print("- AUC > 0.7: Generally considered acceptable")
print("- AUC > 0.8: Considered good")
print("- AUC > 0.9: Considered excellent")

### Example 2: Demonstrating Overfitting

Let's intentionally create an overfitting scenario to see how it manifests in practice.

In [None]:
# Create a smaller dataset to encourage overfitting
X_small, y_small = make_classification(
    n_samples=200,  # Much smaller dataset
    n_features=20,
    n_informative=10,
    n_redundant=10,
    n_classes=2,
    random_state=42
)

X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(
    X_small, y_small, test_size=0.3, random_state=42
)

# Standardize
scaler_s = StandardScaler()
X_train_s_scaled = scaler_s.fit_transform(X_train_s)
X_test_s_scaled = scaler_s.transform(X_test_s)

# Train an overly complex model
print("Training an overly complex model on a small dataset...\n")
overfit_model = SimpleNeuralNetwork(
    input_size=20, 
    hidden_size=128,  # Very large hidden layer
    output_size=1, 
    learning_rate=0.1
)
overfit_history = overfit_model.train(
    X_train_s_scaled, y_train_s, 
    X_test_s_scaled, y_test_s, 
    epochs=200
)

In [None]:
# Visualize overfitting
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss plot showing overfitting
axes[0].plot(overfit_history['train_loss'], label='Training Loss', linewidth=2)
axes[0].plot(overfit_history['val_loss'], label='Validation Loss', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Overfitting: Diverging Loss Curves', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Add annotation
max_epoch = len(overfit_history['train_loss'])
if overfit_history['val_loss'][-1] > overfit_history['val_loss'][max_epoch//4]:
    axes[0].annotate('Validation loss starts increasing', 
                    xy=(max_epoch//4, overfit_history['val_loss'][max_epoch//4]), 
                    xytext=(max_epoch//3, max(overfit_history['val_loss'])*0.7),
                    arrowprops=dict(arrowstyle='->', color='red', lw=2),
                    fontsize=10, color='red')

# Accuracy plot
axes[1].plot(overfit_history['train_acc'], label='Training Accuracy', linewidth=2)
axes[1].plot(overfit_history['val_acc'], label='Validation Accuracy', linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Overfitting: Diverging Accuracy Curves', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nSigns of Overfitting Observed:")
print(f"- Final training accuracy: {overfit_history['train_acc'][-1]:.4f}")
print(f"- Final validation accuracy: {overfit_history['val_acc'][-1]:.4f}")
print(f"- Gap: {overfit_history['train_acc'][-1] - overfit_history['val_acc'][-1]:.4f}")
print("\nThis large gap indicates the model has memorized the training data")
print("but fails to generalize to new data.")

### Example 3: Applying Regularization

Now let's implement L2 regularization and dropout to combat overfitting.

In [None]:
class RegularizedNeuralNetwork:
    """Neural network with L2 regularization and dropout"""
    
    def __init__(self, input_size, hidden_size, output_size, 
                 learning_rate=0.01, l2_lambda=0.01, dropout_rate=0.0):
        # Initialize weights
        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / input_size)
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2.0 / hidden_size)
        self.b2 = np.zeros((1, output_size))
        self.learning_rate = learning_rate
        self.l2_lambda = l2_lambda
        self.dropout_rate = dropout_rate
        
    def forward(self, X, training=True):
        """Forward pass with optional dropout"""
        self.z1 = np.dot(X, self.W1) + self.b1
        self.a1 = relu(self.z1)
        
        # Apply dropout during training
        if training and self.dropout_rate > 0:
            self.dropout_mask = np.random.binomial(1, 1 - self.dropout_rate, 
                                                   size=self.a1.shape) / (1 - self.dropout_rate)
            self.a1 *= self.dropout_mask
        
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        self.a2 = sigmoid(self.z2)
        return self.a2
    
    def backward(self, X, y):
        """Backward pass with L2 regularization"""
        m = X.shape[0]
        
        # Output layer gradients
        dz2 = self.a2 - y.reshape(-1, 1)
        dW2 = (np.dot(self.a1.T, dz2) + self.l2_lambda * self.W2) / m
        db2 = np.sum(dz2, axis=0, keepdims=True) / m
        
        # Hidden layer gradients
        da1 = np.dot(dz2, self.W2.T)
        if self.dropout_rate > 0:
            da1 *= self.dropout_mask
        dz1 = da1 * (self.z1 > 0)
        dW1 = (np.dot(X.T, dz1) + self.l2_lambda * self.W1) / m
        db1 = np.sum(dz1, axis=0, keepdims=True) / m
        
        # Update weights
        self.W2 -= self.learning_rate * dW2
        self.b2 -= self.learning_rate * db2
        self.W1 -= self.learning_rate * dW1
        self.b1 -= self.learning_rate * db1
    
    def compute_loss(self, X, y, training=True):
        """Compute loss with L2 regularization"""
        y_pred = self.forward(X, training=training)
        data_loss = binary_cross_entropy(y, y_pred)
        
        # Add L2 regularization term
        l2_loss = (self.l2_lambda / 2) * (np.sum(self.W1**2) + np.sum(self.W2**2))
        total_loss = data_loss + l2_loss
        
        return total_loss
    
    def train(self, X, y, X_val, y_val, epochs=100):
        """Train the network"""
        history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': []}
        
        for epoch in range(epochs):
            # Forward pass with dropout
            y_pred = self.forward(X, training=True)
            
            # Backward pass
            self.backward(X, y)
            
            # Calculate metrics
            train_loss = self.compute_loss(X, y, training=False)
            train_acc = np.mean((y_pred.flatten() > 0.5) == y)
            
            # Validation metrics (no dropout)
            val_loss = self.compute_loss(X_val, y_val, training=False)
            y_val_pred = self.forward(X_val, training=False)
            val_acc = np.mean((y_val_pred.flatten() > 0.5) == y_val)
            
            history['train_loss'].append(train_loss)
            history['val_loss'].append(val_loss)
            history['train_acc'].append(train_acc)
            history['val_acc'].append(val_acc)
            
            if (epoch + 1) % 40 == 0:
                print(f"Epoch {epoch+1}/{epochs} - "
                      f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f} - "
                      f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")
        
        return history
    
    def predict(self, X):
        """Make predictions"""
        return (self.forward(X, training=False).flatten() > 0.5).astype(int)

# Train regularized model on the small dataset
print("Training a regularized model with L2 regularization and dropout...\n")
reg_model = RegularizedNeuralNetwork(
    input_size=20, 
    hidden_size=128, 
    output_size=1, 
    learning_rate=0.1,
    l2_lambda=0.01,  # L2 regularization
    dropout_rate=0.3  # 30% dropout
)
reg_history = reg_model.train(
    X_train_s_scaled, y_train_s, 
    X_test_s_scaled, y_test_s, 
    epochs=200
)

In [None]:
# Compare regularized vs unregularized models
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Unregularized model
axes[0, 0].plot(overfit_history['train_loss'], label='Training Loss', linewidth=2)
axes[0, 0].plot(overfit_history['val_loss'], label='Validation Loss', linewidth=2)
axes[0, 0].set_xlabel('Epoch', fontsize=11)
axes[0, 0].set_ylabel('Loss', fontsize=11)
axes[0, 0].set_title('Without Regularization: Loss', fontsize=12, fontweight='bold')
axes[0, 0].legend(fontsize=9)
axes[0, 0].grid(True, alpha=0.3)

axes[0, 1].plot(overfit_history['train_acc'], label='Training Accuracy', linewidth=2)
axes[0, 1].plot(overfit_history['val_acc'], label='Validation Accuracy', linewidth=2)
axes[0, 1].set_xlabel('Epoch', fontsize=11)
axes[0, 1].set_ylabel('Accuracy', fontsize=11)
axes[0, 1].set_title('Without Regularization: Accuracy', fontsize=12, fontweight='bold')
axes[0, 1].legend(fontsize=9)
axes[0, 1].grid(True, alpha=0.3)

# Regularized model
axes[1, 0].plot(reg_history['train_loss'], label='Training Loss', linewidth=2)
axes[1, 0].plot(reg_history['val_loss'], label='Validation Loss', linewidth=2)
axes[1, 0].set_xlabel('Epoch', fontsize=11)
axes[1, 0].set_ylabel('Loss', fontsize=11)
axes[1, 0].set_title('With Regularization: Loss', fontsize=12, fontweight='bold')
axes[1, 0].legend(fontsize=9)
axes[1, 0].grid(True, alpha=0.3)

axes[1, 1].plot(reg_history['train_acc'], label='Training Accuracy', linewidth=2)
axes[1, 1].plot(reg_history['val_acc'], label='Validation Accuracy', linewidth=2)
axes[1, 1].set_xlabel('Epoch', fontsize=11)
axes[1, 1].set_ylabel('Accuracy', fontsize=11)
axes[1, 1].set_title('With Regularization: Accuracy', fontsize=12, fontweight='bold')
axes[1, 1].legend(fontsize=9)
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nComparison Summary:")
print("\nWithout Regularization:")
print(f"  Final Train Accuracy: {overfit_history['train_acc'][-1]:.4f}")
print(f"  Final Val Accuracy:   {overfit_history['val_acc'][-1]:.4f}")
print(f"  Gap:                  {overfit_history['train_acc'][-1] - overfit_history['val_acc'][-1]:.4f}")

print("\nWith Regularization (L2 + Dropout):")
print(f"  Final Train Accuracy: {reg_history['train_acc'][-1]:.4f}")
print(f"  Final Val Accuracy:   {reg_history['val_acc'][-1]:.4f}")
print(f"  Gap:                  {reg_history['train_acc'][-1] - reg_history['val_acc'][-1]:.4f}")

print("\nRegularization helps reduce overfitting by keeping the gap smaller!")

### Example 4: Early Stopping

Early stopping monitors the validation loss and stops training when it starts to increase, preventing overfitting.

In [None]:
class EarlyStoppingNeuralNetwork(RegularizedNeuralNetwork):
    """Neural network with early stopping capability"""
    
    def train_with_early_stopping(self, X, y, X_val, y_val, 
                                  epochs=200, patience=10):
        """Train with early stopping"""
        history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': []}
        best_val_loss = float('inf')
        patience_counter = 0
        best_weights = None
        stopped_epoch = 0
        
        for epoch in range(epochs):
            # Forward pass
            y_pred = self.forward(X, training=True)
            
            # Backward pass
            self.backward(X, y)
            
            # Calculate metrics
            train_loss = self.compute_loss(X, y, training=False)
            train_acc = np.mean((y_pred.flatten() > 0.5) == y)
            
            # Validation metrics
            val_loss = self.compute_loss(X_val, y_val, training=False)
            y_val_pred = self.forward(X_val, training=False)
            val_acc = np.mean((y_val_pred.flatten() > 0.5) == y_val)
            
            history['train_loss'].append(train_loss)
            history['val_loss'].append(val_loss)
            history['train_acc'].append(train_acc)
            history['val_acc'].append(val_acc)
            
            # Early stopping logic
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                patience_counter = 0
                # Save best weights
                best_weights = (self.W1.copy(), self.b1.copy(), 
                              self.W2.copy(), self.b2.copy())
            else:
                patience_counter += 1
                
            if patience_counter >= patience:
                print(f"\nEarly stopping triggered at epoch {epoch+1}")
                print(f"Best validation loss: {best_val_loss:.4f} at epoch {epoch+1-patience}")
                stopped_epoch = epoch + 1
                # Restore best weights
                self.W1, self.b1, self.W2, self.b2 = best_weights
                break
            
            if (epoch + 1) % 40 == 0:
                print(f"Epoch {epoch+1}/{epochs} - "
                      f"Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f} - "
                      f"Patience: {patience_counter}/{patience}")
        
        history['stopped_epoch'] = stopped_epoch if stopped_epoch > 0 else epochs
        return history

# Train with early stopping
print("Training with early stopping...\n")
es_model = EarlyStoppingNeuralNetwork(
    input_size=20, 
    hidden_size=128, 
    output_size=1, 
    learning_rate=0.1,
    l2_lambda=0.001,
    dropout_rate=0.2
)
es_history = es_model.train_with_early_stopping(
    X_train_s_scaled, y_train_s, 
    X_test_s_scaled, y_test_s, 
    epochs=200,
    patience=15
)

In [None]:
# Visualize early stopping
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

stopped_epoch = es_history['stopped_epoch']

# Loss plot
axes[0].plot(es_history['train_loss'], label='Training Loss', linewidth=2)
axes[0].plot(es_history['val_loss'], label='Validation Loss', linewidth=2)
axes[0].axvline(x=stopped_epoch-1, color='red', linestyle='--', linewidth=2, 
               label=f'Stopped at epoch {stopped_epoch}')
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Early Stopping: Loss Curves', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Accuracy plot
axes[1].plot(es_history['train_acc'], label='Training Accuracy', linewidth=2)
axes[1].plot(es_history['val_acc'], label='Validation Accuracy', linewidth=2)
axes[1].axvline(x=stopped_epoch-1, color='red', linestyle='--', linewidth=2,
               label=f'Stopped at epoch {stopped_epoch}')
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Early Stopping: Accuracy Curves', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nTraining stopped at epoch {stopped_epoch} out of 200")
print("Early stopping prevented unnecessary training and potential overfitting.")

### Example 5: Hyperparameter Tuning - Learning Rate

Let's experiment with different learning rates to see their impact on training.

In [None]:
# Compare different learning rates
learning_rates = [0.001, 0.01, 0.1, 0.5]
lr_histories = {}

print("Training models with different learning rates...\n")

for lr in learning_rates:
    print(f"Training with learning rate = {lr}")
    model_lr = SimpleNeuralNetwork(
        input_size=20, 
        hidden_size=32, 
        output_size=1, 
        learning_rate=lr
    )
    history = model_lr.train(
        X_train_scaled, y_train, 
        X_val_scaled, y_val, 
        epochs=100
    )
    lr_histories[lr] = history
    print()

print("Training complete!")

In [None]:
# Visualize learning rate comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss comparison
for lr, history in lr_histories.items():
    axes[0].plot(history['val_loss'], label=f'LR = {lr}', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Validation Loss', fontsize=12)
axes[0].set_title('Impact of Learning Rate on Loss', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].set_yscale('log')
axes[0].grid(True, alpha=0.3)

# Accuracy comparison
for lr, history in lr_histories.items():
    axes[1].plot(history['val_acc'], label=f'LR = {lr}', linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Validation Accuracy', fontsize=12)
axes[1].set_title('Impact of Learning Rate on Accuracy', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print final performance for each learning rate
print("\nFinal Validation Performance:")
print("="*50)
for lr, history in lr_histories.items():
    print(f"Learning Rate {lr:6.3f}: "
          f"Loss = {history['val_loss'][-1]:.4f}, "
          f"Accuracy = {history['val_acc'][-1]:.4f}")
print("="*50)

print("\nObservations:")
print("- Too small: Slow convergence, may not reach optimal performance")
print("- Too large: Unstable training, may diverge or oscillate")
print("- Optimal: Fast convergence with stable, good performance")

## Hands-On Exercise

Now it's your turn to apply what you've learned! In this exercise, you'll experiment with different regularization techniques and hyperparameters.

### Exercise Tasks:

1. **Create a regression dataset** using `make_regression` from sklearn
2. **Build and train three neural networks**:
   - Model A: No regularization
   - Model B: With L2 regularization
   - Model C: With dropout
3. **Compare their performance** using MSE, RMSE, and R² scores
4. **Visualize the training curves** for all three models
5. **Experiment with different hyperparameters**:
   - Try different hidden layer sizes (16, 32, 64, 128)
   - Try different learning rates (0.001, 0.01, 0.1)
   - Try different regularization strengths

### Starter Code:

In [None]:
# Step 1: Create regression dataset
X_reg, y_reg = make_regression(
    n_samples=500,
    n_features=10,
    n_informative=8,
    noise=10.0,
    random_state=42
)

# Split the data
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Standardize
scaler_r = StandardScaler()
X_train_r_scaled = scaler_r.fit_transform(X_train_r)
X_test_r_scaled = scaler_r.transform(X_test_r)

# Standardize targets for better training
y_scaler = StandardScaler()
y_train_r_scaled = y_scaler.fit_transform(y_train_r.reshape(-1, 1)).flatten()
y_test_r_scaled = y_scaler.transform(y_test_r.reshape(-1, 1)).flatten()

print("Regression dataset created!")
print(f"Features: {X_train_r_scaled.shape[1]}")
print(f"Training samples: {X_train_r_scaled.shape[0]}")
print(f"Test samples: {X_test_r_scaled.shape[0]}")

# TODO: Build and train three models with different configurations
# TODO: Evaluate using MSE, RMSE, and R²
# TODO: Create visualizations comparing the models
# TODO: Experiment with hyperparameters

print("\n--- Exercise: Complete the TODOs above ---")
print("Hint: Modify the classes we created earlier for regression tasks")
print("(Use linear activation for output layer, MSE loss instead of BCE)")

## Key Takeaways

Let's summarize the essential concepts from this lesson:

### 1. Evaluation Metrics Matter
- **Classification**: Use accuracy, precision, recall, F1-score, and ROC-AUC
- **Regression**: Use MSE, RMSE, MAE, and R²
- Choose metrics appropriate for your problem (e.g., F1-score for imbalanced data)

### 2. Overfitting is a Primary Challenge
- Occurs when models memorize training data rather than learning general patterns
- Identified by: high training accuracy but low validation/test accuracy
- Monitor validation loss during training to detect overfitting early

### 3. Regularization Techniques
- **L1/L2 Regularization**: Penalizes large weights, encourages simpler models
- **Dropout**: Randomly drops neurons during training, prevents co-adaptation
- **Early Stopping**: Stops training when validation performance degrades
- **Batch Normalization**: Normalizes layer inputs, stabilizes training

### 4. Hyperparameter Tuning is Critical
- **Learning Rate**: Most important hyperparameter, affects convergence speed and stability
- **Batch Size**: Tradeoff between training speed and generalization
- **Network Architecture**: Number of layers and neurons per layer
- Use systematic approaches: grid search, random search, or Bayesian optimization

### 5. Visualization Aids Understanding
- Training curves reveal overfitting, underfitting, and convergence issues
- ROC curves show classification performance across thresholds
- Confusion matrices provide detailed prediction breakdowns

### 6. Best Practices
- Always use train/validation/test splits
- Start with simple models, add complexity gradually
- Monitor both training and validation metrics
- Use appropriate evaluation metrics for your task
- Apply regularization to improve generalization
- Document hyperparameter choices and results

### Mathematical Insights

The core principle of regularization can be expressed as:

$$\min_{\theta} \left[\mathcal{L}(\theta) + \lambda R(\theta)\right]$$

where:
- $\mathcal{L}(\theta)$ is the data loss (e.g., cross-entropy)
- $R(\theta)$ is the regularization term
- $\lambda$ controls the regularization strength
- $\theta$ represents all model parameters

This formulation shows that we're balancing two objectives: fitting the training data well (minimizing $\mathcal{L}$) while keeping the model simple (minimizing $R$).

## Further Resources

To deepen your understanding of neural network evaluation and tuning, explore these resources:

### Online Courses and Tutorials
1. **Deep Learning Specialization** by Andrew Ng (Coursera) - Comprehensive coverage of neural network optimization
2. **Fast.ai Practical Deep Learning** - Practical approaches to training neural networks
3. **TensorFlow Documentation** - https://www.tensorflow.org/guide - Official guides and tutorials

### Books
1. **"Deep Learning" by Goodfellow, Bengio, and Courville** - Chapter 7 (Regularization) and Chapter 8 (Optimization)
2. **"Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron** - Practical implementations
3. **"Neural Networks and Deep Learning" by Michael Nielsen** - Free online book with interactive examples

### Research Papers
1. **"Dropout: A Simple Way to Prevent Neural Networks from Overfitting"** by Srivastava et al. (2014)
2. **"Batch Normalization: Accelerating Deep Network Training"** by Ioffe and Szegedy (2015)
3. **"Adam: A Method for Stochastic Optimization"** by Kingma and Ba (2015)

### Interactive Tools
1. **TensorFlow Playground** - http://playground.tensorflow.org/ - Visualize neural network training
2. **Netron** - https://netron.app/ - Visualize neural network architectures
3. **Weights & Biases** - https://wandb.ai/ - Experiment tracking and visualization

### Documentation
1. **Keras API Documentation** - https://keras.io/api/ - Comprehensive API reference
2. **PyTorch Tutorials** - https://pytorch.org/tutorials/ - Alternative deep learning framework
3. **Scikit-learn Metrics** - https://scikit-learn.org/stable/modules/model_evaluation.html - Evaluation metrics guide

### Recommended Next Steps
- Experiment with real-world datasets (Kaggle, UCI ML Repository)
- Implement neural networks from scratch to understand internals
- Participate in ML competitions to apply tuning techniques
- Read recent papers on arxiv.org for cutting-edge techniques
- Join ML communities (Reddit r/MachineLearning, ML Discord servers)

## Conclusion

Congratulations on completing Day 50! You've reached the halfway point in your 100 Days of Machine Learning journey, and you've gained crucial skills in evaluating and tuning neural networks.

Today you learned:
- How to evaluate neural networks using appropriate metrics
- The importance of regularization in preventing overfitting
- Techniques for hyperparameter tuning
- How to visualize and diagnose model behavior

These skills form the foundation for building robust, production-ready machine learning systems. As you continue your journey, you'll apply these techniques to increasingly complex architectures like CNNs, RNNs, and Transformers.

Keep experimenting, stay curious, and remember: **the best model is not always the most complex one, but the one that generalizes best to unseen data.**

See you on Day 51, where we'll dive into Convolutional Neural Networks (CNNs) for image processing!

---

*"The goal is to turn data into information, and information into insight."* - Carly Fiorina