
# 0.8 Model Training, Evaluation, and Regularization

This notebook introduces the complete workflow of training a simple neural network model, evaluating its performance, and applying regularization techniques to improve generalization.




## Introduction to a Simple Classification Dataset

A **classification dataset** consists of input features and corresponding class labels.

Typical examples:
- Email spam detection (spam / not spam)
- Handwritten digit recognition (0–9)
- Disease prediction (yes / no)

**Mathematically:**  
- Input matrix: $X \in \mathbb{R}^{m \times n}$  
- Label vector: $y$



**Goal:** Learn a mapping from inputs to class labels that generalizes well.



## Data Preparation and Preprocessing

Data preparation and preprocessing are critical steps in the machine learning pipeline. Real-world data is often incomplete, noisy, or inconsistent, and cannot be directly used for training a neural network.

Before training, raw data must be cleaned and transformed.

Common steps:
- Handling missing values  
- Feature scaling (Normalization / Standardization)  
- Encoding categorical variables  
- Train–Validation–Test split  

Good preprocessing leads to **faster convergence** and **stable learning**.


In [None]:

# Example: simple normalization
import numpy as np

X = np.array([10, 20, 30, 40])
X_norm = (X - X.min()) / (X.max() - X.min())
X_norm



## Designing a Simple MLP Model

A Multi-Layer Perceptron (MLP) is a feedforward neural network composed of multiple layers of neurons. It includes an input layer, one or more hidden layers, and an output layer. Each neuron in a layer is connected to neurons in the subsequent layer through weighted connections.

A **Multi-Layer Perceptron (MLP)** consists of:
- Input layer
- One or more hidden layers
- Output layer

For classification:
- Sigmoid → Binary classification
- Softmax → Multi-class classification



### MLP structure
![mlp](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/Artificial_neural_network.svg/640px-Artificial_neural_network.svg.png)



## Forward Pass and Prediction

The forward pass is the process by which input data is propagated through the neural network to generate an output. At each neuron, a weighted sum of inputs is computed and a bias term is added. This result is then passed through an activation function to produce the neuron’s output.

Each neuron computes:

$$
z = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b
$$


The result is passed through an **activation function** to produce output.


In [None]:

# Simple forward pass example
x = np.array([1.0, 2.0])
w = np.array([0.5, -1.0])
b = 0.1

z = np.dot(w, x) + b
z



## Model Evaluation Using Accuracy

Once a model has been trained, its performance must be evaluated using appropriate metrics. Accuracy is one of the simplest and most commonly used evaluation metrics for classification problems.

Accuracy measures how many predictions are correct:

$$
\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}
$$


Accuracy alone may be misleading for imbalanced datasets.


In [None]:

# Accuracy example
y_true = np.array([1, 0, 1, 1])
y_pred = np.array([1, 0, 0, 1])

accuracy = np.mean(y_true == y_pred)
accuracy



## Underfitting vs Overfitting

Underfitting and overfitting are two fundamental challenges in model training.

Underfitting occurs when a model is too simple to capture the underlying patterns in the data. Such a model performs poorly on both training and unseen data, indicating insufficient learning capacity.

Overfitting occurs when a model is excessively complex and learns not only the true patterns but also the noise present in the training data. An overfitted model shows excellent performance on training data but fails to generalize to new data.


- **Underfitting:** Model too simple → poor performance everywhere  
- **Overfitting:** Model too complex → memorizes training data  

The primary objective of model training is to achieve a balance between these two extremes, resulting in good generalization.



### Visual comparison
![overfitting](https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Overfitting.svg/640px-Overfitting.svg.png)



## Regularization Techniques

Regularization refers to a collection of methods designed to reduce overfitting by controlling model complexity. These techniques impose constraints or penalties that discourage the model from learning overly complex representations.

Regularization prevents overfitting by limiting model complexity.

Techniques include:
- L1 / L2 Regularization
- Dropout
- Data Augmentation
- Early Stopping




## L1 and L2 Regularization

L1 and L2 regularization are weight-based regularization techniques that modify the loss function by adding a penalty term.

L1 regularization adds the sum of the absolute values of the model weights to the loss function. This encourages sparsity in the model by driving some weights to exactly zero, effectively performing feature selection.

L2 regularization adds the sum of the squared values of the weights to the loss function. This discourages large weight values and leads to smoother, more stable models without eliminating features entirely.

**L1 (Lasso):**
$$ \lambda \sum |w| $$
→ Produces sparse models

**L2 (Ridge):**
$$ \lambda \sum w^2 $$
→ Penalizes large weights


Both methods reduce overfitting by preventing the model from relying too heavily on any single feature.

Let the original loss function be:
$$
\mathcal{L}_{\text{original}} = \frac{1}{m}\sum_{i=1}^{m} \ell(y_i, \hat{y}_i)
$$

**L1 Regularization (Lasso):**
$$
\mathcal{L}_{\text{L1}} = \mathcal{L}_{\text{original}} + \lambda \sum_{j=1}^{n} |w_j|
$$

**L2 Regularization (Ridge):**
$$
\mathcal{L}_{\text{L2}} = \mathcal{L}_{\text{original}} + \lambda \sum_{j=1}^{n} w_j^2
$$




## Dropout

Dropout is a regularization technique in which a fraction of neurons is randomly deactivated during each training iteration. This forces the network to learn redundant representations and prevents neurons from becoming overly dependent on one another.

By randomly removing neurons during training, dropout simulates training an ensemble of smaller networks. During inference, all neurons are active, and their outputs are appropriately scaled.

Dropout significantly improves robustness and generalization, particularly in deep neural networks.


For neuron activation $h$:

$$
r \sim \text{Bernoulli}(p)
$$

$$
\tilde{h} = r \cdot h
$$

**Inverted Dropout:**
$$
\tilde{h} = \frac{r}{p} \cdot h
$$




## Data Augmentation (Introductory)

Data augmentation is a regularization strategy that increases the effective size of the training dataset by creating modified versions of existing data samples. These modifications preserve the original label while introducing variation.

Common augmentation techniques include geometric transformations, noise injection, and scaling. Data augmentation exposes the model to a wider range of input patterns, reducing overfitting and improving generalization.

This technique is especially useful when the available training data is limited.


Given a training sample:
$$
(x, y)
$$

Augmented sample:
$$
(\tilde{x}, y), \quad \tilde{x} = T(x)
$$

Noise injection:
$$
\tilde{x} = x + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)
$$




## Early Stopping

Early stopping is a regularization technique that halts training when model performance on a validation dataset stops improving. While training loss may continue to decrease, validation performance may begin to degrade, signaling the onset of overfitting.

By stopping training at the optimal point, early stopping prevents the model from memorizing the training data and reduces unnecessary computational cost.

Early stopping is simple to implement and is widely used in practical neural network training.


Validation loss at epoch $t$:
$$
\mathcal{L}_{\text{val}}(t)
$$

Training stops when:
$$
\mathcal{L}_{\text{val}}(t+k) > \mathcal{L}_{\text{val}}(t)
$$

Optimal epoch:
$$
t^* = \arg\min_t \mathcal{L}_{\text{val}}(t)
$$




## Summary

In this notebook, we explored:
- Data preparation
- Model design
- Forward pass & prediction
- Evaluation using accuracy
- Overfitting vs underfitting
- Regularization techniques

These ideas form the **foundation of deep learning training workflows**.



## Task for the Reader

1.Explain the role of data preprocessing in training a classification model.

2.Design a simple MLP architecture for a binary classification problem.

3.Compute the forward pass output for a given set of weights and inputs.

4.Calculate model accuracy manually using true and predicted labels.

5.Identify whether a given model is underfitting or overfitting from its behavior.

6.Compare L1 and L2 regularization in terms of their effect on model weights.

7.Explain how dropout helps in reducing overfitting.

8.Give examples of data augmentation techniques for classification tasks.

9.Describe the early stopping criterion using validation loss.