
# Loss Functions and Evaluation Metrics

Loss functions, optimization techniques, and evaluation metrics form the
mathematical and conceptual foundation of training neural networks.
This chapter provides a detailed and rigorous explanation of these concepts,
focusing on intuition, mathematical meaning, and practical relevance.



## Concept of Loss Functions

A loss function is a mathematical function that quantifies how well a model’s
predictions match the true target values. It transforms the difference between
prediction and ground truth into a single numerical value.

In supervised learning, every prediction made by the model produces an error.
The loss function aggregates these errors and provides a measurable signal
that indicates how incorrect the model is. This signal is essential because
learning algorithms require a numeric objective to optimize.

Loss functions serve as the **bridge between prediction and learning**.
Without a loss function, the model would have no direction for improvement.
During training, model parameters are adjusted to minimize the loss value,
thereby improving prediction quality over time.

Different tasks require different loss functions. For example:
- Regression tasks use losses that measure numerical distance.
- Classification tasks use losses that compare probabilities.
- Ranking and structured tasks require specialized losses.

Choosing an appropriate loss function is critical, as it directly influences
how the model learns and what types of errors it prioritizes.



The plot shows how **MSE penalizes large errors more aggressively**
than MAE. This explains why MSE is sensitive to outliers while
MAE is more robust.



## Regression Losses

Regression problems involve predicting continuous numerical values.
Regression loss functions measure how far predicted values deviate
from actual target values.

These losses focus on **magnitude of error**, rather than correctness
of class assignment. The most commonly used regression losses are
Mean Squared Error (MSE) and Mean Absolute Error (MAE).



### Mean Squared Error (MSE)

Mean Squared Error computes the average of the squared differences
between predicted values and actual values.

MSE = (1/n) Σ (y − ŷ)²

Squaring the error has two important effects. First, it ensures that
all errors are positive. Second, it penalizes larger errors much more
than smaller ones. As a result, MSE strongly discourages large deviations
between predictions and targets.

MSE is widely used because it is smooth and differentiable, making it
well-suited for gradient-based optimization. However, its sensitivity
to outliers can be a disadvantage when data contains extreme values.


In [None]:

import numpy as np

y_true = np.array([10, 12, 14, 16])
y_pred = np.array([9, 13, 15, 14])

np.mean((y_true - y_pred) ** 2)



### Mean Absolute Error (MAE)

Mean Absolute Error computes the average of the absolute differences
between predicted and actual values.

MAE = (1/n) Σ |y − ŷ|

Unlike MSE, MAE treats all errors equally, regardless of their magnitude.
This makes MAE more robust to outliers and easier to interpret, as the
loss is expressed in the same units as the target variable.

However, MAE is not differentiable at zero, which can make optimization
slightly more challenging for some algorithms. Despite this, MAE is
often preferred when robustness is more important than sensitivity
to large errors.


In [None]:

np.mean(np.abs(y_true - y_pred))



## Classification Loss

Classification tasks involve predicting discrete class labels.
Instead of measuring numerical distance, classification losses
measure how well predicted probabilities align with true labels.

These losses evaluate not only whether a prediction is correct,
but also how confident the model is in its prediction.



### Cross-Entropy Loss

Cross-entropy loss measures the difference between the true class
distribution and the predicted probability distribution.

For binary classification, it is defined as:

L = −[y log(ŷ) + (1 − y) log(1 − ŷ)]

Cross-entropy heavily penalizes confident but incorrect predictions.
If the model assigns high probability to the wrong class, the loss
becomes very large. This encourages the model to produce well-calibrated
probabilities rather than just correct class labels.

Cross-entropy is derived from probability theory and maximum likelihood
estimation, making it theoretically well-founded and widely adopted
in classification models.


In [None]:

y_true = np.array([1, 0, 1])
y_pred_prob = np.array([0.9, 0.1, 0.6])

- np.mean(
    y_true * np.log(y_pred_prob) +
    (1 - y_true) * np.log(1 - y_pred_prob)
)



## Objective of Optimization

The objective of optimization is to find model parameters that minimize
the loss function. Optimization transforms learning into a mathematical
problem of finding the minimum of a function.

In neural networks, optimization operates in a high-dimensional parameter
space, often containing millions of weights and biases. The optimization
process searches this space iteratively, improving parameters step by step.

A good optimization strategy balances:
- Speed of convergence
- Stability of updates
- Ability to escape poor local solutions

The effectiveness of a learning algorithm depends heavily on how well
this optimization objective is defined and solved.



## Gradient Descent Intuition

Gradient descent is an iterative optimization algorithm used to minimize
the loss function. It works by computing the gradient (slope) of the loss
with respect to model parameters and updating parameters in the opposite
direction of the gradient.

Intuitively, gradient descent can be visualized as moving downhill on
a surface defined by the loss function. Each step moves the parameters
closer to a minimum.

The size and direction of each step are determined by:
- The gradient
- The learning rate

Gradient descent forms the backbone of neural network training.



## Conceptual Illustration: Loss Minimization

```
Loss
 ^
 |        *
 |      *
 |    *
 |  *
 |*
 +------------------> Model Parameters
```

The objective of training is to move the model parameters toward
the region where the loss value is minimized.



## Conceptual Illustration: Gradient Descent Movement

```
Start  --->  --->  --->  Minimum
  *      *      *      *
```

Each step taken by gradient descent moves the parameters closer
to a minimum of the loss surface.



## Confusion Matrix (Conceptual View)

```
                Predicted
              Positive  Negative
Actual Positive    TP        FN
Actual Negative    FP        TN
```

Evaluation metrics such as precision, recall, and F1-score are
derived from these four values.


In [None]:

# Interactive example: effect of learning rate on convergence (conceptual)
import ipywidgets as widgets
from IPython.display import display

def learning_rate_effect(lr):
    print(f"Learning Rate selected: {lr}")
    if lr < 0.01:
        print("Slow convergence, very stable updates")
    elif lr < 0.1:
        print("Good balance between speed and stability")
    else:
        print("Risk of divergence or unstable training")

slider = widgets.FloatSlider(
    value=0.05,
    min=0.001,
    max=1.0,
    step=0.01,
    description='Learning Rate',
)

display(slider)
widgets.interactive_output(learning_rate_effect, {'lr': slider})



##  Types of Gradient Descent

Gradient descent algorithms differ mainly in **how much data is used**
to compute the gradient at each update step. This choice strongly affects
training speed, stability, and memory usage.



### Batch Gradient Descent ()

Batch Gradient Descent computes the gradient of the loss function using
the **entire training dataset** before updating model parameters.

Characteristics:
- Produces smooth and stable updates
- Guarantees convergence for convex problems
- Computationally expensive for large datasets
- Requires loading all data into memory

Batch gradient descent is mainly used for:
- Small datasets
- Analytical studies
- Convex optimization problems



### Stochastic Gradient Descent ()

Stochastic Gradient Descent (SGD) updates model parameters using
**a single training example at a time**.

Characteristics:
- Extremely fast updates
- Noisy parameter updates
- Helps escape shallow local minima
- Lower memory requirements

SGD introduces randomness into training, which often improves
generalization performance despite noisy updates.



### Mini-batch Gradient Descent ()

Mini-batch Gradient Descent is a compromise between batch and stochastic methods.
It computes gradients using a **small subset (batch) of data**.

Characteristics:
- Efficient computation using vectorization
- More stable than SGD
- Faster than batch gradient descent
- Most widely used in practice

Typical mini-batch sizes range from 16 to 256 samples.



##  Learning Rate and Its Effect

The learning rate determines how far model parameters move during each
gradient descent update. It directly controls the speed and stability
of the learning process.



### Effects of Different Learning Rates

- **Very small learning rate**  
  Slow convergence, long training time

- **Very large learning rate**  
  Overshooting minima, unstable training

- **Well-chosen learning rate**  
  Fast convergence and stable learning



Learning rate selection is often performed using:
- Learning rate schedules
- Adaptive optimizers
- Empirical experimentation

In practice, learning rate is one of the most critical hyperparameters.



##  High-level Intuition of Backpropagation

Backpropagation is the algorithm that enables neural networks to learn.
It efficiently computes how each parameter contributes to the final loss.



### Error Signal Propagation

- Output layer computes prediction error
- Error is propagated backward layer by layer
- Each layer receives a portion of the error
- Parameters are updated using gradients



Backpropagation relies on the **chain rule of calculus**, allowing gradients
to be computed efficiently even in very deep networks.



##  Evaluation Metrics

Evaluation metrics provide insight into model performance beyond loss values.
They are especially important when class distributions are uneven.



### Accuracy ()

Accuracy measures the proportion of correct predictions.

Strengths:
- Easy to interpret
- Useful for balanced datasets

Limitations:
- Misleading for imbalanced data



### Precision ()

Precision measures how many predicted positive samples are actually positive.

Precision is important when:
- False positives are costly
- Quality of positive predictions matters



### Recall ()

Recall measures how many actual positive samples are correctly identified.

Recall is important when:
- Missing positive cases is costly
- Detection coverage matters



### F1-score ()

F1-score balances precision and recall using their harmonic mean.

F1-score is useful when:
- Data is imbalanced
- Both false positives and false negatives matter


## Context and Motivation

By this stage in our study, we have seen how data flows through a neural network, how tensors
represent this data, and how activation functions introduce non-linearity into the model.
We have also worked through a simple classification example and examined how real-world data
must be prepared before it can be used for learning.

With these components in place, a neural network is capable of producing predictions. However,
prediction alone does not imply learning. A learning system must be able to evaluate its own
performance and adjust itself accordingly. This requirement motivates the introduction of
loss functions and evaluation metrics.


## Learning as Feedback

Learning in neural networks relies on feedback. After the network produces an output, this
output must be compared with the true target. The result of this comparison provides a signal
that indicates how the model should change.

Loss functions formalize this idea of feedback by converting prediction errors into numerical
values. These values guide the optimization process that updates the network’s parameters.


## Why Different Tasks Need Different Loss Functions

The nature of the prediction task strongly influences the choice of loss function. In
regression problems, errors are measured as numerical deviations, while in classification
problems, predictions are evaluated in terms of class membership and probability estimates.

As a result, no single loss function is suitable for all tasks. The choice of loss function
reflects assumptions about the data, the noise present in the observations, and the desired
behavior of the model.


## Connecting Loss Functions to Optimization

Loss functions are not chosen arbitrarily. In order to support efficient training, they must
exhibit mathematical properties such as continuity and differentiability. These properties
allow gradient-based optimization algorithms to compute meaningful parameter updates.

This connection between loss functions and optimization explains why some intuitive performance
measures are unsuitable as training objectives, even though they may be useful for evaluation.


## Interpreting Evaluation Metrics

Evaluation metrics translate model performance into interpretable quantities. While loss
functions operate internally during training, metrics are used to communicate results and to
compare different models.

Choosing appropriate evaluation metrics is especially important when dealing with imbalanced
datasets or asymmetric costs of errors, where simple accuracy may provide a misleading picture.


## Perspective

Loss functions and evaluation metrics complete the learning pipeline introduced throughout the
previous chapters. They connect model predictions back to data-driven objectives and provide
the basis for systematic improvement.

In later chapters, these concepts will be revisited in the context of more advanced models and
training strategies, reinforcing their foundational role in machine learning.



## Task for the reader

1. Compare MSE and MAE on datasets with outliers  
2. Analyze the effect of learning rate changes  
3. Explain why cross-entropy is preferred for classification  
4. Compute evaluation metrics manually from predictions  
