# Chapter 32: Evaluation Metrics for Classification

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the fundamental metrics used to evaluate classification models
- Compute and interpret accuracy, precision, recall, and F1‑score for binary direction predictions
- Construct and analyze a confusion matrix to understand error types
- Explain the trade‑off between precision and recall and how to adjust decision thresholds
- Generate ROC curves and calculate AUC to assess model discrimination
- Use precision‑recall curves for imbalanced datasets
- Apply log loss to evaluate probabilistic predictions
- Handle class imbalance appropriately when choosing metrics
- Extend binary metrics to multi‑class classification problems
- Select the most suitable metrics for the NEPSE direction prediction task

---

## **32.1 Introduction to Classification Metrics**

In Chapter 19, we defined a binary classification target for the NEPSE prediction system: predicting whether tomorrow's return will be positive (up) or non‑positive (down). After training a classifier (e.g., logistic regression, random forest, or neural network), we need to evaluate its performance. Classification metrics quantify how well the model distinguishes between classes.

Unlike regression metrics (MAE, RMSE), classification metrics focus on counts of correct and incorrect predictions. They also allow us to assess different types of errors separately – for example, predicting an up move when the market actually goes down (false positive) versus predicting down when it goes up (false negative). Depending on the trading strategy, one error type may be more costly than the other.

We will explore the most common metrics and illustrate them using predictions from a simple classifier on NEPSE data.

---

## **32.2 Accuracy and Its Limitations**

Accuracy is the simplest metric: the proportion of correct predictions among all predictions.

**Formula:**  
`Accuracy = (TP + TN) / (TP + TN + FP + FN)`

where:
- TP = True Positives (correctly predicted up)
- TN = True Negatives (correctly predicted down)
- FP = False Positives (predicted up, actual down)
- FN = False Negatives (predicted down, actual up)

Accuracy is intuitive and widely used. However, it can be misleading when classes are imbalanced. For example, if the market goes up 70% of the time, a model that always predicts "up" would achieve 70% accuracy without any predictive skill. Therefore, we must always consider accuracy in the context of the **baseline** (the proportion of the majority class).

### **32.2.1 Computing Accuracy for NEPSE Direction Predictions**

```python
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit

# Assume we have prepared data X (features) and y (binary target) from previous chapters
# For demonstration, we'll create synthetic predictions
np.random.seed(42)
y_true = np.random.randint(0, 2, size=1000)  # actual directions
y_pred = np.random.randint(0, 2, size=1000)  # random predictions

accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Baseline accuracy (predict majority class)
baseline = max(y_true.mean(), 1 - y_true.mean())
print(f"Baseline accuracy (always predict majority): {baseline:.4f}")
```

**Explanation:**  
`accuracy_score` from scikit‑learn computes the fraction of matching labels. The baseline gives context: if our model's accuracy is only slightly above the baseline, it may not be truly predictive. For NEPSE, if up days constitute 52% of the test period, a model with 53% accuracy is only marginally better than guessing.

---

## **32.3 Confusion Matrix**

A confusion matrix provides a tabular summary of correct and incorrect predictions, broken down by class. It is the foundation for many other metrics.

For binary classification, it is a 2×2 matrix:

|                | Predicted Down | Predicted Up |
|----------------|----------------|--------------|
| Actual Down    | TN             | FP           |
| Actual Up      | FN             | TP           |

### **32.3.2 Generating a Confusion Matrix in Python**

```python
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)

# Visualize
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Down', 'Up'])
disp.plot()
plt.show()
```

**Explanation:**  
The confusion matrix shows, for example, how many actual up days were correctly predicted (TP) and how many were misclassified as down (FN). This breakdown is essential for understanding where the model fails. In trading, we might care more about avoiding false positives (predicting up when it actually goes down) if we are going long, or false negatives if we are shorting.

---

## **32.4 Precision and Recall**

Precision and recall focus on the positive class (usually the class of interest, e.g., "up").

**Precision** (also called Positive Predictive Value) measures how many of the predicted positive cases are actually positive:  
`Precision = TP / (TP + FP)`

A high precision means that when the model predicts an up move, it is very likely to be correct. This is important if false positives are costly (e.g., entering a long position that loses money).

**Recall** (also called Sensitivity or True Positive Rate) measures how many of the actual positive cases the model captured:  
`Recall = TP / (TP + FN)`

A high recall means the model catches most of the up moves, missing few. This is important if missing a profit opportunity is costly.

### **32.4.1 Computing Precision and Recall**

```python
from sklearn.metrics import precision_score, recall_score

precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
```

**Explanation:**  
These metrics give a more nuanced view than accuracy. For NEPSE, if precision is high but recall is low, the model is very selective: it only predicts up when it is very sure, but it misses many up moves. Conversely, high recall and low precision means it predicts up often, but many of those predictions are wrong.

---

## **32.5 F1‑Score and F‑Beta Score**

F1‑score is the harmonic mean of precision and recall, providing a single metric that balances both.

**Formula:**  
`F1 = 2 * (precision * recall) / (precision + recall)`

It ranges from 0 to 1, with 1 being perfect precision and recall. F1 is useful when you want to balance precision and recall, especially when classes are imbalanced.

**F‑Beta score** generalizes F1 by allowing different weights for precision and recall:  
`Fβ = (1 + β²) * (precision * recall) / (β² * precision + recall)`

- β < 1 weighs precision more (e.g., F0.5)
- β > 1 weighs recall more (e.g., F2)

### **32.5.1 Computing F1 and F‑Beta**

```python
from sklearn.metrics import f1_score, fbeta_score

f1 = f1_score(y_true, y_pred)
f05 = fbeta_score(y_true, y_pred, beta=0.5)
f2 = fbeta_score(y_true, y_pred, beta=2)

print(f"F1 Score: {f1:.4f}")
print(f"F0.5 (precision-focused): {f05:.4f}")
print(f"F2 (recall-focused): {f2:.4f}")
```

**Explanation:**  
If your trading strategy is more sensitive to false positives (e.g., you lose money on wrong long positions), you might prefer F0.5. If missing a profit opportunity hurts more, F2 could be more appropriate.

---

## **32.6 ROC Curve and AUC**

The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (recall) against the False Positive Rate (FPR = FP / (FP + TN)) at various threshold settings. The Area Under the ROC Curve (AUC) summarizes the curve: AUC = 1 for a perfect model, 0.5 for a random model.

ROC curves are useful for comparing models and for choosing a threshold that balances TPR and FPR according to business needs.

### **32.6.1 Plotting ROC Curve and Computing AUC**

```python
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# For ROC, we need predicted probabilities (not just classes)
# Assuming we have a model that can output probabilities
# For demonstration, we'll use random probabilities
y_scores = np.random.rand(1000)  # simulated probabilities

fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc = roc_auc_score(y_true, y_scores)

plt.figure()
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

print(f"AUC: {auc:.4f}")
```

**Explanation:**  
The ROC curve shows the trade‑off: as we lower the threshold, TPR increases but so does FPR. AUC gives the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative one. For NEPSE, an AUC > 0.6 might indicate some predictive power.

---

## **32.7 Precision‑Recall Curve**

When classes are imbalanced, the ROC curve can be overly optimistic because FPR remains low due to the large number of true negatives. The Precision‑Recall (PR) curve focuses on the positive class, plotting precision against recall at different thresholds. The Area Under the PR curve (AUPRC) is a useful metric.

### **32.7.1 Plotting PR Curve**

```python
from sklearn.metrics import precision_recall_curve, average_precision_score

precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
ap = average_precision_score(y_true, y_scores)

plt.figure()
plt.plot(recall, precision, label=f'PR curve (AP = {ap:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.show()

print(f"Average Precision: {ap:.4f}")
```

**Explanation:**  
Average Precision (AP) summarizes the PR curve. For imbalanced datasets (e.g., if up days are rare), PR curves are more informative than ROC.

---

## **32.8 Log Loss (Cross‑Entropy)**

Log loss, or cross‑entropy loss, measures the performance of a classification model where the output is a probability value between 0 and 1. It penalizes false classifications, but also penalizes uncertainty. The formula for binary log loss is:

`LogLoss = - (1/n) Σ [yᵢ log(pᵢ) + (1 - yᵢ) log(1 - pᵢ)]`

where `pᵢ` is the predicted probability of the positive class.

Log loss is sensitive to how confident the model is: being wrong with high confidence is penalized more than being wrong with low confidence.

### **32.8.1 Computing Log Loss**

```python
from sklearn.metrics import log_loss

# y_scores are predicted probabilities for the positive class
logloss = log_loss(y_true, y_scores)
print(f"Log Loss: {logloss:.4f}")
```

**Explanation:**  
A perfect model would have log loss close to 0. A model that predicts the baseline probability (e.g., always 0.5) would have higher log loss. For NEPSE, log loss can be used to compare probability‑based models.

---

## **32.9 Metrics for Imbalanced Data**

In many financial datasets, the classes may be imbalanced (e.g., up days 55%, down days 45%, or more extreme). Standard accuracy can be misleading. We should:

- Use precision, recall, F1, and PR curves.
- Consider the **balanced accuracy**, which averages recall on each class: `(TPR + TNR) / 2`.
- Use **Cohen's Kappa**, which measures agreement with chance.
- For probabilistic outputs, use **Brier score** (mean squared error of probabilities).

### **32.9.1 Balanced Accuracy and Cohen's Kappa**

```python
from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score

balanced_acc = balanced_accuracy_score(y_true, y_pred)
kappa = cohen_kappa_score(y_true, y_pred)

print(f"Balanced Accuracy: {balanced_acc:.4f}")
print(f"Cohen's Kappa: {kappa:.4f}")
```

**Explanation:**  
Balanced accuracy gives equal weight to both classes, so it is not inflated by majority class bias. Kappa measures how much better the model is than random guessing, accounting for class imbalance.

---

## **32.10 Multi‑Class Classification Metrics**

If we extend to multi‑class (e.g., strong up, weak up, weak down, strong down), we need metrics that handle multiple classes.

- **Macro‑averaging:** Compute metric for each class independently and average (unweighted).
- **Micro‑averaging:** Aggregate TP, FP, FN across classes and then compute metric.
- **Weighted averaging:** Average weighted by the number of true instances per class.

Scikit‑learn provides `classification_report` that includes precision, recall, F1 for each class and their averages.

```python
from sklearn.metrics import classification_report

# Assume y_true_multi and y_pred_multi for 4 classes
report = classification_report(y_true_multi, y_pred_multi)
print(report)
```

---

## **32.11 Choosing the Right Metric for NEPSE**

For the NEPSE direction prediction task, we recommend a combination:

- **Accuracy** – quick baseline, but always compare to majority class baseline.
- **Confusion matrix** – to see where errors occur.
- **Precision and recall** – depending on the trading strategy:
  - If you are long‑only, you want high precision (avoid false positives that lose money).
  - If you are short‑only, you want high recall of down moves, which translates to high precision for down class (or high recall for up if you trade both sides?).
- **F1‑score** – a balanced view.
- **ROC‑AUC** – overall discriminative power.
- **Log loss** – if you need well‑calibrated probabilities for position sizing.

Always compute these on a temporally separated test set to avoid look‑ahead bias.

---

## **32.12 Chapter Summary**

In this chapter, we covered the essential classification metrics and applied them to the NEPSE direction prediction problem.

- **Accuracy** is simple but can be misleading with imbalanced data.
- **Confusion matrix** provides a detailed breakdown of errors.
- **Precision** and **recall** focus on the positive class and reveal different aspects of performance.
- **F1‑score** balances precision and recall.
- **ROC curve and AUC** assess discrimination across thresholds.
- **Precision‑recall curve** is better for imbalanced datasets.
- **Log loss** evaluates probabilistic predictions.
- For imbalanced data, use **balanced accuracy**, **Cohen's kappa**, and PR curves.
- Multi‑class extensions (macro/micro averaging) allow evaluation of more than two classes.

### **Practical Takeaways for the NEPSE System:**

- For a binary up/down classifier, report precision, recall, and F1 alongside accuracy.
- Always compute the majority class baseline to put accuracy in context.
- Use ROC‑AUC to compare models; a value above 0.6 may indicate predictive value.
- If you plan to use predicted probabilities for position sizing, monitor log loss and calibration.
- Tailor metric choice to the trading strategy: if false positives are costly, focus on precision.

In the next chapter, **Chapter 33: Time‑Series Specific Evaluation**, we will extend these concepts to the unique aspects of evaluating forecasts over time, including cumulative errors, directional accuracy, hit rate, and economic metrics.

---

**End of Chapter 32**