Below is the detailed explanation of each metric along with its corresponding mathematical formula. All of the original information is preserved, with the formulas added for deeper study.
# 1. Classification Metrics

## Binary & Multiclass Classification

### **Accuracy**  
**What it tells:**  
The overall proportion of correct predictions. It measures the rate at which the model predicts the correct class.

**Consideration:**  
May be misleading for imbalanced datasets where one class dominates.

**Formula:**
$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$  
where:  
- \( TP \) = True Positives  
- \( TN \) = True Negatives  
- \( FP \) = False Positives  
- \( FN \) = False Negatives

---

### **Precision**  
**What it tells:**  
The proportion of positive predictions that are actually correct. It answers: "When the model predicts positive, how often is it right?"

**Consideration:**  
High precision means few false positives.

**Formula:**
$$
\text{Precision} = \frac{TP}{TP + FP}
$$

---

### **Recall (Sensitivity, True Positive Rate - TPR)**  
**What it tells:**  
The proportion of actual positives that the model correctly identified. It answers: "How many of the actual positive cases did the model capture?"

**Consideration:**  
High recall means few false negatives.

**Formula:**
$$
\text{Recall} = \frac{TP}{TP + FN}
$$

---

### **F1 Score**  
**What it tells:**  
The harmonic mean of precision and recall. It provides a balance between the two, especially useful when you need a single metric for imbalanced datasets.

**Consideration:**  
A high F1 score indicates both high precision and recall.

**Formula:**
$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

---

### **ROC-AUC (Receiver Operating Characteristic - Area Under Curve)**  
**What it tells:**  
Measures the model’s ability to distinguish between classes by comparing the true positive rate to the false positive rate at various threshold settings.

**Consideration:**  
A higher AUC indicates better performance across different classification thresholds.

**Formula:**  
There is no single closed-form expression. It is computed as the area under the ROC curve, where:
$$
\text{TPR} = \frac{TP}{TP + FN} \quad \text{and} \quad \text{FPR} = \frac{FP}{FP + TN}
$$  
Numerical integration (e.g., using the trapezoidal rule) is typically used.

---

### **PR-AUC (Precision-Recall AUC)**  
**What it tells:**  
Summarizes the trade-off between precision and recall across different thresholds, especially informative when dealing with imbalanced classes.

**Consideration:**  
More sensitive to the performance on the minority class than ROC-AUC.

**Formula:**  
Like ROC-AUC, PR-AUC is the area under the precision-recall curve:
$$
\text{PR-AUC} = \int_{0}^{1} \text{Precision}(\text{Recall}^{-1}(x)) \, dx
$$  
computed numerically.

---

### **Log Loss (Cross-Entropy Loss)**  
**What it tells:**  
Measures the uncertainty of predictions by penalizing confident but wrong predictions more than less confident ones. Lower log loss indicates better model calibration.

**Consideration:**  
It’s sensitive to how well the predicted probabilities reflect true likelihoods.

**Formula (for binary classification):**
$$
\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i) \right]
$$  
where:  
- \( y_i \) is the true label (0 or 1)  
- \( \hat{p}_i \) is the predicted probability for the positive class  
- \( N \) is the number of samples

---

## Multilabel Classification

### **Hamming Loss**  
**What it tells:**  
The fraction of labels that are incorrectly predicted. It penalizes each misclassified label equally.

**Consideration:**  
Lower values indicate better performance.

**Formula:**
$$
\text{Hamming Loss} = \frac{1}{N \times L} \sum_{i=1}^{N} \sum_{j=1}^{L} \mathbf{1}(y_{ij} \neq \hat{y}_{ij})
$$  
where:  
- \( N \) is the number of samples  
- \( L \) is the number of labels per sample  
- \( \mathbf{1}(\cdot) \) is the indicator function

---

### **Jaccard Similarity (Intersection over Union)**  
**What it tells:**  
Measures the similarity between the predicted set of labels and the true set of labels. It is the size of the intersection divided by the size of the union of the label sets.

**Consideration:**  
Higher values indicate better overlap between predictions and true labels.

**Formula:**
$$
\text{Jaccard Similarity} = \frac{|Y \cap \hat{Y}|}{|Y \cup \hat{Y}|}
$$  
where:  
- \( Y \) is the set of true labels  
- \( \hat{Y} \) is the set of predicted labels

---

## Imbalanced Classification

### **Balanced Accuracy**  
**What it tells:**  
The average recall obtained on each class, which helps when classes are imbalanced.

**Consideration:**  
It adjusts for the class imbalance by taking the average of the per-class recall values.

**Formula (for \( C \) classes):**
$$
\text{Balanced Accuracy} = \frac{1}{C} \sum_{c=1}^{C} \frac{TP_c}{TP_c + FN_c}
$$  
where \( TP_c \) and \( FN_c \) are the true positives and false negatives for class \( c \).

---

### **Matthews Correlation Coefficient (MCC)**  
**What it tells:**  
A correlation coefficient between the observed and predicted classifications. It takes into account true and false positives and negatives and is regarded as a balanced measure even if the classes are of very different sizes.

**Consideration:**  
Ranges from -1 (total disagreement) to +1 (perfect prediction).

**Formula (binary case):**
$$
\text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}
$$

---

### **F1 Score (macro, weighted, or micro)**  
**What it tells:**  
Variants of the F1 score adjust for class imbalance:  
- **Macro F1:** Averages F1 scores per class without weighting, treating all classes equally.  
- **Weighted F1:** Averages F1 scores per class, weighted by the number of instances in each class.  
- **Micro F1:** Aggregates contributions of all classes to compute the average metric.

**Consideration:**  
The choice depends on whether you want to give equal importance to all classes or weigh them by frequency.

**Formula:**  
The base F1 formula is:
$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$  
The aggregation method (macro, weighted, micro) determines how the scores are averaged over the classes.

---

# 2. Regression Metrics

### **Mean Absolute Error (MAE)**  
**What it tells:**  
The average absolute difference between the predicted and actual values. It gives a straightforward measure of prediction error in the same units as the output.

**Consideration:**  
All errors are weighted equally, regardless of their magnitude.

**Formula:**
$$
MAE = \frac{1}{N} \sum_{i=1}^{N} \left| y_i - \hat{y}_i \right|
$$

---

### **Mean Squared Error (MSE)**  
**What it tells:**  
The average of the squared differences between predicted and actual values. Squaring errors penalizes larger errors more significantly.

**Consideration:**  
Sensitive to outliers due to the squaring of differences.

**Formula:**
$$
MSE = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2
$$

---

### **Root Mean Squared Error (RMSE)**  
**What it tells:**  
The square root of MSE, which converts the error metric back to the original units of the output.

**Consideration:**  
Like MSE, it penalizes large errors, making it useful when large errors are particularly undesirable.

**Formula:**
$$
RMSE = \sqrt{MSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2}
$$

---

### **R-squared (\(R^2\))**  
**What it tells:**  
The proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1 (sometimes negative if the model performs poorly).

**Consideration:**  
Higher values indicate a better fit of the model to the data.

**Formula:**
$$
R^2 = 1 - \frac{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{N} (y_i - \bar{y})^2}
$$  
where \( \bar{y} \) is the mean of the actual values.

---

### **Mean Absolute Percentage Error (MAPE)**  
**What it tells:**  
The average absolute percentage difference between predicted and actual values, providing a relative measure of error.

**Consideration:**  
Can be problematic when actual values are close to zero.

**Formula:**
$$
MAPE = \frac{100\%}{N} \sum_{i=1}^{N} \left| \frac{y_i - \hat{y}_i}{y_i} \right|
$$

---

### **Huber Loss**  
**What it tells:**  
A combination of MAE and MSE that is less sensitive to outliers than MSE while still differentiable at zero error.

**Consideration:**  
Useful in scenarios where outliers may otherwise skew the error metric.

**Formula:**  
For a given threshold \( \delta \) and error \( a = y_i - \hat{y}_i \):
$$
L_\delta(a) =
\begin{cases}
\frac{1}{2}a^2, & \text{if } |a| \le \delta \\
\delta \left(|a| - \frac{1}{2}\delta\right), & \text{otherwise}
\end{cases}
$$

---

# 3. Clustering Metrics

### **Silhouette Score**  
**What it tells:**  
Measures how similar an object is to its own cluster compared to other clusters. Scores range from -1 (poor clustering) to +1 (well-clustered), with values around 0 indicating overlapping clusters.

**Consideration:**  
Useful when the true number of clusters is unknown.

**Formula (for a single sample \( i \)):**
$$
s(i) = \frac{b(i) - a(i)}{\max\{a(i),\, b(i)\}}
$$  
where:  
- \( a(i) \) is the average intra-cluster distance for sample \( i \).  
- \( b(i) \) is the average distance from sample \( i \) to the nearest cluster it does not belong to.

---

### **Davies-Bouldin Index**  
**What it tells:**  
Evaluates intra-cluster similarity (how close data points are within the same cluster) and inter-cluster separation (how distinct the clusters are from each other). Lower values indicate better clustering.

**Consideration:**  
It is sensitive to the number of clusters chosen.

**Formula (for \( K \) clusters):**
$$
DB = \frac{1}{K} \sum_{i=1}^{K} \max_{j \neq i} \left( \frac{S_i + S_j}{M_{ij}} \right)
$$  
where:  
- \( S_i \) is the average distance of all points in cluster \( i \) to its centroid, and  
- \( M_{ij} \) is the distance between the centroids of clusters \( i \) and \( j \).

---

### **Dunn Index**  
**What it tells:**  
The ratio between the smallest distance between observations not in the same cluster (inter-cluster distance) and the largest intra-cluster distance. Higher values indicate better clustering quality.

**Consideration:**  
It is used to identify compact and well-separated clusters.

**Formula:**
$$
\text{Dunn Index} = \frac{\min_{1 \leq i < j \leq K} \delta(C_i, C_j)}{\max_{1 \leq k \leq K} \Delta(C_k)}
$$  
where:  
- \( \delta(C_i, C_j) \) is the inter-cluster distance between clusters \( C_i \) and \( C_j \), and  
- \( \Delta(C_k) \) is the intra-cluster distance (diameter) of cluster \( C_k \).

---

### **Adjusted Rand Index (ARI)**  
**What it tells:**  
Measures the similarity between the clustering result and the ground truth (if available), adjusted for chance. It ranges from -1 to 1, where 1 indicates perfect agreement.

**Consideration:**  
Especially useful when comparing different clustering algorithms or parameter settings.

**Formula (simplified representation):**
$$
ARI = \frac{RI - \text{Expected } RI}{\text{Max } RI - \text{Expected } RI}
$$  
where \( RI \) (Rand Index) is computed based on pair-counting between clusters and ground truth labels.

---

### **Normalized Mutual Information (NMI)**  
**What it tells:**  
Measures the amount of shared information between the predicted clusters and the true clusters, normalized to scale between 0 and 1.

**Consideration:**  
A higher NMI means a better agreement between the cluster assignments and the actual labels.

**Formula:**
$$
NMI = \frac{I(U;V)}{\sqrt{H(U) \, H(V)}}
$$  
where:  
- \( I(U;V) \) is the mutual information between the clustering \( U \) and the ground truth \( V \), and  
- \( H(U) \) and \( H(V) \) are the entropies of \( U \) and \( V \), respectively.

---

# 4. Anomaly Detection Metrics

### **Precision, Recall, and F1-score**  
**What they tell:**  
Similar to classification, these metrics focus on the correct detection of anomalies (the minority class).  
- **Precision:** How many detected anomalies are true anomalies.  
- **Recall:** How many true anomalies were detected.  
- **F1-score:** Balances precision and recall.

**Consideration:**  
Crucial for applications where false negatives (missed anomalies) or false positives (false alarms) have significant consequences.

**Formulas:**  
$$
\text{Precision} = \frac{TP}{TP + FP}
$$
$$
\text{Recall} = \frac{TP}{TP + FN}
$$
$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

---

### **ROC-AUC / PR-AUC**  
**What they tell:**  
Evaluate the performance of anomaly detection models by examining the trade-off between true positives and false positives (ROC-AUC) or precision and recall (PR-AUC), which is particularly important for imbalanced datasets.

**Consideration:**  
PR-AUC is often more informative in cases with a heavy imbalance.

**Formula:**  
Computed as the area under the respective curves (see Classification Metrics for details), typically using numerical integration.

---

### **Mean Squared Error (for reconstruction-based methods like Autoencoders)**  
**What it tells:**  
In models that reconstruct input data (e.g., autoencoders), the reconstruction error (often measured by MSE) can be used to identify anomalies. A higher error may indicate an anomaly.

**Consideration:**  
The threshold for anomaly detection must be chosen carefully.

**Formula:**
$$
MSE = \frac{1}{N} \sum_{i=1}^{N} (x_i - \hat{x}_i)^2
$$  
where:  
- \( x_i \) is the original input and \( \hat{x}_i \) is the reconstructed input.

---

### **Z-score / Mahalanobis Distance**  
**What they tell:**  
These metrics measure how far a data point is from the mean (or the expected distribution) of the normal data.

**Consideration:**  
Useful in statistical anomaly detection where anomalies are assumed to be far from the mean or outside a certain distribution.

**Z-score Formula:**
$$
z = \frac{x - \mu}{\sigma}
$$  
where:  
- \( x \) is the value,  
- \( \mu \) is the mean, and  
- \( \sigma \) is the standard deviation.

**Mahalanobis Distance Formula:**
$$
D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}
$$  
where:  
- \( x \) is the data point,  
- \( \mu \) is the mean vector, and  
- \( \Sigma \) is the covariance matrix.

---

# 5. Ranking & Recommendation Metrics

### **Mean Reciprocal Rank (MRR)**  
**What it tells:**  
The average of the reciprocal ranks of the first relevant item. It reflects how far down the list you have to go to find a relevant item on average.

**Consideration:**  
Useful when a single relevant result is enough, such as in search queries.

**Formula:**
$$
MRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}
$$  
where:  
- \( |Q| \) is the number of queries, and  
- \( \text{rank}_i \) is the rank position of the first relevant item for the \( i \)th query.

---

### **Normalized Discounted Cumulative Gain (NDCG)**  
**What it tells:**  
Evaluates the quality of the ranking by giving higher scores to relevant items appearing higher in the list, with a discount for items lower down.

**Consideration:**  
Effective when relevance is graded (not just binary).

**Formula:**
$$
NDCG@k = \frac{DCG@k}{IDCG@k}
$$  
where:
$$
DCG@k = \sum_{i=1}^{k} \frac{2^{\text{rel}_i} - 1}{\log_2(i+1)}
$$  
and \( IDCG@k \) is the ideal (maximum possible) DCG up to rank \( k \).

---

### **Hit Rate**  
**What it tells:**  
Measures whether at least one relevant item appears in the recommendation list for a user. It is usually expressed as a percentage.

**Consideration:**  
Does not consider the rank order of the hits.

**Formula:**  
There isn’t a single closed-form formula; it is often defined as:
$$
\text{Hit Rate} = \frac{\text{Number of users with at least one hit}}{\text{Total number of users}} \times 100\%
$$

---

### **Mean Average Precision (MAP)**  
**What it tells:**  
Averages the precision scores after each relevant item is retrieved, providing a single number summary of ranking quality across multiple queries.

**Consideration:**  
Sensitive to the ranking order and number of relevant items per query.

**Formula:**
$$
MAP = \frac{1}{|Q|} \sum_{q \in Q} AP(q)
$$  
with
$$
AP(q) = \frac{1}{N_q} \sum_{k=1}^{n} P(k) \times rel(k)
$$  
where:  
- \( N_q \) is the number of relevant items for query \( q \),  
- \( P(k) \) is the precision at rank \( k \), and  
- \( rel(k) \) is an indicator function (1 if the item at rank \( k \) is relevant, 0 otherwise).

---

# 6. Generative Model Metrics

### **Frechet Inception Distance (FID)**  
**What it tells:**  
Compares the distribution of generated images to real images by measuring the distance between feature representations (usually using a pretrained network). Lower FID indicates generated images that are more similar to real ones.

**Consideration:**  
Widely used in evaluating GANs and other generative models.

**Formula:**
$$
FID = \|\mu_r - \mu_g\|^2 + \operatorname{Tr}\left(\Sigma_r + \Sigma_g - 2\left(\Sigma_r \Sigma_g\right)^{\frac{1}{2}}\right)
$$  
where:  
- \( \mu_r, \Sigma_r \) are the mean and covariance of the real images' feature representations, and  
- \( \mu_g, \Sigma_g \) are the mean and covariance of the generated images' feature representations.

---

### **Inception Score (IS)**  
**What it tells:**  
Evaluates both the quality and diversity of generated images by assessing the confidence of a pretrained classifier on generated images and the variety of classes represented.

**Consideration:**  
Higher scores indicate better performance but can sometimes be insensitive to mode collapse.

**Formula:**
$$
IS = \exp\left(\mathbb{E}_{x}\left[ D_{KL}(p(y|x) \,\|\, p(y)) \right]\right)
$$  
where:  
- \( p(y|x) \) is the conditional label distribution given image \( x \) from a pretrained classifier, and  
- \( p(y) \) is the marginal distribution over all generated images.

---

### **Perplexity (for text generation)**  
**What it tells:**  
Measures how well a probability model predicts a sample. Lower perplexity indicates the model is more confident and makes fewer mistakes in predicting the next token in a sequence.

**Consideration:**  
Commonly used in language modeling; a lower perplexity suggests a better model fit.

**Formula:**
$$
\text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log p(w_i)\right)
$$  
or equivalently,
$$
\text{Perplexity} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 p(w_i)}
$$  
where \( p(w_i) \) is the predicted probability of the \( i \)th token.

---

# Summary of Insights by Model Type

- **Classification Metrics:**  
  Focus on the correctness of discrete predictions using measures such as accuracy, precision, recall, F1 score, ROC-AUC, PR-AUC, and log loss.

- **Regression Metrics:**  
  Measure the magnitude and direction of errors using MAE, MSE, RMSE, \(R^2\), MAPE, and Huber Loss.

- **Clustering Metrics:**  
  Assess the cohesiveness and separation of data groups with metrics like Silhouette Score, Davies-Bouldin Index, Dunn Index, Adjusted Rand Index, and Normalized Mutual Information.

- **Anomaly Detection Metrics:**  
  Gauge the model’s ability to identify rare or unusual instances using adapted classification metrics (precision, recall, F1), ROC-AUC/PR-AUC, reconstruction error (MSE), and statistical distances (Z-score, Mahalanobis Distance).

- **Ranking & Recommendation Metrics:**  
  Evaluate the ordering and relevance of results using Mean Reciprocal Rank, NDCG, Hit Rate, and Mean Average Precision.

- **Generative Model Metrics:**  
  Measure the fidelity and diversity of generated content through metrics like FID, Inception Score, and Perplexity.

---
