# Metrics for Evaluating Performance

## Distance Metrics

1. Manhattan Distance: $\sum_{i=1}^{n}|x_i - y_i|$
2. Euclidean Distance: $\sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}$
3. Minkowski Distance: $\sqrt[p]{\sum_{i=1}^{n}|x_i - y_i|^p}$
4. Hamming Distance: $\sum_{i=1}^{n}I(x_i \neq y_i)$

## Classification Evaluation Metrics

- **Accuracy**: The accuracy of a model is the ratio of the number of correct predictions to the total number of predictions. It is a measure of the model’s correctness. 
  - When to use accuracy? When the classes are balanced, say, **we can not use accuracy to predict if an asteroid is going to hit the earth or not.**

$$\text{Accuracy} = \frac{\text{True Positives} + \text{True Negatives}}{\text{True Positives} + \text{True Negatives} + \text{False Positives} + \text{False Negatives}}$$

- **Precision**: The precision of a model is the ratio of the number of true positives to all the predicted positives. A model with high precision is more likely to predict a positive class when it is actually positive.
  - When to use precision? When the classes are imbalanced and we want to be very sure of our prediction, say, **we can use precision to predict if an email is spam or not.**

$$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$$

- **Recall**: The recall of a model is the ratio of the number of true positives to the total number of true positives and false negatives. It is a measure of the proportion of actual positives are correctly classified. A model with high recall is more likely to predict a positive class when it is actually negative.
  - When to use recall? When we want to predict all the positive classes, say, **we can use recall to predict if a patient has a disease or not.**

$$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$$

- **F1 score**: The F1-score of a classification model is calculated as the harmonic mean of the precision and recall of the model.  It is a good measure to use if you have an uneven class distribution (i.e. a lot more positive samples than negative samples). It is maximum when Precision is equal to Recall.
  - When to use F1 score? When False Positive and False Negative are equally costly and True Negative is high

$$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

- **ROC and AUC**
  - The ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. 
  - The AUC is the area under the ROC curve.
  - AUC is the larger the better. AUC is useful as a single number summary of classifier performance.

## Natural Language Generation Evaluation Metrics

- **BLEU**: BLEU is a metric for evaluating a generated sentence to a reference sentence. It is calculated by **comparing n-grams of the generated sentence to the n-grams of the reference sentence**. The higher the BLEU score, the better the generated sentence is.
  - $\text{BLEU} = \exp\left(\frac{1}{N} \sum_{n=1}^{N} \log p_n\right)$, where $p_n$ is the precision of $n$-grams.
  - Disadvantage of BLUE: It is **overdependent on reference** and it not a good metric for evaluating the **fluency/grammar** of the generated sentence.
- METEOR:
- CIDEr:
- ChrF++:
