# metrics for NLP tasks

## perplexity 困惑度

**perplexity**

- Perplexity measures how well a language model predicts a sequence of words by quantifying the **average number of choices (the number of equally likely choices) the model has for predicting a sequence**.

- the lower perplexity, the better the model: A lower perplexity means that the model is more certain about the sequence. a good model will give highest probability to ground truth sequence.

**Definition**

Perplexity of a language model parametrized by $\theta$ is defined as n-power root of likelihood (joint probability) of a sequence of words $w_1, w_2, ..., w_N$ under such model

$$
\text{Perplexity}(\theta)=\sqrt[n]{\prod _{i=1}^N p_{\theta} (w_i | w_{1:i-1})}
$$

where $p_{\theta}(w_i | w_{1:i-1})$ represents conditional probability of observing word $w_i$ given the previous words $w_{1:i-1}$ (context), according to the language model parametrized by $\theta$

**interpretation**

- log perplexity is Negative log likelihood

    $$
    \log(\text{Perplexity}(\theta))=\log \left ( \prod _{i=1}^N p_{\theta} (w_i | w_{1:i-1}) \right )^{-\frac{1}{N}}=-\frac{1}{N}\log \left ( \prod _{i=1}^N p_{\theta} (w_i | w_{1:i-1}) \right )
    $$

- perplexity is exponential of average cross-entropy loss

    $$
    \begin{aligned}
    \text{Perplexity}(\theta) &= \exp \left[\log(\text{Perplexity}(\theta))\right]\\[1em]
    &= \exp\left[-\frac{1}{N}\log \left ( \prod_{i=1}^N p_{\theta} (w_i | w_{1:i-1}) \right )\right]\\[1em]
    &= \exp\left[\frac{1}{N}\sum_{i=1}^N - \log p_{\theta} (w_i | w_{1:i-1})\right]
    \end{aligned}
    $$

- exponent of perplexity is **average number of bits to encode each word in the sequence** 

    $$
    \text{\# bits}=-\frac{1}{N}\sum_{i=1}^N\log_2 p_{\theta} (w_i | w_{1:i-1})$$

    by changing base of logarithm from e to 2

    $$
    \begin{aligned}
    \text{Perplexity}(\theta) = 2^{-\frac{1}{N}\sum_{i=1}^N \log_2 p_{\theta} (w_i | w_{1:i-1})}
    \end{aligned}
    $$


- e.g. if perplexity is 100, then model is as uncertain about the next word in a sequence as it would be if there were 100 equally likely words to pick from.

$$
\text{perplexity}(\theta)=\left ( \prod _{i=1}^{100} p_{\theta} (w_i | w_{1:i-1}) \right )^{-\frac{1}{100}}=\left ( \left(\frac{1}{100}\right)^{100}  \right )^{-\frac{1}{100}}=100
$$

## BLEU

BLEU (BiLingual Evaluation Understudy) is n-gram precision measure used in machine translation

called "understudy" 替补 because it is not perfect replacements for human evaluation but rather serve as a proxy

measures the similarity between the machine-generated translation and **multiple** human-generated reference translations.

$$
\text{BLEU}=\min \left(1, \exp(1-\frac{\text{len(ref)}}{\text{len(pred)}})\right) \left(\prod_{n=1}^4 \text{n-gram precision}\right)^{1/4}
\\[1em]
\log \text{BLEU}=\min \left(0, 1-\frac{\text{len(ref)}}{\text{len(pred)}} \right) +\text{mean log precision}
$$

$\min(1, \exp(1-\frac{len(ref)}{len(pred)}))$ is length/brevity penalty for translations that are too short


BLEU Doesn’t correlate well when comparing human and automatic translations because it has some limitations:

- semantics: BLEU only considers surface-level n-gram matches. can't differentiate between translations that have the same n-gram matches but different meanings, or translations that have different n-gram matches but convey the same meaning.

- recall: BLEU measures precision (how many n-grams in the machine-generated translation appear in the reference translation) but doesn't consider recall (how many n-grams in the reference translation appear in the machine-generated translation).

- fluency: BLEU only captures some aspects of fluency, such as word order. Human translations tend to be more fluent and idiomatic than machine translations.

## word error rate

Word Error Rate (WER): minimum number of **word-level** operations (insertions, deletions, and substitutions) required to transform a system-generated transcription (hypothesis) into a reference transcription (ground truth).

WER = (Insertions + Deletions + Substitutions) / Total Words in Reference

WER can be seen as a word-level version of edit distance

limitation: can't capture the semantic similarity between the reference and hypothesis

## edit cost

$$

D(i, j) = 
\begin{cases}
0, & \text{if } i = 0 \text{ and } j = 0 \\
i, & \text{if } j = 0 \text{ and } i > 0 \\
j, & \text{if } i = 0 \text{ and } j > 0 \\
D(i-1, j-1), & \text{if } A_i = B_j \\
\min \begin{cases}
D(i-1, j) + 1 \\
D(i, j-1) + 1 \\
D(i-1, j-1) + 1
\end{cases}, & \text{otherwise}
\end{cases}

$$

Edit cost (edit distance, Levenshtein distance): a measure of similarity between two strings. used in spell checking, DNA sequence alignment, and speech recognition. 

definition: minimum number of **character-level** operations (insertions, deletions, or substitutions) required to transform one string into another. 

The edit cost can be calculated at different levels

- Word level: The strings A and B consist of words, and the edit operations are performed on words.

- Character level: The strings A and B consist of characters, and the edit operations are performed on characters.

- Minute level: The strings A and B consist of time-stamped events or actions, and the edit operations are performed on these time-based units.

algorithm

1. Initialize a matrix D with dimensions `(len(A) + 1) x (len(B) + 1)`, where len(A) and len(B) are the lengths of the strings A and B, respectively.

2. Set the first row and column of the matrix D to their respective index values, `D[i][0] = i` and `D[0][j] = j` for all i and j.

3. Iterate through the matrix D from the second row and second column, using the following formula for each cell D[i][j]:

    - If `A[i-1] == B[j-1]`, then `D[i][j] = D[i-1][j-1]`

    - Otherwise, `D[i][j] = min(D[i-1][j], D[i][j-1], D[i-1][j-1]) + 1`

4. The edit distance between A and B is the value in the bottom-right cell of the matrix D, `D[len(A)][len(B)]`.

## ROUGE 

ROUGE (Recall-Oriented Understudy for Gisting Evaluation):  text summarization and machine translation tasks. 

It measures the overlap of n-grams, long common subsequences, or skip-bigrams between the system-generated summary or translation and the reference summary or translation.

## METEOR 

METEOR  (Metric for Evaluation of Translation with Explicit ORdering): machine translation. 

It evaluates translations based on the harmonic mean of unigram precision and recall, considering synonyms and paraphrases, as well as word order through alignment.

## TER 

(Translation Edit Rate) is another evaluation metric for machine translation. It measures the number of edits (insertions, deletions, substitutions, and shifts) required to transform a machine-generated translation into one of the reference translations, normalized by the length of the reference translation. The lower the TER score, the closer the machine-generated translation is to the reference translations, indicating better translation quality.