- Dimension: Generated Answer <-> GroundTruth Answer
- Reference: Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question Answering Benchmark
- Type: Token-wise Accuracy
F1-recall measures the overlap between model-generated responses and ground truth, focusing on the model's ability to reproduce key elements from the reference.
- Tokenization: Both the generated text and ground truth are segmented into token lists using word segmentation tools.
- Calculation: Determine the ratio of tokens in the model's output that also appear in the ground truth token list.
- Formula: F1-recall = (Number of common tokens) / (Total number of tokens in ground truth)