# 快速入门 A quick tour
https://huggingface.co/docs/evaluate/en/a_quick_tour  
总结：  
1. evaluation又两种方法，第一是拿到模型的输出，直接和ground truth 进行比较；另一种是使用集成好的 evaluator
2. evaluator 接受三个主要参数：model、data、metric。但是也是可以指定数据中的那个column是模型的输入，哪个column是ground truth
3. 所有评估方法被分为三类：metric、comparison和measurement，他们分别有自己的‘space’，在space中可以看到有哪些评估方法、定义和案例
4. 可以通过 evaluation 的 features 属性查看输入的基本数据类型
5. 支持一次性、累计和分布式的评估
6. 可以结合多个指标进行评估
7. 可以将评估结果可视化
8. 可以将评估结果保存为JSON文件

There are two common types of question answering:

- extractive: given a question and some context, the answer is a span of text from the context the model must extract  
- abstractive: given a question and some context, the answer is generated from the context; this approach is handled by the Text2TextGenerationPipeline instead of the QuestionAnsweringPipeline shown below

In [2]:
from transformers import pipeline
# Extractive QA:
question_answerer = pipeline(task="question-answering")
preds = question_answerer(
    question="What is the name of the repository?",
    context="The name of the repository is huggingface/transformers",
)
print(
    f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}"
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


score: 0.9327, start: 30, end: 54, answer: huggingface/transformers


In [None]:
from transformers import pipeline
# Abstractive QA:
generator = pipeline(model="mrm8488/t5-base-finetuned-question-generation-ap")
generator(
    "The name of the repository is huggingface/transformers. What is the name of the repository?"
)

https://huggingface.co/docs/evaluate/en/base_evaluator


In [None]:
from evaluate import evaluator
from datasets import load_dataset
task_evaluator = evaluator("text2text-generation")
data = load_dataset("conll2003", split="validation[:2]")
results = task_evaluator.compute(
    model_or_pipeline="elastic/distilbert-base-uncased-finetuned-conll03-english",
    data=data,
    metric="seqeval",
)

In [5]:
import evaluate 
exact_match = evaluate.load("exact_match", module_type="comparison")
print(exact_match.features) # 查看输入数据类型
# results = exact_match.compute(predictions1="test1", predictions2="test") # 难怪不能检测字符串
# print(results)
print(exact_match.description) # 查看相关信息
# 其他相关信息：
# https://huggingface.co/docs/evaluate/en/a_quick_tour Module attributes
# Attribute	Description
# description	A short description of the evaluation module.
# citation	A BibTex string for citation when available.
# features	A Features object defining the input format.
# inputs_description	This is equivalent to the modules docstring.
# homepage	The homepage of the module.
# license	The license of the module.
# codebase_urls	Link to the code behind the module.
# reference_urls	Additional reference URLs.

# features中说的是基础数据单元的类型，实际上完全是可以接受list等集合的。
# Note that features always describe the type of a single input element. 
# In general we will add lists of elements so you can always think of a list around the types in features. 
# Evaluate accepts various input formats (Python lists, NumPy arrays, PyTorch tensors, etc.) and 
# converts them to an appropriate format for storage and computation.

'''
至于 evaluator 是怎么知道什么数据是输入，什么是输出的，在evaluator的参数表中有默认值：
    input_column: str = "text",
    label_column: str = "label",

所以其实可以自己指定，但是不能指定多个input column，所以针对你的情况，你还是不能直接使用 evaluator。因为你的输入不止一个。

还可以指定 tokenizer 等

'''

{'predictions1': Value(dtype='int64', id=None), 'predictions2': Value(dtype='int64', id=None)}

Returns the rate at which the predictions of one model exactly match those of another model.



# 如何选择合适的评估方法 Choosing a metric for your task

一般有三种确定评估函数的方法：
1. 选择一般的评估方法 Generic metrics
2. 根据任务选择评估方法 Task-specific metrics
3. 根据数据集选择评估方法，因为有些数据集自己带了评估方法（dedicated evaluation metric） Dataset-specific metrics

Generic metrics: accuracy and precision  
accuracy: 其实就是计算预测正确的正负样本的百分比。只支持数值类型的标签


Task-specific metrics:    
这有很多中评估方式，可以在对应的task页面查看‘metrics’，得到评估方式。https://huggingface.co/tasks   

https://medium.com/@sharathhebbar24/text-generation-v-s-text2text-generation-3a2b235ac19b  
Text Generation, also known as Causal Language Modeling, is the process of generating text that closely resembles human writing.  
Text-to-Text Generation, also known as Sequence-to-Sequence Modeling, is the process of converting one piece of text into another. See examples from T5.