# MCQ (Multiple Choice Question) 평가 튜토리얼

## MCQDataset

이 튜토리얼에서는 Huggingface의 객관식 dataset을 불러와서 평가 후 재업로드하는 과정까지 경험해볼 것입니다.

### 1. 데이터셋 불러오기
먼저 HuggingFace Hub에서 데이터셋을 불러오는 방법을 알아보겠습니다:

In [1]:
from langmetrics.llmdataset import MCQDataset
from langmetrics.llmtestcase import MCQTestCase
from datasets import load_dataset
import pandas as pd

In [2]:
dataset = load_dataset('sickgpt/001_MedQA_raw')

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'expected_output', 'choices'],
        num_rows: 10178
    })
    test: Dataset({
        features: ['question', 'expected_output', 'choices'],
        num_rows: 1273
    })
})

이제 MCQDataset을 이용해서 불러와봅시다.

먼저 MCQTestCase는 input, choices, expected_output을 고정으로 받습니다. 그런데 위에 Dataset은 input이 question이라는 열로 되어있네요. field_mapping 인자를 이용해서 column을 매핑해주겠습니다.

In [4]:
MCQTestCase.__annotations__

{'input': str,
 'choices': typing.List[str],
 'expected_output': typing.Union[int, str],
 'output': typing.Optional[str],
 'reasoning': typing.Optional[str]}

In [5]:
# 예시 사용법
field_mapping = {
    'input': 'question',  # 데이터셋의 'question' 필드를 'input'으로 매핑
    'expected_output': 'expected_output',
    'choices': 'choices'
}

In [6]:
dataset = MCQDataset.from_huggingface_hub('sickgpt/001_MedQA_raw', field_mapping=field_mapping)

In [7]:
print(len(dataset))

11451


In [8]:
test_dataset = MCQDataset.from_huggingface_hub('sickgpt/001_MedQA_raw', field_mapping=field_mapping, split='test')

In [9]:
len(test_dataset)

1273

이제 evaluate을 진행해봅시다.

In [10]:
from langmetrics.llmfactory import LLMFactory

ModuleNotFoundError: No module named 'langmetrics.metrics.mcq_choice.prompt_dict'

In [None]:
LLMFactory.get_model_list()

['gpt-4o',
 'gpt-4o-mini',
 'deepseek-v3',
 'deepseek-reasoner',
 'claude-3.5-sonnet',
 'claude-3.5-haiku',
 'naver']

In [1]:
from langmetrics.llmfactory import LLMFactory
# LLM 모델 생성
deepseek_llm = LLMFactory.create_llm('deepseek-v3')

In [2]:
deepseek_r1 = LLMFactory.create_llm('deepseek-reasoner', temperature=None)

In [3]:
from langmetrics.metrics import MCQMetric
metric = MCQMetric(
    answer_model=deepseek_llm,
    template_language='en',  # 'ko' 또는 'en'
    generate_template_type='reasoning'  # 'reasoning' 또는 'only_answer'
)

In [None]:
from langmetrics.metrics import MCQMetric
r1_metric = MCQMetric(
    answer_model=deepseek_r1,
    template_language='en',  # 'ko' 또는 'en'
    generate_template_type='reasoning', # 'reasoning' 또는 'only_answer'
    verbose_mode=False
)

async를 통해서 빠르게 추론을 할 것입니다.

In [16]:
import nest_asyncio
nest_asyncio.apply()

In [17]:
print(test_dataset[0])

MCQTestCase(input='A junior orthopaedic surgery resident is completing a carpal tunnel repair with the department chairman as the attending physician. During the case, the resident inadvertently cuts a flexor tendon. The tendon is repaired without complication. The attending tells the resident that the patient will do fine, and there is no need to report this minor complication that will not harm the patient, as he does not want to make the patient worry unnecessarily. He tells the resident to leave this complication out of the operative report. Which of the following is the correct next action for the resident to take?', choices=['Disclose the error to the patient and put it in the operative report', 'Tell the attending that he cannot fail to disclose this mistake', 'Report the physician to the ethics committee', 'Refuse to dictate the operative report'], expected_output='B', output=None, reasoning=None)


In [None]:
results = await metric.ameasure(test_dataset)

In [None]:
r1_results = await r1_metric.ameasure(test_dataset[:10])

약 1200개의 달하는 test를 단 30초만에 모두 추론한 것을 확인할 수 있습니다!

In [27]:
scores = sum([i.score for i in results]) / len(results)

NameError: name 'results' is not defined

In [None]:
print(scores)

0.8499607227022781


In [23]:
r1_scores = sum([i.score for i in r1_results]) / len(r1_results)

In [24]:
r1_scores

0.8

In [None]:
data = {
    "question": [result.question for result in results],
    "predicted": [result.predicted for result in results],
    "language": [result.language for result in results],
    "score": [result.score for result in results],
    "ground_truth": [result.ground_truth for result in results],
    "choice": [", ".join(result.choice) for result in results],  # 리스트를 문자열로 변환
    "reasoning": [result.reasoning for result in results],
    "token_usage": [str(result.token_usage) for result in results]  # 딕셔너리를 문자열로 변환
}


In [None]:
df = pd.DataFrame(data)
df.to_csv('deepseek_medqa_result.csv', index=False)