# MCQ (Multiple Choice Question) 평가 튜토리얼

## MCQDataset

이 튜토리얼에서는 Huggingface의 객관식 dataset을 불러와서 평가 후 재업로드하는 과정까지 경험해볼 것입니다.

### 1. 데이터셋 불러오기
먼저 HuggingFace Hub에서 데이터셋을 불러오는 방법을 알아보겠습니다:

In [1]:
from langmetrics.llmdataset import LLMDataset
from langmetrics.llmtestcase import LLMTestCase
from datasets import load_dataset
import pandas as pd
from dotenv import load_dotenv

In [2]:
load_dotenv(override=True)

True

In [3]:
dataset = load_dataset('sickgpt/001_MedQA_raw')

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'expected_output', 'choices'],
        num_rows: 10178
    })
    test: Dataset({
        features: ['question', 'expected_output', 'choices'],
        num_rows: 1273
    })
})

이제 LLMDataset을 이용해서 불러와봅시다.

먼저 LLMTestCase는 input, choices, expected_output을 고정으로 받습니다. 그런데 위에 Dataset은 input이 question이라는 열로 되어있네요. field_mapping 인자를 이용해서 column을 매핑해주겠습니다.

In [5]:
LLMTestCase.__annotations__

{'input': str,
 'output': typing.Optional[str],
 'expected_output': str,
 'context': typing.Optional[typing.List[str]],
 'retrieval_context': typing.Optional[typing.List[str]],
 'reasoning': typing.Optional[str],
 'choices': typing.Optional[str]}

In [6]:
# 예시 사용법
field_mapping = {
    'input': 'question',  # 데이터셋의 'question' 필드를 'input'으로 매핑
    'expected_output': 'expected_output',
    'choices': 'choices'
}

In [7]:
dataset = LLMDataset.from_huggingface_hub('sickgpt/001_MedQA_raw', field_mapping=field_mapping)

In [8]:
print(len(dataset))

10178


In [9]:
test_dataset = LLMDataset.from_huggingface_hub('sickgpt/001_MedQA_raw', field_mapping=field_mapping, split='test')

In [10]:
len(test_dataset)

1273

이제 evaluate을 진행해봅시다.

In [11]:
from langmetrics.llmfactory import LLMFactory
from langmetrics.config import ModelConfig

In [12]:
LLMFactory.get_model_list()

['gpt-4o',
 'gpt-4o-mini',
 'deepseek-v3',
 'deepseek-reasoner',
 'claude-3.7-sonnet',
 'claude-3.5-sonnet',
 'claude-3.5-haiku',
 'naver',
 'gemini-2.0-flash']

In [13]:
# 커스텀 모델 설정 생성
custom_config = ModelConfig(
    model_name="Qwen/Qwen2.5-3B-Instruct",
    api_base="http://qwen3b:8000/v1",
    api_key='EMPTY',
    max_tokens=32000,
    seed=66,
    provider="openai"
)

In [14]:
# localllm은 서버를 local에서 실행시키기 때문에 부팅되는 시간이 존재합니다.
custom_llm = LLMFactory.create_llm(custom_config, temperature=1.0)

In [15]:
from langmetrics.llmfactory import LLMFactory
# LLM 모델 생성
gpt_4o_mini = LLMFactory.create_llm('gpt-4o-mini')

In [16]:
from langmetrics.metrics import MCQMetric
metric = MCQMetric(
    answer_model=custom_llm,
    template_language='en',  # 'ko' 또는 'en'
    generate_template_type='reasoning'  # 'reasoning' 또는 'only_answer'
)

async를 통해서 빠르게 추론을 할 것입니다.

In [17]:
import nest_asyncio
nest_asyncio.apply()

In [18]:
print(test_dataset[0])

LLMTestCase(input='A junior orthopaedic surgery resident is completing a carpal tunnel repair with the department chairman as the attending physician. During the case, the resident inadvertently cuts a flexor tendon. The tendon is repaired without complication. The attending tells the resident that the patient will do fine, and there is no need to report this minor complication that will not harm the patient, as he does not want to make the patient worry unnecessarily. He tells the resident to leave this complication out of the operative report. Which of the following is the correct next action for the resident to take?', output=None, expected_output='B', context=None, retrieval_context=None, reasoning=None, choices=['Disclose the error to the patient and put it in the operative report', 'Tell the attending that he cannot fail to disclose this mistake', 'Report the physician to the ethics committee', 'Refuse to dictate the operative report'])


In [19]:
results = await metric.ameasure(test_dataset[512:1000])

In [20]:
results

LLMDataset(Pandas DataFrame with 488 rows)

In [21]:
# r1_results = await r1_metric.ameasure(test_dataset[:10])

약 1200개의 달하는 test를 단 30초만에 모두 추론한 것을 확인할 수 있습니다!

In [22]:
results.df['student_answer'][0]

'{\n    "reasoning": "<Pregnant patients with hyperthyroidism are typically managed by discontinuing methimazole as it can cross the placenta and potentially harm the fetus. Given that this patient\'s TSH is 2.0 μU/mL, which is mildly elevated, they are still within a range that suggests hyperthyroidism but is not dangerously high. Thyroid-stimulating hormone (TSH) should be closely monitored in pregnancy. However, continuation of treatment with a safer alternative, such as methimazole or propylthiouracil, is usually preferred. In conclusion, the most appropriate next step is to discontinue methimazole and start a more suitable antithyroid medication.>",\n    "answer": "A"\n}'

In [23]:
results.df

Unnamed: 0,input,student_answer,teacher_answer,expected_output,context,retrieval_context,reasoning,choices,score,metadata
0,A 35-year-old woman presents to her primary ca...,"{\n ""reasoning"": ""<Pregnant patients with h...",,B,,,<Pregnant patients with hyperthyroidism are ty...,"[Continue methimazole, Discontinue methimazole...",0,"{'student_template_language': 'en', 'student_m..."
1,A 65-year-old man presents to the emergency de...,"{\n ""reasoning"": ""<Presents a scenario with...",,D,,,<Presents a scenario with symptoms of back pai...,"[Compression fracture, Herniated nucleus pulpo...",0,"{'student_template_language': 'en', 'student_m..."
2,A 3-year-old girl is brought to the physician ...,"{\n ""reasoning"": ""<Pear-shaped multi-flagel...",,A,,,<Pear-shaped multi-flagellated organisms in st...,"[Anaphylactic transfusion reactions, Cutaneous...",0,"{'student_template_language': 'en', 'student_m..."
3,An 8-year-old boy who recently immigrated to t...,"{\n ""reasoning"": ""The patient presents with...",,D,,,"The patient presents with a pink, ring-like ra...",[Atypical lymphocytes on peripheral blood smea...,0,"{'student_template_language': 'en', 'student_m..."
4,A 59-year-old man presents to general medical ...,"{\n ""reasoning"": ""<The patient's cough is m...",,D,,,<The patient's cough is most likely due to the...,"[Change lisinopril to propanolol, Change lisin...",1,"{'student_template_language': 'en', 'student_m..."
...,...,...,...,...,...,...,...,...,...,...
483,"A 36-year-old woman, gravida 3, para 2, at 42 ...","{\n ""reasoning"": ""<Pregnant patients with a...",,C,,,<Pregnant patients with a high BMI (>30 kg/m²)...,"[Polyhydramnios, Acute respiratory distress sy...",0,"{'student_template_language': 'en', 'student_m..."
484,The only immunoglobulin found as a dimer has w...,"{\n ""reasoning"": ""The key in identifying th...",,C,,,The key in identifying the correct answer is u...,"[Protect against invasive helminth infection, ...",0,"{'student_template_language': 'en', 'student_m..."
485,A 47-year-old woman is brought to the emergenc...,"{\n ""reasoning"": ""<Pupil constriction, shal...",,D,,,"<Pupil constriction, shallow breathing, decrea...","[Diabetic ketoacidosis, Diuretic overdose, Hyp...",1,"{'student_template_language': 'en', 'student_m..."
486,A 72-year-old woman with a 40 pack-year histor...,"{\n ""reasoning"": ""The most appropriate init...",,B,,,The most appropriate initial statement should ...,"[""Have you ever heard of pancreatic cancer?"", ...",1,"{'student_template_language': 'en', 'student_m..."


In [24]:
print(len(results))

488


In [25]:
print([i.score for i in results])

[0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 

In [26]:
scores = sum([i.score for i in results]) / len(results)

In [30]:
print(scores)

0.47540983606557374


In [28]:
print(results[0])

LLMResult(input='A 35-year-old woman presents to her primary care provider concerned that she may be pregnant. She has a history of regular menstruation every 4 weeks that lasts about 4 days with mild to moderate bleeding, but she missed her last period 2 weeks ago. A home pregnancy test was positive. She has a 6-year history of hyperthyroidism that is well-controlled with daily methimazole. She is currently asymptomatic and has no complaints or concerns. A blood specimen is taken and confirms the diagnosis. Additionally, her thyroid-stimulating hormone (TSH) is 2.0 μU/mL. Which of the following is the next best step in the management of this patient?', student_answer='{\n    "reasoning": "<Pregnant patients with hyperthyroidism are typically managed by discontinuing methimazole as it can cross the placenta and potentially harm the fetus. Given that this patient\'s TSH is 2.0 μU/mL, which is mildly elevated, they are still within a range that suggests hyperthyroidism but is not dangero

In [29]:
results.df['metadata']

0      {'student_template_language': 'en', 'student_m...
1      {'student_template_language': 'en', 'student_m...
2      {'student_template_language': 'en', 'student_m...
3      {'student_template_language': 'en', 'student_m...
4      {'student_template_language': 'en', 'student_m...
                             ...                        
483    {'student_template_language': 'en', 'student_m...
484    {'student_template_language': 'en', 'student_m...
485    {'student_template_language': 'en', 'student_m...
486    {'student_template_language': 'en', 'student_m...
487    {'student_template_language': 'en', 'student_m...
Name: metadata, Length: 488, dtype: object