## Summary

### Tóm tắt dự án

Dự án này xây dựng hệ thống phân loại response dựa trên **Student–Teacher Framework**:

- **Student**: gồm nhiều mô hình nhị phân (binary classification) để phân loại response theo các tiêu chí nhỏ (ví dụ: yes/no, known/unknown, support/contradictory).
- **Teacher**: là mô hình 3 nhãn tổng quát (no, intrinsic, extrinsic).

Cơ chế hoạt động:

- Nếu các mô hình Student đồng thuận với nhau → sử dụng kết quả của Student.
- Nếu các mô hình Student không đồng thuận → sử dụng kết quả của Teacher.

Cách tiếp cận này giúp tăng độ chính xác bằng cách tận dụng ưu điểm của cả Student (chi tiết, chuyên biệt) và Teacher (tổng quát, ổn định).

### Ánh xạ nhãn Student → Teacher

Mỗi Student chịu trách nhiệm phân loại ở một khía cạnh nhỏ, sau đó ánh xạ về nhãn cuối cùng của Teacher (3 nhãn: **no, intrinsic, extrinsic**):

- **Student 1 (yes/no):**

  - `yes` → **extrinsic** hoặc **intrinsic**
  - `no` → **no**

- **Student 2 (known/unknown):**

  - `known` → **no** hoặc **intrinsic**
  - `unknown` → **extrinsic**

- **Student 3 (support/contradictory):**
  - `support` → **no** hoặc **extrinsic**
  - `contradictory` → **intrinsic**

Hệ thống sẽ:

- Ưu tiên kết quả khi **các Student đồng thuận**.
- Nếu **Student mâu thuẫn nhau**, kết quả sẽ được lấy từ **Teacher model**.

## Global variables

In [1]:
FULL_LABEL_MODEL_PATH = "/kaggle/input/prime-dsc-models/full_label/full_label"
KNOWN_UNKNOWN_MODEL_PATH = "/kaggle/input/prime-dsc-models/known_unknown/known_unknown"
SUPPORTED_CONTRADICTORY_MODEL_PATH = "/kaggle/input/prime-dsc-models/supported_contradictory/supported_contradictory"
TRUTH_HALLUCINATION_MODEL_PATH = "/kaggle/input/prime-dsc-models/truth_hallucination/truth_hallucination"
PRIVATE_TEST_DATASET_PATH = "/kaggle/input/dsc-2025-llm-hallucination/vihallu-private-test.csv"
SUBMITTED_FILE_PATH = "/kaggle/input/dsc-2025-llm-hallucination/submited_file.csv"

In [2]:
BASE_MODEL_ID = "Qwen/Qwen3-4B-Instruct-2507"

## Install and import necessary libraries

In [3]:
!pip install -q pandas==2.2.3 torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 tqdm==4.67.1

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m74.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m66.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━

In [4]:
!pip install -q transformers==4.57.0 datasets==4.1.1 accelerate==1.10.1 evaluate==0.4.6 peft==0.17.1 bitsandbytes==0.48.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.4/41.4 kB[0m [31m895.1 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m66.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m503.6/503.6 kB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.9/374.9 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m504.9/504.9 kB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m564.3/564.3 kB[0m [31m691.5 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━

In [5]:
import re 
import gc

from tqdm import tqdm
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

2025-10-07 00:38:35.608046: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1759797515.796671      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1759797515.869200      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Base model and utilities

In [6]:
df = pd.read_csv(PRIVATE_TEST_DATASET_PATH)

In [7]:
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID)

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_ID,
    quantization_config=quant_config,
    device_map="auto"
).eval()

tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/99.6M [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/238 [00:00<?, ?B/s]

In [8]:
def clean_label(text: str) -> str:
    # Xoá toàn bộ tag mở/đóng kiểu <...> hoặc </...>
    text = re.sub(r"<[^>]+>", "", text)

    # Xoá khoảng trắng thừa
    text = text.strip()
    return text

In [9]:
@torch.no_grad()
def predict_one(messages, model, max_new_tokens: int = 10):
    # Tạo prompt theo chat template, thêm chỗ cho model trả lời
    prompt_ids = tokenizer.apply_chat_template(
        conversation=messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(model.device)

    out = model.generate(
        input_ids=prompt_ids,
        max_new_tokens=max_new_tokens,
        do_sample=False,          
        temperature=0.0,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=[tokenizer.eos_token_id] 
    )

    gen_ids = out[0, prompt_ids.shape[-1]:]  # chỉ phần model sinh ra
    text = tokenizer.decode(gen_ids, skip_special_tokens=True).strip()
    text = clean_label(text)
    return text

## TRUTH VS HALLUCINATION MODEL (no vs extrinsic / intrinsic)

In [10]:
# Load adapter từ local checkpoint
truth_hallucination_model = PeftModel.from_pretrained(
    base_model,
    TRUTH_HALLUCINATION_MODEL_PATH
)

# Setup pad token
truth_hallucination_model.config.pad_token_id = tokenizer.pad_token_id
truth_hallucination_model.generation_config.pad_token_id = tokenizer.pad_token_id

In [11]:
truth_hallucination_model_system_prompt = """You are an expert AI assistant specializing in detecting hallucinations in Vietnamese language model outputs. Your task is to analyze a given Context and Response to determine if the Response contains any hallucinations relative to the Context.

First, think step-by-step. Carefully analyze the Response and compare it sentence-by-sentence against the information provided in the Context. Identify any contradictions, distortions, or new information that cannot be inferred from the source.

After your step-by-step analysis, you must classify the Response into one of two categories:

1.  **no**: The Response is fully consistent with and factually supported by the information in the Context. It does not introduce any information that cannot be directly inferred from the source text.
2.  **yes**: The Response contains a hallucination. This means it either:
    * Directly contradicts or distorts information present in the Context.
    * Introduces additional information that is NOT present in the Context and cannot be inferred from it.

Based on your conclusion, you must output ONLY the label: <label>. Do not provide any explanations or additional text.
"""

In [12]:
def truth_hallucination_build_messages(context: str, response: str):
    prompt = f"""### Bối cảnh: {context} 
                
### Phản hồi: {response}
    
**Nhiệm vụ:** Dựa vào **Bối cảnh**, hãy xác định xem **Phản hồi** có chứa ảo giác hay không và phân loại nó. Chỉ trả lời bằng một trong hai nhãn sau: `no`, `yes`.
"""
    return [
        {"role": "system", "content": truth_hallucination_model_system_prompt},
        {"role": "user", "content": prompt}
    ]

In [13]:
truth_hallucination_preds = []
truth_hallucination_valid_labels = ["no", "yes"]

for _, row in tqdm(df.iterrows(), total=len(df)):
    messages = truth_hallucination_build_messages(row["context"], row["response"])
    label = predict_one(messages, truth_hallucination_model)
    if label in truth_hallucination_valid_labels:
        truth_hallucination_preds.append({"id": row["id"], "predict_label": label})
    else:
        err_id = row["id"]
        print(f"ERROR: {err_id} - Response: {label}")

truth_hallucination_df = pd.DataFrame(truth_hallucination_preds)

  0%|          | 0/2000 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
100%|██████████| 2000/2000 [35:34<00:00,  1.07s/it]


In [14]:
truth_hallucination_df['predict_label'].value_counts()

predict_label
yes    1341
no      659
Name: count, dtype: int64

In [15]:
del truth_hallucination_model

In [16]:
gc.collect()

64

## KNOWN VS UNKNOWN (extrinsic vs no / intrinsic) 

In [17]:
# Load adapter từ local checkpoint
known_unknown_model = PeftModel.from_pretrained(
    base_model,
    KNOWN_UNKNOWN_MODEL_PATH
)

# Setup pad token
known_unknown_model.config.pad_token_id = tokenizer.pad_token_id
known_unknown_model.generation_config.pad_token_id = tokenizer.pad_token_id



In [18]:
known_unknown_model_system_prompt = """You are an expert AI assistant specializing in information verifiability. Your task is to determine if a given Response can be fully judged (either confirmed or contradicted) based **solely** on the information provided in a Context.

First, think step-by-step. Carefully read the Response and compare it to the information presented in the Context. Your goal is to assess whether the Context contains enough information to make a definitive judgment about the Response's truthfulness.

After your step-by-step analysis, you must classify the Response into one of two categories:

1.  **known**: The truthfulness of the Response (whether it is true or false) can be determined entirely by using the information within the Context. All necessary information is **intrinsic** to the provided text. This applies when the Response is either fully supported by or directly contradicts the Context.
2.  **unknown**: It is impossible to determine whether the Response is true or false based only on the Context. The Response contains information that is not mentioned or cannot be inferred from the Context, requiring **extrinsic** or external knowledge for verification.

Based on your conclusion, you must output ONLY the label: `known` or `unknown`. Do not provide any explanations or additional text.
"""

In [19]:
def known_unknown_build_messages(context: str, response: str):
    prompt = f"""### Bối cảnh: {context}

### Phản hồi: {response}

**Nhiệm vụ:** Dựa vào **Bối cảnh**, hãy xác định xem tính đúng sai của **Phản hồi** có thể được kiểm chứng hoàn toàn từ thông tin có sẵn hay không.

Chỉ trả lời bằng một trong hai nhãn sau: `known` hoặc `unknown`.
"""
    return [
        {"role": "system", "content": known_unknown_model_system_prompt},
        {"role": "user", "content": prompt}
    ]

In [20]:
known_unknown_preds = []
known_unknown_valid_labels = ["known", "unknown"]

for _, row in tqdm(df.iterrows(), total=len(df)):
    messages = known_unknown_build_messages(row["context"], row["response"])
    label = predict_one(messages, known_unknown_model)
    if label in known_unknown_valid_labels:
        known_unknown_preds.append({"id": row["id"], "predict_label": label})
    else:
        err_id = row["id"]
        print(f"ERROR: {err_id} - Response: {label}")

known_unknown_df = pd.DataFrame(known_unknown_preds)

100%|██████████| 2000/2000 [37:23<00:00,  1.12s/it]


In [21]:
known_unknown_df['predict_label'].value_counts()

predict_label
known      1358
unknown     642
Name: count, dtype: int64

In [22]:
del known_unknown_model

In [23]:
gc.collect()

0

## SUPPORTED VS CONTRADICTORY (intrinsic vs no / extrinsic)

In [24]:
# Load adapter từ local checkpoint
supported_contradictory_model = PeftModel.from_pretrained(
    base_model,
    SUPPORTED_CONTRADICTORY_MODEL_PATH
)

# Setup pad token
supported_contradictory_model.config.pad_token_id = tokenizer.pad_token_id
supported_contradictory_model.generation_config.pad_token_id = tokenizer.pad_token_id



In [25]:
supported_contradictory_model_system_prompt = """You are an expert AI assistant specializing in evaluating the factual consistency between a Vietnamese response and its source context. Your task is to analyze a given Context and Response to determine if the Response contradicts the information provided in the Context.

First, think step-by-step. Carefully compare the claims made in the Response against the information available in the Context. Your primary goal is to identify any direct contradictions or misrepresentations.

After your step-by-step analysis, you must classify the Response into one of two categories:

1.  **supported**: The Response is consistent with the Context. It either (a) only contains information present in the Context, or (b) introduces new information that does **not** contradict the Context.
2.  **contradictory**: The Response contains information that directly contradicts or misrepresents information explicitly stated in the Context.

Based on your conclusion, you must output ONLY the label: `supported` or `contradictory`. Do not provide any explanations or additional text.
"""

In [26]:
def supported_contradictory_build_messages(context: str, response: str):
    prompt = f"""### Bối cảnh: {context}

### Phản hồi: {response}

**Nhiệm vụ:** Dựa vào **Bối cảnh**, hãy xác định xem **Phản hồi** có **mâu thuẫn** với thông tin trong Bối cảnh hay không. Chỉ trả lời bằng một trong hai nhãn sau: `supported` hoặc `contradictory`.
"""
    return [
        {"role": "system", "content": supported_contradictory_model_system_prompt},
        {"role": "user", "content": prompt}
    ]

In [27]:
supported_contradictory_preds = []
supported_contradictory_valid_labels = ["supported", "contradictory"]

for _, row in tqdm(df.iterrows(), total=len(df)):
    messages = supported_contradictory_build_messages(row["context"], row["response"])
    label = predict_one(messages, supported_contradictory_model)
    if label in supported_contradictory_valid_labels:
        supported_contradictory_preds.append({"id": row["id"], "predict_label": label})
    else:
        err_id = row["id"]
        print(f"ERROR: {err_id} - Response: {label}")

supported_contradictory_df = pd.DataFrame(supported_contradictory_preds)

100%|██████████| 2000/2000 [38:11<00:00,  1.15s/it]


In [28]:
supported_contradictory_df['predict_label'].value_counts()

predict_label
supported        1325
contradictory     675
Name: count, dtype: int64

In [29]:
del supported_contradictory_model

In [30]:
gc.collect()

0

## FULL 3 LABELS

In [31]:
# Load adapter từ local checkpoint
full_model = PeftModel.from_pretrained(
    base_model,
    FULL_LABEL_MODEL_PATH
)

# Setup pad token
full_model.config.pad_token_id = tokenizer.pad_token_id
full_model.generation_config.pad_token_id = tokenizer.pad_token_id



In [32]:
full_model_system_prompt = """You are an expert AI assistant specializing in detecting hallucinations in Vietnamese language model outputs. Your task is to analyze a given Context and Response to determine if the Response contains hallucinations relative to the Context.

First, think step-by-step. Carefully analyze the Response and compare it sentence-by-sentence against the information provided in the Context. Identify any contradictions, distortions, or new information that cannot be inferred from the source.

After your step-by-step analysis, you must classify the Response into one of three categories:

1.  **no**: The Response is fully consistent with and factually supported by the information in the Context. It does not introduce any information that cannot be directly inferred from the source text.
2.  **intrinsic**: The Response directly contradicts or distorts information that is explicitly present in the Context. The hallucinated content is based on entities or concepts from the context but presents them inaccurately.
3.  **extrinsic**: The Response introduces additional information that is NOT present in the Context and cannot be inferred from it. This information might be true in the real world, but it is not supported by the provided source text.

Based on your conclusion, you must output ONLY the label: . Do not provide any explanations or additional text.
"""

In [33]:
def full_build_messages(context: str, response: str):
    prompt = f"""### Bối cảnh: {context} 
                
### Phản hồi: {response}
    
**Nhiệm vụ:** Dựa vào **Bối cảnh**, hãy xác định xem **Phản hồi** có chứa ảo giác hay không và phân loại nó. Chỉ trả lời bằng một trong ba nhãn sau: `no`, `intrinsic`, `extrinsic`.
"""
    return [
        {"role": "system", "content": full_model_system_prompt},
        {"role": "user", "content": prompt}
    ]

In [34]:
full_preds = []
full_valid_labels = ["no", "intrinsic", "extrinsic"]

for _, row in tqdm(df.iterrows(), total=len(df)):
    messages = full_build_messages(row["context"], row["response"])
    label = predict_one(messages, full_model)
    if label in full_valid_labels:
        full_preds.append({"id": row["id"], "predict_label": label})
    else:
        err_id = row["id"]
        print(f"ERROR: {err_id} - Response: {label}")

full_df = pd.DataFrame(full_preds)

100%|██████████| 2000/2000 [1:01:04<00:00,  1.83s/it]


In [35]:
full_df['predict_label'].value_counts()

predict_label
no           690
intrinsic    668
extrinsic    642
Name: count, dtype: int64

In [36]:
del full_model

In [37]:
gc.collect()

0

## FINAL DECISIONS

In [38]:
predict_df = truth_hallucination_df.merge(known_unknown_df, on="id", suffixes=("_m1", "_m2"))
predict_df = predict_df.merge(supported_contradictory_df, on="id")
predict_df = predict_df.rename(columns={"predict_label": "predict_label_m3"})
predict_df = predict_df.merge(full_df, on="id")
predict_df = predict_df.rename(columns={"predict_label": "predict_label_m4"})

In [39]:
predict_df.head()

Unnamed: 0,id,predict_label_m1,predict_label_m2,predict_label_m3,predict_label_m4
0,ef35e7a1-766f-4455-b349-09084a1f56ba,no,known,supported,no
1,85aac4aa-e53c-4d01-bdcf-e8b27a2cd9bc,yes,known,supported,intrinsic
2,cd056e1b-51f6-4adf-939c-e742645437bf,no,known,supported,intrinsic
3,6cc80aa4-44db-4366-8cb6-16453aa8ecbf,yes,known,contradictory,intrinsic
4,948849cb-0c83-4dc1-b5ae-aaa585a8d528,yes,known,contradictory,intrinsic


In [40]:
def final_label(row):
    global cnt
    if row["predict_label_m1"] == "no" and row["predict_label_m2"] == "known" and row["predict_label_m3"] == "supported":
        return "no"
    elif row["predict_label_m1"] == "yes" and row["predict_label_m2"] == "unknown" and row["predict_label_m3"] == "supported":
        return "extrinsic"
    elif row["predict_label_m1"] == "yes" and row["predict_label_m2"] == "known" and row["predict_label_m3"] == "contradictory":
        return "intrinsic"
    return row["predict_label_m4"]

In [41]:
predict_df["predict_label"] = predict_df.apply(final_label, axis=1)

In [42]:
predict_df.head()

Unnamed: 0,id,predict_label_m1,predict_label_m2,predict_label_m3,predict_label_m4,predict_label
0,ef35e7a1-766f-4455-b349-09084a1f56ba,no,known,supported,no,no
1,85aac4aa-e53c-4d01-bdcf-e8b27a2cd9bc,yes,known,supported,intrinsic,intrinsic
2,cd056e1b-51f6-4adf-939c-e742645437bf,no,known,supported,intrinsic,no
3,6cc80aa4-44db-4366-8cb6-16453aa8ecbf,yes,known,contradictory,intrinsic,intrinsic
4,948849cb-0c83-4dc1-b5ae-aaa585a8d528,yes,known,contradictory,intrinsic,intrinsic


In [43]:
conflict_df = predict_df[predict_df["predict_label_m4"] != predict_df["predict_label"]]
print("Number of mismatch predicted labels: ", len(conflict_df))

Number of mismatch predicted labels:  52


In [44]:
condition = (conflict_df['predict_label_m4'] == 'extrinsic') & (conflict_df['predict_label'] == 'no')
conflict_df.loc[condition, 'predict_label'] = 'extrinsic'

In [45]:
# Đếm số hàng thỏa điều kiện
num_rows = condition.sum()
print("Số dòng thay đổi sau khi phân tích kết quả:", num_rows)

Số dòng thay đổi sau khi phân tích kết quả: 5


In [46]:
df_to_update = predict_df.set_index('id')
df_with_updates = conflict_df.set_index('id')
df_to_update.update(df_with_updates)
final_df = df_to_update.reset_index()

final_df.drop(columns=["predict_label_m1", "predict_label_m2", "predict_label_m3", "predict_label_m4"], inplace=True)
final_df.to_csv("submit.csv", index=False)

## CHECKING WITH SUBMITTED FILE

In [47]:
submitted_df = pd.read_csv(SUBMITTED_FILE_PATH)

In [48]:
isSame = final_df.equals(submitted_df)
print("Is re-produce file matching with submitted file: ", isSame)

Is re-produce file matching with submitted file:  False


In [49]:
if isSame == False:
    print("---Các dòng khác biệt---")
    mask = final_df["predict_label"] != submitted_df["predict_label"]
    
    diff = pd.concat(
        [final_df[mask].add_suffix("_df1"), submitted_df[mask].add_suffix("_df2")],
        axis=1
    )
    
    print(diff)

---Các dòng khác biệt---
                                   id_df1 predict_label_df1  \
230  c5f3c48b-80f9-437c-839e-b188430a63f8         intrinsic   

                                   id_df2 predict_label_df2  
230  c5f3c48b-80f9-437c-839e-b188430a63f8         extrinsic  


Nhận xét: chỉ có **1/2000 dòng bị thay đổi**. Nguyên nhân có thể đến từ sự ngẫu nhiên hoặc đặc tính của LLM. Tuy nhiên, việc chỉ thay đổi **1 dòng trên tổng 2000** chứng minh rằng hệ thống có **tính ổn định và consistency cao**.