<a href="https://colab.research.google.com/github/j-chim/QMUL-Thesis-Draft/blob/main/cl_synthetic_data_evaluation_examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About
This notebook contains example intrinsic evaluation code described in our paper: [Evaluating Synthetic Data Generation from User Generated Text (Chim et al., CL 2024)](https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00540/124625).

# Setup
The notebook is split by evaluation aspect - meaning, style, divergence.

Most sections can be directly ran with little setup. However, for style evaluation, you will need to use the idiolect model trained by [Zhu and Jurgen, 2021](https://aclanthology.org/2021.emnlp-main.25/) or obtain alternative style-sensitive embeddings. Our paper uses the following weights, re-saved from Zhu and Jurgen's model for software version compatibility: https://drive.google.com/file/d/1SXSlp4K9sM5EOhiwkP-XjkUZIzc9worB/view?usp=sharing. Ensure you save this in your drive (if running directly on colab) or download it for offline use.



## Kết nối Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd drive/MyDrive/Thesis/Evaluate_test_dataset

/content/drive/MyDrive/Thesis/Evaluate_test_dataset


| Tiêu chí | Question | Answers |
|-------|-------| -------|
| Mean preservation | 0.84798 | 0.8497 |
| Style preservation (idiolect) | 0.98423 | 0.9877 |
| POS based (Word POS Score) | 0.78835 | 0.81667 |
| POS based (Triagram POS Score) | 0.29438 | 0.36346 |
| Divergence (BLEU) | 5.5367 | 20.3583 |
| Fréchet Distance | 0 | 0.00333 |
| Compare individual POS tags at distribution level | 0.0826 | 0.096467 |
| Compare POS trigrams at distribution level | 0.0710 | 0.0603 |
| Compute divergence of character n-grams | 0.0914 | 0.0659 |


## Đọc dữ liệu

In [3]:
import pandas as pd

# Đọc dữ liệu từ file Excel
df_all = pd.read_excel("./mental_health_data_official.xlsx", engine='openpyxl')
df_part1 = df_all.iloc[0:2000]
df_part2 = df_all.iloc[9000:]

# Gộp lại thành một DataFrame mới
original_df = pd.concat([df_part1, df_part2], ignore_index=True)

test_df = pd.read_excel("./generated_test_dataset_question_answers_best_answer.xlsx", engine='openpyxl')

In [4]:
original_df = original_df[0:6000]
# original_df = original_df[7000:7010]

In [5]:
original_df

Unnamed: 0,question,answers,labels,best_answer
0,"Cậu có tin rằng, dù là ai, cũng đều xứng đáng ...","['Có chứ, chỉ là đôi khi, người ta quên mất đi...",Cảm xúc,"Có chứ, chỉ là đôi khi, người ta quên mất điều..."
1,"Mình không biết ngày mai sẽ ra sao, chỉ thấy t...",['Không trở thành ngôi sao hay mặt trời trong ...,Chữa lành,Không trở thành ngôi sao hay mặt trời trong mắ...
2,Có lúc tớ cảm thấy mình vô hình trong thế giới...,"['Tớ biết dạo gần đây cậu cũng rất mệt, nhưng ...",Chữa lành,"Tớ biết dạo gần đây cậu cũng rất mệt, nhưng mà..."
3,Em luôn cảm thấy mình không đủ xinh đẹp để đượ...,['Em đừng tự ti vì bản thân mình không xinh đẹ...,Cảm xúc,Em đừng tự ti vì bản thân mình không xinh đẹp ...
4,Tớ không thể chịu đựng thêm được nữa. Tớ chỉ m...,"['Gửi cậu, người vừa có ý định rời đi...\nKhi ...",Cảm xúc,"Gửi cậu, người vừa có ý định rời đi...\nKhi nỗ..."
...,...,...,...,...
5995,Bạn có tin vào tình yêu thật sự có xuất hiện t...,['T thường xem trên phim ảnh về tình yêu đích ...,Tình yêu và hôn nhân,Bạn có biết chuyện tình của nữ hoàng Anh không...
5996,Được phụ nữ khen dễ thương\nTối nay em đi trên...,"['Hông biết với những bạn khác thì sao, nhưng ...",Cảm xúc,À đầy người hiền lành nhưng cũng chả thấy ai k...
5997,Làm Thế Nào Để Tiếp Cận Được Anh Trai Khối Trê...,['Có duyên sẽ gặp. Cứ chuẩn bị cho mình một tâ...,Tình yêu và hôn nhân,Thích thì viết một phong thơ! Nhờ đường bưu đi...
5998,Năm cấp 3 mình có gặp một anh tên Khôi. Là ảnh...,['Mình nghĩ nó không phải là thích hay yêu đâu...,Tình yêu và hôn nhân,Mình nghĩ nó không phải là thích hay yêu đâu b...


In [None]:
test_df

Unnamed: 0,question,answers,best_answer
0,Mình mới bị từ chối công việc mơ ước. Cảm giác...,"['Ôi, mình hiểu cảm giác của bạn. Bị từ chối c...","Ôi, mình hiểu cảm giác của bạn. Bị từ chối côn..."
1,"Mình mới bị bồ đá, cảm thấy như cả thế giới sụ...",['Thật sự chia buồn với bạn. Mất đi một người ...,Thật sự chia buồn với bạn. Mất đi một người mì...
2,Dạo này mình hay bị overthinking về những chuy...,['Tớ hiểu cảm giác đó. Overthinking nó cứ như ...,Tớ hiểu cảm giác đó. Overthinking nó cứ như mộ...
3,Mình mới bị sếp mắng vì một lỗi rất ngớ ngẩn t...,"['Ôi, ai mà chẳng có lúc mắc lỗi phải không bạ...","Ôi, ai mà chẳng có lúc mắc lỗi phải không bạn?..."
4,"Mình mới bị công ty sa thải, về nhà chẳng dám ...","['Chào bạn, mình hiểu cảm giác của bạn lúc này...","Chào bạn, mình hiểu cảm giác của bạn lúc này. ..."
...,...,...,...
5995,"Mình mới ra trường, xin được một công việc khá...","['Có thể bạn đang trải qua giai đoạn ""khủng ho...","Tớ hiểu cảm giác của bạn. Sau khi ra trường, t..."
5996,Mình vừa cãi nhau to với mẹ. Chuyện là mình mu...,"['Mẹ nào rồi cũng thương con thôi, có thể mẹ b...",Tớ hiểu cảm giác của bạn. Đôi khi khoảng cách ...
5997,Mình vừa trải qua một buổi phỏng vấn xin việc ...,['Đừng quá khắt khe với bản thân. Ai cũng có n...,Đừng tự trách mình quá nhiều nhé. Ai trong chú...
5998,Mình vừa bị sếp mắng vì cái báo cáo mình làm s...,['Đừng vội quy chụp cho mình là vô dụng nhé. A...,Có thể bạn đang bị stress hoặc thiếu ngủ đấy. ...


# **Lưu ý**: chỗ ghi **Eng ver** là code của paper

Dữ liệu mẫu của code English version

In [None]:
# # Example synthetic texts from a single source, varying in style and meaning similarity

# synthetic_texts = [
#     "The nimble brown fox hops across the sleepy dog.",
#     "the lazy canine was lying around when it got jumped over by a quick-moving brown fox!",
#     "despite heavy rain yesterday evening, remember to water those plants!"
# ]

# original_texts = [
#     "The quick brown fox jumps over the lazy dog."
# ] * len(synthetic_texts)


# Meaning

In [None]:
# the main reported metric is BERTScore, which is conveniently run using the official implementation
%%capture
!pip install -q bert-score

#### Eng ver

In [None]:
# from bert_score import BERTScorer

# scorer = BERTScorer(model_type="roberta-large", lang="vi")
# # we report the mean (F.mean()) in our paper
# _, _, F = scorer.score(synthetic_texts, original_texts)

# print("\nBERTScore of each example text:")
# for score, synthetic_text in zip(F, synthetic_texts):
#     print(f"{synthetic_text} (score: {score.item():.2f})")

#### Vietnamese ver

model `xlm-roberta-large` là phiên bản multilingual của model `roberta-large` (model ban đầu của code English ver)

BERTScore là chỉ số đánh giá mức độ bảo toàn ngữ nghĩa (semantic similarity) giữa hai câu văn, được tính bằng cách sử dụng mô hình ngôn ngữ BERT

In [None]:
from bert_score import BERTScorer
import torch

# Khởi tạo BERTScorer cho tiếng Việt
scorer = BERTScorer(model_type="xlm-roberta-large", lang="vi", rescale_with_baseline=False)

# Chuyển sang list để dễ chia chunk
original_questions = original_df["question"].astype(str).tolist()
test_questions = test_df["question"].astype(str).tolist()

original_answers = original_df["answers"].astype(str).tolist()
test_answers = test_df["answers"].astype(str).tolist()

# Cài đặt chunk size
chunk_size = 1000
num_chunks = (len(original_questions) + chunk_size - 1) // chunk_size  # Làm tròn lên

# Xử lý theo từng chunk
for i in range(num_chunks):
    start = i * chunk_size
    end = min((i + 1) * chunk_size, len(original_questions))

    chunk_orig_q = original_questions[start:end]
    chunk_test_q = test_questions[start:end]

    chunk_orig_a = original_answers[start:end]
    chunk_test_a = test_answers[start:end]

    # Tính BERTScore cho question và answer
    _, _, F_q = scorer.score(chunk_test_q, chunk_orig_q)
    _, _, F_a = scorer.score(chunk_test_a, chunk_orig_a)

    # In điểm trung bình của chunk này
    print(f"Chunk {i+1}/{num_chunks} — Question Score: {F_q.mean().item():.4f}, Answer Score: {F_a.mean().item():.4f}")

Chunk 1/6 — Question Score: 0.8496, Answer Score: 0.8453
Chunk 2/6 — Question Score: 0.8453, Answer Score: 0.8487
Chunk 3/6 — Question Score: 0.8491, Answer Score: 0.8535
Chunk 4/6 — Question Score: 0.8523, Answer Score: 0.8548
Chunk 5/6 — Question Score: 0.8449, Answer Score: 0.8460
Chunk 6/6 — Question Score: 0.8467, Answer Score: 0.8499


Mean **question** score: 0.84798, Mean **answers** score: 0.8497

# Style

### Embedding-based

In [None]:
%%capture
!pip install transformers==4.30.2 # needed to load style embeddings

#### English ver

In [None]:
# import torch
# from transformers import RobertaConfig, RobertaModel, AutoTokenizer
# import torch.nn.functional as F
# import numpy as np
# from scipy.spatial.distance import cosine

# # Adapted from: https://github.com/lingjzhu/idiolect

# class AttentionPooling(torch.nn.Module):
#     """
#     Implementation of SelfAttentionPooling
#     Original Paper: Self-Attention Encoding and Pooling for Speaker Recognition
#     https://arxiv.org/pdf/2008.01077v1.pdf
#     """

#     def __init__(self, input_dim):
#         super(AttentionPooling, self).__init__()
#         self.W = torch.nn.Linear(input_dim, 1)
#         self.softmax = torch.nn.functional.softmax

#     def forward(self, batch_rep, att_mask=None):
#         """
#         N: batch size, T: sequence length, H: Hidden dimension
#         input:
#             batch_rep : size (N, T, H)
#         attention_weight:
#             att_w : size (N, T, 1)
#         return:
#             utter_rep: size (N, H)
#         """
#         att_logits = self.W(batch_rep).squeeze(-1)
#         if att_mask is not None:
#             att_logits = att_mask + att_logits
#         att_w = self.softmax(att_logits, dim=-1).unsqueeze(-1)
#         utter_rep = torch.sum(batch_rep * att_w, dim=1)

#         return utter_rep


# class DNNSelfAttention(torch.nn.Module):
#     def __init__(self, hidden_dim, **kwargs):
#         super(DNNSelfAttention, self).__init__()
#         self.pooling = AttentionPooling(hidden_dim)
#         self.out_layer = torch.nn.Sequential(
#             torch.nn.Linear(hidden_dim, hidden_dim),
#             torch.nn.ReLU(),
#             torch.nn.Linear(hidden_dim, hidden_dim),
#         )

#     def forward(self, features, att_mask):
#         out = self.pooling(features, att_mask).squeeze(-1)
#         predicted = self.out_layer(out)
#         return predicted


# class SRoberta(torch.nn.Module):
#     def __init__(self, model_name="roberta-base"):
#         super().__init__()
#         config = RobertaConfig.from_pretrained(model_name, return_dict=True)
#         config.output_hidden_states = True
#         self.roberta = RobertaModel.from_pretrained(model_name, config=config)

#         self.pooler = DNNSelfAttention(768)

#     def forward(self, input_ids, att_mask=None):
#         out = self.roberta(input_ids, att_mask)
#         out = out.last_hidden_state
#         out = self.pooler(out, att_mask)
#         return out


# def batch_embed(texts, model, tokenizer, max_length=512):
#     inputs = tokenizer(
#         texts,
#         add_special_tokens=True,
#         max_length=max_length,
#         padding=True,
#         truncation=True,
#         return_tensors="pt"
#         )
#     with torch.no_grad():
#         hidden = model(
#             inputs['input_ids'].to(device),
#             inputs['attention_mask'].to(device)
#         )
#         hidden = F.normalize(hidden, dim=-1).cpu().detach()
#     return hidden

In [None]:
# # 1. Load model
# # The following implementation assumes you are loading directly from google drive

# from google.colab import drive
# drive.mount("/content/MyDrive/")

# style_model_path = "/content/MyDrive/MyDrive/experiments/sroberta_model-4_reddit_resave.bin" # replace with your path
# device = "cuda" if torch.cuda.is_available() else "cpu"
# MODEL_NAME = "roberta-base"
# tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# model = SRoberta()
# if torch.cuda.is_available():
#     model.load_state_dict(torch.load(style_model_path))
# else:
#     model.load_state_dict(
#         torch.load(
#             style_model_path,
#             map_location=torch.device('cpu')
#         )
#     )
# _ = model.to(device)

# # 2. Extract embeddings
# original_embeds = batch_embed(original_texts, model, tokenizer)
# synthetic_embeds = batch_embed(synthetic_texts, model, tokenizer)

# # 3. Compute embedding similarity
# scores = np.array([1 - cosine(a, b) for a, b in zip(original_embeds, synthetic_embeds)])
# # if all synthetic texts are from the same system, we can report scores.mean()
# print("\nIdiolect embedding scores for each example text:")
# for score, synthetic_text in zip(scores, synthetic_texts):
#     print(f"{synthetic_text} (score: {score.item():.2f})")

#### Vietnamese ver

Theo paper, đánh giá phong cách cá nhân tổng thể (`idiolect`) bằng sự tương đồng trong không gian nhúng phong cách (`style embedding similarity`) và nhắm đến phong cách cú pháp bằng các điểm số dựa trên từ loại (`POS-based scores`).

In [12]:
import torch
from transformers import AutoConfig, AutoModel, AutoTokenizer
import torch.nn.functional as F
import numpy as np
from scipy.spatial.distance import cosine


# Attention pooling module
class AttentionPooling(torch.nn.Module):
    def __init__(self, input_dim):
        super(AttentionPooling, self).__init__()
        self.W = torch.nn.Linear(input_dim, 1)
        self.softmax = torch.nn.functional.softmax

    def forward(self, batch_rep, att_mask=None):
        att_logits = self.W(batch_rep).squeeze(-1)
        if att_mask is not None:
            att_logits = att_mask + att_logits
        att_w = self.softmax(att_logits, dim=-1).unsqueeze(-1)
        utter_rep = torch.sum(batch_rep * att_w, dim=1)
        return utter_rep


# DNN layer after pooling
class DNNSelfAttention(torch.nn.Module):
    def __init__(self, hidden_dim, **kwargs):
        super(DNNSelfAttention, self).__init__()
        self.pooling = AttentionPooling(hidden_dim)
        self.out_layer = torch.nn.Sequential(
            torch.nn.Linear(hidden_dim, hidden_dim),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_dim, hidden_dim),
        )

    def forward(self, features, att_mask):
        out = self.pooling(features, att_mask).squeeze(-1)
        predicted = self.out_layer(out)
        return predicted


# Main model: XLM-Roberta + Attention pooling
class SRoberta(torch.nn.Module):
    def __init__(self, model_name="xlm-roberta-base"):
        super().__init__()
        config = AutoConfig.from_pretrained(model_name)
        config.output_hidden_states = True
        self.encoder = AutoModel.from_pretrained(model_name, config=config)
        hidden_size = config.hidden_size
        self.pooler = DNNSelfAttention(hidden_size)

    def forward(self, input_ids, att_mask=None):
        out = self.encoder(input_ids=input_ids, attention_mask=att_mask)
        last_hidden = out.last_hidden_state
        pooled = self.pooler(last_hidden, att_mask)
        return pooled


# Embedding function
# def batch_embed(texts, model, tokenizer, max_length=512):
#     inputs = tokenizer(
#         texts,
#         add_special_tokens=True,
#         max_length=max_length,
#         padding=True,
#         truncation=True,
#         return_tensors="pt"
#     )
#     with torch.no_grad():
#         hidden = model(
#             inputs['input_ids'].to(device),
#             inputs['attention_mask'].to(device)
#         )
#         hidden = F.normalize(hidden, dim=-1).cpu().detach()
#     return hidden
def batch_embed(texts, model, tokenizer, batch_size=32, max_length=512):
    embeddings = []

    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        inputs = tokenizer(
            batch_texts,
            add_special_tokens=True,
            max_length=max_length,
            padding=True,
            truncation=True,
            return_tensors="pt"
        ).to(device)

        with torch.no_grad():
            hidden = model(
                inputs['input_ids'],
                inputs['attention_mask']
            )
            hidden = F.normalize(hidden, dim=-1).cpu().detach().numpy()
            embeddings.append(hidden)

        # Giải phóng bộ nhớ GPU sau mỗi batch
        del inputs, hidden
        torch.cuda.empty_cache()

    return np.vstack(embeddings)

In [13]:
device = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_NAME = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

model = SRoberta(model_name=MODEL_NAME).to(device)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

In [14]:
# Tạo dữ liệu đầu vào
original_questions = original_df["question"].astype(str).tolist()
synthetic_questions = test_df["question"].astype(str).tolist()

original_answers = original_df["answers"].astype(str).tolist()
synthetic_answers = test_df["answers"].astype(str).tolist()

# Cài đặt chunk size
chunk_size = 1000
num_chunks = (len(original_questions) + chunk_size - 1) // chunk_size
BATCH_SIZE = 32  # Bạn có thể giảm xuống 16 nếu vẫn hết bộ nhớ

# Chạy từng chunk
for i in range(num_chunks):
    start = i * chunk_size
    end = min((i + 1) * chunk_size, len(original_questions))

    # Cắt dữ liệu theo chunk
    chunk_orig_q = original_questions[start:end]
    chunk_syn_q = synthetic_questions[start:end]
    chunk_orig_a = original_answers[start:end]
    chunk_syn_a = synthetic_answers[start:end]

    # Embedding cho Question
    q_orig_embeds = batch_embed(chunk_orig_q, model, tokenizer, batch_size=BATCH_SIZE)
    q_syn_embeds = batch_embed(chunk_syn_q, model, tokenizer, batch_size=BATCH_SIZE)
    q_scores = np.array([1 - cosine(a, b) for a, b in zip(q_orig_embeds, q_syn_embeds)])

    # Embedding cho Answer
    a_orig_embeds = batch_embed(chunk_orig_a, model, tokenizer, batch_size=BATCH_SIZE)
    a_syn_embeds = batch_embed(chunk_syn_a, model, tokenizer, batch_size=BATCH_SIZE)
    a_scores = np.array([1 - cosine(a, b) for a, b in zip(a_orig_embeds, a_syn_embeds)])

    # In kết quả trung bình cho chunk này
    print(f"📦 Chunk {i+1}/{num_chunks} — "
          f"❓ Question Score: {q_scores.mean():.4f}, "
          f"✅ Answer Score: {a_scores.mean():.4f}")
# # Chạy từng chunk
# for i in range(num_chunks):
#     start = i * chunk_size
#     end = min((i + 1) * chunk_size, len(original_questions))

#     # Cắt dữ liệu theo chunk
#     chunk_orig_q = original_questions[start:end]
#     chunk_syn_q = synthetic_questions[start:end]

#     chunk_orig_a = original_answers[start:end]
#     chunk_syn_a = synthetic_answers[start:end]

#     # Embedding cho Question
#     q_orig_embeds = batch_embed(chunk_orig_q, model, tokenizer)
#     q_syn_embeds = batch_embed(chunk_syn_q, model, tokenizer)
#     q_scores = np.array([1 - cosine(a, b) for a, b in zip(q_orig_embeds, q_syn_embeds)])

#     # Embedding cho Answer
#     a_orig_embeds = batch_embed(chunk_orig_a, model, tokenizer)
#     a_syn_embeds = batch_embed(chunk_syn_a, model, tokenizer)
#     a_scores = np.array([1 - cosine(a, b) for a, b in zip(a_orig_embeds, a_syn_embeds)])

#     # In kết quả trung bình cho chunk này
#     print(f"Chunk {i+1}/{num_chunks} — Question Score: {q_scores.mean():.4f}, Answer Score: {a_scores.mean():.4f}")

📦 Chunk 1/6 — ❓ Question Score: 0.9803, ✅ Answer Score: 0.9911
📦 Chunk 2/6 — ❓ Question Score: 0.9861, ✅ Answer Score: 0.9880
📦 Chunk 3/6 — ❓ Question Score: 0.9886, ✅ Answer Score: 0.9846
📦 Chunk 4/6 — ❓ Question Score: 0.9805, ✅ Answer Score: 0.9839
📦 Chunk 5/6 — ❓ Question Score: 0.9864, ✅ Answer Score: 0.9915
📦 Chunk 6/6 — ❓ Question Score: 0.9838, ✅ Answer Score: 0.9873


Mean question score: 0.98423, Mean answer score: 0.9877

### POS-based

#### Eng ver

In [None]:
# from collections import Counter
# import numpy as np
# import spacy
# import nltk
# nltk.download('punkt')
# nltk.download('stopwords')
# from nltk.corpus import stopwords
# nltk_stopwords = stopwords.words('english')

# nlp = spacy.load("en_core_web_sm")

# class POSStyleSimilarityScorer:
#     def __init__(self):
#         # Ireland and Pennebaker, 2010 captures writing styles
#         # by examining POS tag occurences across categories:
#         # 0) adv, 1) adj, 2) conj, 3) det, 4) noun, 5) pron, 6) preposition, 7) punct
#         self._VALID_UPOS = {
#             "ADV",
#             "ADJ",
#             # NO AUX,
#             "CCONJ",
#             "SCONJ",
#             "DET",
#             # NO INTJ
#             "NOUN",
#             "PROPN",
#             # NO NUM
#             "PRON",
#             "ADP",
#             "PART",
#             "PUNCT",
#             # NO SYMB
#             # NO VERB,
#             # NO X
#         }
#         self.VALID_UPOS = sorted(self.map_tag(t) for t in self._VALID_UPOS)

#     def map_tag(self, tag):
#         # collapse UPOS tagset to categories
#         mapper = {
#             "CCONJ": "CONJ",
#             "SCONJ": "CONJ",
#             "PROPN": "NOUN",
#             "ADP": "PREP",
#             "PART": "PREP",
#             # BNC2014
#             "SUBST": "NOUN",
#             "ART": "DET",
#             "INTERJ": "INTJ",
#         }
#         return mapper.get(tag, tag)

#     def compute_jaccard_similarity(self, list1, list2):
#         set1, set2 = set(list1), set(list2)
#         intersection = list(set1.intersection(set2))
#         intersection_length = len(list(set1.intersection(set2)))
#         union_length = (len(set1) + len(set2)) - intersection_length
#         if union_length == 0:
#             return union_length
#         return float(intersection_length) / union_length

#     def tag_and_filter(self, text):
#         doc = nlp(text)
#         return [self.map_tag(t.pos_) for t in doc if t.pos_ in self._VALID_UPOS], len(
#             doc
#         )

#     def word_pos_score(self, pos1, pos2, len1, len2):
#         """
#         Calculate POS similarity (Ireland and Pennbaker 2010) over UPOS tags.
#             1. for each POS category, get its count in proportion to total sentence length
#             2. calculate similarity score wrt each category
#             3. average to get total POS similarity score
#         """
#         pos_counts1 = Counter(pos1)
#         pos_counts2 = Counter(pos2)

#         category_scores = []
#         for t in self.VALID_UPOS:
#             cat1 = pos_counts1.get(t, 0) / len1
#             cat2 = pos_counts2.get(t, 0) / len2
#             if cat1 == 0 and cat2 == 0:
#                 score = 1
#             else:
#                 score = 1 - (abs(cat1 - cat2) / (cat1 + cat2))
#             category_scores.append(score)

#         return np.mean(category_scores)

#     def trigram_pos_score(self, pos1, pos2):
#         # note that this will return 0 for shorter texts
#         pos1 = self._make_ngrams(pos1, n=3)
#         pos2 = self._make_ngrams(pos2, n=3)
#         return self.compute_jaccard_similarity(pos1, pos2)

#     def get_trigram_pos_score(self, text1, text2):
#         pos1, _ = self.tag_and_filter(text1)
#         pos2, _ = self.tag_and_filter(text2)
#         return self.trigram_pos_score(pos1, pos2)

#     def get_mean_trigram_pos_scores(self, src, targets, **kwargs):
#         return np.mean([self.get_trigram_pos_score(t, src) for t in targets]).item()

#     def _make_ngrams(self, l, n=3):
#         return ["".join(l[i : i + n]) for i in range(len(l) - n + 1)]

# scorer = POSStyleSimilarityScorer()

In [None]:
# trigram_pos_scores, word_pos_scores = [], []
# for orig_text, syn_text in zip(original_texts, synthetic_texts):
#     orig_pos, orig_len = scorer.tag_and_filter(orig_text)
#     syn_pos, syn_len = scorer.tag_and_filter(syn_text)
#     trigram_pos_score = scorer.get_trigram_pos_score(orig_text, syn_text)
#     word_pos_score = scorer.word_pos_score(orig_pos, syn_pos, orig_len, syn_len)
#     trigram_pos_scores.append(trigram_pos_score)
#     word_pos_scores.append(word_pos_score)

# print("\nPOS-based scores for each example text:")
# for i, synthetic_text in enumerate(synthetic_texts):
#     print(f"{synthetic_text} (Word: {word_pos_scores[i]:.2f}; Trigram: {trigram_pos_scores[i]:.2f})")

#### Vietnamese ver

**POS (Part-of-Speech)** là phương pháp đánh giá dựa trên cấu trúc ngữ pháp dùng để so sánh mức độ giữ nguyên **phong cách cú pháp**/**ngữ pháp** giữa các cặp câu (gốc và được sinh ra). Đây là một hướng đánh giá về **hình thức** thay vì chỉ **ngữ nghĩa**.

In [None]:
!pip install -q spacy spacy-udpipe

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/936.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m931.8/936.8 kB[0m [31m28.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m936.8/936.8 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import spacy_udpipe
import numpy as np
from collections import Counter
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# --- Tải và load model UDPipe tiếng Việt ---
spacy_udpipe.download("vi")  # Chạy 1 lần để tải model tiếng Việt
nlp = spacy_udpipe.load("vi")

class POSStyleSimilarityScorer:
    def __init__(self):
        self._VALID_UPOS = {
            "ADV",
            "ADJ",
            "CCONJ",
            "SCONJ",
            "DET",
            "NOUN",
            "PROPN",
            "PRON",
            "ADP",
            "PART",
            "PUNCT",
        }
        self.VALID_UPOS = sorted(self.map_tag(t) for t in self._VALID_UPOS)

    def map_tag(self, tag):
        mapper = {
            "CCONJ": "CONJ",
            "SCONJ": "CONJ",
            "PROPN": "NOUN",
            "ADP": "PREP",
            "PART": "PREP",
        }
        return mapper.get(tag, tag)

    def compute_jaccard_similarity(self, list1, list2):
        set1, set2 = set(list1), set(list2)
        intersection_length = len(set1.intersection(set2))
        union_length = len(set1.union(set2))
        if union_length == 0:
            return 0.0
        return float(intersection_length) / union_length

    def tag_and_filter(self, text):
        doc = nlp(text)
        return [self.map_tag(t.pos_) for t in doc if t.pos_ in self._VALID_UPOS], len(doc)

    def word_pos_score(self, pos1, pos2, len1, len2):
        pos_counts1 = Counter(pos1)
        pos_counts2 = Counter(pos2)

        category_scores = []
        for t in self.VALID_UPOS:
            cat1 = pos_counts1.get(t, 0) / len1 if len1 > 0 else 0
            cat2 = pos_counts2.get(t, 0) / len2 if len2 > 0 else 0
            if cat1 == 0 and cat2 == 0:
                score = 1
            else:
                score = 1 - (abs(cat1 - cat2) / (cat1 + cat2))
            category_scores.append(score)

        return np.mean(category_scores)

    def trigram_pos_score(self, pos1, pos2):
        pos1_ngrams = self._make_ngrams(pos1, n=3)
        pos2_ngrams = self._make_ngrams(pos2, n=3)
        return self.compute_jaccard_similarity(pos1_ngrams, pos2_ngrams)

    def get_trigram_pos_score(self, text1, text2):
        pos1, _ = self.tag_and_filter(text1)
        pos2, _ = self.tag_and_filter(text2)
        return self.trigram_pos_score(pos1, pos2)

    def _make_ngrams(self, lst, n=3):
        return ["".join(lst[i:i+n]) for i in range(len(lst) - n + 1)]

# Khởi tạo scorer
scorer = POSStyleSimilarityScorer()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Downloaded pre-trained UDPipe model for 'vi' language


In [None]:
# Tách dữ liệu đầu vào
original_questions = original_df["question"].astype(str).tolist()
test_questions = test_df["question"].astype(str).tolist()

original_answers = original_df["answers"].astype(str).tolist()
test_answers = test_df["answers"].astype(str).tolist()

# Thiết lập chunk size
chunk_size = 1000
num_chunks = (len(original_questions) + chunk_size - 1) // chunk_size

for i in range(num_chunks):
    start = i * chunk_size
    end = min((i + 1) * chunk_size, len(original_questions))

    # Cắt theo chunk
    chunk_orig_q = original_questions[start:end]
    chunk_test_q = test_questions[start:end]

    chunk_orig_a = original_answers[start:end]
    chunk_test_a = test_answers[start:end]

    # Tính điểm cho Questions
    word_pos_scores_q = []
    trigram_pos_scores_q = []

    for orig, syn in zip(chunk_orig_q, chunk_test_q):
        orig_pos, orig_len = scorer.tag_and_filter(orig)
        syn_pos, syn_len = scorer.tag_and_filter(syn)
        word_pos_scores_q.append(scorer.word_pos_score(orig_pos, syn_pos, orig_len, syn_len))
        trigram_pos_scores_q.append(scorer.get_trigram_pos_score(orig, syn))

    # Tính điểm cho Answers
    word_pos_scores_a = []
    trigram_pos_scores_a = []

    for orig, syn in zip(chunk_orig_a, chunk_test_a):
        orig_pos, orig_len = scorer.tag_and_filter(orig)
        syn_pos, syn_len = scorer.tag_and_filter(syn)
        word_pos_scores_a.append(scorer.word_pos_score(orig_pos, syn_pos, orig_len, syn_len))
        trigram_pos_scores_a.append(scorer.get_trigram_pos_score(orig, syn))

    # In kết quả trung bình của chunk hiện tại
    print(f"\n📦 Chunk {i+1}/{num_chunks}")
    print(f"Mean Word POS Score - Questions: {np.mean(word_pos_scores_q):.4f}")
    print(f"Mean Trigram POS Score - Questions: {np.mean(trigram_pos_scores_q):.4f}")
    print(f"Mean Word POS Score - Answers: {np.mean(word_pos_scores_a):.4f}")
    print(f"Mean Trigram POS Score - Answers: {np.mean(trigram_pos_scores_a):.4f}")


📦 Chunk 1/6
Mean Word POS Score - Questions: 0.7843
Mean Trigram POS Score - Questions: 0.2866
Mean Word POS Score - Answers: 0.8460
Mean Trigram POS Score - Answers: 0.4043

📦 Chunk 2/6
Mean Word POS Score - Questions: 0.7943
Mean Trigram POS Score - Questions: 0.3000
Mean Word POS Score - Answers: 0.8131
Mean Trigram POS Score - Answers: 0.3656

📦 Chunk 3/6
Mean Word POS Score - Questions: 0.7987
Mean Trigram POS Score - Questions: 0.3120
Mean Word POS Score - Answers: 0.7949
Mean Trigram POS Score - Answers: 0.3336

📦 Chunk 4/6
Mean Word POS Score - Questions: 0.7707
Mean Trigram POS Score - Questions: 0.2794
Mean Word POS Score - Answers: 0.7976
Mean Trigram POS Score - Answers: 0.3270

📦 Chunk 5/6
Mean Word POS Score - Questions: 0.7903
Mean Trigram POS Score - Questions: 0.2958
Mean Word POS Score - Answers: 0.8359
Mean Trigram POS Score - Answers: 0.3993

📦 Chunk 6/6
Mean Word POS Score - Questions: 0.7918
Mean Trigram POS Score - Questions: 0.2925
Mean Word POS Score - Answers

# Divergence

In [9]:
%%capture
!pip install evaluate
!pip install sacrebleu

#### Eng ver

In [None]:
# import evaluate

# sacrebleu = evaluate.load("sacrebleu")

# results = sacrebleu.compute(
#     predictions=synthetic_texts,
#     references=original_texts,
#     use_effective_order=True,
#     smooth_method='floor',
#     force=True
# ) # use sentence-level

# # Aggregated BLEU between synthetic and original texts
# round(100 - results["score"], 4)

#### Vietnamese

**BLEU (Bilingual Evaluation Understudy)** là chỉ số đánh giá mức độ **giống nhau** về mặt **từ ngữ** giữa **một câu sinh ra** và **một hoặc nhiều câu gốc**. Nó không xét ngữ nghĩa mà tập trung vào các n-gram.

 **sacrebleu** là một thư viện chuẩn hóa việc tính BLEU

In [8]:
!pip install evaluate underthesea sacrebleu --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.9/20.9 MB[0m [31m65.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m657.8/657.8 kB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m54.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import evaluate
from underthesea import word_tokenize

# Load sacreBLEU
sacrebleu = evaluate.load("sacrebleu")

# Hàm tokenize tiếng Việt
def tokenize_vietnamese_texts(texts):
    return [" ".join(word_tokenize(text, format="text")) for text in texts]

# Hàm tính BLEU
def compute_bleu(predictions, references):
    results = sacrebleu.compute(
        predictions=predictions,
        references=[[ref] for ref in references],
        use_effective_order=True,
        smooth_method="floor",
        force=True
    )
    return results["score"]

# Tách dữ liệu
original_questions = original_df["question"].astype(str).tolist()
synthetic_questions = test_df["question"].astype(str).tolist()

original_answers = original_df["answers"].astype(str).tolist()
synthetic_answers = test_df["answers"].astype(str).tolist()

# Thiết lập chunk size
chunk_size = 1000
num_chunks = (len(original_questions) + chunk_size - 1) // chunk_size

for i in range(num_chunks):
    start = i * chunk_size
    end = min((i + 1) * chunk_size, len(original_questions))

    # Cắt chunk
    o_q_chunk = original_questions[start:end]
    s_q_chunk = synthetic_questions[start:end]
    o_a_chunk = original_answers[start:end]
    s_a_chunk = synthetic_answers[start:end]

    # Tokenize chunk
    o_q_tok = tokenize_vietnamese_texts(o_q_chunk)
    s_q_tok = tokenize_vietnamese_texts(s_q_chunk)
    o_a_tok = tokenize_vietnamese_texts(o_a_chunk)
    s_a_tok = tokenize_vietnamese_texts(s_a_chunk)

    # Tính BLEU
    bleu_q = compute_bleu(s_q_tok, o_q_tok)
    bleu_a = compute_bleu(s_a_tok, o_a_tok)

    combined_syn = s_q_chunk + s_a_chunk
    combined_orig = o_q_chunk + o_a_chunk
    bleu_combined = compute_bleu(
        tokenize_vietnamese_texts(combined_syn),
        tokenize_vietnamese_texts(combined_orig)
    )

    # In kết quả
    print(f"\n📦 Chunk {i+1}/{num_chunks}")
    print(f"📘 BLEU score (Questions): {bleu_q:.2f}")
    print(f"📗 BLEU score (Answers):   {bleu_a:.2f}")
    print(f"📊 Combined BLEU score:    {bleu_combined:.2f}")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]


📦 Chunk 1/6
📘 BLEU score (Questions): 7.51
📗 BLEU score (Answers):   13.74
📊 Combined BLEU score:    11.84

📦 Chunk 2/6
📘 BLEU score (Questions): 3.33
📗 BLEU score (Answers):   20.71
📊 Combined BLEU score:    12.04

📦 Chunk 3/6
📘 BLEU score (Questions): 6.64
📗 BLEU score (Answers):   22.49
📊 Combined BLEU score:    16.87

📦 Chunk 4/6
📘 BLEU score (Questions): 8.50
📗 BLEU score (Answers):   21.07
📊 Combined BLEU score:    19.29

📦 Chunk 5/6
📘 BLEU score (Questions): 3.45
📗 BLEU score (Answers):   19.39
📊 Combined BLEU score:    11.98

📦 Chunk 6/6
📘 BLEU score (Questions): 3.79
📗 BLEU score (Answers):   24.75
📊 Combined BLEU score:    14.33


# Other

## Distribution-level metrics

### Fréchet Distance:

#### Eng ver

In [None]:
# # Example usage: compare embedding distribution distances

# import numpy as np
# import scipy
# from scipy.stats import entropy
# import torch
# from transformers import AutoModel, AutoTokenizer
# import torch.nn.functional as F

# # Fréchet distance code adapted from: https://github.com/mchong6/FID_IS_infinity/blob/master/score_infinity.py

# def calculate_feature_statistics(feats):
#     """Calculation of the statistics used by the FID.
#     Params:
#     -- feats       : tensor of features with the shape [N, D]
#     Returns:
#     -- mu    : The mean over samples of the activations of the pool_3 layer of
#                the inception model.
#     -- sigma : The covariance matrix of the activations of the pool_3 layer of
#                the inception model.
#     """
#     mu = np.mean(feats, axis=0) # (N, D)
#     sigma = np.cov(feats, rowvar=False)
#     return mu, sigma


# def calculate_frechet_distance(mu1, sigma1, mu2, sigma2, eps=1e-6):
#     """Numpy implementation of the Frechet Distance.
#     The Frechet distance between two multivariate Gaussians X_1 ~ N(mu_1, C_1)
#     and X_2 ~ N(mu_2, C_2) is
#             d^2 = ||mu_1 - mu_2||^2 + Tr(C_1 + C_2 - 2*sqrt(C_1*C_2)).
#     Stable version by Dougal J. Sutherland.
#     Params:
#     -- mu1   : Numpy array containing the activations of a layer of the
#                inception net (like returned by the function 'get_predictions')
#                for generated samples.
#     -- mu2   : The sample mean over activations, precalculated on an
#                representative data set.
#     -- sigma1: The covariance matrix over activations for generated samples.
#     -- sigma2: The covariance matrix over activations, precalculated on an
#                representative data set.
#     Returns:
#     --   : The Frechet Distance.
#     """

#     mu1 = np.atleast_1d(mu1)
#     mu2 = np.atleast_1d(mu2)

#     sigma1 = np.atleast_2d(sigma1)
#     sigma2 = np.atleast_2d(sigma2)

#     assert mu1.shape == mu2.shape, \
#         'Training and test mean vectors have different lengths'
#     assert sigma1.shape == sigma2.shape, \
#         'Training and test covariances have different dimensions'

#     diff = mu1 - mu2

#     # Product might be almost singular
#     covmean, _ = scipy.linalg.sqrtm(sigma1.dot(sigma2), disp=False)
#     if not np.isfinite(covmean).all():
#         msg = ('fid calculation produces singular product; '
#                'adding %s to diagonal of cov estimates') % eps
#         print(msg)
#         offset = np.eye(sigma1.shape[0]) * eps
#         covmean = scipy.linalg.sqrtm((sigma1 + offset).dot(sigma2 + offset))

#     # Numerical error might give slight imaginary component
#     if np.iscomplexobj(covmean):
#         if not np.allclose(np.diagonal(covmean).imag, 0, atol=1e-3):
#             m = np.max(np.abs(covmean.imag))
#             raise ValueError('Imaginary component {}'.format(m))
#         covmean = covmean.real

#     tr_covmean = np.trace(covmean)

#     return (diff.dot(diff) + np.trace(sigma1)
#             + np.trace(sigma2) - 2 * tr_covmean)

# # Example usage - compare meaning distributions
# # (note this method is best used when there are larger numbers of examples)

# device = "cuda" if torch.cuda.is_available() else "cpu"

# model = AutoModel.from_pretrained("roberta-large")
# model.eval()
# model.to(device)

# # if comparing style embedding distributions,
# # load the styleroberta (or other authorship-related models) instead
# tokenizer = AutoTokenizer.from_pretrained('roberta-base')

# with torch.no_grad():
#     original_embeds = model(
#         **tokenizer(
#             original_texts,
#             padding=True,
#             truncation=True,
#             return_tensors="pt"
#             )
#         ).pooler_output
#     synthetic_embeds = model(
#         **tokenizer(
#             synthetic_texts,
#             padding=True,
#             truncation=True,
#             return_tensors="pt"
#             )
#         ).pooler_output
#     original_embeds = F.normalize(original_embeds, dim=-1).numpy()
#     synthetic_embeds = F.normalize(synthetic_embeds, dim=-1).numpy()

# orig_mu, orig_sigma = calculate_feature_statistics(original_embeds)
# syn_mu, syn_sigma = calculate_feature_statistics(synthetic_embeds)
# distance = calculate_frechet_distance(orig_mu, orig_sigma, syn_mu, syn_sigma, eps=1e-6)

# print(f"\nFréchet distance: {distance:.2f}")

#### Vietnamese ver

Fréchet Distance (FD) là chỉ số dùng để đo khoảng cách giữa hai phân phối xác suất. Nó thường được dùng để so sánh **mức độ giống nhau giữa hai tập hợp vector đặc trưng** (embeddings).
- FD ≈ 0 → hai tập gần như giống hệt nhau về mặt phân phối ngữ nghĩa (embedding).

- FD cao → khác biệt lớn.

In [None]:
# import numpy as np
# import scipy
# from scipy.stats import entropy
# import torch
# import torch.nn.functional as F
# from transformers import AutoTokenizer, AutoModel
# from underthesea import word_tokenize  # Dùng để tách từ tiếng Việt

# def calculate_feature_statistics(feats):
#     mu = np.mean(feats, axis=0)
#     sigma = np.cov(feats, rowvar=False)
#     return mu, sigma

# def calculate_frechet_distance(mu1, sigma1, mu2, sigma2, eps=1e-6):
#     mu1 = np.atleast_1d(mu1)
#     mu2 = np.atleast_1d(mu2)
#     sigma1 = np.atleast_2d(sigma1)
#     sigma2 = np.atleast_2d(sigma2)

#     assert mu1.shape == mu2.shape
#     assert sigma1.shape == sigma2.shape

#     diff = mu1 - mu2
#     covmean, _ = scipy.linalg.sqrtm(sigma1.dot(sigma2), disp=False)
#     if not np.isfinite(covmean).all():
#         offset = np.eye(sigma1.shape[0]) * eps
#         covmean = scipy.linalg.sqrtm((sigma1 + offset).dot(sigma2 + offset))
#     if np.iscomplexobj(covmean):
#         covmean = covmean.real
#     tr_covmean = np.trace(covmean)
#     return diff.dot(diff) + np.trace(sigma1) + np.trace(sigma2) - 2 * tr_covmean

# # ---- Load model hỗ trợ tiếng Việt ----
# device = "cuda" if torch.cuda.is_available() else "cpu"
# model_name = "xlm-roberta-large"
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModel.from_pretrained(model_name)
# model.to(device)
# model.eval()

# # ---- Hàm xử lý embedding ----
# def embed_texts(texts):
#     inputs = tokenizer(
#         texts, padding=True, truncation=True, return_tensors="pt"
#     ).to(device)
#     with torch.no_grad():
#         outputs = model(**inputs)
#     # Lấy hidden states cuối cùng, trung bình theo chiều token
#     embeddings = outputs.last_hidden_state.mean(dim=1)
#     return F.normalize(embeddings, dim=-1).cpu().numpy()

# # ---- Dữ liệu đầu vào: Danh sách câu hỏi & trả lời ----
# # (Giả sử đã có sẵn: original_questions, original_answers, synthetic_questions, synthetic_answers)
# def vi_tokenize(texts):
#     return [" ".join(word_tokenize(text)) for text in texts]

# # ---- Tách từ cho tiếng Việt ----
# original_questions_tok = vi_tokenize(original_questions)
# synthetic_questions_tok = vi_tokenize(synthetic_questions)

# original_answers_tok = vi_tokenize(original_answers)
# synthetic_answers_tok = vi_tokenize(synthetic_answers)

# # ---- Embedding ----
# orig_q_embeds = embed_texts(original_questions_tok)
# syn_q_embeds = embed_texts(synthetic_questions_tok)
# orig_a_embeds = embed_texts(original_answers_tok)
# syn_a_embeds = embed_texts(synthetic_answers_tok)

# # ---- Tính Fréchet Distance ----
# # Với câu hỏi
# mu_q_orig, sigma_q_orig = calculate_feature_statistics(orig_q_embeds)
# mu_q_syn, sigma_q_syn = calculate_feature_statistics(syn_q_embeds)
# frechet_q = calculate_frechet_distance(mu_q_orig, sigma_q_orig, mu_q_syn, sigma_q_syn)

# # Với câu trả lời
# mu_a_orig, sigma_a_orig = calculate_feature_statistics(orig_a_embeds)
# mu_a_syn, sigma_a_syn = calculate_feature_statistics(syn_a_embeds)
# frechet_a = calculate_frechet_distance(mu_a_orig, sigma_a_orig, mu_a_syn, sigma_a_syn)

# # ---- Kết quả ----
# print(f"\n📊 Fréchet Distance:")
# print(f"   ❓ Question: {frechet_q:.2f}")
# print(f"   ✅ Answer:   {frechet_a:.2f}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 7.81 GiB. GPU 0 has a total capacity of 14.74 GiB of which 6.64 GiB is free. Process 5140 has 8.10 GiB memory in use. Of the allocated memory 6.98 GiB is allocated by PyTorch, and 1016.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [6]:
original_questions = original_df["question"].astype(str).tolist()
synthetic_questions = test_df["question"].astype(str).tolist()

original_answers = original_df["answers"].astype(str).tolist()
synthetic_answers = test_df["answers"].astype(str).tolist()
# original_questions = original_questions[0:6000]
# synthetic_questions= synthetic_questions[0:6000]
# original_answers= original_answers[0:6000]
# synthetic_answers= synthetic_answers[0:6000]

In [11]:
import numpy as np
import scipy
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from underthesea import word_tokenize

# ---- Hàm hỗ trợ ----
def calculate_feature_statistics(feats):
    mu = np.mean(feats, axis=0)
    sigma = np.cov(feats, rowvar=False)
    return mu, sigma

def calculate_frechet_distance(mu1, sigma1, mu2, sigma2, eps=1e-6):
    mu1 = np.atleast_1d(mu1)
    mu2 = np.atleast_1d(mu2)
    sigma1 = np.atleast_2d(sigma1)
    sigma2 = np.atleast_2d(sigma2)

    assert mu1.shape == mu2.shape
    assert sigma1.shape == sigma2.shape

    diff = mu1 - mu2
    covmean, _ = scipy.linalg.sqrtm(sigma1.dot(sigma2), disp=False)
    if not np.isfinite(covmean).all():
        offset = np.eye(sigma1.shape[0]) * eps
        covmean = scipy.linalg.sqrtm((sigma1 + offset).dot(sigma2 + offset))
    if np.iscomplexobj(covmean):
        covmean = covmean.real
    tr_covmean = np.trace(covmean)
    return diff.dot(diff) + np.trace(sigma1) + np.trace(sigma2) - 2 * tr_covmean

def vi_tokenize(texts):
    return [" ".join(word_tokenize(text)) for text in texts]

# ---- Load model ----
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "xlm-roberta-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).to(device)
model.eval()

def embed_texts(texts, batch_size=32):
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        inputs = tokenizer(batch_texts, padding=True, truncation=True, return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = model(**inputs)
            batch_embeddings = outputs.last_hidden_state.mean(dim=1)
            batch_embeddings = F.normalize(batch_embeddings, dim=-1).cpu().numpy()
            embeddings.append(batch_embeddings)
        # Giải phóng bộ nhớ
        del inputs, outputs, batch_embeddings
        torch.cuda.empty_cache()
    return np.vstack(embeddings)

# ---- Chunking ----
chunk_size = 1000
batch_size = 32
num_chunks = (len(original_questions) + chunk_size - 1) // chunk_size

for i in range(num_chunks):
    start = i * chunk_size
    end = min((i + 1) * chunk_size, len(original_questions))

    # Cắt chunk
    o_q_chunk = vi_tokenize(original_questions[start:end])
    s_q_chunk = vi_tokenize(synthetic_questions[start:end])
    o_a_chunk = vi_tokenize(original_answers[start:end])
    s_a_chunk = vi_tokenize(synthetic_answers[start:end])

    # Embedding theo batch nhỏ
    o_q_embeds = embed_texts(o_q_chunk, batch_size)
    s_q_embeds = embed_texts(s_q_chunk, batch_size)
    o_a_embeds = embed_texts(o_a_chunk, batch_size)
    s_a_embeds = embed_texts(s_a_chunk, batch_size)

    # Tính Fréchet Distance
    mu_q_o, sigma_q_o = calculate_feature_statistics(o_q_embeds)
    mu_q_s, sigma_q_s = calculate_feature_statistics(s_q_embeds)
    frechet_q = calculate_frechet_distance(mu_q_o, sigma_q_o, mu_q_s, sigma_q_s)

    mu_a_o, sigma_a_o = calculate_feature_statistics(o_a_embeds)
    mu_a_s, sigma_a_s = calculate_feature_statistics(s_a_embeds)
    frechet_a = calculate_frechet_distance(mu_a_o, sigma_a_o, mu_a_s, sigma_a_s)

    # In kết quả cho chunk
    print(f"\n📦 Chunk {i+1}/{num_chunks}")
    print(f"   ❓ Fréchet Distance - Questions: {frechet_q:.2f}")
    print(f"   ✅ Fréchet Distance - Answers:   {frechet_a:.2f}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]


📦 Chunk 1/6
   ❓ Fréchet Distance - Questions: 0.00
   ✅ Fréchet Distance - Answers:   0.00

📦 Chunk 2/6
   ❓ Fréchet Distance - Questions: 0.00
   ✅ Fréchet Distance - Answers:   0.00

📦 Chunk 3/6
   ❓ Fréchet Distance - Questions: 0.00
   ✅ Fréchet Distance - Answers:   0.01

📦 Chunk 4/6
   ❓ Fréchet Distance - Questions: 0.00
   ✅ Fréchet Distance - Answers:   0.01

📦 Chunk 5/6
   ❓ Fréchet Distance - Questions: 0.00
   ✅ Fréchet Distance - Answers:   0.00

📦 Chunk 6/6
   ❓ Fréchet Distance - Questions: 0.00
   ✅ Fréchet Distance - Answers:   0.00


### Compare individual POS tags at distribution level

#### Eng ver

In [None]:
# # Example usage: compare individual POS tags at distribution level
# import spacy
# from scipy.spatial.distance import jensenshannon

# def calculate_js_divergence(data1, data2):
#     # Convert datasets into probability distributions
#     max_val = max(max(data1), max(data2)) + 1
#     prob_dist1 = np.zeros(max_val)
#     prob_dist2 = np.zeros(max_val)

#     for val in data1:
#         prob_dist1[val] += 1
#     for val in data2:
#         prob_dist2[val] += 1

#     prob_dist1 /= np.sum(prob_dist1)
#     prob_dist2 /= np.sum(prob_dist2)

#     # Calculate JS divergence
#     js_divergence = jensenshannon(prob_dist1, prob_dist2, base=2)

#     return js_divergence

# nlp = spacy.load("en_core_web_sm")

# # get tag mappings from the POS similarity scorer defined in `style'
# scorer = POSStyleSimilarityScorer()
# upos_mapper = {t:i for i,t in enumerate(scorer.VALID_UPOS)}

# # map texts to POS tag (IDs)
# original_tags, synthetic_tags = [], []
# for text in original_texts:
#     tags, _ = scorer.tag_and_filter(text)
#     original_tags.extend([upos_mapper[t] for t in tags])

# for text in synthetic_texts:
#     tags, _ = scorer.tag_and_filter(text)
#     synthetic_tags.extend([upos_mapper[t] for t in tags])

# print(f"JS Divergence: {calculate_js_divergence(original_tags, synthetic_tags):.2f}")

#### Vietnamese ver

**Compare individual POS tags at distribution level**

So sánh sự phân bố (tần suất) của từng loại từ loại (POS tag) giữa hai tập văn bản.

Cụ thể hơn:

- POS (Part-of-Speech) là từ loại: danh từ (NOUN), động từ (VERB), tính từ (ADJ), trạng từ (ADV), v.v.

- At distribution level nghĩa là không chỉ đếm từng loại, mà so sánh phân bố xác suất giữa 2 tập văn bản: gốc và sinh.

In [None]:
import spacy_udpipe
import numpy as np
from collections import Counter
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# --- Tải và load model UDPipe tiếng Việt ---
spacy_udpipe.download("vi")  # Chạy 1 lần để tải model tiếng Việt
nlp = spacy_udpipe.load("vi")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Already downloaded a model for the 'vi' language


In [None]:
original_questions = original_df["question"].astype(str).tolist()
synthetic_questions = test_df["question"].astype(str).tolist()

original_answers = original_df["answers"].astype(str).tolist()
synthetic_answers = test_df["answers"].astype(str).tolist()

In [None]:
import numpy as np
from scipy.spatial.distance import jensenshannon
import spacy_udpipe

# Load mô hình tiếng Việt
nlp = spacy_udpipe.load("vi")

VALID_UPOS = [
    'NOUN', 'VERB', 'ADJ', 'ADV', 'PRON', 'PROPN', 'DET', 'ADP',
    'AUX', 'CCONJ', 'PART', 'NUM', 'SCONJ', 'PUNCT', 'INTJ', 'X', 'SYM'
]
upos_mapper = {t: i for i, t in enumerate(VALID_UPOS)}

def extract_pos_ids(texts):
    """Trích xuất POS tag và ánh xạ sang ID"""
    pos_ids = []
    for doc in nlp.pipe(texts):
        for token in doc:
            if token.pos_ in upos_mapper:
                pos_ids.append(upos_mapper[token.pos_])
    return pos_ids

def calculate_js_divergence(data1, data2):
    max_val = max(max(data1, default=0), max(data2, default=0)) + 1
    prob_dist1 = np.zeros(max_val)
    prob_dist2 = np.zeros(max_val)
    for val in data1:
        prob_dist1[val] += 1
    for val in data2:
        prob_dist2[val] += 1
    if np.sum(prob_dist1) == 0 or np.sum(prob_dist2) == 0:
        return float('nan')
    prob_dist1 /= np.sum(prob_dist1)
    prob_dist2 /= np.sum(prob_dist2)
    return jensenshannon(prob_dist1, prob_dist2, base=2)

# ---- Chunking ----
chunk_size = 1000
num_chunks = (len(original_questions) + chunk_size - 1) // chunk_size

for i in range(num_chunks):
    start = i * chunk_size
    end = min((i + 1) * chunk_size, len(original_questions))

    # Lấy từng chunk
    o_q_chunk = original_questions[start:end]
    s_q_chunk = synthetic_questions[start:end]
    o_a_chunk = original_answers[start:end]
    s_a_chunk = synthetic_answers[start:end]

    # POS tagging
    orig_q_tags = extract_pos_ids(o_q_chunk)
    syn_q_tags = extract_pos_ids(s_q_chunk)
    orig_a_tags = extract_pos_ids(o_a_chunk)
    syn_a_tags = extract_pos_ids(s_a_chunk)

    # Tính JS Divergence
    js_question = calculate_js_divergence(orig_q_tags, syn_q_tags)
    js_answer = calculate_js_divergence(orig_a_tags, syn_a_tags)

    # In kết quả cho chunk
    print(f"\n📦 Chunk {i+1}/{num_chunks}")
    print(f"   ❓ JS Divergence - Questions: {js_question:.4f}")
    print(f"   ✅ JS Divergence - Answers:   {js_answer:.4f}")


📦 Chunk 1/6
   ❓ JS Divergence - Questions: 0.0783
   ✅ JS Divergence - Answers:   0.0966

📦 Chunk 2/6
   ❓ JS Divergence - Questions: 0.0769
   ✅ JS Divergence - Answers:   0.1071

📦 Chunk 3/6
   ❓ JS Divergence - Questions: 0.0933
   ✅ JS Divergence - Answers:   0.0875

📦 Chunk 4/6
   ❓ JS Divergence - Questions: 0.0811
   ✅ JS Divergence - Answers:   0.0833

📦 Chunk 5/6
   ❓ JS Divergence - Questions: 0.0830
   ✅ JS Divergence - Answers:   0.1096

📦 Chunk 6/6
   ❓ JS Divergence - Questions: 0.0831
   ✅ JS Divergence - Answers:   0.0947


Mean JS Divergence - Questions: 0.0826, Mean JS Divergence - Answers: 0.096467

### Compare POS trigrams at distribution level

#### Eng ver

In [None]:
# # Example usage: compare POS trigrams at distribution level
# import numpy as np
# import spacy
# from nltk.util import trigrams
# from collections import Counter


# def generate_pos_trigram_distribution(nlp, text):
#     tokens = []
#     for doc in nlp.pipe(
#         text,
#         disable=["ner"]
#         ):
#         tokens.extend([token.pos_ for token in doc])
#     trigrams_generated = trigrams(tokens)
#     trigram_counts = Counter(trigrams_generated)
#     # normalize counts to create a distribution
#     total_count = sum(trigram_counts.values())
#     trigram_distribution = {trigram: count / total_count for trigram, count in trigram_counts.items()}

#     return trigram_distribution

# def kl_divergence(p, q):
#     kl_div = 0
#     for key in p:
#         p_val = p[key]
#         q_val = q.get(key, 0)  # default to 0 if key is not in q

#         # only consider non-zero p values
#         if p_val > 0:
#             if q_val > 0:
#                 kl_div += p_val * np.log2(p_val / q_val)
#             else:
#                 kl_div += p_val * np.log2(p_val / (q_val + 1e-10))  # avoid division by zero

#     return kl_div

# def js_divergence(distr1, distr2):
#     avg_distr = {k: (distr1.get(k, 0) + distr2.get(k, 0)) / 2 for k in set(distr1) | set(distr2)}
#     kl_div1 = kl_divergence(distr1, avg_distr)
#     kl_div2 = kl_divergence(distr2, avg_distr)
#     return (kl_div1 + kl_div2) / 2


# nlp = spacy.load("en_core_web_sm")

# trigram_distribution1 = generate_pos_trigram_distribution(nlp, original_texts)
# trigram_distribution2 = generate_pos_trigram_distribution(nlp, synthetic_texts)
# print(trigram_distribution1)
# print(trigram_distribution2)

# js_div = js_divergence(trigram_distribution1, trigram_distribution2)
# print("JS Divergence:", js_div)

#### Vietnamese ver

**Compare POS trigrams at distribution level**

- POS Trigram là chuỗi 3 từ loại liên tiếp trong một câu.
- Ví dụ câu: "Tôi thích ăn phở bò." → POS: PRON VERB VERB NOUN NOUN
→ POS Trigrams:

    - ('PRON', 'VERB', 'VERB')

    - ('VERB', 'VERB', 'NOUN')

    - ('VERB', 'NOUN', 'NOUN')

- Nghĩa là so sánh phân bố tần suất của các POS trigrams giữa hai tập văn bản. Mỗi POS trigram có tần suất xuất hiện riêng → tạo thành một phân phối xác suất. Ta so sánh hai phân phối đó bằng Jensen-Shannon divergence (JS divergence).





In [10]:
original_questions = original_df["question"].astype(str).tolist()
synthetic_questions = test_df["question"].astype(str).tolist()

original_answers = original_df["answers"].astype(str).tolist()
synthetic_answers = test_df["answers"].astype(str).tolist()

In [None]:
import numpy as np
from nltk.util import trigrams
from collections import Counter
import spacy_udpipe

# Load mô hình tiếng Việt
nlp = spacy_udpipe.load("vi")

def generate_pos_trigram_distribution(nlp, texts):
    tokens = []
    for doc in nlp.pipe(texts, disable=["ner"]):
        tokens.extend([token.pos_ for token in doc])
    trigrams_generated = trigrams(tokens)
    trigram_counts = Counter(trigrams_generated)
    total_count = sum(trigram_counts.values())
    trigram_distribution = {
        trigram: count / total_count for trigram, count in trigram_counts.items()
    }
    return trigram_distribution

def kl_divergence(p, q):
    kl_div = 0
    for key in p:
        p_val = p[key]
        q_val = q.get(key, 0)
        if p_val > 0:
            if q_val > 0:
                kl_div += p_val * np.log2(p_val / q_val)
            else:
                kl_div += p_val * np.log2(p_val / (q_val + 1e-10))
    return kl_div

def js_divergence(distr1, distr2):
    keys = set(distr1.keys()) | set(distr2.keys())
    avg_distr = {k: (distr1.get(k, 0) + distr2.get(k, 0)) / 2 for k in keys}
    kl1 = kl_divergence(distr1, avg_distr)
    kl2 = kl_divergence(distr2, avg_distr)
    return (kl1 + kl2) / 2

# ---- Chunking setup ----
chunk_size = 1000
total_chunks = (len(original_questions) + chunk_size - 1) // chunk_size

# Để lưu tổng kết quả
js_q_scores = []
js_a_scores = []

# ---- Process từng chunk ----
for i in range(total_chunks):
    start = i * chunk_size
    end = min((i + 1) * chunk_size, len(original_questions))

    # Lấy dữ liệu cho chunk hiện tại
    o_q_chunk = original_questions[start:end]
    s_q_chunk = synthetic_questions[start:end]
    o_a_chunk = original_answers[start:end]
    s_a_chunk = synthetic_answers[start:end]

    # Tính phân phối POS trigrams
    q_orig_dist = generate_pos_trigram_distribution(nlp, o_q_chunk)
    q_syn_dist = generate_pos_trigram_distribution(nlp, s_q_chunk)
    a_orig_dist = generate_pos_trigram_distribution(nlp, o_a_chunk)
    a_syn_dist = generate_pos_trigram_distribution(nlp, s_a_chunk)

    # Tính JS Divergence
    js_q = js_divergence(q_orig_dist, q_syn_dist)
    js_a = js_divergence(a_orig_dist, a_syn_dist)

    js_q_scores.append(js_q)
    js_a_scores.append(js_a)

    # In kết quả từng chunk
    print(f"\n📦 Chunk {i+1}/{total_chunks}")
    print(f"   ❓ JS Divergence - Questions: {js_q:.4f}")
    print(f"   ✅ JS Divergence - Answers:   {js_a:.4f}")

# ---- In kết quả trung bình sau tất cả các chunk ----
print("\n📊 Trung bình toàn bộ:")
print(f"   ❓ Avg JS Divergence - Questions: {np.mean(js_q_scores):.4f}")
print(f"   ✅ Avg JS Divergence - Answers:   {np.mean(js_a_scores):.4f}")


📦 Chunk 1/6
   ❓ JS Divergence - Questions: 0.0683
   ✅ JS Divergence - Answers:   0.0582

📦 Chunk 2/6
   ❓ JS Divergence - Questions: 0.0661
   ✅ JS Divergence - Answers:   0.0664

📦 Chunk 3/6
   ❓ JS Divergence - Questions: 0.0744
   ✅ JS Divergence - Answers:   0.0572

📦 Chunk 4/6
   ❓ JS Divergence - Questions: 0.0704
   ✅ JS Divergence - Answers:   0.0531

📦 Chunk 5/6
   ❓ JS Divergence - Questions: 0.0733
   ✅ JS Divergence - Answers:   0.0690

📦 Chunk 6/6
   ❓ JS Divergence - Questions: 0.0735
   ✅ JS Divergence - Answers:   0.0578

📊 Trung bình toàn bộ:
   ❓ Avg JS Divergence - Questions: 0.0710
   ✅ Avg JS Divergence - Answers:   0.0603


Mean JS Divergence - Questions: , Mean JS Divergence - Answers:

### Compute divergence of character n-grams

#### Eng ver

In [None]:
# # Example: compute divergence of character n-grams
# import json
# import numpy as np
# from scipy.stats import entropy

# def generate_n_grams(texts, n, pad_token='|'):
#     """Generate n-grams from the given list of texts, padding the end if necessary."""
#     all_n_grams = []
#     for text in texts:
#         # determine the padding required to complete the last n-gram
#         padding_required = (n - len(text) % n) % n
#         padded_text = text + pad_token * padding_required
#         # generate n-grams from the padded text
#         n_grams = [padded_text[i:i+n] for i in range(0, len(padded_text), n)]
#         all_n_grams.extend(n_grams)
#     return all_n_grams

# def update_mapping(n_grams, mapping):
#     max_value = max(mapping.values(), default=0)
#     for n_gram in n_grams:
#         if n_gram not in mapping:
#             max_value += 1
#             mapping[n_gram] = max_value
#     return mapping

# def process_texts(texts, n=3, mapping={}):
#     n_grams = generate_n_grams(texts, n)
#     mapping = update_mapping(n_grams, mapping)
#     processed_texts = [[mapping[n_gram] for n_gram in generate_n_grams([text], n)] for text in texts]
#     return processed_texts, mapping

# def compute_freq_dist(mapping, processed_texts):
#     """Compute frequency distribution of n-grams."""
#     freq_dist = np.zeros(max(mapping.values()) + 1)
#     for text in processed_texts:
#         for n_gram_idx in text:
#             freq_dist[n_gram_idx] += 1
#     return freq_dist

# def normalize_dist(freq_dist):
#     """Convert frequency distribution to probability distribution."""
#     total_count = np.sum(freq_dist)
#     return freq_dist / total_count if total_count > 0 else np.zeros_like(freq_dist)

# def js_divergence(p, q):
#     m = 0.5 * (p + q)
#     p = np.where(p == 0, 1e-10, p)  # avoid log(0)
#     q = np.where(q == 0, 1e-10, q)
#     m = np.where(m == 0, 1e-10, m)
#     return 0.5 * entropy(p, m) + 0.5 * entropy(q, m)

# processed_texts_original, ngram_to_id = process_texts(original_texts)
# processed_texts_synthetic, ngram_to_id = process_texts(synthetic_texts, mapping=ngram_to_id)
# freq_dist_original = compute_freq_dist(ngram_to_id, processed_texts_original)
# freq_dist_synthetic = compute_freq_dist(ngram_to_id, processed_texts_synthetic)

# # Convert to probability distributions and ensure they have the same length
# prob_dist_original = normalize_dist(freq_dist_original)
# prob_dist_synthetic = normalize_dist(freq_dist_synthetic)
# length = max(len(prob_dist_original), len(prob_dist_synthetic))
# prob_dist_original = np.pad(prob_dist_original, (0, length - len(prob_dist_original)), 'constant')
# prob_dist_synthetic = np.pad(prob_dist_synthetic, (0, length - len(prob_dist_synthetic)), 'constant')

# js_div = js_divergence(prob_dist_original, prob_dist_synthetic)

# print(f"JS Divergence: {js_div:.2f}")

#### Vietnamese ver

**Compute divergence of character n-grams**

Character n-gram là các chuỗi n ký tự liên tiếp trong một văn bản.

Việc tính JS divergence trên n-grams ký tự giúp:

- Đánh giá mức độ giống nhau về mặt hình thức (character-level) giữa văn bản gốc và văn bản sinh.

- Nhận biết xem mô hình sinh có bắt chước đúng mẫu ngôn ngữ, dạng từ, chính tả của dữ liệu gốc hay không.

In [None]:
import numpy as np
from scipy.stats import entropy

def generate_n_grams(texts, n, pad_token='|'):
    all_n_grams = []
    for text in texts:
        padding_required = (n - len(text) % n) % n
        padded_text = text + pad_token * padding_required
        n_grams = [padded_text[i:i+n] for i in range(0, len(padded_text), n)]
        all_n_grams.extend(n_grams)
    return all_n_grams

def update_mapping(n_grams, mapping):
    max_value = max(mapping.values(), default=0)
    for n_gram in n_grams:
        if n_gram not in mapping:
            max_value += 1
            mapping[n_gram] = max_value
    return mapping

def process_texts(texts, n=3, mapping=None):
    if mapping is None:
        mapping = {}
    n_grams = generate_n_grams(texts, n)
    mapping = update_mapping(n_grams, mapping)
    processed_texts = [
        [mapping[n_gram] for n_gram in generate_n_grams([text], n)]
        for text in texts
    ]
    return processed_texts, mapping

def compute_freq_dist(mapping, processed_texts):
    freq_dist = np.zeros(max(mapping.values()) + 1)
    for text in processed_texts:
        for idx in text:
            freq_dist[idx] += 1
    return freq_dist

def normalize_dist(freq_dist):
    total = np.sum(freq_dist)
    return freq_dist / total if total > 0 else np.zeros_like(freq_dist)

def js_divergence(p, q):
    m = 0.5 * (p + q)
    p = np.where(p == 0, 1e-10, p)
    q = np.where(q == 0, 1e-10, q)
    m = np.where(m == 0, 1e-10, m)
    return 0.5 * entropy(p, m) + 0.5 * entropy(q, m)

# ---- Chunked processing ----
chunk_size = 1000
num_chunks = (len(original_questions) + chunk_size - 1) // chunk_size

js_q_scores = []
js_a_scores = []

for i in range(num_chunks):
    start = i * chunk_size
    end = min((i + 1) * chunk_size, len(original_questions))

    # Questions
    q_orig_chunk = original_questions[start:end]
    q_syn_chunk = synthetic_questions[start:end]

    proc_q_orig, map_q = process_texts(q_orig_chunk)
    proc_q_syn, _ = process_texts(q_syn_chunk, mapping=map_q)

    freq_q_orig = compute_freq_dist(map_q, proc_q_orig)
    freq_q_syn = compute_freq_dist(map_q, proc_q_syn)

    prob_q_orig = normalize_dist(freq_q_orig)
    prob_q_syn = normalize_dist(freq_q_syn)

    length_q = max(len(prob_q_orig), len(prob_q_syn))
    prob_q_orig = np.pad(prob_q_orig, (0, length_q - len(prob_q_orig)))
    prob_q_syn = np.pad(prob_q_syn, (0, length_q - len(prob_q_syn)))

    js_q = js_divergence(prob_q_orig, prob_q_syn)
    js_q_scores.append(js_q)

    # Answers
    a_orig_chunk = original_answers[start:end]
    a_syn_chunk = synthetic_answers[start:end]

    proc_a_orig, map_a = process_texts(a_orig_chunk)
    proc_a_syn, _ = process_texts(a_syn_chunk, mapping=map_a)

    freq_a_orig = compute_freq_dist(map_a, proc_a_orig)
    freq_a_syn = compute_freq_dist(map_a, proc_a_syn)

    prob_a_orig = normalize_dist(freq_a_orig)
    prob_a_syn = normalize_dist(freq_a_syn)

    length_a = max(len(prob_a_orig), len(prob_a_syn))
    prob_a_orig = np.pad(prob_a_orig, (0, length_a - len(prob_a_orig)))
    prob_a_syn = np.pad(prob_a_syn, (0, length_a - len(prob_a_syn)))

    js_a = js_divergence(prob_a_orig, prob_a_syn)
    js_a_scores.append(js_a)

    # Print per-chunk result
    print(f"\n📦 Chunk {i + 1}/{num_chunks}")
    print(f"   ❓ JS Divergence - Questions: {js_q:.4f}")
    print(f"   ✅ JS Divergence - Answers:   {js_a:.4f}")

# ---- Print average result ----
print("\n📊 Trung bình toàn bộ:")
print(f"   ❓ Avg JS Divergence - Questions: {np.mean(js_q_scores):.4f}")
print(f"   ✅ Avg JS Divergence - Answers:   {np.mean(js_a_scores):.4f}")


📦 Chunk 1/6
   ❓ JS Divergence - Questions: 0.0906
   ✅ JS Divergence - Answers:   0.0610

📦 Chunk 2/6
   ❓ JS Divergence - Questions: 0.0885
   ✅ JS Divergence - Answers:   0.0706

📦 Chunk 3/6
   ❓ JS Divergence - Questions: 0.0884
   ✅ JS Divergence - Answers:   0.0660

📦 Chunk 4/6
   ❓ JS Divergence - Questions: 0.0877
   ✅ JS Divergence - Answers:   0.0629

📦 Chunk 5/6
   ❓ JS Divergence - Questions: 0.0983
   ✅ JS Divergence - Answers:   0.0676

📦 Chunk 6/6
   ❓ JS Divergence - Questions: 0.0949
   ✅ JS Divergence - Answers:   0.0673

📊 Trung bình toàn bộ:
   ❓ Avg JS Divergence - Questions: 0.0914
   ✅ Avg JS Divergence - Answers:   0.0659
