<a href="https://colab.research.google.com/github/j-chim/QMUL-Thesis-Draft/blob/main/cl_synthetic_data_evaluation_examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About
This notebook contains example intrinsic evaluation code described in our paper: [Evaluating Synthetic Data Generation from User Generated Text (Chim et al., CL 2024)](https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00540/124625).

## Setup
The notebook is split by evaluation aspect - meaning, style, divergence.

Most sections can be directly ran with little setup. However, for style evaluation, you will need to use the idiolect model trained by [Zhu and Jurgen, 2021](https://aclanthology.org/2021.emnlp-main.25/) or obtain alternative style-sensitive embeddings. Our paper uses the following weights, re-saved from Zhu and Jurgen's model for software version compatibility: https://drive.google.com/file/d/1SXSlp4K9sM5EOhiwkP-XjkUZIzc9worB/view?usp=sharing. Ensure you save this in your drive (if running directly on colab) or download it for offline use.



In [1]:
# Example synthetic texts from a single source, varying in style and meaning similarity

synthetic_texts = [
    "The nimble brown fox hops across the sleepy dog.",
    "the lazy canine was lying around when it got jumped over by a quick-moving brown fox!",
    "despite heavy rain yesterday evening, remember to water those plants!"
]

original_texts = [
    "The quick brown fox jumps over the lazy dog."
] * len(synthetic_texts)

# Meaning

In [5]:
# the main reported metric is BERTScore, which is conveniently run using the official implementation
%%capture
!pip install bert-score

In [23]:
from bert_score import BERTScorer

scorer = BERTScorer(model_type="roberta-large", lang="en")
# we report the mean (F.mean()) in our paper
_, _, F = scorer.score(synthetic_texts, original_texts)

print("\nBERTScore of each example text:")
for score, synthetic_text in zip(F, synthetic_texts):
    print(f"{synthetic_text} (score: {score.item():.2f})")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



BERTScore of each example text:
The nimble brown fox hops across the sleepy dog. (score: 0.96)
the lazy canine was lying around when it got jumped over by a quick-moving brown fox! (score: 0.91)
despite heavy rain yesterday evening, remember to water those plants! (score: 0.84)


# Style

### Embedding-based

In [7]:
%%capture
!pip install transformers==4.30.2 # needed to load style embeddings

In [None]:
import torch
from transformers import RobertaConfig, RobertaModel, AutoTokenizer
import torch.nn.functional as F
import numpy as np
from scipy.spatial.distance import cosine

# Adapted from: https://github.com/lingjzhu/idiolect

class AttentionPooling(torch.nn.Module):
    """
    Implementation of SelfAttentionPooling
    Original Paper: Self-Attention Encoding and Pooling for Speaker Recognition
    https://arxiv.org/pdf/2008.01077v1.pdf
    """

    def __init__(self, input_dim):
        super(AttentionPooling, self).__init__()
        self.W = torch.nn.Linear(input_dim, 1)
        self.softmax = torch.nn.functional.softmax

    def forward(self, batch_rep, att_mask=None):
        """
        N: batch size, T: sequence length, H: Hidden dimension
        input:
            batch_rep : size (N, T, H)
        attention_weight:
            att_w : size (N, T, 1)
        return:
            utter_rep: size (N, H)
        """
        att_logits = self.W(batch_rep).squeeze(-1)
        if att_mask is not None:
            att_logits = att_mask + att_logits
        att_w = self.softmax(att_logits, dim=-1).unsqueeze(-1)
        utter_rep = torch.sum(batch_rep * att_w, dim=1)

        return utter_rep


class DNNSelfAttention(torch.nn.Module):
    def __init__(self, hidden_dim, **kwargs):
        super(DNNSelfAttention, self).__init__()
        self.pooling = AttentionPooling(hidden_dim)
        self.out_layer = torch.nn.Sequential(
            torch.nn.Linear(hidden_dim, hidden_dim),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_dim, hidden_dim),
        )

    def forward(self, features, att_mask):
        out = self.pooling(features, att_mask).squeeze(-1)
        predicted = self.out_layer(out)
        return predicted


class SRoberta(torch.nn.Module):
    def __init__(self, model_name="roberta-base"):
        super().__init__()
        config = RobertaConfig.from_pretrained(model_name, return_dict=True)
        config.output_hidden_states = True
        self.roberta = RobertaModel.from_pretrained(model_name, config=config)

        self.pooler = DNNSelfAttention(768)

    def forward(self, input_ids, att_mask=None):
        out = self.roberta(input_ids, att_mask)
        out = out.last_hidden_state
        out = self.pooler(out, att_mask)
        return out


def batch_embed(texts, model, tokenizer, max_length=512):
    inputs = tokenizer(
        texts,
        add_special_tokens=True,
        max_length=max_length,
        padding=True,
        truncation=True,
        return_tensors="pt"
        )
    with torch.no_grad():
        hidden = model(
            inputs['input_ids'].to(device),
            inputs['attention_mask'].to(device)
        )
        hidden = F.normalize(hidden, dim=-1).cpu().detach()
    return hidden

In [34]:
# 1. Load model
# The following implementation assumes you are loading directly from google drive

from google.colab import drive
drive.mount("/content/MyDrive/")

style_model_path = "/content/MyDrive/MyDrive/experiments/sroberta_model-4_reddit_resave.bin" # replace with your path
device = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_NAME = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

model = SRoberta()
if torch.cuda.is_available():
    model.load_state_dict(torch.load(style_model_path))
else:
    model.load_state_dict(
        torch.load(
            style_model_path,
            map_location=torch.device('cpu')
        )
    )
_ = model.to(device)

# 2. Extract embeddings
original_embeds = batch_embed(original_texts, model, tokenizer)
synthetic_embeds = batch_embed(synthetic_texts, model, tokenizer)

# 3. Compute embedding similarity
scores = np.array([1 - cosine(a, b) for a, b in zip(original_embeds, synthetic_embeds)])
# if all synthetic texts are from the same system, we can report scores.mean()
print("\nIdiolect embedding scores for each example text:")
for score, synthetic_text in zip(scores, synthetic_texts):
    print(f"{synthetic_text} (score: {score.item():.2f})")

Drive already mounted at /content/MyDrive/; to attempt to forcibly remount, call drive.mount("/content/MyDrive/", force_remount=True).


Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  torch.load(



Idiolect embedding scores for each example text:
The nimble brown fox hops across the sleepy dog. (score: 0.90)
the lazy canine was lying around when it got jumped over by a quick-moving brown fox! (score: 0.69)
despite heavy rain yesterday evening, remember to water those plants! (score: 0.56)


### POS-based

In [5]:
from collections import Counter
import numpy as np
import spacy
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk_stopwords = stopwords.words('english')

nlp = spacy.load("en_core_web_sm")

class POSStyleSimilarityScorer:
    def __init__(self):
        # Ireland and Pennebaker, 2010 captures writing styles
        # by examining POS tag occurences across categories:
        # 0) adv, 1) adj, 2) conj, 3) det, 4) noun, 5) pron, 6) preposition, 7) punct
        self._VALID_UPOS = {
            "ADV",
            "ADJ",
            # NO AUX,
            "CCONJ",
            "SCONJ",
            "DET",
            # NO INTJ
            "NOUN",
            "PROPN",
            # NO NUM
            "PRON",
            "ADP",
            "PART",
            "PUNCT",
            # NO SYMB
            # NO VERB,
            # NO X
        }
        self.VALID_UPOS = sorted(self.map_tag(t) for t in self._VALID_UPOS)

    def map_tag(self, tag):
        # collapse UPOS tagset to categories
        mapper = {
            "CCONJ": "CONJ",
            "SCONJ": "CONJ",
            "PROPN": "NOUN",
            "ADP": "PREP",
            "PART": "PREP",
            # BNC2014
            "SUBST": "NOUN",
            "ART": "DET",
            "INTERJ": "INTJ",
        }
        return mapper.get(tag, tag)

    def compute_jaccard_similarity(self, list1, list2):
        set1, set2 = set(list1), set(list2)
        intersection = list(set1.intersection(set2))
        intersection_length = len(list(set1.intersection(set2)))
        union_length = (len(set1) + len(set2)) - intersection_length
        if union_length == 0:
            return union_length
        return float(intersection_length) / union_length

    def tag_and_filter(self, text):
        doc = nlp(text)
        return [self.map_tag(t.pos_) for t in doc if t.pos_ in self._VALID_UPOS], len(
            doc
        )

    def word_pos_score(self, pos1, pos2, len1, len2):
        """
        Calculate POS similarity (Ireland and Pennbaker 2010) over UPOS tags.
            1. for each POS category, get its count in proportion to total sentence length
            2. calculate similarity score wrt each category
            3. average to get total POS similarity score
        """
        pos_counts1 = Counter(pos1)
        pos_counts2 = Counter(pos2)

        category_scores = []
        for t in self.VALID_UPOS:
            cat1 = pos_counts1.get(t, 0) / len1
            cat2 = pos_counts2.get(t, 0) / len2
            if cat1 == 0 and cat2 == 0:
                score = 1
            else:
                score = 1 - (abs(cat1 - cat2) / (cat1 + cat2))
            category_scores.append(score)

        return np.mean(category_scores)

    def trigram_pos_score(self, pos1, pos2):
        # note that this will return 0 for shorter texts
        pos1 = self._make_ngrams(pos1, n=3)
        pos2 = self._make_ngrams(pos2, n=3)
        return self.compute_jaccard_similarity(pos1, pos2)

    def get_trigram_pos_score(self, text1, text2):
        pos1, _ = self.tag_and_filter(text1)
        pos2, _ = self.tag_and_filter(text2)
        return self.trigram_pos_score(pos1, pos2)

    def get_mean_trigram_pos_scores(self, src, targets, **kwargs):
        return np.mean([self.get_trigram_pos_score(t, src) for t in targets]).item()

    def _make_ngrams(self, l, n=3):
        return ["".join(l[i : i + n]) for i in range(len(l) - n + 1)]

scorer = POSStyleSimilarityScorer()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
trigram_pos_scores, word_pos_scores = [], []
for orig_text, syn_text in zip(original_texts, synthetic_texts):
    orig_pos, orig_len = scorer.tag_and_filter(orig_text)
    syn_pos, syn_len = scorer.tag_and_filter(syn_text)
    trigram_pos_score = scorer.get_trigram_pos_score(orig_text, syn_text)
    word_pos_score = scorer.word_pos_score(orig_pos, syn_pos, orig_len, syn_len)
    trigram_pos_scores.append(trigram_pos_score)
    word_pos_scores.append(word_pos_score)

print("\nPOS-based scores for each example text:")
for i, synthetic_text in enumerate(synthetic_texts):
    print(f"{synthetic_text} (Word: {word_pos_scores[i]:.2f}; Trigram: {trigram_pos_scores[i]:.2f})")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



POS-based scores for each example text:
The nimble brown fox hops across the sleepy dog. (Word: 1.00; Trigram: 1.00)
the lazy canine was lying around when it got jumped over by a quick-moving brown fox! (Word: 0.52; Trigram: 0.19)
despite heavy rain yesterday evening, remember to water those plants! (Word: 0.64; Trigram: 0.00)


# Divergence

In [1]:
%%capture
!pip install evaluate
!pip install sacrebleu

In [3]:
import evaluate

sacrebleu = evaluate.load("sacrebleu")

results = sacrebleu.compute(
    predictions=synthetic_texts,
    references=original_texts,
    use_effective_order=True,
    smooth_method='floor',
    force=True
) # use sentence-level

# Aggregated BLEU between synthetic and original texts
round(100 - results["score"], 4)

Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.


Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

97.628

# Other

## Distribution-level metrics

In [9]:
# Example usage: compare embedding distribution distances

import numpy as np
import scipy
from scipy.stats import entropy
import torch
from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F

# Fréchet distance code adapted from: https://github.com/mchong6/FID_IS_infinity/blob/master/score_infinity.py

def calculate_feature_statistics(feats):
    """Calculation of the statistics used by the FID.
    Params:
    -- feats       : tensor of features with the shape [N, D]
    Returns:
    -- mu    : The mean over samples of the activations of the pool_3 layer of
               the inception model.
    -- sigma : The covariance matrix of the activations of the pool_3 layer of
               the inception model.
    """
    mu = np.mean(feats, axis=0) # (N, D)
    sigma = np.cov(feats, rowvar=False)
    return mu, sigma


def calculate_frechet_distance(mu1, sigma1, mu2, sigma2, eps=1e-6):
    """Numpy implementation of the Frechet Distance.
    The Frechet distance between two multivariate Gaussians X_1 ~ N(mu_1, C_1)
    and X_2 ~ N(mu_2, C_2) is
            d^2 = ||mu_1 - mu_2||^2 + Tr(C_1 + C_2 - 2*sqrt(C_1*C_2)).
    Stable version by Dougal J. Sutherland.
    Params:
    -- mu1   : Numpy array containing the activations of a layer of the
               inception net (like returned by the function 'get_predictions')
               for generated samples.
    -- mu2   : The sample mean over activations, precalculated on an
               representative data set.
    -- sigma1: The covariance matrix over activations for generated samples.
    -- sigma2: The covariance matrix over activations, precalculated on an
               representative data set.
    Returns:
    --   : The Frechet Distance.
    """

    mu1 = np.atleast_1d(mu1)
    mu2 = np.atleast_1d(mu2)

    sigma1 = np.atleast_2d(sigma1)
    sigma2 = np.atleast_2d(sigma2)

    assert mu1.shape == mu2.shape, \
        'Training and test mean vectors have different lengths'
    assert sigma1.shape == sigma2.shape, \
        'Training and test covariances have different dimensions'

    diff = mu1 - mu2

    # Product might be almost singular
    covmean, _ = scipy.linalg.sqrtm(sigma1.dot(sigma2), disp=False)
    if not np.isfinite(covmean).all():
        msg = ('fid calculation produces singular product; '
               'adding %s to diagonal of cov estimates') % eps
        print(msg)
        offset = np.eye(sigma1.shape[0]) * eps
        covmean = scipy.linalg.sqrtm((sigma1 + offset).dot(sigma2 + offset))

    # Numerical error might give slight imaginary component
    if np.iscomplexobj(covmean):
        if not np.allclose(np.diagonal(covmean).imag, 0, atol=1e-3):
            m = np.max(np.abs(covmean.imag))
            raise ValueError('Imaginary component {}'.format(m))
        covmean = covmean.real

    tr_covmean = np.trace(covmean)

    return (diff.dot(diff) + np.trace(sigma1)
            + np.trace(sigma2) - 2 * tr_covmean)

# Example usage - compare meaning distributions
# (note this method is best used when there are larger numbers of examples)

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained("roberta-large")
model.eval()
model.to(device)

# if comparing style embedding distributions,
# load the styleroberta (or other authorship-related models) instead
tokenizer = AutoTokenizer.from_pretrained('roberta-base')

with torch.no_grad():
    original_embeds = model(
        **tokenizer(
            original_texts,
            padding=True,
            truncation=True,
            return_tensors="pt"
            )
        ).pooler_output
    synthetic_embeds = model(
        **tokenizer(
            synthetic_texts,
            padding=True,
            truncation=True,
            return_tensors="pt"
            )
        ).pooler_output
    original_embeds = F.normalize(original_embeds, dim=-1).numpy()
    synthetic_embeds = F.normalize(synthetic_embeds, dim=-1).numpy()

orig_mu, orig_sigma = calculate_feature_statistics(original_embeds)
syn_mu, syn_sigma = calculate_feature_statistics(synthetic_embeds)
distance = calculate_frechet_distance(orig_mu, orig_sigma, syn_mu, syn_sigma, eps=1e-6)

print(f"\nFréchet distance: {distance:.2f}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Fréchet distance: 0.03


In [6]:
# Example usage: compare individual POS tags at distribution level
import spacy
from scipy.spatial.distance import jensenshannon

def calculate_js_divergence(data1, data2):
    # Convert datasets into probability distributions
    max_val = max(max(data1), max(data2)) + 1
    prob_dist1 = np.zeros(max_val)
    prob_dist2 = np.zeros(max_val)

    for val in data1:
        prob_dist1[val] += 1
    for val in data2:
        prob_dist2[val] += 1

    prob_dist1 /= np.sum(prob_dist1)
    prob_dist2 /= np.sum(prob_dist2)

    # Calculate JS divergence
    js_divergence = jensenshannon(prob_dist1, prob_dist2, base=2)

    return js_divergence

nlp = spacy.load("en_core_web_sm")

# get tag mappings from the POS similarity scorer defined in `style'
scorer = POSStyleSimilarityScorer()
upos_mapper = {t:i for i,t in enumerate(scorer.VALID_UPOS)}

# map texts to POS tag (IDs)
original_tags, synthetic_tags = [], []
for text in original_texts:
    tags, _ = scorer.tag_and_filter(text)
    original_tags.extend([upos_mapper[t] for t in tags])

for text in synthetic_texts:
    tags, _ = scorer.tag_and_filter(text)
    synthetic_tags.extend([upos_mapper[t] for t in tags])

print(f"JS Divergence: {calculate_js_divergence(original_tags, synthetic_tags):.2f}")

JS Divergence: 0.28


In [3]:
# Example usage: compare POS trigrams at distribution level
import numpy as np
import spacy
from nltk.util import trigrams
from collections import Counter


def generate_pos_trigram_distribution(nlp, text):
    tokens = []
    for doc in nlp.pipe(
        text,
        disable=["ner"]
        ):
        tokens.extend([token.pos_ for token in doc])
    trigrams_generated = trigrams(tokens)
    trigram_counts = Counter(trigrams_generated)
    # normalize counts to create a distribution
    total_count = sum(trigram_counts.values())
    trigram_distribution = {trigram: count / total_count for trigram, count in trigram_counts.items()}

    return trigram_distribution

def kl_divergence(p, q):
    kl_div = 0
    for key in p:
        p_val = p[key]
        q_val = q.get(key, 0)  # default to 0 if key is not in q

        # only consider non-zero p values
        if p_val > 0:
            if q_val > 0:
                kl_div += p_val * np.log2(p_val / q_val)
            else:
                kl_div += p_val * np.log2(p_val / (q_val + 1e-10))  # avoid division by zero

    return kl_div

def js_divergence(distr1, distr2):
    avg_distr = {k: (distr1.get(k, 0) + distr2.get(k, 0)) / 2 for k in set(distr1) | set(distr2)}
    kl_div1 = kl_divergence(distr1, avg_distr)
    kl_div2 = kl_divergence(distr2, avg_distr)
    return (kl_div1 + kl_div2) / 2


nlp = spacy.load("en_core_web_sm")

trigram_distribution1 = generate_pos_trigram_distribution(nlp, original_texts)
trigram_distribution2 = generate_pos_trigram_distribution(nlp, synthetic_texts)
print(trigram_distribution1)
print(trigram_distribution2)

js_div = js_divergence(trigram_distribution1, trigram_distribution2)
print("JS Divergence:", js_div)



{('DET', 'ADJ', 'ADJ'): 0.10714285714285714, ('ADJ', 'ADJ', 'NOUN'): 0.10714285714285714, ('ADJ', 'NOUN', 'VERB'): 0.10714285714285714, ('NOUN', 'VERB', 'ADP'): 0.10714285714285714, ('VERB', 'ADP', 'DET'): 0.10714285714285714, ('ADP', 'DET', 'ADJ'): 0.10714285714285714, ('DET', 'ADJ', 'NOUN'): 0.10714285714285714, ('ADJ', 'NOUN', 'PUNCT'): 0.10714285714285714, ('NOUN', 'PUNCT', 'DET'): 0.07142857142857142, ('PUNCT', 'DET', 'ADJ'): 0.07142857142857142}
{('DET', 'ADJ', 'ADJ'): 0.02564102564102564, ('ADJ', 'ADJ', 'NOUN'): 0.02564102564102564, ('ADJ', 'NOUN', 'VERB'): 0.02564102564102564, ('NOUN', 'VERB', 'ADP'): 0.02564102564102564, ('VERB', 'ADP', 'DET'): 0.02564102564102564, ('ADP', 'DET', 'ADJ'): 0.05128205128205128, ('DET', 'ADJ', 'NOUN'): 0.05128205128205128, ('ADJ', 'NOUN', 'PUNCT'): 0.05128205128205128, ('NOUN', 'PUNCT', 'DET'): 0.02564102564102564, ('PUNCT', 'DET', 'ADJ'): 0.02564102564102564, ('ADJ', 'NOUN', 'AUX'): 0.02564102564102564, ('NOUN', 'AUX', 'VERB'): 0.0256410256410256

In [15]:
# Example: compute divergence of character n-grams
import json
import numpy as np
from scipy.stats import entropy

def generate_n_grams(texts, n, pad_token='|'):
    """Generate n-grams from the given list of texts, padding the end if necessary."""
    all_n_grams = []
    for text in texts:
        # determine the padding required to complete the last n-gram
        padding_required = (n - len(text) % n) % n
        padded_text = text + pad_token * padding_required
        # generate n-grams from the padded text
        n_grams = [padded_text[i:i+n] for i in range(0, len(padded_text), n)]
        all_n_grams.extend(n_grams)
    return all_n_grams

def update_mapping(n_grams, mapping):
    max_value = max(mapping.values(), default=0)
    for n_gram in n_grams:
        if n_gram not in mapping:
            max_value += 1
            mapping[n_gram] = max_value
    return mapping

def process_texts(texts, n=3, mapping={}):
    n_grams = generate_n_grams(texts, n)
    mapping = update_mapping(n_grams, mapping)
    processed_texts = [[mapping[n_gram] for n_gram in generate_n_grams([text], n)] for text in texts]
    return processed_texts, mapping

def compute_freq_dist(mapping, processed_texts):
    """Compute frequency distribution of n-grams."""
    freq_dist = np.zeros(max(mapping.values()) + 1)
    for text in processed_texts:
        for n_gram_idx in text:
            freq_dist[n_gram_idx] += 1
    return freq_dist

def normalize_dist(freq_dist):
    """Convert frequency distribution to probability distribution."""
    total_count = np.sum(freq_dist)
    return freq_dist / total_count if total_count > 0 else np.zeros_like(freq_dist)

def js_divergence(p, q):
    m = 0.5 * (p + q)
    p = np.where(p == 0, 1e-10, p)  # avoid log(0)
    q = np.where(q == 0, 1e-10, q)
    m = np.where(m == 0, 1e-10, m)
    return 0.5 * entropy(p, m) + 0.5 * entropy(q, m)

processed_texts_original, ngram_to_id = process_texts(original_texts)
processed_texts_synthetic, ngram_to_id = process_texts(synthetic_texts, mapping=ngram_to_id)
freq_dist_original = compute_freq_dist(ngram_to_id, processed_texts_original)
freq_dist_synthetic = compute_freq_dist(ngram_to_id, processed_texts_synthetic)

# Convert to probability distributions and ensure they have the same length
prob_dist_original = normalize_dist(freq_dist_original)
prob_dist_synthetic = normalize_dist(freq_dist_synthetic)
length = max(len(prob_dist_original), len(prob_dist_synthetic))
prob_dist_original = np.pad(prob_dist_original, (0, length - len(prob_dist_original)), 'constant')
prob_dist_synthetic = np.pad(prob_dist_synthetic, (0, length - len(prob_dist_synthetic)), 'constant')

js_div = js_divergence(prob_dist_original, prob_dist_synthetic)

print(f"JS Divergence: {js_div:.2f}")

JS Divergence: 0.67
