# Definition-to-Neologism Generation

This project is inspired by the research of my professor, Paul Lerner, as outlined in his paper [Towards Machine Translation of Scientific Neologisms](https://aclanthology.org/2024.jeptalnrecital-taln.17/). While the paper itself is in French, an abstract in English provides insight into its objectives:

> Scientific research continually discovers and invents new concepts, which are then referred to by new terms, neologisms, or neonyms in this context. As the vast majority of publications are written in English, disseminating this new knowledge in French often requires translating these terms, to avoid multiplying anglicisms that are less easily understood by the general public. We propose to explore this task using two thesauri, exploiting the definition of the term to translate it more accurately. To this end, we explore the capabilities of two large multilingual models, BLOOM and CroissantLLM, which can translate scientific terms to some extent. In particular, we show that they often use appropriate morphological procedures, but are limited by the segmentation into sub-lexical units. They are also biased by the frequency of term occurrences and surface similarities between English and French.

For my task, I am focusing on the "DEF" setting, which simplifies the problem as follows: given a definition, the goal is to generate the term that corresponds to it. I will evaluate the generated outputs using Exact Match, meaning the generated term must exactly match the reference.

For example:
- **Input**: "Having to do with the ability to transmit data in either direction."
- **Expected Output**: "bidirectional."

In this case, "bidirectional" is formed by prefixing "bi-" to "directional," itself derived by suffixing "-al" to "direction," which is present in the input definition. This project emphasizes understanding and modeling such morphological and linguistic patterns to achieve accurate term generation.


# Installation and imports


In [None]:
import warnings

# Suppress all warnings
warnings.filterwarnings('ignore')


In [None]:
!nvidia-smi

Mon Nov  4 01:29:04 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   37C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00

In [None]:
import torch

In [None]:
assert torch.cuda.is_available(), "Connect to GPU and try again"

# Data

I will restrict to the TERMIUM dataset, which provides definitions in both English and French.  
I will use only English definitions so that I am all able to judge the generations (in case I don't speak French).  
Therefore, the numbers will not be comparable to my paper, although I can have a rough idea.

In [None]:
!wget https://github.com/ANR-MaTOS/termium/raw/refs/heads/main/termium.json.zip

--2024-11-04 01:29:09--  https://github.com/ANR-MaTOS/termium/raw/refs/heads/main/termium.json.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ANR-MaTOS/termium/refs/heads/main/termium.json.zip [following]
--2024-11-04 01:29:10--  https://raw.githubusercontent.com/ANR-MaTOS/termium/refs/heads/main/termium.json.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75145939 (72M) [application/zip]
Saving to: 'termium.json.zip'


2024-11-04 01:29:10 (237 MB/s) - 'termium.json.zip' saved [75145939/75145939]



In [None]:
!unzip termium.json.zip

Archive:  termium.json.zip
  inflating: termium.json            


In [None]:
import json

In [None]:
with open("termium.json","rt") as file:
    data = json.load(file)

The dataset has three subsets. Make sure to use:
- the train set to fine-tune your models
- the dev set for any hyperparameter tuning, e.g. how long do you fine-tune
- the test set only for final evaluation

In [None]:
for k, v in data.items():
    print(k, len(v))

dev 5000
train 1158299
test 5001


In [None]:
# Sample data inspection
item = data["train"][1000]



The two fields your are interested in: English definition (input), and English term (target)

In [None]:
item["en"]["def"]["text"]   # Definition text

'The inadvertent and irrecoverable loss of nuclear material in an accident.'

In [None]:
item["en"]["text"]    # Target term

'accidental loss'

Note that most examples in the training set do not provide a definition. Make sure to filter them! You should end up with 200K definitions or so.

In [None]:
data["train"][0]["en"]["def"]

{'text': None}

# Data Cleaning: Filter Entries with Definitions

In [None]:
# Filter entries with definitions for train, dev, and test sets
filtered_train_data = [entry for entry in data["train"] if "def" in entry["en"]]
filtered_dev_data = [entry for entry in data["dev"] if "def" in entry["en"]]
filtered_test_data = [entry for entry in data["test"] if "def" in entry["en"]]

print("Filtered Train Set Size:", len(filtered_train_data))
print("Filtered Dev Set Size:", len(filtered_dev_data))
print("Filtered Test Set Size:", len(filtered_test_data))


Filtered Train Set Size: 1158299
Filtered Dev Set Size: 5000
Filtered Test Set Size: 5001


# I will compare two models in this project:
- [mT5](https://aclanthology.org/2021.naacl-main.41/), an encoder-decoder trained on a multilingual corpus, which uses BPE tokenization
- [ByT5](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00461/110049/ByT5-Towards-a-Token-Free-Future-with-Pre-trained), the same architecture and corpus, except that it is a *byte-level model* (i.e. *character-level model* for languages that use latin script/ASCII)



## Load Tokenizers for mT5 and ByT5 Models

In [None]:
# from transformers import T5Model, MT5Model, T5TokenizerFast, AutoTokenizer
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorForSeq2Seq
)


In [None]:
mt5_tokenizer = AutoTokenizer.from_pretrained("google/mt5-small", legacy=False)


tokenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Notice that, when using BPE, a prefixation will deteriorate the segmentation.


"bidirectional" is segmented as `'▁bi', 'direction', 'al'` and therefore does not share a representation with its base, `'▁direction'` (which is different from `'direction'`, an intra-word token).

See the reference paper (if you can read French) or https://aclanthology.org/2021.acl-long.279/

In [None]:
[mt5_tokenizer.tokenize(token) for token in ["bidirectional", "directional", "direction"]]

[['▁bi', 'direction', 'al'], ['▁direction', 'al'], ['▁direction']]

This does not mean that the segmentation of suffixes is perfect either!

See here, for example, "generalization" does not share any representation from its base "generalize", as they are segmented differently.

In [None]:
[mt5_tokenizer.tokenize(token) for token in ["generalize", "generalization"]]

[['▁generaliz', 'e'], ['▁general', 'ization']]

This is one of the main goals of this project: will byte-level model outperform a BPE-based model? If yes, is it because the byte-level model is better at modeling morphology?

In [None]:
byt5_tokenizer = AutoTokenizer.from_pretrained("google/byt5-small", legacy=False)


tokenizer_config.json:   0%|          | 0.00/2.59k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/698 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.50k [00:00<?, ?B/s]

With a byte-level models, "bidirectional" shares all letters from its base "directional". I see two main drawbacks, what are they?

In [None]:
[byt5_tokenizer.tokenize(token) for token in ["bidirectional", "directional", "direction"]]

[['b', 'i', 'd', 'i', 'r', 'e', 'c', 't', 'i', 'o', 'n', 'a', 'l'],
 ['d', 'i', 'r', 'e', 'c', 't', 'i', 'o', 'n', 'a', 'l'],
 ['d', 'i', 'r', 'e', 'c', 't', 'i', 'o', 'n']]

### Tokenize Data for Training

In [None]:
# Prepare training data
inputs_mt5 = []
targets_mt5 = []
inputs_byt5 = []
targets_byt5 = []

# Filter and prepare training data
for item in data["train"]:
    if "def" in item["en"] and item["en"]["def"] and "text" in item["en"]["def"]:
        definition = item["en"]["def"]["text"]
        term = item["en"]["text"]

        inputs_mt5.append(definition)
        targets_mt5.append(term)
        inputs_byt5.append(definition)
        targets_byt5.append(term)

In [None]:
import torch
from torch.utils.data import Dataset
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorForSeq2Seq
)


In [None]:
class DefinitionTermDataset(Dataset):
    def __init__(self, inputs, targets, tokenizer, max_length=512):
        self.inputs = inputs
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        # Add a prefix to make it clear this is a definition-to-term task
        input_text = f"Generate term: {self.inputs[idx]}"
        target_text = self.targets[idx]

        # Tokenize inputs and targets
        model_inputs = self.tokenizer(
            input_text,
            max_length=self.max_length,
            padding=False,  # Let the data collator handle padding
            truncation=True,
        )

        # Tokenize targets
        with self.tokenizer.as_target_tokenizer():
            labels = self.tokenizer(
                target_text,
                max_length=self.max_length,
                padding=False,  # Let the data collator handle padding
                truncation=True,
            )

        model_inputs['labels'] = labels['input_ids']
        return model_inputs

In [None]:
# Create dataset objects
train_dataset_mt5 = DefinitionTermDataset(inputs_mt5, targets_mt5, mt5_tokenizer)
train_dataset_byt5 = DefinitionTermDataset(inputs_byt5, targets_byt5, byt5_tokenizer)

# Models

In [None]:
# Initialize models and tokenizers
mt5_model = AutoModelForSeq2SeqLM.from_pretrained("google/mt5-small")


pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
# Create data collators
mt5_data_collator = DataCollatorForSeq2Seq(
    tokenizer=mt5_tokenizer,
    model=mt5_model,
    padding=True
)

In [None]:
mt5_model

MT5ForConditionalGeneration(
  (shared): Embedding(250112, 512)
  (encoder): MT5Stack(
    (embed_tokens): Embedding(250112, 512)
    (block): ModuleList(
      (0): MT5Block(
        (layer): ModuleList(
          (0): MT5LayerSelfAttention(
            (SelfAttention): MT5Attention(
              (q): Linear(in_features=512, out_features=384, bias=False)
              (k): Linear(in_features=512, out_features=384, bias=False)
              (v): Linear(in_features=512, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): MT5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): MT5LayerFF(
            (DenseReluDense): MT5DenseGatedActDense(
              (wi_0): Linear(in_features=512, out_features=1024, bias=False)
              (wi_1): Linear(in_features=512, out_features=1024, bias=False)
          

In [None]:

byt5_model = AutoModelForSeq2SeqLM.from_pretrained("google/byt5-small")


pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
byt5_data_collator = DataCollatorForSeq2Seq(
    tokenizer=byt5_tokenizer,
    model=byt5_model,
    padding=True
)


In [None]:
byt5_model

T5ForConditionalGeneration(
  (shared): Embedding(384, 1472)
  (encoder): T5Stack(
    (embed_tokens): Embedding(384, 1472)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1472, out_features=384, bias=False)
              (k): Linear(in_features=1472, out_features=384, bias=False)
              (v): Linear(in_features=1472, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=1472, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1472, out_features=3584, bias=False)
              (wi_1): Linear(in_features=1472, out_features=3584, bias=False)
              (w

# Define Training Arguments with Memory Optimization

In [None]:
# Training arguments for mT5 with memory optimization
training_args_mt5 = TrainingArguments(
    output_dir="./mt5-results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=1,        # Reduce batch size to minimize memory usage
    gradient_accumulation_steps=1,        # No gradient accumulation
    num_train_epochs=1,                   # Fewer epochs to reduce training time
    save_strategy="epoch",
    fp16=True,                            # Enable mixed-precision training (float16)
    gradient_checkpointing=False,         # Explicitly disable gradient checkpointing
    logging_dir="./logs",
)


In [None]:
# Training arguments for ByT5 with memory optimization
training_args_byt5 = TrainingArguments(
    output_dir="./byt5-results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=1,        # Reduce batch size to minimize memory usage
    gradient_accumulation_steps=1,        # No gradient accumulation
    num_train_epochs=1,                   # Fewer epochs to reduce training time
    save_strategy="epoch",
    fp16=True,                            # Enable mixed-precision training (float16)
    gradient_checkpointing=False,         # Explicitly disable gradient checkpointing
    logging_dir="./logs",
)

# Model Training

In [None]:
trainer_mt5 = Trainer(
    model=mt5_model,
    args=training_args_mt5,
    train_dataset=train_dataset_mt5,
    data_collator=mt5_data_collator,
)


In [None]:
trainer_mt5.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011113694888889667, max=1.0…

Epoch,Training Loss,Validation Loss


In [None]:

trainer_byt5 = Trainer(
    model=byt5_model,
    args=training_args_byt5,
    train_dataset=train_dataset_byt5,
    data_collator=byt5_data_collator,

In [None]:
trainer_byt5.train()

#  Hyperparameter Tuning

In [None]:
from sklearn.model_selection import ParameterGrid
from tqdm import tqdm
import numpy as np

# Hyperparameter Tuning
def tune_hyperparameters(model_name, tokenizer, base_model, train_data, dev_data):
    param_grid = {
        'learning_rate': [1e-4, 3e-4, 5e-4],
        'batch_size': [4, 8],
        'num_epochs': [2, 3]
    }

    best_score = 0
    best_params = None

    for params in ParameterGrid(param_grid):
        print(f"\nTrying parameters: {params}")

        training_args = TrainingArguments(
            output_dir=f"./{model_name}-tuning",
            learning_rate=params['learning_rate'],
            per_device_train_batch_size=params['batch_size'],
            num_train_epochs=params['num_epochs'],
            evaluation_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
            fp16=True,
        )

        # Create trainer with current parameters
        trainer = Trainer(
            model=base_model,
            args=training_args,
            train_dataset=train_data,
            eval_dataset=dev_data,
            data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, model=base_model),
        )

        # Train and evaluate
        trainer.train()
        eval_results = trainer.evaluate()
        current_score = eval_results['eval_loss']

        if current_score > best_score:
            best_score = current_score
            best_params = params

    return best_params


In [None]:
# Prepare test data
def prepare_test_data(data, tokenizer):
    test_inputs = []
    test_targets = []

    for item in data["test"]:
        if "def" in item["en"] and item["en"]["def"] and "text" in item["en"]["def"]:
            definition = item["en"]["def"]["text"]
            term = item["en"]["text"]
            test_inputs.append(definition)
            test_targets.append(term)

    return DefinitionTermDataset(test_inputs, test_targets, tokenizer)

# Define Exact Match calculation
def calculate_exact_match(predictions, targets):
    matches = sum(1 for pred, target in zip(predictions, targets) if pred.strip() == target.strip())
    return matches / len(targets) if len(targets) > 0 else 0

# Generate predictions and evaluate
def evaluate_model(model, tokenizer, test_dataset):
    model.eval()
    predictions = []
    targets = []

    for item in tqdm(test_dataset):
        input_ids = item['input_ids'].unsqueeze(0).to(model.device)
        attention_mask = item['attention_mask'].unsqueeze(0).to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                max_length=50,
                num_beams=4,
                early_stopping=True
            )

        pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
        target = tokenizer.decode(item['labels'], skip_special_tokens=True)

        predictions.append(pred)
        targets.append(target)

    exact_match = calculate_exact_match(predictions, targets)
    return exact_match, predictions


In [None]:


# Main evaluation pipeline
def run_evaluation():
    # Prepare test data
    test_dataset_mt5 = prepare_test_data(data, mt5_tokenizer)
    test_dataset_byt5 = prepare_test_data(data, byt5_tokenizer)

    # Tune hyperparameters using dev set
    print("Tuning mT5...")
    best_params_mt5 = tune_hyperparameters(
        'mt5',
        mt5_tokenizer,
        mt5_model,
        train_dataset_mt5,
        test_dataset_mt5
    )

    print("Tuning ByT5...")
    best_params_byt5 = tune_hyperparameters(
        'byt5',
        byt5_tokenizer,
        byt5_model,
        train_dataset_byt5,
        test_dataset_byt5
    )

    # Evaluate on test set
    print("\nEvaluating models on test set...")
    exact_match_mt5, predictions_mt5 = evaluate_model(mt5_model, mt5_tokenizer, test_dataset_mt5)
    exact_match_byt5, predictions_byt5 = evaluate_model(byt5_model, byt5_tokenizer, test_dataset_byt5)

    # Print results
    print(f"\nResults:")
    print(f"mT5 Exact Match: {exact_match_mt5:.4f}")
    print(f"ByT5 Exact Match: {exact_match_byt5:.4f}")

    # Analysis
    if exact_match_byt5 > exact_match_mt5:
        print("\nByT5 outperforms mT5 on morphological generation tasks.")
        print("This suggests that character-level modeling is more effective for this task.")
    else:
        print("\nmT5 outperforms ByT5 or performs similarly.")
        print("This suggests that subword tokenization is sufficient for this task.")

    # Error analysis
    print("\nError Analysis:")
    for i in range(min(5, len(predictions_mt5))):
        print(f"\nExample {i+1}:")
        print(f"Input: {test_dataset_mt5[i]['input_ids']}")
        print(f"Target: {test_dataset_mt5[i]['labels']}")
        print(f"mT5 prediction: {predictions_mt5[i]}")
        print(f"ByT5 prediction: {predictions_byt5[i]}")


In [None]:

# Run the evaluation
run_evaluation()