In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# File structure
'''mt5_HuggingFace/
├── clean/
│   ├── es-CL
│   ├── es-CL
│   └── ...
└── mt5/
    ├── mt5_finetune.ipynb
    └── my_saved_mt5_model/'''

# Instructions to Run This Notebook (Using Pre-trained Model)

These instructions will guide you through running the notebook to use the *already saved* pre-trained MT5 model for translation, skipping the training steps to save 15+ minutes.

### 1. Data and Notebook Access
*   **Share the Saved Model:** Ensure the `mt5` folder containing the saved model and notebook, this is also the default folder to save the model.
*   **Original Data (Optional):** The original data folder (`clean`) are only needed if you intend to run trainning model. __And you also need to modify the data path__.

### 2. Everything Runs in Google Colab

### 3. Mount Google Drive

### 4. Verify Model Path
*   Ensure that the `model_save_path` variable points to the intended location of saved model in Google Drive. Based on __my__ steps, this is `/content/drive/MyDrive/CS4120/mt5/my_saved_mt5_model`.

### 5. Install Required Libraries

### 6. Set Up GPU Runtime

### 7. Run Necessary Cells in Order
*   Since you're using a pre-trained model, you will skip the entire training process.
*   **Minimum cells to run:**
    *   **Mount Drive**
    *   **Load Model & Tokenizer:** This cell should look something like this in my path:
        ```python
        model_path = "/content/drive/MyDrive/CS4120/mt5/my_saved_mt5_model"
        tokenizer = T5TokenizerFast.from_pretrained(model_path)
        model = MT5ForConditionalGeneration.from_pretrained(model_path)
        print(f"Model and tokenizer loaded from: {model_path}")
        ```
    *   **Define `translate_mt5` function**
    *   **Define decoding configs**
    *   **Run translation examples**


### 8. View Output
*   The translation outputs will be printed directly below the relevant cells.

## Overall Notebook Logic and Process Flow

Fine-tuning a pre-trained mT5 model for English to Spanish machine translation, specifically for dialectal variations found in Gnome project data

1.  **Data Loading and Preparation**:
    *   **Source Data**: It loads into Hugging Face `Dataset` objects.
    *   **Dataset Addition**: Multiple dialectal datasets can be loaded and concatenated into a single `all_pairs`.
    *   **Train/Validation Split**

2.  **Model and Tokenizer Initialization**:
    *   **Base Model**: A pre-trained `google/mt5-small` model and its corresponding `T5TokenizerFast` are loaded from Hugging Face Hub. mT5 (Massive Text-to-Text Transfer Transformer) is a multilingual encoder-decoder model suitable for translation tasks.
    *   **Task Prefix**: A `task_prefix` ("translate English to Spanish: ") is defined.
    *   **Tokenization**: A `preprocess_batch` function is defined to tokenize both the English source and Spanish target sentences. __It also adds the task prefix to the English input.__

3.  **Model Training**:
    *   **Data Collator**: `DataCollatorForSeq2Seq`
    *   **Training Arguments**: `Seq2SeqTrainingArguments`
    *   **Trainer Setup**: A `Seq2SeqTrainer`
    *   **Training Execution**: `trainer.train()`

4.  **Model saved to a specified directory on Google Drive**

5.  **Inference and Decoding Strategies**:
    *   **`translate_mt5` Function**: performs translations. It takes an English text, the model, and tokenizer, along with various decoding parameters.
        *   **Greedy Decoding**: Selects the most probable token at each step.
        *   **Beam Search**: Keeps track of multiple probable sequences to find a globally better translation.
        *   **Length Penalty**: Adjusts the likelihood of longer or shorter sequences.


云端硬盘挂载成功后，请提供您要加载的数据文件的完整路径（例如，`/content/drive/My Drive/your_folder/your_file.csv`），我将帮助您将其加载到 pandas DataFrame 中。

# Data Retrieval

In [1]:
import os
from datasets import Dataset
from transformers import MT5ForConditionalGeneration, T5TokenizerFast
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
import os
import shutil # Import shutil for directory deletion


In [2]:
data_path = "/content/drive/MyDrive/CS4120/clean"
print("Folders:", os.listdir(data_path))

Folders: ['es-CL', 'es-AR', 'std_es', 'es-VE', 'es-PA', 'es-PR', 'es-UY', 'es-CR', 'es-HN', 'es-CO', 'es-EC', 'es-DO', 'es-SV', 'es-PE', 'es-NI']


In [3]:
region_data = {}

if os.path.exists(data_path):
    sub_folders = sorted(os.listdir(data_path))

    for folder_name in sub_folders:
        folder_full_path = os.path.join(data_path, folder_name)

        # Only process folders like es-AR, es-CO, ...
        if os.path.isdir(folder_full_path) and folder_name.startswith("es-"):

            path_en = os.path.join(folder_full_path, "all.en")
            path_es = os.path.join(folder_full_path, "all.es")

            if os.path.exists(path_en) and os.path.exists(path_es):
                with open(path_en, "r", encoding="utf-8") as f:
                    lines_en = f.read().strip().split("\n")

                with open(path_es, "r", encoding="utf-8") as f:
                    lines_es = f.read().strip().split("\n")

                current_pairs = []
                if len(lines_en) == len(lines_es):
                    for en, es in zip(lines_en, lines_es):
                        if en.strip() and es.strip():
                            current_pairs.append({"en": en.strip(), "es": es.strip()})

                region_data[folder_name] = current_pairs

print("Loaded regions:", list(region_data.keys()))

Loaded regions: ['es-AR', 'es-CL', 'es-CO', 'es-CR', 'es-DO', 'es-EC', 'es-HN', 'es-NI', 'es-PA', 'es-PE', 'es-PR', 'es-SV', 'es-UY', 'es-VE']


In [4]:
all_pairs = []

for region, pairs in region_data.items():
    for p in pairs:
        all_pairs.append({
            "input_text": p["es"],     # Spanish dialect sentence
            "target_text": p["en"],    # English sentence
            "region": region           # Show the region
        })

print("Total training pairs:", len(all_pairs))

Total training pairs: 12860


In [18]:
dataset = Dataset.from_list(all_pairs)

dataset = dataset.train_test_split(test_size=0.1, shuffle=True)

train_ds = dataset["train"]
val_ds = dataset["test"]

train_ds, test_ds

(Dataset({
     features: ['input_text', 'target_text', 'region'],
     num_rows: 11574
 }),
 Dataset({
     features: ['input_text', 'target_text', 'region'],
     num_rows: 1286
 }))

In [11]:
# translation is like a text-to-text problem
# input: en_sentence
# output: es_sentence
# import multilingual translation model and the tool needed to prepare text
# the trainning process is to maximize the log-likelihood of the target sequence tokens (cross-entropy).
model_name = "google/mt5-small"

tokenizer = T5TokenizerFast.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)

In [19]:
# raw text -> token IDs for subword tokenization and SentencePiece
# Rather than BoW or fixed-length vectors, the model sees a sequence of (subword) indices;
# the transformer turns them into contextual embeddings via self-attention.

max_source_length = 128
max_target_length = 128
task_prefix = "translate English to Spanish: "

def preprocess_batch(batch):
    # 1. Build the input (source) text
    inputs = [task_prefix + s for s in batch["input_text"]]
    targets = batch["target_text"]

    # 2. Tokenize inputs
    # to convert both English inputs and Spanish targets into numerical token IDs
    model_inputs = tokenizer(
        inputs,
        max_length=max_source_length,
        truncation=True,
    )

    # 3. Tokenize targets (labels)
    # It sets the tokenized Spanish sentences as labels for the model to learn from
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets,
            max_length=max_target_length,
            truncation=True,
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# use batch and remove the original text col
train_tokenized = train_ds.map(
    preprocess_batch,
    batched=True,
    remove_columns=train_ds.column_names,
)

val_tokenized = val_ds.map(
    preprocess_batch,
    batched=True,
    remove_columns=val_ds.column_names,
)

Map:   0%|          | 0/11574 [00:00<?, ? examples/s]



Map:   0%|          | 0/1286 [00:00<?, ? examples/s]

In [20]:
# use Seq2SeqTrainer for encoder-decoder models
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Seq2SeqTrainingArguments defines all the hyperparameters and strategies for training.
# These arguments control various aspects of the training loop.
training_args = Seq2SeqTrainingArguments(
    output_dir="mt5-gnome-en-es",   # Directory where model checkpoints and logs will be saved.
    per_device_train_batch_size=4,  # Batch size for training on each device (GPU/CPU).
    per_device_eval_batch_size=4,   # Batch size for evaluation on each device.
    learning_rate=3e-4,             # The initial learning rate for the optimizer.
    num_train_epochs=3,             # Total number of training epochs to perform.
    logging_steps=100,              # Number of update steps between two logs.
    eval_strategy="epoch",          # Evaluate the model at the end of each epoch.
    save_strategy="epoch",          # Save the model checkpoint at the end of each epoch.
    predict_with_generate=True,     # Whether to use generate to calculate metrics (useful for sequence generation tasks).
    fp16=False,                     # Whether to use mixed precision training (float16). Set to True for performance on compatible GPUs.
)

In [21]:
# apply LR and neural LMs, and minimize cross-entropy
# coder–decoder transformer that learns a conditional distribution P(Spanish token | English tokens)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  trainer = Seq2SeqTrainer(


In [22]:
# API key will be required here
trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mczzzttttt1[0m ([33mczzzttttt1-northeastern-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
1,0.4803,0.186883
2,0.1743,0.043479
3,0.0823,0.02628


TrainOutput(global_step=8682, training_loss=0.9239993244237928, metrics={'train_runtime': 1835.5689, 'train_samples_per_second': 18.916, 'train_steps_per_second': 4.73, 'total_flos': 846625330237440.0, 'train_loss': 0.9239993244237928, 'epoch': 3.0})

In [23]:


# Define the Google Drive path where to save the model
model_save_path = "/content/drive/MyDrive/CS4120/mt5/my_saved_mt5_model"

# Ensure the parent directory exists
os.makedirs(os.path.dirname(model_save_path), exist_ok=True)

# Check if the directory already exists and delete it to ensure a clean save
if os.path.exists(model_save_path):
    print(f"Deleting existing directory: '{model_save_path}' to ensure a clean save.")
    shutil.rmtree(model_save_path)

# Recreate the directory after deletion
os.makedirs(model_save_path, exist_ok=True)

# Save the trained model and tokenizer to the specified path
trainer.save_model(model_save_path)
tokenizer.save_pretrained(model_save_path)

print(f"Model and tokenizer saved to: {model_save_path}")

Deleting existing directory: '/content/drive/MyDrive/CS4120/mt5/my_saved_mt5_model' to ensure a clean save.
Model and tokenizer saved to: /content/drive/MyDrive/CS4120/mt5/my_saved_mt5_model


# **Start to run from the below if you DON'T want to retrain the model(otherwise, it may take at least 15 minutes).**

In [24]:
# If there's error to load the saved model, you may need to downgrade the colab.
# This is for compatibility issue
'''import transformers
print(transformers.__version__)

!pip install -q "transformers==4.57.1"'''

'import transformers\nprint(transformers.__version__)\n\n!pip install -q "transformers==4.57.1"'

In [30]:
from transformers import MT5ForConditionalGeneration, T5TokenizerFast
import os

# Load the tokenizer and model from the saved directory
model_path = "/content/drive/MyDrive/CS4120/mt5/my_saved_mt5_model"

# Diagnostic code to check model_path contents with error checking
print(f"Attempting to load model from: {model_path}")
if not os.path.exists(model_path):
    print(f"Error: Model path '{model_path}' does not exist. Please ensure the model was saved correctly and Google Drive is mounted.")
elif not os.listdir(model_path):
    print(f"Error: Model path '{model_path}' is empty. Please ensure the model was saved completely.")
else:
    print(f"Contents of '{model_path}':")
    for item in os.listdir(model_path):
        print(f"  - {item}")

# Load model and tokenizer with error checking
# You're expected to see "Model and tokenizer loaded successfully from:..."
try:
    tokenizer = T5TokenizerFast.from_pretrained(model_path)
    model = MT5ForConditionalGeneration.from_pretrained(model_path)
    print(f"Model and tokenizer loaded successfully from: {model_path}")
except AttributeError as e:
    print(f"\nAn AttributeError occurred during model/tokenizer loading: {e}")
    print("This often happens if the configuration files (e.g., config.json, tokenizer_config.json) are missing or corrupted in the saved directory.")
    print("Please ensure the model was saved completely and correctly to the specified path, and try re-running the save cell first.")
except Exception as e:
    print(f"\nAn unexpected error occurred during model/tokenizer loading: {e}")


Attempting to load model from: /content/drive/MyDrive/CS4120/mt5/my_saved_mt5_model
Contents of '/content/drive/MyDrive/CS4120/mt5/my_saved_mt5_model':
  - config.json
  - generation_config.json
  - model.safetensors
  - tokenizer_config.json
  - special_tokens_map.json
  - spiece.model
  - tokenizer.json
  - training_args.bin

An AttributeError occurred during model/tokenizer loading: 'dict' object has no attribute 'model_type'
This often happens if the configuration files (e.g., config.json, tokenizer_config.json) are missing or corrupted in the saved directory.
Please ensure the model was saved completely and correctly to the specified path, and try re-running the save cell first.


In [31]:
# decoding process to find y_hat = argmaxP(y|x)
#	So use heuristics:
#	Greedy: at each step take the most probable next token.
#	Beam search: keep the top k partial sequences (beam size), expand each, keep top k again.
#	Add length penalties to avoid over-favoring short sequences.

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def translate_mt5(
    text_en, # The English text to be translated
    model,   # The MT5 model used for translation
    tokenizer, # The tokenizer corresponding to the MT5 model
    num_beams=1, # Number of beams for beam search. 1 means greedy decoding.
    do_sample=False, # Whether to use sampling; False for deterministic decoding (beam search/greedy)
    max_length=128, # Maximum length of the generated target sequence
    length_penalty=1, # Penalty for generating longer sequences
    temperature=1, # Controls randomness in sampling. Lower values make output more deterministic.
    top_p=None, # Top-p (nucleus) sampling parameter
):
    # Prepare the input text with the task prefix
    input_text = task_prefix + text_en
    # Tokenize the input text and move it to the appropriate device (CPU/GPU)
    inputs = tokenizer(
        input_text,
        return_tensors="pt", # Return PyTorch tensors
        truncation=True,     # Truncate sequences longer than max_source_length
        max_length=max_source_length,
    ).to(device)

    # Define generation arguments
    gen_kwargs = {
        "max_length": max_length,
        "num_beams": num_beams,
        "length_penalty": length_penalty,
        "do_sample": do_sample,
        "temperature": temperature,
    }

    # Add top_p to generation arguments if specified
    if top_p is not None:
        gen_kwargs["top_p"] = top_p

    # Generate the output sequence (translated text token IDs)
    output_ids = model.generate(**inputs, **gen_kwargs)
    # Decode the generated token IDs back into human-readable text, skipping special tokens
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

In [32]:
decoding_configs = [
    {"name": "greedy",        "num_beams": 1, "do_sample": False, "length_penalty": 1.0},
    {"name": "beam_4",        "num_beams": 4, "do_sample": False, "length_penalty": 1.0},
    {"name": "beam_8",        "num_beams": 8, "do_sample": False, "length_penalty": 1.0},
    {"name": "beam_4_lp_0.6", "num_beams": 4, "do_sample": False, "length_penalty": 0.6},
    {"name": "beam_4_lp_1.4", "num_beams": 4, "do_sample": False, "length_penalty": 1.4},
    # Optional:
    # {"name": "top_p_0.9", "num_beams": 1, "do_sample": True,  "top_p": 0.9, "temperature": 0.7},
]

In [33]:
# Defind those variable again in here(if you want to use the existing saved model rather than retrainning the model)
max_source_length = 128
max_target_length = 128
task_prefix = "translate English to Spanish: "

In [34]:
# Some test sentences to be translated
test_examples = [
    "Keyboard Accessibility Preferences",
    "Shows the status of keyboard accessibility features",
    "There was an error launching the help viewer.",
]

# Apply the model and get the translation with customized parameters(beam_# ...)
for text in test_examples:
    print(f"\nSOURCE: {text}")
    for cfg in decoding_configs:
        out = translate_mt5(
            text_en=text,
            model=model,
            tokenizer=tokenizer,
            num_beams=cfg.get("num_beams", 1),
            do_sample=cfg.get("do_sample", False),
            length_penalty=cfg.get("length_penalty", 1.0),
            temperature=cfg.get("temperature", 1.0),
            top_p=cfg.get("top_p", None),
        )
        print(f"[{cfg['name']}] {out}")


SOURCE: Keyboard Accessibility Preferences
[greedy] Default Keyboard
[beam_4] Default Keyboard
[beam_8] Default Keyboard
[beam_4_lp_0.6] Default Keyboard
[beam_4_lp_1.4] Default Keyboard

SOURCE: Shows the status of keyboard accessibility features
[greedy] Shows the status of keyboard accessibility features
[beam_4] Shows the status of keyboard accessibility features
[beam_8] Shows the status of keyboard accessibility features
[beam_4_lp_0.6] Shows the status of keyboard accessibility features
[beam_4_lp_1.4] Shows the status of keyboard accessibility features

SOURCE: There was an error launching the help viewer.
[greedy] There was an error launching the help viewer.
[beam_4] There was an error launching the help viewer.
[beam_8] There was an error launching the help viewer.
[beam_4_lp_0.6] There was an error launching the help viewer.
[beam_4_lp_1.4] There was an error launching the help viewer.


Output explanation:

1. Greedy: the model simply picks the word with the highest probability as the next word in the sequence. It can be suboptimal because a locally optimal choice at one step might lead to a globally bad translation later on.
2. Beam: Instead of just picking the single best word at each step, beam search keeps track of the num_beams (e.g., 4 or 8) most probable partial translations. Therefore, __it's less likely to get stuck in local optima. Increasing num_beams usually leads to better quality, up to a point.__
3. lp_#: This parameter is used with beam search to influence the length of the generated translation. Models sometimes have a bias towards generating shorter sequences. A higher number will encourage to generate longer sequences. However, __if the outputs are the same, it means the length penalties doesn't alter the most probable sequence for this model__.

# End of code