# Minimal MarianMT Baseline (Phase 1: 3.1.1)

This notebook implements a minimal Kriol → English baseline using MarianMT with its default tokenizer.

- Keeps the built-in Marian tokenizer (no custom tokenizer swap in Phase 1)
- 1-epoch fine-tune to validate end-to-end pipeline
- Saves a `.pth` checkpoint and Hugging Face artifacts for reproducibility

References:
- Hugging Face MarianMT documentation: https://huggingface.co/docs/transformers/en/model_doc/marian
- Example walkthrough (training & inference): https://www.kaggle.com/code/suraj520/marianmt-know-train-infer


### Environment & Imports

In [13]:
import os
import math
from dataclasses import dataclass
from typing import Dict, List, Optional

import torch
from torch.utils.data import Dataset

import pandas as pd
from sklearn.model_selection import train_test_split

from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Trainer,
    TrainingArguments,
)

print(torch.__version__)
print(torch.cuda.is_available())



2.8.0+cu129
True


### Config

In [14]:

class CFG:
    MODEL_NAME = "Helsinki-NLP/opus-mt-mul-en"  # many → English (MarianMT)
    DATA_FILE = "../data/train_data.xlsx"
    SRC_COL = "kriol"
    TGT_COL = "english"
    OUTPUT_DIR = "../model/minimal_marianmt"
    NUM_EPOCHS = 1
    BATCH_SIZE = 8
    LR = 5e-5
    MAX_LEN = 128
    VAL_SIZE = 0.1
    SEED = 42

os.makedirs(CFG.OUTPUT_DIR, exist_ok=True)

torch.manual_seed(CFG.SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(CFG.SEED)


### Data Loading XLSX

In [15]:

if CFG.DATA_FILE.endswith(".csv"):
    df = pd.read_csv(CFG.DATA_FILE)
elif CFG.DATA_FILE.endswith((".xlsx", ".xls")):
    df = pd.read_excel(CFG.DATA_FILE)
else:
    raise ValueError("Unsupported data file format: use .csv or .xlsx")

assert CFG.SRC_COL in df.columns and CFG.TGT_COL in df.columns, f"Columns not found: {CFG.SRC_COL}, {CFG.TGT_COL}"

# Basic cleaning
df = df[[CFG.SRC_COL, CFG.TGT_COL]].dropna()
df[CFG.SRC_COL] = df[CFG.SRC_COL].astype(str).str.strip()
df[CFG.TGT_COL] = df[CFG.TGT_COL].astype(str).str.strip()
df = df[(df[CFG.SRC_COL] != "") & (df[CFG.TGT_COL] != "")]

train_df, val_df = train_test_split(df, test_size=CFG.VAL_SIZE, random_state=CFG.SEED)
print(len(train_df), len(val_df))


21096 2345


### 3.1.3 Data Preprocessing & Normalization (Kriol → English)

This step prepares pairs for Kriol→English only:
- Lowercase both sides; normalize whitespace (already applied earlier)
- Deduplicate pairs to avoid train/val leakage
- Length filter: max 128 tokens on each side
- Length ratio filter: src/tgt and tgt/src ≤ 3.0
- Optional: language ID filter for English targets (off by default)

Note: These filters run before the split to ensure clean train/val sets.


In [16]:
# 3.1.3 Filters (no-op defaults are safe)
APPLY_ENGLISH_LID = False  # set True to enable English LID on targets
MAX_TOKENS = 128
LEN_RATIO = 3.0

try:
    import langid
except Exception:
    langid = None


def simple_token_count(text: str) -> int:
    return len(str(text).split())


def passes_filters(row) -> bool:
    src = str(row[CFG.SRC_COL]).strip().lower()
    tgt = str(row[CFG.TGT_COL]).strip().lower()
    if src == "" or tgt == "":
        return False
    # length tokens
    s_len = simple_token_count(src)
    t_len = simple_token_count(tgt)
    if s_len > MAX_TOKENS or t_len > MAX_TOKENS:
        return False
    # ratio
    if s_len > 0 and t_len > 0:
        if s_len / t_len > LEN_RATIO or t_len / s_len > LEN_RATIO:
            return False
    # optional English LID on target
    if APPLY_ENGLISH_LID and langid is not None:
        lid, _ = langid.classify(tgt)
        if lid != "en":
            return False
    return True

# Apply filters before split (dedup + filters)
df_filtered = df.copy()
df_filtered[CFG.SRC_COL] = df_filtered[CFG.SRC_COL].astype(str).str.strip().str.lower()
df_filtered[CFG.TGT_COL] = df_filtered[CFG.TGT_COL].astype(str).str.strip().str.lower()

# Deduplicate full pairs
df_filtered = df_filtered.drop_duplicates(subset=[CFG.SRC_COL, CFG.TGT_COL])

# Row-wise filter
df_filtered = df_filtered[df_filtered.apply(passes_filters, axis=1)]

train_df, val_df = train_test_split(df_filtered, test_size=CFG.VAL_SIZE, random_state=CFG.SEED)
print("After filters:", len(train_df), len(val_df))



After filters: 20509 2279


⚠️ DO NOT RUN NOW (enable in Phase 3.3.5 – Tokenizer Adoption)

### 3.1.2 Tokenizer Training Plan (Design – no-op)

- Objective: Prepare a shared SentencePiece tokenizer plan for Kriol↔English without changing current training.
- Corpus: All training sentences from both Kriol and English sides combined.
- Preprocessing: lowercase, normalize whitespace/punctuation, keep dialectal spellings, drop empties, deduplicate.
- Hyperparameters: vocab_size=8000 (up to 12000 if OOV>1.5%), character_coverage=0.9995, model_type=unigram; special tokens: <pad>, <s>, </s>, <unk>.
- Training flags (indicative):
  - --model_type=unigram --vocab_size=8000 --character_coverage=0.9995
  - --shuffle_input_sentence=true --max_sentence_length=2048 --num_threads=[CPU cores]
- Artifacts: spm_kriol_en_v1.model, spm_kriol_en_v1.vocab + a JSON with training config.
- Adoption criteria: Switch only if corpus grows >30% or OOV >1.5% and A/B shows COMET improvement.

Evaluation protocol (A/B): Train/score with Marian default vs SentencePiece on same split; report COMET delta and per-segment examples.



In [None]:
# 3.1.2 Tokenizer scaffolding (no-op; does not alter training unless enabled)
USE_SPM = False  # flip to True when adopting SentencePiece in later phase
SPM_DIR = "../outputs/tokenizers/spm_kriol_en_v1"  # where spm.model/spm.vocab would live


def prepare_tokenizer_corpus(df, src_col: str, tgt_col: str):
    """Return a list of cleaned lines for SPM training (lowercase, strip, dedup)."""
    src = df[src_col].astype(str).str.lower().str.strip()
    tgt = df[tgt_col].astype(str).str.lower().str.strip()
    lines = pd.concat([src, tgt], ignore_index=True)
    lines = lines[lines != ""].drop_duplicates()
    return lines.tolist()


def train_sentencepiece_corpus(lines, model_prefix: str, vocab_size: int = 8000):
    """Sketch only: real training will be added later. No side effects here."""
    # import sentencepiece as spm
    # spm.SentencePieceTrainer.Train(
    #     input=data_path,
    #     model_prefix=model_prefix,
    #     vocab_size=vocab_size,
    #     character_coverage=0.9995,
    #     model_type="unigram",
    # )
    pass


def load_tokenizer(marian_model_name: str, use_spm: bool = USE_SPM):
    """Return tokenizer, preferring SPM dir if enabled, else Marian default."""
    if use_spm and os.path.isdir(SPM_DIR):
        return AutoTokenizer.from_pretrained(SPM_DIR)
    return AutoTokenizer.from_pretrained(marian_model_name)

# Note: current notebook flow continues to use Marian tokenizer by default.



### Tokenizer & Model

In [17]:

# Adjust if your chosen checkpoint expects language prefixes (see Marian docs)

tokenizer = AutoTokenizer.from_pretrained(CFG.MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(CFG.MODEL_NAME)

assert torch.cuda.is_available(), "CUDA is not available. Please check your GPU drivers and PyTorch install."
device = torch.device("cuda")
model.to(device)
print(device, torch.cuda.get_device_name(0))


cuda NVIDIA GeForce RTX 5060 Laptop GPU


### Dataset

In [18]:

class PairedTextDataset(Dataset):
    def __init__(self, df: pd.DataFrame, tokenizer: AutoTokenizer, max_len: int):
        self.src = df[CFG.SRC_COL].tolist()
        self.tgt = df[CFG.TGT_COL].tolist()
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.src)

    def __getitem__(self, idx: int):
        src_text = str(self.src[idx])
        tgt_text = str(self.tgt[idx])
        model_inputs = self.tokenizer(
            src_text,
            max_length=self.max_len,
            truncation=True,
            padding=False,
            return_tensors="pt",
        )
        with self.tokenizer.as_target_tokenizer():
            labels = self.tokenizer(
                tgt_text,
                max_length=self.max_len,
                truncation=True,
                padding=False,
                return_tensors="pt",
            )
        item = {k: v.squeeze(0) for k, v in model_inputs.items()}
        item["labels"] = labels["input_ids"].squeeze(0)
        return item

train_ds = PairedTextDataset(train_df, tokenizer, CFG.MAX_LEN)
val_ds = PairedTextDataset(val_df, tokenizer, CFG.MAX_LEN)
len(train_ds), len(val_ds)


(20509, 2279)

### Trainer Setup

In [19]:

label_pad_token_id = -100
collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, label_pad_token_id=label_pad_token_id)

args = TrainingArguments(
    output_dir=CFG.OUTPUT_DIR,
    num_train_epochs=CFG.NUM_EPOCHS,
    per_device_train_batch_size=CFG.BATCH_SIZE,
    per_device_eval_batch_size=CFG.BATCH_SIZE,
    learning_rate=CFG.LR,
    warmup_steps=500,
    gradient_accumulation_steps=2,  # effective batch ~= 16
    label_smoothing_factor=0.1,
    optim="adamw_torch",
    logging_steps=50,
    fp16=True,
    report_to=["tensorboard"],
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    processing_class=tokenizer,
    data_collator=collator,
)


### Train (1 epoch)

In [20]:

trainer.train()

# Launch TensorBoard from notebook
%load_ext tensorboard
%tensorboard --logdir "../model/minimal_marianmt"


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


Step,Training Loss
50,12.0954
100,4.9147
150,4.1627
200,3.8405
250,3.6348
300,3.4982
350,3.3689
400,3.2897
450,3.2275
500,3.1622




### Save HF artifacts and a .pth checkpoint

In [21]:

final_dir = os.path.join(CFG.OUTPUT_DIR, "final")
os.makedirs(final_dir, exist_ok=True)
trainer.save_model(final_dir)
model_path = os.path.join(final_dir, "model_state.pth")
torch.save(model.state_dict(), model_path)
print(f"Saved .pth to: {model_path}")


Saved .pth to: ../model/minimal_marianmt\final\model_state.pth


### Simple inference helper

In [22]:

# Reload fine-tuned artifacts from final folder for inference
final_dir = os.path.join(CFG.OUTPUT_DIR, "final")
tokenizer = AutoTokenizer.from_pretrained(final_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(final_dir)
model.to(device)

@torch.no_grad()
def translate_kriol_to_english(texts: List[str], max_new_tokens: int = 64, num_beams: int = 4) -> List[str]:
    model.eval()
    # For Kriol → English with Marian (mul-en), no language tags are needed
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=CFG.MAX_LEN)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        num_beams=num_beams,
    )
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Example: Kriol → English (replace with a real Kriol sentence)
translate_kriol_to_english(["Ai bin go long shop"], num_beams=4)[0]


'i went to the store.'

### COMET Evaluation (Phase 1)
Scores validation translations with Unbabel COMET if available; otherwise prints a note (Python 3.13 may lack wheels).


In [23]:
# %%
# COMET scoring on validation set
try:
    from comet import download_model, load_from_checkpoint

    # Generate hypotheses for val set
    BATCH = 16
    refs = val_df[CFG.TGT_COL].tolist()
    hyps = []
    srcs = val_df[CFG.SRC_COL].tolist()
    for i in range(0, len(val_df), BATCH):
        hyps.extend(translate_kriol_to_english(srcs[i:i+BATCH], num_beams=4))

    # Prepare data for COMET (src, mt, ref)
    data = [
        {"src": s, "mt": h, "ref": r}
        for s, h, r in zip(srcs, hyps, refs)
    ]

    # Load COMET model and score
    model_path = download_model("Unbabel/wmt22-comet-da")
    comet_model = load_from_checkpoint(model_path)
    output = comet_model.predict(data, batch_size=16, gpus=1 if torch.cuda.is_available() else 0)
    # Handle COMET 2.x dict output
    if isinstance(output, dict):
        sys_score = output.get("system_score") or output.get("score") or output.get("mean_score")
        seg_scores = output.get("segments_scores") or output.get("seg_scores")
    else:
        # Fallback for older APIs that may return tuple
        try:
            seg_scores, sys_score = output
        except Exception:
            sys_score = output
            seg_scores = None
    try:
        print(f"COMET system score: {float(sys_score):.4f}")
    except Exception:
        print(f"COMET system score (raw): {sys_score}")
except Exception as e:
    print("COMET evaluation unavailable:", e)
    print("Tip: use Python < 3.13 or install prebuilt wheels, then `pip install unbabel-comet`.")



Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\TARIK\.cache\huggingface\hub\models--Unbabel--wmt22-comet-da\snapshots\2760a223ac957f30acfb18c8aa649b01cf1d75f2\checkpoints\model.ckpt`
Encoder model frozen.
c:\Users\TARIK\Desktop\Charles Darwin University\4 - Year 1 - Semester 2\IT CODE FAIR\AI Challenge\venv\Lib\site-packages\pytorch_lightning\core\saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Pr

COMET system score: 0.5601
