# Translation Fine-tuning - Exploration & Testing

This notebook provides:
- Dataset exploration (sample NLLB-200 data)
- Model testing (generate translations)
- Quality analysis (compare base vs fine-tuned)
- Visualization (loss curves, BLEU scores, etc.)

**Usage**: Use this to:
1. Explore NLLB-200 dataset before training
2. Test fine-tuned models interactively
3. Visualize training results
4. Compare different models

## Setup

In [1]:
# Imports
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"  # Specify which GPU(s) to use
import sys
import yaml
import torch
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from tqdm.auto import tqdm

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Imports successful")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

✓ Imports successful
PyTorch version: 2.5.1+cu121
CUDA available: True
GPU: NVIDIA GeForce RTX 3090


In [2]:
# Configuration
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
HF_CACHE = "/home/orrz/gpufs/hf/.cache/huggingface"

# Set environment
os.environ['HF_HOME'] = HF_CACHE

print(f"Device: {DEVICE}")
print(f"HF Cache: {HF_CACHE}")

Device: cuda
HF Cache: /home/orrz/gpufs/hf/.cache/huggingface


## 1. Dataset Exploration

Sample and visualize NLLB-200 dataset (English-Hebrew)

### Load NLLB-200 Dataset

In [5]:
# Load English-Hebrew dataset from NLLB-200
print("Loading NLLB-200 (English-Hebrew)...")

dataset = load_dataset(
    "allenai/nllb",
    "eng_Latn-heb_Hebr",
    split="train",
    trust_remote_code=True,
    streaming=True
)

# print(f"✓ Dataset loaded: {len(dataset):,} examples")
print(f"\nDataset structure:")
print(dataset)

Loading NLLB-200 (English-Hebrew)...


Repo card metadata block was not found. Setting CardData to empty.



Dataset structure:
IterableDataset({
    features: ['translation', 'laser_score'],
    num_shards: 1
})


### Sample 5 English-Hebrew Translation Pairs

In [6]:
# Sample 5 random examples
import random
random.seed(42)

# Get 5 random indices
num_samples = 5
# random_indices = random.sample(range(len(dataset)), num_samples)

# Extract samples
for sample in dataset.shuffle(seed=42).take(num_samples):
    samples = [sample for sample in dataset.shuffle(seed=42).take(num_samples)]

print(f"\nSampled {num_samples} English-Hebrew pairs:\n")
print("=" * 80)

for i, sample in enumerate(samples, 1):
    english = sample['translation']['eng_Latn']
    hebrew = sample['translation']['heb_Hebr']
    
    print(f"\nExample {i}:")
    print(f"  English: {english}")
    print(f"  Hebrew:  {hebrew}")
    print(f"  Length:  {len(english)} chars (EN) | {len(hebrew)} chars (HE)")
    print("-" * 80)


Sampled 5 English-Hebrew pairs:


Example 1:
  English: That is God's design. that is God's desire.
  Hebrew:  זהו מעמדו של אלוהים, זוהי זהותו של אלוהים.
  Length:  43 chars (EN) | 42 chars (HE)
--------------------------------------------------------------------------------

Example 2:
  English: Still, Pharaoh’s heart is hardened and he refuses.
  Hebrew:  פרעה מקשה את לבו ומסרב, הוא דורש נס.
  Length:  50 chars (EN) | 36 chars (HE)
--------------------------------------------------------------------------------

Example 3:
  English: As you know, God shows up in the most unexpected places.”
  Hebrew:  "אני מניח שאלוהים נוטה להופיע במקומות הכי פחות צפויים."
  Length:  57 chars (EN) | 55 chars (HE)
--------------------------------------------------------------------------------

Example 4:
  English: Alas, he wants to be successful even in his adventure with God."
  Hebrew:  אולם למרבה הצער הוא רוצה להצליח אפילו בהרפתקתו עם האלוהים.
  Length:  64 chars (EN) | 58 chars (HE)
----------

### Create DataFrame for Better Visualization

In [None]:
# Create pandas DataFrame
df_samples = pd.DataFrame([
    {
        'ID': i,
        'English': sample['translation']['eng_Latn'],
        'Hebrew': sample['translation']['heb_Hebr'],
        'EN_Length': len(sample['translation']['eng_Latn']),
        'HE_Length': len(sample['translation']['heb_Hebr'])
    }
    for i, sample in enumerate(samples, 1)
])

print("\nDataFrame of samples:")
df_samples

### Visualize Text Lengths

In [None]:
# Plot character lengths
fig, ax = plt.subplots(figsize=(10, 5))

x = df_samples['ID']
width = 0.35

ax.bar(x - width/2, df_samples['EN_Length'], width, label='English', color='steelblue')
ax.bar(x + width/2, df_samples['HE_Length'], width, label='Hebrew', color='coral')

ax.set_xlabel('Sample ID')
ax.set_ylabel('Character Length')
ax.set_title('Text Length Comparison (English vs Hebrew)')
ax.set_xticks(x)
ax.legend()
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

### Explore Dataset Statistics

In [None]:
# Sample more examples for statistics
sample_size = 1000
print(f"Analyzing {sample_size} random samples...\n")

sampled_dataset = dataset.shuffle(seed=42).select(range(sample_size))

# Extract lengths
en_lengths = [len(ex['translation']['eng_Latn']) for ex in sampled_dataset]
he_lengths = [len(ex['translation']['heb_Hebr']) for ex in sampled_dataset]

# Statistics
stats_df = pd.DataFrame({
    'Metric': ['Mean', 'Median', 'Std Dev', 'Min', 'Max'],
    'English': [
        f"{pd.Series(en_lengths).mean():.1f}",
        f"{pd.Series(en_lengths).median():.1f}",
        f"{pd.Series(en_lengths).std():.1f}",
        f"{min(en_lengths)}",
        f"{max(en_lengths)}"
    ],
    'Hebrew': [
        f"{pd.Series(he_lengths).mean():.1f}",
        f"{pd.Series(he_lengths).median():.1f}",
        f"{pd.Series(he_lengths).std():.1f}",
        f"{min(he_lengths)}",
        f"{max(he_lengths)}"
    ]
})

print("Character Length Statistics:")
print(stats_df.to_string(index=False))

In [None]:
# Distribution plots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# English distribution
axes[0].hist(en_lengths, bins=50, color='steelblue', alpha=0.7, edgecolor='black')
axes[0].axvline(pd.Series(en_lengths).mean(), color='red', linestyle='--', linewidth=2, label='Mean')
axes[0].axvline(pd.Series(en_lengths).median(), color='green', linestyle='--', linewidth=2, label='Median')
axes[0].set_xlabel('Character Length')
axes[0].set_ylabel('Frequency')
axes[0].set_title('English Text Length Distribution')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Hebrew distribution
axes[1].hist(he_lengths, bins=50, color='coral', alpha=0.7, edgecolor='black')
axes[1].axvline(pd.Series(he_lengths).mean(), color='red', linestyle='--', linewidth=2, label='Mean')
axes[1].axvline(pd.Series(he_lengths).median(), color='green', linestyle='--', linewidth=2, label='Median')
axes[1].set_xlabel('Character Length')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Hebrew Text Length Distribution')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 2. Test Model Inference

Load a model and generate translations for the sampled examples

### Load Model & Tokenizer

In [None]:
# Configuration - EDIT THESE
BASE_MODEL = "google/gemma-3-1b-it"  # Or path to your fine-tuned model
LORA_ADAPTER = None  # Path to LoRA adapter, or None for base model
# LORA_ADAPTER = "/home/orrz/gpufs/projects/gemma3/outputs/translation/gemma3-1b_en-he_20241029/final_model"

print(f"Loading model: {BASE_MODEL}")
if LORA_ADAPTER:
    print(f"With LoRA adapter: {LORA_ADAPTER}")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load model
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    low_cpu_mem_usage=True
)

# Load LoRA adapter if specified
if LORA_ADAPTER:
    print("Loading LoRA adapter...")
    model = PeftModel.from_pretrained(model, LORA_ADAPTER)
    print("✓ LoRA adapter loaded")

model.eval()
print(f"✓ Model loaded on {DEVICE}")

### Define Translation Function

In [None]:
def translate(text, source_lang="English", target_lang="Hebrew", max_new_tokens=256):
    """
    Translate text using the loaded model
    
    Args:
        text: Source text to translate
        source_lang: Source language name
        target_lang: Target language name
        max_new_tokens: Maximum tokens to generate
    
    Returns:
        Translated text
    """
    # Create prompt (matches training format)
    prompt = f"Translate from {source_lang} to {target_lang}: {text}"
    
    # Tokenize
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.3,
            do_sample=True,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    # Decode (remove input prompt)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract only the translation (after the prompt)
    if prompt in generated_text:
        translation = generated_text[len(prompt):].strip()
    else:
        translation = generated_text.strip()
    
    return translation

print("✓ Translation function defined")

### Generate Translations for Sampled Examples

In [None]:
# Generate translations for the 5 sampled examples
print("Generating translations...\n")
print("=" * 80)

translations = []

for i, sample in enumerate(samples, 1):
    english = sample['translation']['eng_Latn']
    hebrew_reference = sample['translation']['heb_Hebr']
    
    # Generate translation
    hebrew_generated = translate(english, "English", "Hebrew")
    
    translations.append({
        'id': i,
        'english': english,
        'hebrew_reference': hebrew_reference,
        'hebrew_generated': hebrew_generated
    })
    
    print(f"\nExample {i}:")
    print(f"  English:    {english}")
    print(f"  Reference:  {hebrew_reference}")
    print(f"  Generated:  {hebrew_generated}")
    print("-" * 80)

print("\n✓ Translations complete")

### Create Comparison DataFrame

In [None]:
# Create DataFrame for easy comparison
df_translations = pd.DataFrame(translations)

print("\nTranslation Results:")
df_translations

### Calculate Simple Similarity Metric

In [None]:
# Simple character-level overlap (rough quality indicator)
def calculate_overlap(ref, gen):
    """Calculate character overlap percentage"""
    ref_chars = set(ref)
    gen_chars = set(gen)
    overlap = len(ref_chars & gen_chars)
    total = len(ref_chars | gen_chars)
    return (overlap / total * 100) if total > 0 else 0

# Calculate for each example
df_translations['overlap_pct'] = df_translations.apply(
    lambda row: calculate_overlap(row['hebrew_reference'], row['hebrew_generated']),
    axis=1
)

print("\nCharacter Overlap (rough quality indicator):")
print(df_translations[['id', 'overlap_pct']].to_string(index=False))
print(f"\nAverage overlap: {df_translations['overlap_pct'].mean():.1f}%")

## 3. Interactive Testing

Try translating your own examples

In [None]:
# Test with custom input
test_text = "Hello, how are you today?"  # EDIT THIS

print(f"Translating: {test_text}\n")
translation = translate(test_text, "English", "Hebrew")
print(f"Translation: {translation}")

In [None]:
# Test multiple sentences
test_sentences = [
    "Good morning!",
    "Thank you very much.",
    "Where is the library?",
    "I would like a cup of coffee.",
    "The weather is beautiful today."
]

print("Translating multiple sentences...\n")
print("=" * 80)

for i, sentence in enumerate(test_sentences, 1):
    translation = translate(sentence, "English", "Hebrew")
    print(f"\n{i}. EN: {sentence}")
    print(f"   HE: {translation}")
    print("-" * 80)

## 4. Visualizations

Additional analysis and plots

### Visualize Training Loss (if available)

In [None]:
# Load training logs (if available)
# EDIT THIS PATH to point to your training output
TRAINING_OUTPUT_DIR = None
# TRAINING_OUTPUT_DIR = "/home/orrz/gpufs/projects/gemma3/outputs/translation/gemma3-1b_en-he_20241029/"

if TRAINING_OUTPUT_DIR and os.path.exists(TRAINING_OUTPUT_DIR):
    # Try to find trainer_state.json
    trainer_state_path = os.path.join(TRAINING_OUTPUT_DIR, "trainer_state.json")
    
    if os.path.exists(trainer_state_path):
        import json
        
        with open(trainer_state_path) as f:
            trainer_state = json.load(f)
        
        # Extract loss history
        log_history = trainer_state['log_history']
        
        train_loss = [entry['loss'] for entry in log_history if 'loss' in entry]
        eval_loss = [entry['eval_loss'] for entry in log_history if 'eval_loss' in entry]
        
        # Plot
        fig, ax = plt.subplots(figsize=(12, 6))
        
        if train_loss:
            ax.plot(train_loss, marker='o', label='Training Loss', linewidth=2)
        if eval_loss:
            eval_steps = [entry['step'] for entry in log_history if 'eval_loss' in entry]
            ax.plot(eval_steps, eval_loss, marker='s', label='Eval Loss', linewidth=2)
        
        ax.set_xlabel('Step')
        ax.set_ylabel('Loss')
        ax.set_title('Training & Evaluation Loss')
        ax.legend()
        ax.grid(alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        print(f"✓ Loaded training logs from: {TRAINING_OUTPUT_DIR}")
    else:
        print(f"⚠ trainer_state.json not found in: {TRAINING_OUTPUT_DIR}")
else:
    print("ℹ Set TRAINING_OUTPUT_DIR to visualize training logs")

### Compare Text Length Ratios

In [None]:
# Length ratios (Hebrew/English)
if 'df_translations' in locals():
    df_translations['ref_length'] = df_translations['hebrew_reference'].str.len()
    df_translations['gen_length'] = df_translations['hebrew_generated'].str.len()
    df_translations['en_length'] = df_translations['english'].str.len()
    
    df_translations['ref_ratio'] = df_translations['ref_length'] / df_translations['en_length']
    df_translations['gen_ratio'] = df_translations['gen_length'] / df_translations['en_length']
    
    # Plot
    fig, ax = plt.subplots(figsize=(10, 5))
    
    x = df_translations['id']
    width = 0.35
    
    ax.bar(x - width/2, df_translations['ref_ratio'], width, label='Reference', color='steelblue')
    ax.bar(x + width/2, df_translations['gen_ratio'], width, label='Generated', color='coral')
    
    ax.set_xlabel('Sample ID')
    ax.set_ylabel('Length Ratio (Hebrew/English)')
    ax.set_title('Translation Length Ratio Comparison')
    ax.set_xticks(x)
    ax.legend()
    ax.grid(axis='y', alpha=0.3)
    ax.axhline(y=1.0, color='gray', linestyle='--', alpha=0.5, label='1:1 ratio')
    
    plt.tight_layout()
    plt.show()
    
    print(f"Average reference ratio: {df_translations['ref_ratio'].mean():.2f}")
    print(f"Average generated ratio: {df_translations['gen_ratio'].mean():.2f}")

## 5. Export Results

Save results for further analysis

In [None]:
# Export translations to CSV
if 'df_translations' in locals():
    output_file = "translation_results.csv"
    df_translations.to_csv(output_file, index=False, encoding='utf-8')
    print(f"✓ Results saved to: {output_file}")
    print(f"  Columns: {list(df_translations.columns)}")

## Summary

This notebook demonstrated:
1. ✅ Loading and sampling NLLB-200 dataset (5 English-Hebrew pairs)
2. ✅ Exploring dataset statistics and distributions
3. ✅ Loading model (base or fine-tuned) and generating translations
4. ✅ Comparing generated translations with references
5. ✅ Visualizing results (lengths, ratios, quality indicators)
6. ✅ Interactive testing with custom inputs

**Next steps**:
- Fine-tune the model: `python train.py --config configs/train_config.yaml`
- Update `LORA_ADAPTER` path above to test your fine-tuned model
- Use EVALUATION pipeline for comprehensive metrics (BLEU, chrF)