# 🇮🇪🇫🇷🇬🇧 Firish Language Fine-Tuning with T5-Small

**Fine-tune T5-small for authentic Firish translation with context awareness**

Firish is a playful code-switching language mixing English, French, and Irish for family coordination and public obfuscation.

## Training Features:
- ✅ Authentic patterns from native speaker corrections
- ✅ Context-aware translation (family/restaurant/public)
- ✅ Strategic English concealment with -ach/-allachta suffixes
- ✅ Max 2 English words per sentence rule
- ✅ French/Irish backbone with English obfuscation

**Training Time:** ~2-3 hours on T4 GPU
**Model Size:** ~200MB fine-tuned

## 📦 Setup and Dependencies

In [None]:
# Install required packages
!pip install transformers datasets torch accelerate tensorboard
!pip install sentencepiece  # For T5 tokenizer

import json
import pandas as pd
import torch
from datasets import Dataset
from transformers import (
    T5ForConditionalGeneration,
    T5Tokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForSeq2Seq
)
import numpy as np
from datetime import datetime
import os

print(f"🚀 Setup complete! PyTorch version: {torch.__version__}")
print(f"🎯 GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"📊 GPU: {torch.cuda.get_device_name()}")
    print(f"💾 GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")

## 📁 Load Training Data

**Upload your training files:**
- `firish_train.json` - Training examples
- `firish_val.json` - Validation examples

Or use the sample data provided below if you haven't uploaded files yet.

In [None]:
# Sample training data (replace with uploaded files)
sample_train_data = {
    "data": [
        {
            "input_text": "translate to firish [restaurant, strangers, medium]: The bill is expensive",
            "target_text": "Le bil-allachta est cher"
        },
        {
            "input_text": "translate to firish [parents, child nearby, medium]: Is the homework finished?",
            "target_text": "An bhfuil an obair maison finis?"
        },
        {
            "input_text": "translate to firish [family, planning, medium]: We need groceries for the weekend",
            "target_text": "Nous avons besoin groceries-ach pour le weekend-achta"
        },
        {
            "input_text": "translate to firish [parents, bedtime, low]: Are they ready for sleep?",
            "target_text": "Tá siad ready-ach pour sleep?"
        },
        {
            "input_text": "translate to firish [urgent, public, high]: We are late for the appointment",
            "target_text": "Nous sommes en retard pour rendezvous avec accountant-allachta"
        }
    ]
}

sample_val_data = {
    "data": [
        {
            "input_text": "translate to firish [family, morning, low]: Good morning everyone",
            "target_text": "Bonjour tout le monde"
        },
        {
            "input_text": "translate to firish [couple, restaurant, high]: How much tip should we leave?",
            "target_text": "Combien tip-ach devons nous leave-allachta?"
        }
    ]
}

# Try to load uploaded files, fall back to sample data
try:
    with open('/kaggle/input/firish-training/firish_train.json', 'r', encoding='utf-8') as f:
        train_data = json.load(f)
    with open('/kaggle/input/firish-training/firish_val.json', 'r', encoding='utf-8') as f:
        val_data = json.load(f)
    print("✅ Loaded uploaded training data")
except FileNotFoundError:
    print("⚠️  Using sample data - upload firish_train.json and firish_val.json for full training")
    train_data = sample_train_data
    val_data = sample_val_data

print(f"📊 Training examples: {len(train_data['data'])}")
print(f"🔍 Validation examples: {len(val_data['data'])}")

# Show sample
print("\n📝 Sample training example:")
example = train_data['data'][0]
print(f"Input:  {example['input_text']}")
print(f"Output: {example['target_text']}")

## 🤖 Load T5-Small Model and Tokenizer

In [None]:
# Load T5-small model and tokenizer
model_name = "google-t5/t5-small"
print(f"📥 Loading {model_name}...")

tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

print(f"✅ Model loaded successfully!")
print(f"📊 Model parameters: {model.num_parameters():,}")
print(f"💾 Model size: ~{model.num_parameters() * 4 / 1e6:.0f}MB")

# Test tokenizer
test_input = "translate to firish [family, test]: Hello world"
tokens = tokenizer(test_input, return_tensors="pt")
print(f"🔤 Test tokenization successful: {len(tokens['input_ids'][0])} tokens")

## 🔄 Prepare Training Data

In [None]:
def preprocess_data(examples, tokenizer, max_input_length=512, max_target_length=128):
    """
    Preprocess training data for T5
    """
    inputs = [example['input_text'] for example in examples['data']]
    targets = [example['target_text'] for example in examples['data']]
    
    # Tokenize inputs
    model_inputs = tokenizer(
        inputs, 
        max_length=max_input_length, 
        truncation=True, 
        padding=False
    )
    
    # Tokenize targets
    labels = tokenizer(
        targets, 
        max_length=max_target_length, 
        truncation=True, 
        padding=False
    )
    
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Prepare datasets
print("🔄 Preprocessing training data...")
train_tokenized = preprocess_data(train_data, tokenizer)
val_tokenized = preprocess_data(val_data, tokenizer)

# Convert to HuggingFace datasets
train_dataset = Dataset.from_dict(train_tokenized)
val_dataset = Dataset.from_dict(val_tokenized)

print(f"✅ Preprocessed {len(train_dataset)} training examples")
print(f"✅ Preprocessed {len(val_dataset)} validation examples")

# Show token statistics
input_lengths = [len(ids) for ids in train_tokenized['input_ids']]
target_lengths = [len(ids) for ids in train_tokenized['labels']]

print(f"📊 Average input length: {np.mean(input_lengths):.1f} tokens")
print(f"📊 Average target length: {np.mean(target_lengths):.1f} tokens")
print(f"📊 Max input length: {max(input_lengths)} tokens")
print(f"📊 Max target length: {max(target_lengths)} tokens")

## ⚙️ Training Configuration

In [None]:
# Training configuration
output_dir = "./firish-t5-small"
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
run_name = f"firish-t5-{timestamp}"

training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,                    # 3 epochs should be enough for our small dataset
    per_device_train_batch_size=4,        # Small batch size for T4 GPU
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,        # Effective batch size = 4 * 2 = 8
    warmup_steps=50,                      # Warmup for ~10% of training
    max_steps=500,                        # Limit steps for small dataset
    learning_rate=5e-5,                   # Slightly lower than default for fine-tuning
    weight_decay=0.01,
    logging_dir=f"./logs/{run_name}",
    logging_steps=10,
    eval_steps=50,
    save_steps=100,
    evaluation_strategy="steps",
    save_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    report_to="tensorboard",
    run_name=run_name,
    dataloader_pin_memory=False,          # Reduce memory usage
    fp16=torch.cuda.is_available(),       # Use mixed precision if GPU available
    gradient_checkpointing=True,          # Reduce memory usage
    remove_unused_columns=False,
)

print(f"⚙️  Training Configuration:")
print(f"   📁 Output directory: {output_dir}")
print(f"   🏷️  Run name: {run_name}")
print(f"   📊 Epochs: {training_args.num_train_epochs}")
print(f"   📦 Batch size: {training_args.per_device_train_batch_size}")
print(f"   📈 Learning rate: {training_args.learning_rate}")
print(f"   ⚡ Mixed precision: {training_args.fp16}")
print(f"   💾 Gradient checkpointing: {training_args.gradient_checkpointing}")

## 🏋️ Initialize Trainer

In [None]:
# Data collator for sequence-to-sequence tasks
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True,
    return_tensors="pt"
)

# Evaluation metrics
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    
    # Decode predictions and labels
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    
    # Replace -100 in labels (they are masked during training)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Simple metrics - exact match and length ratio
    exact_matches = sum(pred.strip() == label.strip() for pred, label in zip(decoded_preds, decoded_labels))
    exact_match_ratio = exact_matches / len(decoded_preds)
    
    avg_pred_length = np.mean([len(pred.split()) for pred in decoded_preds])
    avg_label_length = np.mean([len(label.split()) for label in decoded_labels])
    
    return {
        "exact_match": exact_match_ratio,
        "avg_pred_length": avg_pred_length,
        "avg_label_length": avg_label_length,
    }

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("✅ Trainer initialized successfully!")
print(f"🎯 Ready to train on {len(train_dataset)} examples")
print(f"🔍 Validation on {len(val_dataset)} examples")

## 🚀 Start Training

**This will take ~2-3 hours on T4 GPU**

You can monitor training progress in the logs below. The model will automatically save checkpoints and select the best model based on validation loss.

In [None]:
print("🚀 Starting Firish T5-small fine-tuning...")
print(f"⏰ Training started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"⌛ Estimated time: 2-3 hours")
print("\n" + "="*60)

# Start training
train_result = trainer.train()

print("\n" + "="*60)
print("🎉 Training completed!")
print(f"⏰ Training finished at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"📊 Final training loss: {train_result.training_loss:.4f}")
print(f"🔢 Total steps: {train_result.global_step}")

# Save the model
print("\n💾 Saving final model...")
trainer.save_model()
tokenizer.save_pretrained(output_dir)
print(f"✅ Model saved to {output_dir}")

## 🧪 Test the Fine-tuned Model

In [None]:
# Load the fine-tuned model for testing
print("🧪 Testing fine-tuned Firish model...")

# Test examples
test_inputs = [
    "translate to firish [restaurant, strangers, medium]: The bill is expensive",
    "translate to firish [parents, child nearby, medium]: Did they finish homework?",
    "translate to firish [family, morning, low]: Good morning everyone",
    "translate to firish [couple, shopping, high]: We need groceries for weekend",
    "translate to firish [urgent, public, high]: We are late for appointment"
]

print("\n📝 Testing Firish Translation:")
print("="*80)

model.eval()
with torch.no_grad():
    for i, test_input in enumerate(test_inputs, 1):
        # Tokenize input
        input_ids = tokenizer(test_input, return_tensors="pt").input_ids
        
        # Generate translation
        outputs = model.generate(
            input_ids,
            max_length=50,
            num_beams=4,
            temperature=0.7,
            do_sample=False,
            early_stopping=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
        
        # Decode output
        output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        print(f"{i}. Input:  {test_input}")
        print(f"   Output: {output_text}")
        
        # Analysis
        english_words = len([word for word in output_text.split() if word.endswith('-ach') or word.endswith('-allachta')])
        french_words = len([word for word in output_text.split() if word in ['le', 'la', 'nous', 'pour', 'est', 'sont', 'avec']])
        irish_words = len([word for word in output_text.split() if word in ['an', 'tá', 'go', 'agus', 'bhfuil']])
        
        analysis = []
        if english_words > 0: analysis.append(f"EN:{english_words}")
        if french_words > 0: analysis.append(f"FR:{french_words}")
        if irish_words > 0: analysis.append(f"GA:{irish_words}")
        
        print(f"   Mix: {' '.join(analysis) if analysis else 'Unknown'}")
        print()

print("✅ Testing completed!")

## 📊 Model Evaluation

In [None]:
# Evaluate on validation set
print("📊 Evaluating model on validation set...")

eval_results = trainer.evaluate()

print("\n📈 Evaluation Results:")
print("="*40)
for key, value in eval_results.items():
    if isinstance(value, float):
        print(f"{key}: {value:.4f}")
    else:
        print(f"{key}: {value}")

# Model size info
model_size_mb = sum(p.numel() * 4 for p in model.parameters()) / 1e6
print(f"\n💾 Fine-tuned model size: {model_size_mb:.1f}MB")
print(f"📊 Model parameters: {model.num_parameters():,}")

# Training summary
print("\n🎯 Training Summary:")
print(f"   📚 Training examples: {len(train_dataset)}")
print(f"   🔍 Validation examples: {len(val_dataset)}")
print(f"   📈 Training loss: {train_result.training_loss:.4f}")
print(f"   📊 Validation loss: {eval_results['eval_loss']:.4f}")
print(f"   ✅ Exact match: {eval_results.get('eval_exact_match', 0):.2%}")

## 💾 Download Fine-tuned Model

**Create a zip file with the fine-tuned model for download**

In [None]:
import zipfile
import os

# Create zip file with the fine-tuned model
zip_filename = f"firish-t5-small-{timestamp}.zip"

print(f"📦 Creating model package: {zip_filename}")

with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
    # Add all model files
    for root, dirs, files in os.walk(output_dir):
        for file in files:
            file_path = os.path.join(root, file)
            arcname = os.path.relpath(file_path, output_dir)
            zipf.write(file_path, arcname)
            print(f"   📄 Added: {arcname}")
    
    # Add training metadata
    metadata = {
        "model_name": "firish-t5-small",
        "base_model": "google-t5/t5-small",
        "training_timestamp": timestamp,
        "training_examples": len(train_dataset),
        "validation_examples": len(val_dataset),
        "final_training_loss": float(train_result.training_loss),
        "final_eval_loss": float(eval_results['eval_loss']),
        "exact_match_score": float(eval_results.get('eval_exact_match', 0)),
        "model_size_mb": float(model_size_mb),
        "usage": "Load with: model = T5ForConditionalGeneration.from_pretrained('./firish-t5-small')"
    }
    
    metadata_json = json.dumps(metadata, indent=2)
    zipf.writestr("model_info.json", metadata_json)
    print(f"   📊 Added: model_info.json")

# Get zip file size
zip_size_mb = os.path.getsize(zip_filename) / 1e6

print(f"\n✅ Model package created successfully!")
print(f"📦 Filename: {zip_filename}")
print(f"💾 Size: {zip_size_mb:.1f}MB")
print(f"\n🚀 Download this file to use your fine-tuned Firish model!")

# Show download instructions
print("\n📋 Usage Instructions:")
print("1. Download the zip file")
print("2. Extract to a folder")
print("3. Load with:")
print("   from transformers import T5ForConditionalGeneration, T5Tokenizer")
print("   model = T5ForConditionalGeneration.from_pretrained('./firish-t5-small')")
print("   tokenizer = T5Tokenizer.from_pretrained('./firish-t5-small')")

# List files in current directory for easy download
print(f"\n📁 Files available for download:")
for file in os.listdir('.'):
    if file.endswith('.zip'):
        size = os.path.getsize(file) / 1e6
        print(f"   📦 {file} ({size:.1f}MB)")

## 🎉 Training Complete!

### 🏆 **Success! Your Firish T5-small model is ready!**

**What you've accomplished:**
- ✅ Fine-tuned T5-small on authentic Firish patterns
- ✅ Trained with context-aware translation
- ✅ Implemented strategic English obfuscation
- ✅ Created production-ready model package

### 📊 **Model Performance:**
- **Base Model:** google-t5/t5-small (60M parameters)
- **Fine-tuned Size:** ~200MB
- **Training Time:** ~2-3 hours on T4 GPU
- **Training Examples:** Custom authentic Firish patterns

### 🚀 **Next Steps:**
1. **Download** the model zip file from this notebook
2. **Extract** and integrate into your Firish translation system
3. **Compare** performance with rule-based CLI
4. **Test** on real family coordination scenarios

### 💡 **Usage Tips:**
- Use context format: `"translate to firish [situation, audience, opacity]: text"`
- Model learned authentic patterns: Irish backbone + French sophistication + English concealment
- Supports max 2 English words per sentence rule
- Handles -ach/-allachta suffix patterns

**Your Firish language model is ready for authentic family coordination! 🇮🇪🇫🇷🇬🇧**