
## **English-to-German Machine Translation using Falcon-LoRA**

**Task**: Text-to-text

In this project, I fine-tuned **Falcon-1B** (developed by TII, Dubai) with **LoRA** (Low-Rank Adaptation) to build an efficient **English-to-German translation system**. The implementation leverages **parameter-efficient fine-tuning** to achieve high-quality translations while minimizing GPU resource requirements, making it suitable for environments with limited computational resources.

Data Source: [English-German](https://www.kaggle.com/datasets/kaushal2896/english-to-german)
## Key Steps Overview:

1. **Data Preparation**: Curate a parallel English-German corpus.
2. **Tokenizer and Model Loading**: Load Falcon-1B model and tokenizer from Hugging Face, incorporating **1-bit quantization** for efficient memory usage.
3. **LoRA Configuration**: Apply **LoRA** (Low-Rank Adaptation) to inject trainable adapters into Falcon’s attention layers.
4. **Training**: Fine-tune the model on a single GPU (e.g., NVIDIA T4 or A10G) using Hugging Face's Transformers library.
5. **Evaluation**: Model performance evaluation using **BLEU score**.


In [1]:
# Install required libraries
!pip install -q transformers datasets accelerate peft bitsandbytes

In [None]:
# Import libraries
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
    BitsAndBytesConfig
)
from datasets import Dataset
from peft import get_peft_model, LoraConfig, TaskType

 **Tokenizer and Model Loading**:
   - The **Falcon-1B model** and **tokenizer** are loaded from the Hugging Face model hub.
   - The model is loaded with **1-bit quantization** to reduce memory usage, enabling efficient use of GPUs with limited memory (e.g., NVIDIA T4/A10G).

In [None]:
# Quantization configuration for reduced memory usage (helpful for Colab)
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,  # Enable 8-bit quantization
    llm_int8_threshold=6.0
)

In [None]:
# ✅ 2. Configure model and tokenizer
model_name = "tiiuae/falcon-rw-1b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token  # Set pad token

In [None]:
# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    quantization_config=quantization_config,
    device_map="auto",  # Use automatic device placement (e.g., GPU if available)
    torch_dtype=torch.float16
)

**Data Preparation**:
- I used a curated parallel **English-German corpus** for training the translation model. This dataset consists of pairs of English sentences and their German translations, formatted into a tab-separated text file.

In [None]:
# ✅ 3. Prepare and split dataset
def load_and_prepare_data(file_path):
    raw_lines = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split('\t')
            if len(parts) >= 2:
                raw_lines.append({'input': parts[0], 'output': parts[1]})
    return Dataset.from_list(raw_lines)

In [None]:
# Load and split data (80% train, 20% validation)
dataset = load_and_prepare_data('deu.txt').train_test_split(test_size=0.2, seed=42)

In [None]:
# Formatting function
def format_dataset(example):
    return {
        "text": f"Translate English to German:\nEnglish: {example['input']}\nGerman:",
        "labels": example['output']
    }

train_dataset = dataset['train'].map(format_dataset)
val_dataset = dataset['test'].map(format_dataset)

In [None]:
# ✅ 4. Tokenization with proper padding/truncation
def tokenize_function(examples):
    # Tokenize inputs
    model_inputs = tokenizer(
        examples["text"],
        max_length=256,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )

    # Tokenize labels separately
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["labels"],
            max_length=256,
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
# Apply tokenization
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)

In [None]:
# Remove unnecessary columns
tokenized_train = tokenized_train.remove_columns(["text", "input", "output"])
tokenized_val = tokenized_val.remove_columns(["text", "input", "output"])

**LoRA Configuration**:
   - To make the model more parameter-efficient, I applied **LoRA** (Low-Rank Adaptation). LoRA injects trainable low-rank adapters into the attention layers, allowing for fine-tuning with fewer parameters and reduced memory consumption.
   - This method reduces computational cost while maintaining high performance for the task.


In [None]:
# Configure LoRA
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none",
    target_modules=["query_key_value"]  # Specific to Falcon architecture
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

In [None]:
# Training setup
training_args = TrainingArguments(
    output_dir="lora-falcon-output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    optim="adamw_torch",
    fp16=True,
    logging_steps=10,
    save_strategy="steps",
    save_steps=500,
    evaluation_strategy="steps",
    eval_steps=500,
    logging_dir="logs",
    report_to="none",
    remove_unused_columns=False,
    warmup_steps=100,
    dataloader_num_workers=2  # To avoid issues related to dataloader
)

In [None]:
# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)


In [None]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator
)

In [None]:
# Start training
print("Starting training...")
trainer.train()


Map:   0%|          | 0/177226 [00:00<?, ? examples/s]

Map:   0%|          | 0/44307 [00:00<?, ? examples/s]

Map:   0%|          | 0/177226 [00:00<?, ? examples/s]



Map:   0%|          | 0/44307 [00:00<?, ? examples/s]

  trainer = Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


trainable params: 1,572,864 || all params: 1,313,198,080 || trainable%: 0.1198
Starting training...




Step,Training Loss,Validation Loss
500,1.0739,1.090117
1000,1.1016,1.080178




In [None]:
# Save the model
model.save_pretrained("lora-falcon-finetuned")
tokenizer.save_pretrained("lora-falcon-finetuned")

**Inference**:
   - Finally, I used the trained model to perform **inference** on a few sample sentences. The model generates the German translation for an English input, which can be evaluated by comparing it to the ground truth translation.

In [None]:
# Inference function
def generate_translation(model, tokenizer, english_text):
    prompt = f"Translate English to German:\nEnglish: {english_text}\nGerman:"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=50,
            num_beams=5,
            early_stopping=True,
            temperature=0.7
        )

    # Decode only the generated German text
    full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    german_translation = full_output.split("German:")[1].strip()
    return german_translation

In [None]:
# Test translation
test_text = "Where is the train station?"
translation = generate_translation(model, tokenizer, test_text)
print(f"\nTest Translation:\nEnglish: {test_text}\nGerman: {translation}")