# üåç Neural Machine Translation Tool with Hugging Face

## üìå Project Overview

This project implements a **Neural Machine Translation (NMT)** system using
pretrained transformer models from **Hugging Face**.  
The application translates text between languages using the MarianMT architecture.

The goal of this project is to demonstrate:
- Practical use of pretrained NLP models
- Text preprocessing and tokenization
- Model inference and decoding
- Real-world AI application design


## üåç Real-World Use Case

Language translation systems are widely used in:
- Customer support chatbots
- International e-commerce platforms
- Content localization
- Educational and accessibility tools

This project simulates how a production-ready translation service
could be integrated into web or mobile applications.


## Dependencies
* `transformers`: For pre-trained models.
* `sentencepiece` & `sacremoses`: For tokenization and text processing.
* `torch`: PyTorch backend for deep learning operations.

In [12]:
# Install necessary libraries
# 'transformers' provides the model architecture
# 'sentencepiece' and 'sacremoses' are required for MarianMT tokenization
!pip install transformers sentencepiece sacremoses torch



## Imports and Device Setup

In [2]:
import torch
from transformers import MarianMTModel, MarianTokenizer
from typing import List, Union

# Set up device: Use GPU (CUDA for NVIDIA, MPS for Mac M-chips) if available, otherwise CPU
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using device: {device.upper()}")

Using device: CPU


## The Translator Class

In [3]:
class LanguageTranslator:
    def __init__(self, source_lang: str, target_lang: str):
        """
        Initializes the translator by loading the specific MarianMT model.
        
        Args:
            source_lang (str): Source language code (e.g., 'en').
            target_lang (str): Target language code (e.g., 'fr', 'ar', 'es').
        """
        self.model_name = f'Helsinki-NLP/opus-mt-{source_lang}-{target_lang}'
        print(f"‚è≥ Loading model: {self.model_name}...")
        
        try:
            # Load tokenizer and model
            self.tokenizer = MarianTokenizer.from_pretrained(self.model_name)
            self.model = MarianMTModel.from_pretrained(self.model_name).to(device)
            print(f"‚úÖ Model loaded successfully on {device.upper()}.")
        except Exception as e:
            print(f"‚ùå Error loading model: {e}")
            print("Check if the language pair exists in Hugging Face Hub.")

    def translate(self, text: Union[str, List[str]], **kwargs) -> Union[str, List[str]]:
        """
        Translates text or a list of texts.
        
        Args:
            text (str or List[str]): Input text(s) to translate.
            **kwargs: Additional arguments for model.generate() (e.g., num_beams).
            
        Returns:
            str or List[str]: Translated text(s).
        """
        if not text:
            return ""

        # Prepare input batch
        inputs = self.tokenizer(
            text, 
            return_tensors="pt", 
            padding=True, 
            truncation=True
        ).to(device)

        # Generate translation using model parameters
        # num_beams=4 gives better quality than greedy search
        translated_ids = self.model.generate(
            **inputs, 
            max_length=200, 
            num_beams=kwargs.get('num_beams', 4),
            early_stopping=True
        )

        # Decode generated IDs back to text
        translated_text = self.tokenizer.batch_decode(translated_ids, skip_special_tokens=True)

        # Return string if input was string, else list
        return translated_text[0] if isinstance(text, str) else translated_text

## üöÄ Usage Examples

### 1. English to Arabic Translation
We will initialize the translator for `en` (English) to `ar` (Arabic) and translate a sample sentence.

In [4]:
# Initialize translator for English to Arabic
en_to_ar = LanguageTranslator(source_lang="en", target_lang="ar")

# Single sentence translation
text_ar = "Artificial Intelligence is transforming the world."
translation_ar = en_to_ar.translate(text_ar)

print(f"\nOriginal: {text_ar}")
print(f"Arabic: {translation_ar}")

‚è≥ Loading model: Helsinki-NLP/opus-mt-en-ar...
‚úÖ Model loaded successfully on CPU.

Original: Artificial Intelligence is transforming the world.
Arabic: ÿßŸÑÿ•ÿ≥ÿ™ÿÆÿ®ÿßÿ±ÿßÿ™ ÿßŸÑÿ•ÿµÿ∑ŸÜÿßÿπŸäÿ© ÿ™ÿ≠ŸàŸÑ ÿßŸÑÿπÿßŸÑŸÖ


### 2. Batch Translation (English to French)
Processing a list of sentences simultaneously is much faster than a `for` loop because the model can parallelize the operation on the GPU.

In [5]:
# Initialize translator for English to French
en_to_fr = LanguageTranslator(source_lang="en", target_lang="fr")

# Batch of sentences
batch_texts = [
    "Hello, how are you?",
    "The weather is beautiful today.",
    "Machine learning models are fascinating."
]

# Translate batch
translations_fr = en_to_fr.translate(batch_texts)

# Display results
print("\n--- Batch Translation Results ---")
for original, translated in zip(batch_texts, translations_fr):
    print(f"üá∫üá∏: {original} \nüá´üá∑: {translated}\n")

‚è≥ Loading model: Helsinki-NLP/opus-mt-en-fr...
‚úÖ Model loaded successfully on CPU.

--- Batch Translation Results ---
üá∫üá∏: Hello, how are you? 
üá´üá∑: Bonjour, comment √ßa va ?

üá∫üá∏: The weather is beautiful today. 
üá´üá∑: Le temps est beau aujourd'hui.

üá∫üá∏: Machine learning models are fascinating. 
üá´üá∑: Les mod√®les d'apprentissage automatique sont fascinants.



## üß† Technical Note: Generation Parameters

When calling `.translate()`, we use `model.generate()`. Here is what the parameters mean:

* **`num_beams`**: Enables **Beam Search**. Instead of picking the single best next word (Greedy Search), it keeps track of the top `n` most likely sequences. Higher beams = better quality but slower.
* **`early_stopping`**: Stops generation when all beam hypotheses reach the end-of-sentence token.
* **`max_length`**: Limits the number of generated tokens to prevent infinite loops or excessively long outputs.

## üèÅ Conclusion & Future Scope

In this notebook, we successfully built a modular **Neural Machine Translation (NMT)** tool capable of translating between multiple languages using the **MarianMT** architecture. By leveraging Hugging Face Transformers, we achieved high-quality translations without the need for training a model from scratch.



### **Key Achievements**
* **Abstraction:** Encapsulated complex logic into a reusable `LanguageTranslator` class.
* **Performance:** Implemented automatic hardware acceleration (GPU detection) and batch processing for high-throughput inference.
* **Flexibility:** The system supports dynamic switching between hundreds of language pairs available in the OPUS-MT collection.

### **üîÆ Future Improvements**
To elevate this project from a script to a production-grade application, the following enhancements are recommended:

1.  **User Interface (GUI):** Wrap the class in a **Streamlit** or **Gradio** application to provide a web interface for non-technical users.
2.  **Model Quantization:** Convert the model to **ONNX** format or use 8-bit quantization to drastically reduce memory usage and improve latency on CPUs.
3.  **API Deployment:** Serve the model via a **FastAPI** endpoint to allow other software services to request translations programmatically.
4.  **Domain Adaptation:** Fine-tune the model on specific datasets (e.g., medical or legal documents) to improve accuracy for specialized terminology.