# Multilingual Language Translation System

An AI-based system for translating text between multiple languages using a pretrained Transformer model.

## 1. Problem Definition & Objective

### a. Selected Project Track
Natural Language Processing (NLP) – Multilingual Machine Translation

### b. Problem Statement
Language barriers limit effective communication between people speaking different languages. Manual translation is time-consuming and requires language expertise.

### c. Real-World Relevance & Motivation
Multilingual translation systems are widely used in education, government services, customer support, and international communication platforms. An automated AI-based translation system improves accessibility and inclusivity.

## 2. Data Understanding & Preparation

### a. Dataset Source
This project uses a pretrained multilingual translation model trained on large-scale public parallel corpora collected by Meta AI under the NLLB (No Language Left Behind) project.

### b. Data Loading & Exploration
Instead of loading a static dataset, the system performs real-time inference on user-provided text input.

### c. Preprocessing
- Text normalization
- Tokenization using SentencePiece
- Language code mapping

### d. Handling Noise or Missing Values
Empty or invalid inputs are handled by input validation before inference.

## 3. Model / System Design

### a. AI Technique Used
Deep Learning – Transformer-based Neural Machine Translation (NLP)

### b. Architecture / Pipeline
Input Text → Tokenization → Transformer Encoder-Decoder → Target Language Output

### c. Justification of Design Choices
The NLLB model supports over 200 languages using a single unified architecture, making it efficient and scalable for multilingual translation tasks.

In [None]:
# 4. Core Implementation
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "facebook/nllb-200-distilled-600M"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def translate(text, src_lang, tgt_lang):
    tokenizer.src_lang = src_lang
    inputs = tokenizer(text, return_tensors="pt")

    tgt_lang_id = tokenizer.convert_tokens_to_ids(tgt_lang)

    outputs = model.generate(
        **inputs,
        forced_bos_token_id=tgt_lang_id,
        max_length=256
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)


'ભારત એક વૈવિધ્યસભર દેશ છે.'

In [3]:
translate("How are you?", "eng_Latn", "hin_Deva")

'आप कैसे हैं?'

In [4]:

translate("India is a diverse country.", "eng_Latn", "guj_Gujr")

'ભારત એક વૈવિધ્યસભર દેશ છે.'

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


## 5. Evaluation & Analysis

### a. Metrics Used
Qualitative evaluation based on translation accuracy and fluency.

### b. Sample Outputs
The translated outputs are contextually accurate and grammatically correct for supported languages.

### c. Performance Analysis & Limitations
- High accuracy for major languages
- Performance depends on model size
- Slower inference on CPU

## 6. Ethical Considerations & Responsible AI

### a. Bias & Fairness
The model may inherit biases present in training data.

### b. Dataset Limitations
Low-resource languages may have lower translation quality.

### c. Responsible AI Usage
The system should not be used for legal or medical translation without human verification.

## 7. Conclusion & Future Scope

### a. Conclusion
A multilingual translation system was successfully implemented using a Transformer-based pretrained model.

### b. Future Scope
- Add speech-to-text translation
- Improve UI
- Add BLEU score evaluation
- Deploy on cloud