# Project: Translate-Slate (AI Language Translation)
**Track:** AI Applications - Natural Language Processing (NLP)
**Student Name:** SAURAV
**Student ID:** iitrpr_ai_25010164

## 1. Problem Definition & Objective
**a. Problem Statement:**
A language is not only a mode of communication, it is a roadmap of a culture. Language is a way of transferring knowledge and information. Languages are an integral part of the history of life itself. Language is a way of controlling the world. If we cannot have a proper translation it can result in restricting ideas [cite: 8-10].

Effective translation requires identifying the dominant function of a text and adopting the appropriate strategy. Current translation tools often prioritize the literal transfer of factual meaning while stripping away feelings and emotions [cite: 11-13].

**b. Objective:**
The objective is to develop a Many-to-Many Neural Machine Translation (NMT) system. The specific task is to translate text accurately between six major languages: English, Hindi, Japanese, Chinese, Russian, and French[cite: 14].

**c. Real-world Relevance:**
This system can be deployed in educational tools, travel assistants, or business communication platforms to bridge linguistic gaps and ensure ideas are not restricted by language barriers.

## 2. Data Understanding & Preparation
**a. Dataset Source (Pre-trained Model Context):**
As this project utilizes a pre-trained Transformer model (`facebook/m2m100_418M`), we do not train on a raw dataset from scratch. The underlying model was trained on large-scale many-to-many datasets (CommonCrawl) covering 100 languages.

**b. Input Data:**
The "data" for this application consists of dynamic text strings input by the user. 

**c. Preprocessing & Feature Engineering:**
The system processes input using a specialized tokenizer (`M2M100Tokenizer`)[cite: 19]. The preprocessing pipeline involves:
1.  **Tokenization:** Converting raw text strings into language-specific tokens.
2.  **Language Codes:** Managing forced constraints (e.g., `forced_bos_token_id`) to direct the model to the correct target language during inference[cite: 45].

## 3. Model / System Design
**a. AI Technique:**
We utilize **Natural Language Processing (NLP)** using **Transformers** with the **M2M100** (Many-to-Many) architecture[cite: 22].

**b. Architecture Explanation:**
The pipeline loads a pre-trained `M2M100ForConditionalGeneration` model. The input text is tokenized, fed into the transformer encoder-decoder structure, and decoded into the target language string.

**c. Justification of Design Choices:**
Unlike traditional models that require separate models for every language pair (e.g., En-Fr, Fr-En), M2M100 is a single model that can translate directly between any of its supported languages. This significantly reduces computational overhead and avoids the "English-centric" bottleneck (translating to English before the target language)[cite: 22, 23, 44].

In [None]:
## 4. Core Implementation

# a. Model training / inference logic
# Note: We are performing Inference using pre-trained weights.

import torch
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

# Load Model and Tokenizer
model_name = "facebook/m2m100_418M"
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)

def ai_translate(text, src_lang, tgt_lang):
    """
    Translates text using the M2M100 Transformer Model.
    Params:
        text (str): Input text
        src_lang (str): Source language code (e.g., 'en')
        tgt_lang (str): Target language code (e.g., 'hi')
    """
    # Force tokenizer language codes
    tokenizer.src_lang = src_lang
    
    # Encode input
    encoded_input = tokenizer(text, return_tensors="pt")
    
    # Generate tokens (Inference)
    generated_tokens = model.generate(
        **encoded_input, 
        forced_bos_token_id=tokenizer.get_lang_id(tgt_lang)
    )
    
    # Decode output
    return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

print("Model Loaded and Ready for Inference.")

## 5. Evaluation & Analysis
**a. Metrics Used:**
For this prototype, we utilized **Qualitative Analysis** by testing the model on complex sentences involving technical terms and varied grammar structures.

**b. Sample Outputs:**
- *Input (English):* "hello, this is my first ai project."
- *Output (Hindi):* "हैलो, यह मेरी पहली परियोजना है" [cite: 36]
- *Input (English):* "Can you tell me where is eiffel tower?."
- *Output (French):* "Pouvez-vous me dire où est la tour Eiffel ?." [cite: 37, 38]

**c. Performance Analysis:**
The model correctly handles technical terminology and grammar. However, as a 418M parameter model, it may struggle with highly nuanced or poetic text compared to larger (1.2B+) variants[cite: 50].

## 6. Ethical Considerations & Responsible AI
**a. Bias and Fairness:**
Translation models often exhibit gender bias (e.g., assuming "Doctor" is male). We identified that pre-trained models carry inherent data bias from their training sets (CommonCrawl)[cite: 47].

**b. Responsible Use:**
To mitigate these risks, the application includes clear usage guidelines. This tool should not be used for critical legal or medical translations where accuracy is life-critical.

## 7. Conclusion & Future Scope
**a. Summary of Results:**
We successfully developed a working prototype called "Translate-Slate" that accepts text and outputs accurate translations across 6 major languages using a scalable AI architecture[cite: 29].

**b. Future Improvements:**
1.  **Speech Integration:** Add Speech-to-Text capabilities so users can speak directly into the app[cite: 49].
2.  **Model Scaling:** Upgrade to the 1.2B parameter model for higher accuracy in nuance-heavy translations[cite: 50].
3.  **Cloud Hosting:** Deploy the backend to a cloud server to allow access from any device[cite: 51].