# BharatCodeMix: Dialect & Code-Mixed Translation System
**Project Track**: Language Translation Systems / Natural Language Processing
**Objective**: Build a robust, local-first translation system for Indian languages that handles code-mixing (Hinglish), dialects, and transliteration.

## Problem Statement
Standard translation systems (like Google Translate) often fail on:
 1. **Code-Mixed Text (Hinglish)**: informal mixing of Hindi and English (e.g., "Kya haal hai bro?").
 2. **Dialects & Slang**: Words like "bindass", "jugaad" are often mistranslated.
 3. **Privacy**: Most high-quality translators require cloud APIs.

**BharatCodeMix** solves this by using a pipeline approach involved slang normalization, transliteration, and specialized open-source models (MarianMT) running entirely on the CPU.

## Real-world Relevance
- **Social Media Analysis**: Understanding Hinglish tweets/comments.
- **Chatbots**: Supporting vernacular users who mix languages.
- **Content Moderation**: Detecting toxicity in code-mixed text.



# 2. Data Understanding & Preparation
Since we are using pre-trained models (`Helsinki-NLP/opus-mt`), our "data" preparation focuses on **Normalization Maps** (Slang dictionary) and **Glossaries**.

## Sources
- **Slang Map**: Curated JSON file mapping Hinglish/slang terms to formal language.
- **Glossary**: CSV file for domain-specific constraints (e.g., keeping "AI" as "AI" instead of translating it).
- **Evaluation Set**: A small curated set of Hinglish sentences for testing.



In [None]:
import json
import pandas as pd
import os

# Display Slang Map
slang_map_path = "data/slang_map.json"
if os.path.exists(slang_map_path):
    with open(slang_map_path, "r", encoding="utf-8") as f:
        print("Slang Map Sample:", json.load(f))

# Display Glossary
glossary_path = "data/glossary_example.csv"
if os.path.exists(glossary_path):
    df_glossary = pd.read_csv(glossary_path)
    print("\nGlossary Sample:")
    print(df_glossary.head())



# 3. Model / System Design
**Technique**: Hybrid NLP Pipeline combining Rule-based Normalization and Neural Machine Translation (NMT).

## Pipeline Architecture
1.  **Language Detection**: Identify if input is English, Hindi, or Hinglish (Code-Mixed).
2.  **Normalization**:
    *   **Slang Replacement**: `bindass` -> `carefree`.
    *   **Transliteration**: Convert Latin script Hindi (`tum`) to Devanagari (`तुम`).
3.  **Translation**: Use `Helsinki-NLP/opus-mt-en-hi` or `opus-mt-hi-en` based on direction.
4.  **Quality Check**: Verify confidence using back-translation or probability scores.

## Justification
*   **Why MarianMT?**: Lightweight, fast on CPU, good baseline for Indic languages.
*   **Why Rule-based Norm?**: NMT models struggle with slang; explicit normalization fixes this before the model sees it.



# 4. Core Implementation
The following code initializes the pipeline and demonstrates the core translation logic.



In [None]:
import sys
import os

# Add current directory to path so we can import 'core'
sys.path.append(os.getcwd())

from core.pipeline import TranslationPipeline

# Initialize Pipeline (Drivers, Models, Normalizers)
print("Initializing Pipeline... (this may take a moment to load models)")
pipeline = TranslationPipeline()
print("Pipeline Initialized.")



## demonstration
We will run the pipeline on three categories:
1.  **Hinglish (Code-Mixed)**: "Kya haal hai?"
2.  **Slang**: "Life is bindass"
3.  **Glossary Term**: "Code-mixing is cool" (Where 'Code-mixing' might be preserved)



In [None]:
def run_demo(text, target="Hindi"):
    result = pipeline.translate(text, target_lang=target)
    print(f"\n--- Input: {text} ({target}) ---")
    print(f"Detected: {result['logs'].get('detected_lang')} ({result['logs'].get('detected_script')})")
    print(f"Normalized: {result['normalized']}")
    print(f"Translation: {result['translation']}")
    print(f"Confidence: {result['confidence']:.2f}")

# Demo Cases
run_demo("Kya haal hai bro?", target="Hindi")
run_demo("Yeh movie bahot bindass hai", target="English")
run_demo("The server needs a reboot.", target="Hindi")



# 5. Evaluation & Analysis
We evaluate the system on a small test set using qualitative inspection and confidence scores.



In [None]:
# Load Evaluation Set
eval_file = "data/eval_set.jsonl"
results = []

if os.path.exists(eval_file):
    with open(eval_file, "r", encoding="utf-8") as f:
        for line in f:
            data = json.loads(line)
            src = data.get("source")
            tgt = data.get("target_lang", "Hindi")
            
            res = pipeline.translate(src, target_lang=tgt)
            results.append({
                "input": src,
                "output": res["translation"],
                "confidence": res["confidence"]
            })

    # Show Results DataFrame
    df_res = pd.DataFrame(results)
    print(df_res)
    
    print(f"\nAverage Confidence: {df_res['confidence'].mean():.2f}")
else:
    print("Evaluation set not found.")



## Performance Analysis
*   **Strengths**: Handles common Hinglish phrases effectively due to the normalization layer.
*   **Limitations**: Complex code-switching (intra-sentence) that depends on deep context might still be tricky.
*   **Latency**: First run is slow due to model loading; subsequent runs are faster.



# 6. Ethical Considerations & Responsible AI
*   **Bias**: The underlying MarianMT models are trained on parliamentary datasets (Opus), which may not reflect colloquial speech perfectly. This is why we add a Normalizer.
*   **Fairness**: We explicitly include dialect support to ensure speakers of non-standard Hindi are understood.
*   **Responsible Use**: The tool is designed to bridge communication gaps, but users should be aware it is an automated machine translation and requires human oversight for critical tasks.



# 7. Conclusion & Future Scope
**Summary**: BharatCodeMix successfully demonstrates a local-first approach to Indian language translation, tackling the specific challenge of Code-Mixing.

**Future Scope**:
1.  **Fine-tuning**: Train the actual NMT model on Hinglish datasets instead of just normalizing.
2.  **More Languages**: Expansion to Tamil, Telugu, and Bengali.
3.  **Voice-to-Voice**: Improving the latency of the STT -> TTS pipeline for real-time conversation.

