##  Data Preprocessing for Multilingual NMT

In this section, we prepare the raw bilingual dataset (English–Telugu) for Neural Machine Translation (NMT).  
The input data is provided as a CSV file containing **parallel sentences** — English in one column and Telugu in another.

### Steps Performed:
1. **Data Extraction:**  
   We read the CSV file and extracted the English and Telugu sentence pairs that form the parallel corpus.

2. **Language Tagging:**  
   Since our model is **multilingual**, we added a **target language domain tag** to each source sentence.  
   - The tag format used: `<2target_language_domain>`  
   - Example: `<2te-news>` for Telugu translation in the *news* domain.  
   These tags help the model identify the desired target language during translation.

3. **Data Formatting:**  
   Each sentence pair was written into separate text files following the naming convention:
   - `en-te.en` → English source file for English → Telugu direction  
   - `te-en.en` → Telugu source file for Telugu → English direction  

4. **Output Preparation:**  
   The processed and tagged sentences were saved in plain text format, which serves as the input for the next stages:  
   **Subword segmentation** and **model training**.



## Mounting the LLM drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
%cd /content/drive/MyDrive/Colab Notebooks/LLM/workflow/dataset/raw

/content/drive/MyDrive/Colab Notebooks/LLM/workflow/dataset/raw


In [None]:
df = pd.read_csv("en-te.csv")

In [None]:
df.shape

(133742, 5)

## 1. **Data Extraction:**  

In [None]:
df.head()

Unnamed: 0,Domain,Source Language,Target Language,English,Telugu
0,Computer science,eng_Latn,tel_Telu,Main function of a graphic design is to enhanc...,"ఇమేజెస్‌ను టైపోగ్రాఫిక్, లేదా విజువల్ లేదా రెం..."
1,Computer science,eng_Latn,tel_Telu,Cyber vulnerabilities occur during CPS and the...,CPS మరియు బాహ్య ప్రపంచం వారు కమ్యూనికేట్ చేయడం...
2,Computer science,eng_Latn,tel_Telu,The three faces inside are shaded.,లోపల మూడు ఫేస్‌లు షేడ్‌ చేయబడి ఉంటాయి.
3,Computer science,eng_Latn,tel_Telu,"Before moving on, we should understand what is...","ముందుకు వెళ్లే ముందు, క్రిప్టోసిస్టమ్స్ అంటే ఏ..."
4,Computer science,eng_Latn,tel_Telu,"Instead of just 1 or 2 desk surfaces, like in ...",ఒక క్యూబికల్‌లో వలె కేవలం 1 లేదా 2 డెస్క్ లు బ...


In [None]:
df.drop(['Source Language', 'Target Language'], axis=1, inplace=True)


## 2. **Language Tagging:**

In [None]:
te_Tags = set()  # maintain a set of unique tags for Telugu

def add_te_tag(row):
    # Replace spaces in the domain with underscores (or hyphen)
    tag_clean = row['Domain'].replace(" ", "_")
    tag = f"<2te-{tag_clean}>"
    te_Tags.add(tag)
    return f"{tag} {row['English']}"

df['en-te.en'] = df.apply(add_te_tag, axis=1)
df['en-te.te'] = df['Telugu']  # target language column is Telugu

In [None]:
te_Tags

{'<2te-Computer_science>', '<2te-Mathematics>'}

In [None]:
en_Tags = set()  # maintain a set of unique tags for English

def add_en_tag(row):

    tag_clean = row['Domain'].replace(" ", "_")
    tag = f"<2en-{tag_clean}>"
    en_Tags.add(tag)
    return f"{tag} {row['Telugu']}"

df['te-en.en'] = df.apply(add_en_tag, axis=1)
df['te-en.te'] = df['English']  # source language column is English

In [None]:
en_Tags

{'<2en-Computer_science>', '<2en-Mathematics>'}

## 3. **Data Formatting:**  

In [None]:
df[['en-te.en', 'en-te.te', 'te-en.te', 'te-en.en']].head(3)

Unnamed: 0,en-te.en,en-te.te,te-en.te,te-en.en
0,<2te-Computer_science> Main function of a grap...,"ఇమేజెస్‌ను టైపోగ్రాఫిక్, లేదా విజువల్ లేదా రెం...",Main function of a graphic design is to enhanc...,<2en-Computer_science> ఇమేజెస్‌ను టైపోగ్రాఫిక్...
1,<2te-Computer_science> Cyber vulnerabilities o...,CPS మరియు బాహ్య ప్రపంచం వారు కమ్యూనికేట్ చేయడం...,Cyber vulnerabilities occur during CPS and the...,<2en-Computer_science> CPS మరియు బాహ్య ప్రపంచం...
2,<2te-Computer_science> The three faces inside ...,లోపల మూడు ఫేస్‌లు షేడ్‌ చేయబడి ఉంటాయి.,The three faces inside are shaded.,<2en-Computer_science> లోపల మూడు ఫేస్‌లు షేడ్‌...


## 4. Saving it in a seperate file

In [None]:
import os
import csv

# Define the output directory
output_dir = "parallel_copora"

# Create directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

parallel_corpus = df[['en-te.en', 'en-te.te', 'te-en.te', 'te-en.en']]
for col in parallel_corpus.columns:
    filename = os.path.join(output_dir, col)
    df[col].to_csv(filename, index=False, header=False, quoting=csv.QUOTE_NONE, escapechar='\\')
    print(f"Saved {col} -> {filename}")


Saved en-te.en -> parallel_copora/en-te.en
Saved en-te.te -> parallel_copora/en-te.te
Saved te-en.te -> parallel_copora/te-en.te
Saved te-en.en -> parallel_copora/te-en.en
