<a href="https://colab.research.google.com/github/Mayur-S-Gaidhane/test/blob/master/BPCC_Devanagari.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install required libraries in Colab

In [None]:
!pip install -q datasets pandas pyarrow tqdm

Log in to Hugging Face in Colab

In [None]:
from huggingface_hub import login

login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Explore BPCC configs

In [None]:
from datasets import get_dataset_config_names

configs = get_dataset_config_names("ai4bharat/BPCC")
print("Available BPCC configs:")
for c in configs:
    print("-", c)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

Available BPCC configs:
- bpcc-seed-latest
- bpcc-seed-v1
- daily
- comparable
- ilci
- nllb-seed
- massive
- nllb-filtered
- samanantar-filtered
- samanantar++


Load a BPCC subset

In [None]:
# Step 5: Load BPCC bpcc-seed-v1 dataset (all splits)
from datasets import load_dataset

config_name = "bpcc-seed-v1"

# Load all language splits into a DatasetDict
ds_dict = load_dataset("ai4bharat/BPCC", config_name)

# Print dataset structure to view available splits
print("Dataset Loaded Successfully!")
print(ds_dict)


Dataset Loaded Successfully!
DatasetDict({
    asm_Beng: Dataset({
        features: ['src_lang', 'tgt_lang', 'src', 'tgt'],
        num_rows: 44727
    })
    ben_Beng: Dataset({
        features: ['src_lang', 'tgt_lang', 'src', 'tgt'],
        num_rows: 47994
    })
    brx_Deva: Dataset({
        features: ['src_lang', 'tgt_lang', 'src', 'tgt'],
        num_rows: 22712
    })
    doi_Deva: Dataset({
        features: ['src_lang', 'tgt_lang', 'src', 'tgt'],
        num_rows: 18657
    })
    gom_Deva: Dataset({
        features: ['src_lang', 'tgt_lang', 'src', 'tgt'],
        num_rows: 18284
    })
    guj_Gujr: Dataset({
        features: ['src_lang', 'tgt_lang', 'src', 'tgt'],
        num_rows: 25047
    })
    hin_Deva: Dataset({
        features: ['src_lang', 'tgt_lang', 'src', 'tgt'],
        num_rows: 40270
    })
    kan_Knda: Dataset({
        features: ['src_lang', 'tgt_lang', 'src', 'tgt'],
        num_rows: 32193
    })
    kas_Arab: Dataset({
        features: ['src_lang'

In [None]:
# Find only Devanagari datasets (ending with _Deva)
deva_splits = [name for name in ds_dict.keys() if name.endswith("_Deva")]
print("Devanagari Splits Found:", deva_splits)

total = 0
for name in deva_splits:
    print(name, "->", len(ds_dict[name]))
    total += len(ds_dict[name])

print("Total Devanagari Records:", total)


Devanagari Splits Found: ['brx_Deva', 'doi_Deva', 'gom_Deva', 'hin_Deva', 'mai_Deva', 'mar_Deva', 'npi_Deva', 'san_Deva', 'snd_Deva']
brx_Deva -> 22712
doi_Deva -> 18657
gom_Deva -> 18284
hin_Deva -> 40270
mai_Deva -> 24410
mar_Deva -> 54348
npi_Deva -> 45866
san_Deva -> 27744
snd_Deva -> 10502
Total Devanagari Records: 262793


List which Devanagari languages are present

In [None]:
from collections import Counter

# Collect all src_lang and tgt_lang values from every split in ds_dict
src_lang_counts = Counter()
tgt_lang_counts = Counter()

for split_name, dataset in ds_dict.items():
    src_lang_counts.update(dataset["src_lang"])
    tgt_lang_counts.update(dataset["tgt_lang"])

print("Unique source languages:", list(src_lang_counts.keys()))
print("Unique target languages:", list(tgt_lang_counts.keys()))

# Extract Devanagari languages (end with _Deva)
all_langs = set(src_lang_counts.keys()) | set(tgt_lang_counts.keys())
deva_langs = sorted([lang for lang in all_langs if lang.endswith("_Deva")])

print("\nDevanagari languages found in BPCC:", deva_langs)


Unique source languages: ['eng_Latn']
Unique target languages: ['asm_Beng', 'ben_Beng', 'brx_Deva', 'doi_Deva', 'gom_Deva', 'guj_Gujr', 'hin_Deva', 'kan_Knda', 'kas_Arab', 'mai_Deva', 'mal_Mlym', 'mar_Deva', 'mni_Mtei', 'npi_Deva', 'ory_Orya', 'pan_Guru', 'san_Deva', 'snd_Deva', 'tam_Taml', 'tel_Telu', 'urd_Arab']

Devanagari languages found in BPCC: ['brx_Deva', 'doi_Deva', 'gom_Deva', 'hin_Deva', 'mai_Deva', 'mar_Deva', 'npi_Deva', 'san_Deva', 'snd_Deva']


Extract all Devanagari records from BPCC

In [None]:
from datasets import concatenate_datasets

deva_datasets = []

for split_name, dataset in ds_dict.items():
    # keep only examples where either side is Devanagari
    filtered = dataset.filter(
        lambda ex: ex["src_lang"] in deva_langs or ex["tgt_lang"] in deva_langs
    )
    if len(filtered) > 0:
        print(f"{split_name}: {len(filtered)} Devanagari examples")
        deva_datasets.append(filtered)

# Combine all Devanagari-containing examples into one big dataset
if deva_datasets:
    bpcc_deva = concatenate_datasets(deva_datasets)
    print("Combined Devanagari dataset size:", len(bpcc_deva))
else:
    bpcc_deva = None
    print("No Devanagari examples found (this should NOT happen for bpcc-seed-v1).")


Filter:   0%|          | 0/44727 [00:00<?, ? examples/s]

Filter:   0%|          | 0/47994 [00:00<?, ? examples/s]

Filter:   0%|          | 0/22712 [00:00<?, ? examples/s]

brx_Deva: 22712 Devanagari examples


Filter:   0%|          | 0/18657 [00:00<?, ? examples/s]

doi_Deva: 18657 Devanagari examples


Filter:   0%|          | 0/18284 [00:00<?, ? examples/s]

gom_Deva: 18284 Devanagari examples


Filter:   0%|          | 0/25047 [00:00<?, ? examples/s]

Filter:   0%|          | 0/40270 [00:00<?, ? examples/s]

hin_Deva: 40270 Devanagari examples


Filter:   0%|          | 0/32193 [00:00<?, ? examples/s]

Filter:   0%|          | 0/15512 [00:00<?, ? examples/s]

Filter:   0%|          | 0/24410 [00:00<?, ? examples/s]

mai_Deva: 24410 Devanagari examples


Filter:   0%|          | 0/41621 [00:00<?, ? examples/s]

Filter:   0%|          | 0/54348 [00:00<?, ? examples/s]

mar_Deva: 54348 Devanagari examples


Filter:   0%|          | 0/19882 [00:00<?, ? examples/s]

Filter:   0%|          | 0/45866 [00:00<?, ? examples/s]

npi_Deva: 45866 Devanagari examples


Filter:   0%|          | 0/33727 [00:00<?, ? examples/s]

Filter:   0%|          | 0/6253 [00:00<?, ? examples/s]

Filter:   0%|          | 0/27744 [00:00<?, ? examples/s]

san_Deva: 27744 Devanagari examples


Filter:   0%|          | 0/10502 [00:00<?, ? examples/s]

snd_Deva: 10502 Devanagari examples


Filter:   0%|          | 0/20958 [00:00<?, ? examples/s]

Filter:   0%|          | 0/29726 [00:00<?, ? examples/s]

Filter:   0%|          | 0/41335 [00:00<?, ? examples/s]

Combined Devanagari dataset size: 262793


Create permutation language pairs in Devanagari

In [None]:
from datasets import Dataset, concatenate_datasets

pivot_lang = "eng_Latn"   # English language identifier in BPCC

def keep_en_deva(example):
    """Keep rows where one side is English and the other is a Devanagari language."""
    return (
        (example["src_lang"] == pivot_lang and example["tgt_lang"] in deva_langs) or
        (example["tgt_lang"] == pivot_lang and example["src_lang"] in deva_langs)
    )

en_deva_parts = []

for split_name, dataset in ds_dict.items():
    # 1) Filter to English–Devanagari rows
    filtered = dataset.filter(keep_en_deva)
    if len(filtered) == 0:
        continue

    print(f"{split_name}: {len(filtered)} English–Devanagari rows")

    # 2) Identify which columns hold the actual text
    cols = filtered.column_names
    lang_cols = ["src_lang", "tgt_lang"]
    meta_cols = set(lang_cols + ["score", "id"])
    text_cols = [c for c in cols if c not in meta_cols]

    if len(text_cols) != 2:
        raise ValueError(
            f"Expected 2 text columns, got {len(text_cols)} ({text_cols}) in {split_name}"
        )

    src_text_col, tgt_text_col = text_cols

    def normalize_en_deva(example):
        src_lang = example["src_lang"]
        tgt_lang = example["tgt_lang"]
        src_text = example[src_text_col]
        tgt_text = example[tgt_text_col]

        if src_lang == pivot_lang and tgt_lang in deva_langs:
            return {
                "eng": src_text,
                "deva_lang": tgt_lang,
                "deva_text": tgt_text,
            }
        elif tgt_lang == pivot_lang and src_lang in deva_langs:
            return {
                "eng": tgt_text,
                "deva_lang": src_lang,
                "deva_text": src_text,
            }
        else:
            return None

    normalized = filtered.map(
        normalize_en_deva,
        remove_columns=filtered.column_names,
    )

    normalized = normalized.filter(lambda ex: ex is not None)
    en_deva_parts.append(normalized)

# 3) Combine all splits
if en_deva_parts:
    en_deva_norm = concatenate_datasets(en_deva_parts)
    print("\nFinal English–Devanagari dataset created!")
    print("Total rows:", len(en_deva_norm))
    print("Columns:", en_deva_norm.column_names)
    print("Example row:", en_deva_norm[0])
else:
    en_deva_norm = None
    print("No English–Devanagari rows found.")


Filter:   0%|          | 0/44727 [00:00<?, ? examples/s]

Filter:   0%|          | 0/47994 [00:00<?, ? examples/s]

Filter:   0%|          | 0/22712 [00:00<?, ? examples/s]

brx_Deva: 22712 English–Devanagari rows


Map:   0%|          | 0/22712 [00:00<?, ? examples/s]

Filter:   0%|          | 0/22712 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18657 [00:00<?, ? examples/s]

doi_Deva: 18657 English–Devanagari rows


Map:   0%|          | 0/18657 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18657 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18284 [00:00<?, ? examples/s]

gom_Deva: 18284 English–Devanagari rows


Map:   0%|          | 0/18284 [00:00<?, ? examples/s]

Filter:   0%|          | 0/18284 [00:00<?, ? examples/s]

Filter:   0%|          | 0/25047 [00:00<?, ? examples/s]

Filter:   0%|          | 0/40270 [00:00<?, ? examples/s]

hin_Deva: 40270 English–Devanagari rows


Map:   0%|          | 0/40270 [00:00<?, ? examples/s]

Filter:   0%|          | 0/40270 [00:00<?, ? examples/s]

Filter:   0%|          | 0/32193 [00:00<?, ? examples/s]

Filter:   0%|          | 0/15512 [00:00<?, ? examples/s]

Filter:   0%|          | 0/24410 [00:00<?, ? examples/s]

mai_Deva: 24410 English–Devanagari rows


Map:   0%|          | 0/24410 [00:00<?, ? examples/s]

Filter:   0%|          | 0/24410 [00:00<?, ? examples/s]

Filter:   0%|          | 0/41621 [00:00<?, ? examples/s]

Filter:   0%|          | 0/54348 [00:00<?, ? examples/s]

mar_Deva: 54348 English–Devanagari rows


Map:   0%|          | 0/54348 [00:00<?, ? examples/s]

Filter:   0%|          | 0/54348 [00:00<?, ? examples/s]

Filter:   0%|          | 0/19882 [00:00<?, ? examples/s]

Filter:   0%|          | 0/45866 [00:00<?, ? examples/s]

npi_Deva: 45866 English–Devanagari rows


Map:   0%|          | 0/45866 [00:00<?, ? examples/s]

Filter:   0%|          | 0/45866 [00:00<?, ? examples/s]

Filter:   0%|          | 0/33727 [00:00<?, ? examples/s]

Filter:   0%|          | 0/6253 [00:00<?, ? examples/s]

Filter:   0%|          | 0/27744 [00:00<?, ? examples/s]

san_Deva: 27744 English–Devanagari rows


Map:   0%|          | 0/27744 [00:00<?, ? examples/s]

Filter:   0%|          | 0/27744 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10502 [00:00<?, ? examples/s]

snd_Deva: 10502 English–Devanagari rows


Map:   0%|          | 0/10502 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10502 [00:00<?, ? examples/s]

Filter:   0%|          | 0/20958 [00:00<?, ? examples/s]

Filter:   0%|          | 0/29726 [00:00<?, ? examples/s]

Filter:   0%|          | 0/41335 [00:00<?, ? examples/s]


Final English–Devanagari dataset created!
Total rows: 262793
Columns: ['eng', 'deva_lang', 'deva_text']
Example row: {'eng': 'In Punjab, the Child Rights Cell of the Department of Social Welfare Punjab collaborates with UNICEF to celebrate this day.', 'deva_lang': 'brx_Deva', 'deva_text': 'पंजाब में समाज कल्याण विभाग के बाल अधिकार प्रकोष्ठ ने यूनिसेफ के साथ मिलकर यह दिवस मनाया।'}


Create Devanagari ↔ Devanagari pairs (Hindi–Marathi, etc.)

In [None]:
import pandas as pd
from itertools import permutations
from datasets import Dataset

# Convert to pandas for grouping
df = en_deva_norm.to_pandas()
print("Rows in en_deva_norm:", len(df))
print(df.head())

rows = []

# Group by English sentence (pivot)
for eng_sentence, group in df.groupby("eng"):
    if len(group) < 2:
        # We need at least 2 Devanagari translations to form a pair
        continue

    records = group.to_dict("records")

    # For every ordered pair (src, tgt) of Devanagari translations
    for src_rec, tgt_rec in permutations(records, 2):
        rows.append({
            "src_lang": src_rec["deva_lang"],
            "tgt_lang": tgt_rec["deva_lang"],
            "src_text": src_rec["deva_text"],
            "tgt_text": tgt_rec["deva_text"],
            "pivot_eng": eng_sentence,
            "lang_pair": f'{src_rec["deva_lang"]}-{tgt_rec["deva_lang"]}',
        })

print("Total Devanagari↔Devanagari examples created:", len(rows))

# Convert back to a HF dataset
indic_indic_deva = Dataset.from_pandas(pd.DataFrame(rows))
print(indic_indic_deva)
print("Example pair:", indic_indic_deva[0])


Rows in en_deva_norm: 262793
                                                 eng deva_lang  \
0  In Punjab, the Child Rights Cell of the Depart...  brx_Deva   
1  In recent years this style of wrestling has al...  brx_Deva   
2  The different jathis are tisra, chathusra, kha...  brx_Deva   
3  The action for counting includes a tap or clap...  brx_Deva   
4  Its action includes a tap or clap, followed by...  brx_Deva   

                                           deva_text  
0  पंजाब में समाज कल्याण विभाग के बाल अधिकार प्रक...  
1  बावैसो बोसोरफोराव खमलायनायनि बे रोखोमा गुबुन ह...  
2  बायदिरोखोमनि जाठिआव तिसरा, चथुसरा, खंडा, मिश्र...  
3  सान्नाय खामानियाव मोनसे टेप एबा आखाइ खबनाय थाय...  
4  बेखौ खालामनायाव मोनसे टेप एबा आखाइ खबनाय थायो,...  
Total Devanagari↔Devanagari examples created: 1465316
Dataset({
    features: ['src_lang', 'tgt_lang', 'src_text', 'tgt_text', 'pivot_eng', 'lang_pair'],
    num_rows: 1465316
})
Example pair: {'src_lang': 'brx_Deva', 'tgt_lang': 'gom_Deva', 's

Save the final dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

output_csv = "/content/drive/MyDrive/bpcc_devanagari_indic_indic_pairs.csv"
indic_indic_deva.to_csv(output_csv, index=False)
print("Saved CSV to:", output_csv)

# Or Parquet (more compact)
output_parquet = "/content/drive/MyDrive/bpcc_devanagari_indic_indic_pairs.parquet"
indic_indic_deva.to_parquet(output_parquet)
print("Saved Parquet to:", output_parquet)


Mounted at /content/drive


Creating CSV from Arrow format:   0%|          | 0/1466 [00:00<?, ?ba/s]

Saved CSV to: /content/drive/MyDrive/bpcc_devanagari_indic_indic_pairs.csv


Creating parquet from Arrow format:   0%|          | 0/1466 [00:00<?, ?ba/s]

Saved Parquet to: /content/drive/MyDrive/bpcc_devanagari_indic_indic_pairs.parquet
