In [1]:
import pandas as pd
import pyarrow.parquet as pq

## Load the Wikipedia Dataset and view contents
Note : The original wiki_en_data.parquet is a ~10GB file. We have provided with a sample parquet with 1000 entries for demo purposes.

In [10]:
df = pd.read_parquet("sample_data/sample_wiki_en_data.parquet")

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      1000 non-null   object
 1   body       1000 non-null   object
 2   source     1000 non-null   object
 3   url        1000 non-null   object
 4   langCode   1000 non-null   object
 5   timestamp  1000 non-null   object
dtypes: object(6)
memory usage: 47.0+ KB


In [12]:
df.head()

Unnamed: 0,title,body,source,url,langCode,timestamp
0,Boycott (album),\nBoycott (album)\n\n\n\n,wiki,https://en.wikipedia.org/wiki?curid=63405427,en,24/07/23 15:06
1,Javier Gallo,\nJavier Gallo\n\nJavier Gallo González (born ...,wiki,https://en.wikipedia.org/wiki?curid=31637473,en,24/07/23 14:49
2,R. M. Tristram,\nR. M. Tristram\n\n\n\n,wiki,https://en.wikipedia.org/wiki?curid=56333678,en,24/07/23 15:02
3,T. J. Carter (defensive back),\nT. J. Carter (defensive back)\n\nT. J. Carte...,wiki,https://en.wikipedia.org/wiki?curid=72605589,en,24/07/23 15:11
4,Wrestling at the 2015 Pan American Games - Men...,\nWrestling at the 2015 Pan American Games - M...,wiki,https://en.wikipedia.org/wiki?curid=49831530,en,24/07/23 14:59


## First we perform Templating Stage on the dataset

Note : We pass split split as 1% since the dataset is a 9.5 GB Dataset

In [118]:
!HF_DATASETS_CACHE=/home/$USER/tmp python /home/shanks/setu-translate/stages/perform_templating.py \
    --glob_path "/home/$USER/setu-translate/examples/sample_data/sample_wiki_en_data.parquet" \
    --cache_dir_for_original_data "/home/$USER/setu-translate/examples/cache" \
    --base_save_path "/home/$USER/setu-translate/examples/output/wiki_en/doc_csvs" \
    --save_path "/home/$USER/setu-translate/examples/output/wiki_en/templated" \
    --text_col body \
    --url_col url \
    --timestamp_col timestamp \
    --source_type wiki_en \
    --translation_type sentence \
    --use_cache False \
    --split "train[:100]"

Setting num_proc from 64 back to 1 for the train split to disable multiprocessing as it only contains one shard.
Generating train split: 16266212 examples [00:59, 273449.17 examples/s]
Loaded Dataset from path - /home/shanks/setu-translate/examples/sample_data/wiki_en_data.parquet
Map (num_proc=64): 100%|██████████████| 100/100 [00:00<00:00, 181.82 examples/s]
Performed `templating`
Filter (num_proc=64): 100%|███████████| 100/100 [00:00<00:00, 273.45 examples/s]
Filtered `null` text docs
Map (num_proc=64): 100%|██████████████| 100/100 [00:00<00:00, 160.18 examples/s]
Saving the dataset (64/64 shards): 100%|█| 100/100 [00:00<00:00, 139.40 examples
Saved `templated` dataset to /home/shanks/setu-translate/examples/output/wiki_en/templated


In [119]:
from datasets import Dataset

ds = Dataset.from_file("/home/shanks/setu-translate/examples/output/wiki_en/templated/data-00000-of-00064.arrow")

#### View the templated dataset output

In [120]:
ds[0]

{'source': 'en_wikipedia',
 'url': 'https://en.wikipedia.org/wiki?curid=63405427',
 'timestamp': '24/07/23 15:06',
 'doc_id': '7299c62f59ec33baca764b3cc9f6aa529a64ca0e784d963b48649db5500c0b96',
 'text': '\nboycott (album)\n\n\n\n',
 'sub_strs': '["boycott (album)"]',
 'sids': '["545490bdf181f0d46f6c7bf3a1d2ee08d8266c11fd40c96d3dc6f2238387fffc"]',
 'tlt_folder': '/home/shanks/setu-translate/examples/output/wiki_en/doc_csvs/923682bea6d517dc178d480c88e129e485ed902f4fa024866666658cd4ea6836/7299c62f59ec33baca764b3cc9f6aa529a64ca0e784d963b48649db5500c0b96'}

In [121]:
ds[0:10]['text']

['\nboycott (album)\n\n\n\n',
 '\njavier gallo\n\njavier gallo gonzález (born 6 august 1983) is a mexican professional boxer.\nprofessional career.\nin may 2011, gallo lost a majority decision to former world champion rodel mayol on showtime\'s televised portion of the pacquiao vs. mosley undercard.\non september 9, 2011 at the "war at woodland hills 5", gallo won with a tko over jason rorie.\n\n']

## Create the Global Sentence Level Dataset

In [122]:
!HF_DATASETS_CACHE=/home/$USER/tmp python /home/shanks/setu-translate/stages/create_global_ds.py \
    --paths_data "/home/$USER/setu-translate/examples/output/wiki_en/templated/*.arrow" \
    --cache_dir "/home/$USER/setu-translate/examples/cache" \
    --global_sent_ds_path "/home/$USER/setu-translate/examples/output/wiki_en/sentences"

Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 418123.76it/s]
Generating train split: 100 examples [00:00, 164.45 examples/s]
Loading dataset shards: 100%|█████████████████| 64/64 [00:00<00:00, 7853.12it/s]
Map (num_proc=64): 100%|██████████████| 100/100 [00:00<00:00, 216.38 examples/s]
Map (num_proc=64): 100%|█████████████| 612/612 [00:00<00:00, 1367.55 examples/s]
Saving the dataset (64/64 shards): 100%|█| 612/612 [00:00<00:00, 1160.19 example


In [123]:
ds = Dataset.from_file("/home/shanks/setu-translate/examples/output/wiki_en/sentences/data-00000-of-00064.arrow")

In [124]:
ds[0:10]["sub_strs"]

['boycott (album)',
 'javier gallo',
 'javier gallo gonzález (born 6 august 1983) is a mexican professional boxer.',
 'professional career.',
 "in may 2011, gallo lost a majority decision to former world champion rodel mayol on showtime's televised portion of the pacquiao vs.",
 'mosley undercard.',
 'on september 9, 2011 at the "war at woodland hills 5", gallo won with a tko over jason rorie.',
 'r.',
 'm.',
 'tristram']

## Now Binarize the Sentence Level Dataset

In [125]:
!HF_DATASETS_CACHE=/home/$USER/tmp python /home/shanks/setu-translate/stages/binarize.py \
    --root_dir "/home/$USER/setu-translate" \
    --data_files "/home/$USER/setu-translate/examples/output/wiki_en/sentences/*.arrow" \
    --cache_dir "/home/$USER/setu-translate/examples/cache" \
    --binarized_dir "/home/$USER/setu-translate/examples/output/wiki_en/binarized_sentences" \
    --batch_size 2048 \
    --total_procs 1 \
    --padding max_length \
    --src_lang eng_Latn \
    --tgt_lang hin_Deva \
    --return_format pt

Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 418123.76it/s]
Generating train split: 612 examples [00:00, 32582.49 examples/s]
Loaded Dataset....
Map: 100%|███████████████████████████| 612/612 [00:00<00:00, 1738.87 examples/s]
Saving the dataset (1/1 shards): 100%|█| 612/612 [00:00<00:00, 128313.62 example


In [126]:
ds = Dataset.from_file("/home/shanks/setu-translate/examples/output/wiki_en/binarized_sentences/data-00000-of-00001.arrow")

In [127]:
ds[0:10]["sub_strs"]

['norwegian holocaust center',
 "azerbaijan's construction in areas gained in the 2020 nagorno-karabakh war",
 'following the 2020 nagorno-karabakh war and in accordance with 2020 nagorno-karabakh ceasefire agreement, republic of azerbaijan re-established authority on the part of the territories, previously de facto controlled by the breakaway republic of artsakh, which allowed azerbaijan to begin construction projects and rehabilitation in areas of the karabakh, many of which had been practically leveled since azerbaijan lost control of them in the 1990s.',
 'post-conflict condition and reconstruction.',
 'azerbaijan recovered many of its territories during and after the 2020 nagorno-karabakh war, which culminated by the ceasefire deal on 9 november 2020. the ceasefire allowed rehabilitation to begin in places where azerbaijan re-established authority, many of which had been practically leveled since azerbaijan lost control of them in the 1990s.',
 'government-sponsored sources presen

In [128]:
for input_id in ds[0:10]["input_ids"]:
    print(str(input_id))

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 15, 2563, 27445, 1307, 14660, 26766, 2150, 2]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

## Now perform translation on the binarized dataset

In [129]:
!HF_DATASETS_CACHE=/home/$USER/tmp python /home/shanks/setu-translate/stages/tlt_pipelines/translate_joblib.py \
    --root_dir "/home/$USER/setu-translate" \
    --data_files "/home/$USER/setu-translate/examples/output/wiki_en/binarized_sentences/*.arrow" \
    --cache_dir "/home/$USER/setu-translate/examples/cache" \
    --base_save_dir "/home/$USER/setu-translate/examples/output/wiki_en/model_out" \
    --joblib_temp_folder "/home/$USER/setu-translate/tmp" \
    --batch_size 64 \
    --total_procs 1 \
    --devices "0"

Generating train split: 612 examples [00:00, 322638.77 examples/s]
100%|████████████████████████████| 10/10 [00:28<00:00,  2.82s/ba: 64 samples/ba]
Saving the dataset (1/1 shards): 100%|█| 612/612 [00:00<00:00, 94064.06 examples


In [130]:
ds = Dataset.from_file("/home/shanks/setu-translate/examples/output/wiki_en/model_out/rank_0-device_cuda:0/data-00000-of-00001.arrow")

In [131]:
for input_id in ds[0:10]["translated_input_ids"]:
    print(str(input_id))

[2, 43257, 31, 53526, 606, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[2, 1064, 9, 6953, 3669, 184, 13, 515, 47359, 392, 979, 10, 907, 3736, 10, 213, 12707, 12405, 12221, 31, 797, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[2, 1064, 9, 6953, 3669, 184, 13, 515, 47359, 392, 979, 9, 144, 17, 1064, 9, 6953, 3669, 184, 13, 515, 47359, 392, 4152, 15761, 10878, 9, 937, 5, 213, 12707, 12405, 12221, 19695, 26, 3736, 12, 858, 20, 822, 23, 503, 20, 3134, 78, 5, 85, 395, 41982, 655, 318, 6249, 964,

## Now let's decode the translated inputs

In [132]:
!HF_DATASETS_CACHE=/home/$USER/tmp python /home/shanks/setu-translate/stages/decode.py \
    --root_dir "/home/$USER/setu-translate" \
    --data_files "/home/$USER/setu-translate/examples/output/wiki_en/model_out/*/*.arrow" \
    --cache_dir "/home/$USER/setu-translate/examples/cache" \
    --decode_dir "/home/$USER/setu-translate/examples/output/wiki_en/decode" \
    --batch_size 64 \
    --total_procs 1 \
    --src_lang eng_Latn \
    --tgt_lang hin_Deva 

Generating train split: 612 examples [00:00, 181433.00 examples/s]
Loaded Dataset....
Map: 100%|████████████████████████████| 612/612 [00:01<00:00, 456.12 examples/s]
Map: 100%|███████████████████████████| 612/612 [00:00<00:00, 6849.01 examples/s]
Saving the dataset (1/1 shards): 100%|█| 612/612 [00:00<00:00, 112087.42 example


In [133]:
ds = Dataset.from_file("/home/shanks/setu-translate/examples/output/wiki_en/decode/data-00000-of-00001.arrow")

In [134]:
ds[0:10]["translated"]

['नॉर्वे का नरसंहार केंद्र',
 '2020 के नागोर्नो-काराबाख युद्ध में प्राप्त क्षेत्रों में अज़रबैजान का निर्माण',
 '2020 के नागोर्नो-काराबाख युद्ध के बाद और 2020 के नागोर्नो-काराबाख संघर्ष विराम समझौते के अनुसार, अज़रबैजान गणराज्य ने क्षेत्रों की ओर से अधिकार को फिर से स्थापित किया, जो पहले वस्तुतः अलग हुए आर्टसाख गणराज्य द्वारा नियंत्रित था, जिसने अज़रबैजान को काराबाख के क्षेत्रों में निर्माण परियोजनाएं और पुनर्वास शुरू करने की अनुमति दी, जिनमें से कई को 1990 के दशक में अज़रबैजान द्वारा उन पर नियंत्रण खोने के बाद से व्यावहारिक रूप से समतल कर दिया गया था।',
 'संघर्ष के बाद की स्थिति और पुनर्निर्माण।',
 'अज़रबैजान ने 2020 के नागोर्नो-काराबाख युद्ध के दौरान और उसके बाद अपने कई क्षेत्रों को फिर से हासिल कर लिया, जो 9 नवंबर 2020 को संघर्ष विराम समझौते के साथ समाप्त हुआ. संघर्ष विराम ने उन स्थानों पर पुनर्वास शुरू करने की अनुमति दी जहां अज़रबैजान ने अधिकार फिर से स्थापित किया, जिनमें से कई को व्यावहारिक रूप से समतल कर दिया गया था क्योंकि 1990 के दशक में अज़रबैजान ने उन पर नियंत्रण खो दिया था।'

## Now replace the text with translations in the original templated dataset

In [135]:
!HF_DATASETS_CACHE=/home/$USER/tmp python /home/shanks/setu-translate/stages/replace.py \
    --paths_data "/home/$USER/setu-translate/examples/output/wiki_en/templated/*.arrow" \
    --cache_dir "/home/$USER/setu-translate/examples/cache" \
    --batch_size 64 \
    --num_procs 1 \
    --translated_save_path "/home/$USER/setu-translate/examples/output/wiki_en/translated"

Resolving data files: 100%|█████████████████| 64/64 [00:00<00:00, 325376.31it/s]
Generating train split: 100 examples [00:00, 5610.96 examples/s]
Map: 100%|████████████████████████████████| 4/4 [00:00<00:00, 419.78 examples/s]
Saving the dataset (1/1 shards): 100%|████| 4/4 [00:00<00:00, 754.44 examples/s]


In [136]:
ds = Dataset.from_file("/home/shanks/setu-translate/examples/output/wiki_en/translated/data-00000-of-00001.arrow")

In [137]:
ds[0:10]["translated"]

['\nबोन मैसन एयरड्रोम\n\n\n\n',
 '\nजेसन स्पेंसर\n\n\n\n',
 '\nसैमुएल टेफेरा\n\nसैमुएल टेफेरा (born 23 october 1999) is an ethiopian middle distance runner who specialises in the 1500 metres. 18 साल की उम्र में, वह 2018 के विश्व इनडोर चैंपियन बने, और 2022 की विश्व इनडोर चैंपियनशिप में अपने खिताब का बचाव करते हुए इस प्रक्रिया में चैंपियनशिप रिकॉर्ड स्थापित किया। टेफेरा 1500 मीटर के लिए अफ्रीकी इनडोर रिकॉर्ड धारक है।\nउन्होंने तीन साल तक विश्व इनडोर 1500 मीटर का रिकॉर्ड बनाया, जिसमें उनका निशान वर्तमान में संबंधित विश्व सर्वकालिक सूची में दूसरा सबसे तेज है।\nकैरियर।\nसैमुएल टेफेरा made her first major appearance in 2017, when the then-17-year-old represented ethiopia in the 1500 metres at the world championships in london, not advancing from the heats.\nमार्च 2018 में, उन्होंने बर्मिंघम में आयोजित विश्व इनडोर चैंपियनशिप में स्वर्ण पदक जीता। उन्होंने मार्सिन लेवांडोव्स्की (3:58.39) और अब्दालाती इगुइडर (3:58.43) को हराने के लिए 3:58.19 हासिल किया।\nफरवरी 2019 में, अभी भी 19, टेफेरा ने तीन 