This notebook contains -

1. a simple usage example of the IndicTrans model using HuggingFace library
2. introduction to the library/datasets mentioned in the presentation

In [1]:
# to know more about the packages;
# transformers - https://huggingface.co/docs/transformers/index
# datasets - https://huggingface.co/docs/datasets/index
# sacremose - https://pypi.org/project/sacremoses/
# sentencepiece - https://github.com/google/sentencepiece/
# indic-nlp-library - https://anoopkunchukuttan.github.io/indic_nlp_library/

!pip install transformers datasets
!pip install sacremoses indic-nlp-library sentencepiece
!pip install mosestokenizer

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Installing collected pa

### IndicTrans2 translation - inference

#### from source

In [None]:
!git clone https://github.com/AI4Bharat/IndicTrans2
# %cd IndicTrans2

In [None]:
# Install fairseq from source
!git clone https://github.com/pytorch/fairseq.git
%cd fairseq
!pip install ./
%cd ..

In [None]:
# download IndicTrans2 model
# downloading the indic-en model
!wget https://indictrans2-public.objectstore.e2enetworks.net/it2_preprint_ckpts/indic-en-preprint.zip
!unzip indic-en-preprint.zip -d ./models

# downloading the en-indic model
# !wget https://indictrans2-public.objectstore.e2enetworks.net/it2_preprint_ckpts/en-indic-preprint.zip
# !unzip en-indic-preprint.zip
%cd IndicTrans2

In [None]:
from inference.engine import Model

indic2en_model = Model("/content/models/indic-en-preprint/fairseq_model", model_type="fairseq")

Initializing sentencepiece model for SRC and TGT
Initializing model for translation


In [None]:
trans_text = indic2en_model.translate_paragraph("""Here's a story. Long long ago there lived a lion king in the jungle.
He had a friend by name Amar who was a cunning fox.""", "eng_Latn", "kan_Knda")
print(trans_text)

In [None]:
# ta_sents = ['இதோ ஒரு கதை. நீண்ட காலத்திற்கு முன்பு காட்டில் ஒரு சிங்க மன்னர் வசித்து வந்தார். ',
#             'அவருக்கு அமர் என்ற ஒரு நண்பர் இருந்தார்',
#             'அவர் ஒரு தந்திரமான நரி']

# indic2en_model.batch_translate(ta_sents, 'tam_Taml', 'eng_Latn')

Inference using IndicTrans model - https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/indicTrans_python_interface.ipynb

#### from HuggingFace

In [2]:
# using the HuggingFace library for translation task. For more info- https://huggingface.co/docs/transformers/tasks/translation#inference
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-hi")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-hi")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/812k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/306M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

In [3]:
text = "Lets translate from English to all Indian languages!"

In [5]:
# method1 : using the hf model & encode-decode functions

input_ids = tokenizer.encode(text, return_tensors="pt", padding=True)
outputs = model.generate(input_ids)
decoded_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

decoded_text

'<pad> अंग्रेज़ी से सभी भारतीय भाषाओं में अनुवाद करते हैं!</s>'

In [6]:
# method2 : using hf pipeline and passing model & tokenizer as arguments

translator = pipeline("translation", model=model, tokenizer=tokenizer)
translator(text)

[{'translation_text': 'अंग्रेज़ी से सभी भारतीय भाषाओं में अनुवाद करते हैं!'}]

In [7]:
# example of an idiomatic translation :)

text = "Break a leg on your performance tonight"
translator = pipeline("translation", model=model, tokenizer=tokenizer)
translator(text)

[{'translation_text': 'आज रात अपने प्रदर्शन पर एक पैर तोड़'}]

In [8]:
text = "The ambitious project eventually bit the dust"
translator = pipeline("translation", model=model, tokenizer=tokenizer)
translator(text)

[{'translation_text': 'हरगिज़ नहीं ।'}]

### indic-trans transliteration - inference

In [None]:
# clone original repo and install dependencies
!git clone https://github.com/irshadbhat/indic-trans.git
# or
#!git clone https://github.com/libindic/indic-trans.git

%cd indic-trans
!pip install -r requirements.txt
!pip install .

%cd

Cloning into 'indic-trans'...
remote: Enumerating objects: 2188, done.[K
remote: Total 2188 (delta 0), reused 0 (delta 0), pack-reused 2188[K
Receiving objects: 100% (2188/2188), 516.49 MiB | 14.49 MiB/s, done.
Resolving deltas: 100% (1089/1089), done.
Updating files: 100% (717/717), done.
/content/indic-trans
Collecting pbr (from -r requirements.txt (line 1))
  Downloading pbr-6.0.0-py2.py3-none-any.whl (107 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.5/107.5 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pbr
Successfully installed pbr-6.0.0
Processing /content/indic-trans
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: indictrans
  Building wheel for indictrans (setup.py) ... [?25l[?25hdone
  Created wheel for indictrans: filename=indictrans-1.2.3-cp310-cp310-linux_x86_64.whl size=337805493 sha256=98ce4a9f797b322406c4435d131db6762f233244590daba2f9a5897c0bdfc2a3
  Stored in direc

In [None]:
# # usage instructions and optional parameters
# !indictrans --h

usage: indictrans [-h] [-v] [-s] [-t] [-b] [-m | -r] [-i] [-o]

Transliterator for Indian Languages including English

options:
  -h, --help          show this help message and exit
  -v, --version       show program's version number and exit
  -s , --source       select language (3 letter ISO-639 code) {hin, guj, pan, ben, mal, kan, tam,
                      tel, ori, eng, mar, nep, bod, kok, asm, urd}
  -t , --target       select language (3 letter ISO-639 code) {hin, guj, pan, ben, mal, kan, tam,
                      tel, ori, eng, mar, nep, bod, kok, asm, urd}
  -b, --build-lookup  build lookup to fasten transliteration
  -m, --ml            use ML system for transliteration
  -r, --rb            use rule-based system for transliteration
  -i , --input        <input-file>
  -o , --output       <output-file>


In [None]:
# build_lookup saves time for big corpus. Transliterate hindi text into english
from indictrans import Transliterator
trn = Transliterator(source='hin', target='eng', build_lookup=True)

In [None]:
hindi_source = """प्रतिदिन समाचार–पत्रों में ऐसी घटनाओं के समाचार प्रकाशित होते रहते हैं। आवश्यक और अनावश्यक माँगों को लेकर उनका आक्रोश बढ़ता ही रहता है।
यदि छात्रों की इस शक्ति को सृजनात्मक कार्य में लगा दिया जाए तो देश का कायापलट हो सकता है"""

In [None]:
eng_target = trn.transform(hindi_source)
print(eng_target)

pratidin samachar–patron main aisi ghatnaon ke samachar prakashit hote rahete hai. aavashyak or anaavashyak maangon ko lekar unka aakrosh badhata hi rahata he.
yadi chaatro kii is shakti ko srujanaatmak kaary main laga diya jaae to desh kaa kayapalat ho saktaa he


In [None]:
# back transliterate from english to hindi
trn = Transliterator(source='eng', target='hin')

In [None]:
hindi_verify = trn.transform(eng_target)
print(hindi_verify)

प्रतिदिन समाचार–पत्रों में ऐसी घटनाओं के समाचार प्रकाषित होते रहते हैं. आवश्यक और अनावश्यक मांगों को लेकर उनका आक्रोश बढ़ता ही रहता है.
यदि छात्रो की इस शक्ति को सृजनात्मक कार्य में लगा दिया जाए तो देश का क्यापलट हो सकता है


### Dakshina dataset

In [None]:
import pandas as pd
import numpy as np

In [None]:
# download link for Dakshina dataset--
# https://storage.googleapis.com/gresearch/dakshina/dakshina_dataset_v1.0.tar
# download the dataset, keep it in the google drive, mention the folder path while extracting

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


In [None]:
# download tar file and extract the contents into a new folder named IndicDatasets
import tarfile
my_tar = tarfile.open('/content/gdrive/MyDrive/dakshina_dataset_v1.0.tar')
my_tar.extractall('/content/gdrive/MyDrive/IndicDatasets/')
my_tar.close()

In [None]:
# !unzip /content/gdrive/MyDrive/dakshina_sample.zip -d /content/gdrive/MyDrive/IndicDatasets/

#### kn dataset

In [None]:
#romanized folder, Kannada Language
%cd /content/gdrive/MyDrive/IndicDatasets/dakshina_dataset_v1.0/kn/romanized
%ls

/content/gdrive/MyDrive/IndicDatasets/kn/romanized
kn.romanized.rejoined.aligned.cased_nopunct.tsv
kn.romanized.rejoined.aligned.tsv
kn.romanized.rejoined.dev.native.txt
kn.romanized.rejoined.dev.roman.txt
kn.romanized.rejoined.test.native.txt
kn.romanized.rejoined.test.roman.txt
kn.romanized.rejoined.tsv
kn.romanized.split.tsv
kn.romanized.split.validation.edits.txt
kn.romanized.split.validation.native.txt


In [None]:
rom_df = pd.read_csv('kn.romanized.split.tsv', header=None, sep='\t', error_bad_lines=False)
rom_df.head(10)



  exec(code_obj, self.user_global_ns, self.user_ns)
b'Skipping line 655: expected 5 fields, saw 7\n'


Unnamed: 0,0,1,2,3,4
0,ಎಲಿಜಬೆತ್ ಕವಿಗಳು ತಮ್ಮ ಕಲೆಯನ್ನು ಬಹುಮಟ್ಟಿಗೆ ಅಧ್ಯಯ...,24,Elizabeth kavigalu tamma kaleyannu bahumattige...,24.0,
1,ಬ್ಯಾಟರಿಗಳು ಕಡಿಮೆ ಉಷ್ಣಾಂಶದಲ್ಲಿ ಸಂಗ್ರಹಿಸಿಡಲ್ಪಟ್ಟ...,13,Byaatarigalu kadime ushnaanshadalli samgrahisi...,13.0,
2,ವಿಷ್ಣುವು ಕ್ಷೀರಸಾಗರದಲ್ಲಿ ಶೇಷನಾಗನ ಬೆನ್ನಿನ ಮೇಲೆ ನ...,8,Vishnuvu ksheerasaagaradalli sheshanaagana ben...,8.0,
3,ಮೊದಲ ಟರ್ನರ್ಸ್ ಗುಂಪು ರಚನೆಯಾಯಿತು ಲಂಡನ್ ಸಿನ್ಸಿನ್ನ...,21,Modala turners gumpu rachhaneyaayitu London Ci...,21.0,ಮೊದಲ ಟರ್ನರ್ಸ್ ಗುಂಪು ರಚನೆಯಾಯಿತು ಲಂಡನ್ ಸಿನ್ಸಿನ್ನ...
4,ಅಂದಿನಿಂದ ಅವರ ಮನೆತನದವರು ಆ ಮಠಕ್ಕೆ ನಿಷ್ಠೆಯಿಂದ ನಡೆ...,8,Andininda avara manetanadavaru aa mathakke nis...,8.0,
5,ಆರ್ಯಭಟ ಪ್ರಶಸ್ತಿ,2,aaryabhata prashasti,2.0,
6,ಈ ಸಮಯದಲ್ಲಿಯೇ ಅವರು ಮಾಲ್ಟಾದಲ್ಲಿ ಒಂದು ಸಂಕೇತ ತರಬೇತ...,16,Ee samayadalliye avaru maltadalli ondu sanketa...,16.0,
7,"ಇವರು ರಾಯ್, ಲಿಂಬು, ಮತ್ತು ಗುರುಂಗ್ ಬುಡಕಟ್ಟಿಗೆ ಸೇರ...",10,"Ivaru Roy, limbu, mattu gurung budakattige ser...",10.0,
8,"ಹೂವುಗಳು ಚಿಕ್ಕದಾಗಿದ್ದು, ಬಿಳಿ ಬಣ್ಣವಾಗಿರುತ್ತದೆ ಮತ...",6,"Hoovugalu chhikkadaagiddu, bili bannavaagirutt...",6.0,
9,ಅಮೆರಿಕನ್‌ ಮೊಲ ಸಾಕಣೆಗಾರರ ಸಂಘದ ಅನುಷಂಗವಾದ ಅಮೆರಿಕನ...,19,American mola saakanegaarara sanghada anushang...,19.0,


In [None]:
# removing the last column and renaming the colums appropriately
rom_df.drop(rom_df.columns[-1], axis=1, inplace=True)
rom_df.rename(columns={0:'Kannada_text', 1:'kn_length', 2:'English_text', 3:'en_length'}, inplace=True)
rom_df

Unnamed: 0,Kannada_text,kn_length,English_text,en_length
0,ಎಲಿಜಬೆತ್ ಕವಿಗಳು ತಮ್ಮ ಕಲೆಯನ್ನು ಬಹುಮಟ್ಟಿಗೆ ಅಧ್ಯಯ...,24,Elizabeth kavigalu tamma kaleyannu bahumattige...,24.0
1,ಬ್ಯಾಟರಿಗಳು ಕಡಿಮೆ ಉಷ್ಣಾಂಶದಲ್ಲಿ ಸಂಗ್ರಹಿಸಿಡಲ್ಪಟ್ಟ...,13,Byaatarigalu kadime ushnaanshadalli samgrahisi...,13.0
2,ವಿಷ್ಣುವು ಕ್ಷೀರಸಾಗರದಲ್ಲಿ ಶೇಷನಾಗನ ಬೆನ್ನಿನ ಮೇಲೆ ನ...,8,Vishnuvu ksheerasaagaradalli sheshanaagana ben...,8.0
3,ಮೊದಲ ಟರ್ನರ್ಸ್ ಗುಂಪು ರಚನೆಯಾಯಿತು ಲಂಡನ್ ಸಿನ್ಸಿನ್ನ...,21,Modala turners gumpu rachhaneyaayitu London Ci...,21.0
4,ಅಂದಿನಿಂದ ಅವರ ಮನೆತನದವರು ಆ ಮಠಕ್ಕೆ ನಿಷ್ಠೆಯಿಂದ ನಡೆ...,8,Andininda avara manetanadavaru aa mathakke nis...,8.0
...,...,...,...,...
10189,"ಆಪರೇಟಿವ್ ಬ್ಯಾಂಕ್ ನ ಕಾರ್ಯಾಧ್ಯಕ್ಷ, ವಾಸುದೇವ ಆರ್.",6,"Operative Bank na karyadhyaksa, vasudeva R.",6.0
10190,"ಆದರೆ ‘ಯಾವುದೇ ಕಾರಣಕ್ಕೂ ತಡೆಗೋಡೆ ಬೇಡ, ಎಲ್ಲ 22 ಹಳ್...",18,"Adare yavude karanakka tadegode beda, ella 22 ...",18.0
10191,ಗೋಪುರ ಸೇತುವೆ ಸನಿಹದಲ್ಲಿರುವ ಐತಿಹಾಸಿಕ ಸ್ಥಳಗಳು,5,Gopura setuve sanihadalliruva aitihasika sthal...,5.0
10192,ಆರೋಗ್ಯವು ಅತ್ಯಂತ ಅಮೂಲ್ಯವಾದ ಲಾಭ ಮತ್ತು ತೃಪ್ತಿಯು ಅ...,12,Arogyavu atyanta amulyavada labha mattu trptiy...,12.0


In [None]:
rom_df['Kannada_text'][0]

"ಎಲಿಜಬೆತ್ ಕವಿಗಳು ತಮ್ಮ ಕಲೆಯನ್ನು ಬಹುಮಟ್ಟಿಗೆ ಅಧ್ಯಯನ ಮಾಡಿದ್ದು ಇಂತಹ ಸಂಕಲನಗಳಲ್ಲಿ.೧೫೬೩ರಲ್ಲಿ ಥಾಮಸ್ ಸ್ಟಾಕ್ ವಿಲ್ ಪ್ರಕಟಿಸಿದ 'ಇಂಡಕ್ಷನ್' ಕ್ರುತಿಯು ಕವಿಗಳು ತಮ್ಮ ತಂತ್ರ &ಭಾಷೆಯನ್ನು ,ಛಂದಸನ್ನು ಕುರಿತು ಚರ್ಚಿಸಲು ಅನುವು ನೀಡಿತು."

In [None]:
# tranliterate first kannada sentence using indic-trans package and check if the result is similar to transliterated english text in the dataset
trn = Transliterator(source='kan', target='eng')
kn_text = trn.transform(rom_df['Kannada_text'][0])
similarities = len([i for i in kn_text.split(' ') if i in rom_df['English_text'][0].split(' ')])

print("Transliterated sentence in dataset: ", rom_df['English_text'][0])
print("Manually obtained transliteration: ", kn_text )

print("No of common words is: ", similarities)

Transliterated sentence in dataset:  Elizabeth kavigalu tamma kaleyannu bahumattige adhyayana maadiddu intaha sankalanagalalli.1563ralli Thomus shtochhk Vil prakatisida 'induchhtion' krutiyu kavigalu tamma tamtra &bhaasheyannu ,chhandasannu kuritu chharchhisalu anuvu needitu.
Manually obtained transliteration:  elizabeth kavigalu tamm kaleyannu bahumattige adhyayan madiddu intah sankalanagamalli.1563ralli thomas stack vill prakatisid 'indukshan' kruthiyu kavigalu tamm tantra &bhasheyannu ,chhandasannu kuritu charchisalu anuvu niditu.
No of common words is:  7


#### hi dataset

In [None]:
#romanized folder, Hindi Language
%cd /content/gdrive/MyDrive/IndicDatasets/dakshina_dataset_v1.0/hi/romanized
%ls

/content/gdrive/MyDrive/IndicDatasets/hi/romanized
hi.romanized.rejoined.aligned.cased_nopunct.tsv
hi.romanized.rejoined.aligned.tsv
hi.romanized.rejoined.dev.native.txt
hi.romanized.rejoined.dev.roman.txt
hi.romanized.rejoined.test.native.txt
hi.romanized.rejoined.test.roman.txt
hi.romanized.rejoined.tsv
hi.romanized.split.tsv
hi.romanized.split.validation.edits.txt
hi.romanized.split.validation.native.txt


In [None]:
rom_df = pd.read_csv('hi.romanized.split.tsv', header=None, sep='\t', error_bad_lines=False)
rom_df.head(10)



  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,0,1,2,3,4
0,जबकि यह जैनों से कम है।,6,Jabki yah Jainon se km hai.,6.0,
1,वर्ष 2000 में वेंकटरामन ने प्रयोगशाला में राइब...,24,Varsh 2000 men Venkatraman ne prayogshala men ...,24.0,
2,इस वृक्ष की लंबाई तकरीबन ५० से ६० फीट के आसपास...,14,Is vriksha ki lambai takriban 50 se 60 feet ke...,14.0,
3,"इन अनुसंधान कार्यक्रमों में तारा रचना, तारकीय ...",19,"In anusandhan karyakramon men tara rachana, ta...",19.0,
4,और स्टारबर्स्ट और सक्रिय गांगेय नाभिक जैसे पिं...,18,aur starburst aur sakriy gangeya nabhik jaise ...,18.0,
5,१ नबम्बर १९१९ को मजिस्ट्रेट बी॰ एस॰ क्रिस ने म...,15,1 November 1919 ko magistrate B. S. Kris ne Ma...,15.0,
6,अमरीश पुरी,2,Amrish Puri,2.0,
7,ग्रामीण क्षेत्रों में लोग अक्सर अपशिष्ट का स्थ...,28,Gramin Kshetron men log aksar apshisht ka stha...,28.0,
8,ये आंध्र प्रदेश से हैं।,5,Ye Andhra Pradesh se hain.,5.0,
9,सुशील सिद्धार्थ ये वो सहर तो नहीं हिंदी कथा जग...,16,Sushil Siddharth ye wo sahar to nahi Hindi kat...,16.0,


In [None]:
# removing the last column and renaming the colums appropriately
rom_df.drop(rom_df.columns[-1], axis=1, inplace=True)
rom_df.rename(columns={0:'Hindi_text', 1:'hi_text_length', 2:'English_text', 3:'en_length'}, inplace=True)
rom_df

Unnamed: 0,Hindi_text,hi_text_length,English_text,en_length
0,जबकि यह जैनों से कम है।,6,Jabki yah Jainon se km hai.,6.0
1,वर्ष 2000 में वेंकटरामन ने प्रयोगशाला में राइब...,24,Varsh 2000 men Venkatraman ne prayogshala men ...,24.0
2,इस वृक्ष की लंबाई तकरीबन ५० से ६० फीट के आसपास...,14,Is vriksha ki lambai takriban 50 se 60 feet ke...,14.0
3,"इन अनुसंधान कार्यक्रमों में तारा रचना, तारकीय ...",19,"In anusandhan karyakramon men tara rachana, ta...",19.0
4,और स्टारबर्स्ट और सक्रिय गांगेय नाभिक जैसे पिं...,18,aur starburst aur sakriy gangeya nabhik jaise ...,18.0
...,...,...,...,...
11502,कॉस्टयूम में दृश्यों को फिल्माना पड़ता है और क...,21,costume me drishyo ko filmana padta hai aur ka...,21.0
11503,(६) बोली का प्रयोग अपने क्षेत्र तक सीमित रहता ...,26,(6) boli ka prayog apne kshetra tak simit reht...,26.0
11504,"भोजन बनाना, बच्चों की देख रेख और घर की सफाई कर...",26,"bhojan banana, baccho ki dekh rekh aur ghar ki...",26.0
11505,शिकागो: यूनिवर्सिटी ऑफ़ शिकागो प्रेस.,5,chicago: university of chicago press.,5.0


In [None]:
# tranliterate first hindi sentence using indic-trans package and check if the result is similar to transliterated english text in the dataset
trn = Transliterator(source='hin', target='eng')
hi_text = trn.transform(rom_df['Hindi_text'][0])


In [None]:
# similarities between dataset's entry and the obtained transliteration
print("Transliterated sentence in dataset: ", rom_df['English_text'][0])
print("Manually obtained transliteration: ", hi_text )

similarities = len([i for i in hi_text.split(' ') if i in rom_df['English_text'][0].split(' ')])
print("No of common words is: ", similarities)

Transliterated sentence in dataset:  Jabki yah Jainon se km hai.
Manually obtained transliteration:  jabaki yah jainon se kam he.
No of common words is:  2


### Samanantar dataset

In [9]:
# load the dataset using dataset name mentioned in the huggingface datasets page https://huggingface.co/datasets
# load the samanantar dataset for language Hindi (hi), only first 100 entries from the 'train' split as the dataset is big
from datasets import load_dataset

samanantar_ds = load_dataset('ai4bharat/samanantar', 'as', split='train[:100]')

# to load the entire dataset for language Hindi (hi), use the following line
# samanantar_ds = load_dataset('ai4bharat/samanantar', 'hi')

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/3.82k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.06k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.18G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [10]:
# features in the dataset,
# idx : index
# src : sentence in English language
# tgt : sentence translated into Indic language
samanantar_ds

Dataset({
    features: ['idx', 'src', 'tgt'],
    num_rows: 100
})

In [11]:
samanantar_ds[:5]

{'idx': [0, 1, 2, 3, 4],
 'src': ['Tie up long hair.',
  'Nevertheless, he gave this assurance: He that has endured to the end is the one that will be saved.',
  'David wrote: Many things you yourself have done, O Jehovah my God, even your wonderful works...',
  'To many people today, a martyr is more or less the equivalent of a fanatic, an extremist.',
  'After protests were conducted over the decision of Gauhati University to conduct the final semester exams in the online and offline modes, the university finally decided to conduct the Post Graduate exams in the online mode while the Under Graduate exams will be conducted in both online and offline mode'],
 'tgt': ['মেলি থোৱা দীঘল চুলি।',
  'যিয়েই নহওঁক, যিসকলে এই কাৰ্য্য প্ৰাণপণে কৰা চেষ্টা কৰিব, তেওঁলোকক আশ্বাস দি যীচুৱে এইদৰে কৈছিল: “যি জনে শেষলৈকে সহি থাকে, সেই জনেই পৰিত্ৰাণ পাব । ”',
  'ইয়োব ৩৮ :\u2060 ৪ - \u200b ৬ পদত উল্লেখ কৰা লিখনীৰ পৰা আমি কি জনা উচিত?',
  'এই সন্দৰ্ভত তেওঁ পীলাতক এইদৰে কৈছিল যে, “সত্যতাৰ পক্ষে সাক্ষ্য দিবল

In [None]:
# example - add another feature to the samanantar dataset with transliterated text
# transliterate the Indic language into latin script for English using the indic-trans package
trn = Transliterator(source='asm', target='eng')

transliterated_to_english = [trn.transform(txt) for txt in samanantar_ds['tgt']]
transliterated_to_english[:5]

['meli thoba dighal chuli.',
 'yie nahok, yisakle ei carboya pranapane kara cheshta karib, teonlokk aaswas di yichuve aidre kaichhil: “yi jane sheshlike sahi thake, sei janei paritran pav . ”',
 'yob 38 : 4 - \u200b 6 padat ullekh kara likhni para aami ki jana uchit?',
 'ei sandarvat teon pilatak aidre kaichhil ye, “satyatar pakshe saakshya dibli ” teon aahil .',
 "k'r'naar vibhishika majato guvahati vishwavidyalay vishwavidyalyakhanar adhinar collejasamuhar chudanta shanmasikar chhatra-chatri bave aflainat pariksha anushtit karaar siddhanth grahan karich ৷ yak li chhatra-chatriskalar majat tibra pratikrear srushti haiche."]

In [None]:
# add the transliterated data as a feature to the dataset
samanantar_ds = samanantar_ds.add_column("tgt_english",transliterated_to_english)

In [None]:
samanantar_ds[:3]

{'idx': [0, 1, 2],
 'src': ['Tie up long hair.',
  'Nevertheless, he gave this assurance: He that has endured to the end is the one that will be saved.',
  'David wrote: Many things you yourself have done, O Jehovah my God, even your wonderful works...'],
 'tgt': ['মেলি থোৱা দীঘল চুলি।',
  'যিয়েই নহওঁক, যিসকলে এই কাৰ্য্য প্ৰাণপণে কৰা চেষ্টা কৰিব, তেওঁলোকক আশ্বাস দি যীচুৱে এইদৰে কৈছিল: “যি জনে শেষলৈকে সহি থাকে, সেই জনেই পৰিত্ৰাণ পাব । ”',
  'ইয়োব ৩৮ :\u2060 ৪ - \u200b ৬ পদত উল্লেখ কৰা লিখনীৰ পৰা আমি কি জনা উচিত?'],
 'tgt_english': ['meli thoba dighal chuli.',
  'yie nahok, yisakle ei carboya pranapane kara cheshta karib, teonlokk aaswas di yichuve aidre kaichhil: “yi jane sheshlike sahi thake, sei janei paritran pav . ”',
  'yob 38 : 4 - \u200b 6 padat ullekh kara likhni para aami ki jana uchit?']}

### References



[From English to Indic: Leveraging indicTrans2 for NLP and Indic LLM](https://medium.com/@raju.kandasamy/from-english-to-indic-leveraging-indictrans2-for-nlp-and-indic-llm-2b6164457de1)

[Dakshina dataset](https://github.com/google-research-datasets/dakshina)

[Huggingface datasets](https://huggingface.co/docs/datasets/index)

[OPUS - open source parallel corpus](https://opus.nlpl.eu/)

[MuRIL](https://huggingface.co/google/muril-base-cased)