In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# prompt: change path /content/drive/MyDrive/Bn-Hi translation project
%cd /content/drive/MyDrive/Bn-Hi_translation_project/
!ls

/content/drive/MyDrive/Bn-Hi_translation_project
dataset  nmt  notebook	source.model  source.vocab  target.model  target.vocab


In [7]:
# Create a directory and clone the Github MT-Preparation repository
!mkdir -p nmt
%cd nmt
!git clone https://github.com/ymoslem/MT-Preparation.git

/content/drive/MyDrive/Bn-Hi_translation_project/nmt
Cloning into 'MT-Preparation'...
remote: Enumerating objects: 305, done.[K
remote: Counting objects: 100% (131/131), done.[K
remote: Compressing objects: 100% (69/69), done.[K
remote: Total 305 (delta 66), reused 114 (delta 58), pack-reused 174 (from 1)[K
Receiving objects: 100% (305/305), 84.51 KiB | 2.35 MiB/s, done.
Resolving deltas: 100% (149/149), done.


In [3]:
# Install the requirements
!pip3 install -r \
/content/drive/MyDrive/Bn-Hi_translation_project/nmt/MT-Preparation/requirements.txt



In [4]:
# Download and unzip a dataset
!unzip /content/drive/MyDrive/Bn-Hi_translation_project/dataset/bn-hi.txt.zip \
-d /content/drive/MyDrive/Bn-Hi_translation_project/dataset

Archive:  /content/drive/MyDrive/Bn-Hi_translation_project/dataset/bn-hi.txt.zip
  inflating: /content/drive/MyDrive/Bn-Hi_translation_project/dataset/README  
  inflating: /content/drive/MyDrive/Bn-Hi_translation_project/dataset/LICENSE  
  inflating: /content/drive/MyDrive/Bn-Hi_translation_project/dataset/GNOME.bn-hi.bn  
  inflating: /content/drive/MyDrive/Bn-Hi_translation_project/dataset/GNOME.bn-hi.hi  
  inflating: /content/drive/MyDrive/Bn-Hi_translation_project/dataset/GNOME.bn-hi.ids  


#Rename unziped files into source and target

In [5]:
import os

# Specify the current file name and the new file name
current_file_name = '/content/drive/MyDrive/Bn-Hi_translation_project/dataset/GNOME.bn-hi.bn'  # Replace with your current file name
new_file_name = '/content/drive/MyDrive/Bn-Hi_translation_project/dataset/source.txt'     # Replace with the desired new file name

# Rename the file
try:
    os.rename(current_file_name, new_file_name)
    print(f"File renamed successfully to {new_file_name}")
except FileNotFoundError:
    print(f"File {current_file_name} not found!")
except Exception as e:
    print(f"An error occurred: {e}")


File renamed successfully to /content/drive/MyDrive/Bn-Hi_translation_project/dataset/source.txt


In [6]:
import os

# Specify the current file name and the new file name
current_file_name = '/content/drive/MyDrive/Bn-Hi_translation_project/dataset/GNOME.bn-hi.hi'  # Replace with your current file name
new_file_name = '/content/drive/MyDrive/Bn-Hi_translation_project/dataset/target.txt'     # Replace with the desired new file name

# Rename the file
try:
    os.rename(current_file_name, new_file_name)
    print(f"File renamed successfully to {new_file_name}")
except FileNotFoundError:
    print(f"File {current_file_name} not found!")
except Exception as e:
    print(f"An error occurred: {e}")


File renamed successfully to /content/drive/MyDrive/Bn-Hi_translation_project/dataset/target.txt


#Data Filtering
Filtering out low-quality segments can help improve the translation quality of the output MT model. This might include misalignments, empty segments, duplicates, among other issues.

In [8]:
%cd /content/drive/MyDrive/Bn-Hi_translation_project/dataset
!ls

/content/drive/MyDrive/Bn-Hi_translation_project/dataset
bn-hi.txt.zip  GNOME.bn-hi.ids	LICENSE  README  source.txt  target.txt


In [9]:
# Filter the dataset
# Arguments: source file, target file, source language, target language
!python3 \
/content/drive/MyDrive/Bn-Hi_translation_project/nmt/MT-Preparation/filtering/filter_v1.py \
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/source.txt \
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/target.txt

Dataframe shape (rows, columns): (142794, 2)
--- Rows with Empty Cells Deleted	--> Rows: 142794
--- Duplicates Deleted			--> Rows: 21414
--- Source-Copied Rows Deleted		--> Rows: 21151
--- Too Long Source/Target Deleted	--> Rows: 20464
--- HTML Removed			--> Rows: 20464
--- Rows will remain true-cased		--> Rows: 20464
--- Rows with Empty Cells Deleted	--> Rows: 20463
--- Rows Shuffled			--> Rows: 20463
--- Source Saved: src_filter.txt
--- Target Saved: tgt_filter.txt


#Tokenization / Sub-wording
To build a vocabulary for any NLP model, you have to tokenize (i.e. split) sentences into smaller units. Word-based tokenization used to be the way to go; in this case, each word would be a token. However, an MT model can only learn a specific number of vocabulary tokens due to limited hardware resources. To solve this issue, sub-words are used instead of whole words. At translation time, when the model sees a new word/token that looks like a word/token it has in the vocabulary, it still can try to continue the translation instead of marking this word as “unknown” or “unk”.

There are a few approaches to sub-wording such as BPE and the unigram model. One of the famous toolkits that incorporates the most common approaches is SentencePiece. Note that you have to train a sub-wording model and then use it. After translation, you will have to “desubword” or “decode” your text back using the same SentencePiece model.

In [10]:
!ls /content/drive/MyDrive/Bn-Hi_translation_project/nmt/MT-Preparation/subwording/

1-train_bpe.py		  1-train_unigram.py  3-desubword.py
1-train_unigram_joint.py  2-subword.py	      spm_to_vocab.py


In [11]:
# Train a SentencePiece model for subword tokenization
!python3 \
/content/drive/MyDrive/Bn-Hi_translation_project/nmt/MT-Preparation/subwording/1-train_unigram.py \
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/src_filter.txt \
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/tgt_filter.txt

sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=/content/drive/MyDrive/Bn-Hi_translation_project/dataset/src_filter.txt --model_prefix=source --vocab_size=50000 --hard_vocab_limit=false --split_digits=true
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: /content/drive/MyDrive/Bn-Hi_translation_project/dataset/src_filter.txt
  input_format: 
  model_prefix: source
  model_type: UNIGRAM
  vocab_size: 50000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 1
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  tr

In [12]:
!ls

bn-hi.txt.zip	 LICENSE  source.model	source.vocab	target.model  target.vocab
GNOME.bn-hi.ids  README   source.txt	src_filter.txt	target.txt    tgt_filter.txt


In [15]:
# Subword the dataset
!python3 \
/content/drive/MyDrive/Bn-Hi_translation_project/nmt/MT-Preparation/subwording/2-subword_v1.py \
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/source.model \
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/target.model \
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/src_filter.txt \
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/tgt_filter.txt


Source Model: /content/drive/MyDrive/Bn-Hi_translation_project/dataset/source.model
Target Model: /content/drive/MyDrive/Bn-Hi_translation_project/dataset/target.model
Source Dataset: /content/drive/MyDrive/Bn-Hi_translation_project/dataset/src_filter.txt
Target Dataset: /content/drive/MyDrive/Bn-Hi_translation_project/dataset/tgt_filter.txt
Done subwording the source file! Output: src_subword.txt
Done subwording the target file! Output: tgt_subword.txt


In [16]:
# First 3 lines before subwording
!head -n 3 /content/drive/MyDrive/Bn-Hi_translation_project/dataset/src_filter.txt \
&& echo "-----" && \
head -n 3 /content/drive/MyDrive/Bn-Hi_translation_project/dataset/tgt_filter.txt


ডিস্কে কোনো ট্র্যাক তথ্য (লেখক, শিরোনাম, ...) লেখা হবেনা।
অধিবৃত্তিক সাইন [k]
ব্রাসেরো - চিত্র তৈরি করা হচ্ছে
-----
कोई ट्रैक सूचना नहीं (कलाकार, शीर्षक, ...) डिस्क में नहीं लिखी जाएगी.
अतिपरवलयिक साइन
ब्रासेरो - छवि बना रहा है


In [18]:
from typing_extensions import Text
# First 3 lines after subwording
!head -n 3 /content/drive/MyDrive/Bn-Hi_translation_project/dataset/src_subword.txt \
&& echo "---" && \
head -n 3 /content/drive/MyDrive/Bn-Hi_translation_project/dataset/tgt_subword.txt


▁ডিস্কে ▁ কোনো ▁ট্র্যাক ▁তথ্য ▁( লেখক , ▁শিরোনাম , ▁...) ▁লেখা ▁হবেনা ।
▁অধি বৃত্তি ক ▁সাইন ▁[ k ]
▁ব্রাসেরো ▁- ▁চিত্র ▁তৈরি ▁করা ▁হচ্ছে
---
▁को ई ▁ट्रैक ▁सूचना ▁नहीं ▁( कलाकार , ▁शीर्षक , ▁...) ▁डिस्क ▁में ▁नहीं ▁लिख ी ▁जाएगी .
▁अति पर वल यिक ▁साइन
▁ब्रा सेरो ▁- ▁छवि ▁बना ▁रहा ▁है


#Data Splitting
We usually split our dataset into 3 portions:

training dataset - used for training the model;
development dataset - used to run regular validations during the training to help improve the model parameters; and
testing dataset - a holdout dataset used after the model finishes training to finally evaluate the model on unseen data.

In [19]:
# Split the dataset into training set, development set, and test set
# Development and test sets should be between 1000 and 5000 segments (here we chose 2000)
!python3 /content/drive/MyDrive/Bn-Hi_translation_project/nmt/MT-Preparation/train_dev_split/train_dev_test_split.py \
3000 3000 \
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/src_subword.txt \
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/tgt_subword.txt


Dataframe shape: (20463, 2)
--- Empty Cells Deleted --> Rows: 20463
--- Wrote Files
Done!
Output files
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/src_subword.txt.train
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/tgt_subword.txt.train
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/src_subword.txt.dev
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/tgt_subword.txt.dev
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/src_subword.txt.test
/content/drive/MyDrive/Bn-Hi_translation_project/dataset/tgt_subword.txt.test


In [20]:
%cd /content/drive/MyDrive/Bn-Hi_translation_project/dataset
!ls

/content/drive/MyDrive/Bn-Hi_translation_project/dataset
bn-hi.txt.zip	 source.model	 src_subword.txt	target.model	tgt_subword.txt
GNOME.bn-hi.ids  source.txt	 src_subword.txt.dev	target.txt	tgt_subword.txt.dev
LICENSE		 source.vocab	 src_subword.txt.test	target.vocab	tgt_subword.txt.test
README		 src_filter.txt  src_subword.txt.train	tgt_filter.txt	tgt_subword.txt.train


In [22]:
# Line count for the subworded train, dev, test datatest
!wc -l *_subword.*

  20463 src_subword.txt
   3000 src_subword.txt.dev
   3000 src_subword.txt.test
  14463 src_subword.txt.train
  20463 tgt_subword.txt
   3000 tgt_subword.txt.dev
   3000 tgt_subword.txt.test
  14463 tgt_subword.txt.train
  81852 total


In [23]:
# Check the first and last line from each dataset

# -------------------------------------------
# Change this cell to print your name
!echo -e "My name is: FirstName SecondName \n"
# -------------------------------------------

!echo "---First line---"
!head -n 1 *.{train,dev,test}

!echo -e "\n---Last line---"
!tail -n 1 *.{train,dev,test}

My name is: FirstName SecondName 

---First line---
==> src_subword.txt.train <==
▁ডিস্কে ▁ কোনো ▁ট্র্যাক ▁তথ্য ▁( লেখক , ▁শিরোনাম , ▁...) ▁লেখা ▁হবেনা ।

==> tgt_subword.txt.train <==
▁को ई ▁ट्रैक ▁सूचना ▁नहीं ▁( कलाकार , ▁शीर्षक , ▁...) ▁डिस्क ▁में ▁नहीं ▁लिख ी ▁जाएगी .

==> src_subword.txt.dev <==
▁পরিসেবা ▁উপলব্ধ ▁নয়

==> tgt_subword.txt.dev <==
▁सेवा ▁अनुपलब्ध

==> src_subword.txt.test <==
▁ডিফল্ট ▁মান ▁ব্যবহার ▁করা ▁হবে ▁(_ U )

==> tgt_subword.txt.test <==
▁लोकेल ▁तयशुदा ▁का ▁उपयोग ▁करें

---Last line---
==> src_subword.txt.train <==
▁ফোল্ডার ▁`% s '- র ▁শেষে ▁বার্তা ▁যোগ ▁করা ▁সম্ভব ▁নয় : ▁অজানা ▁সমস্যা

==> tgt_subword.txt.train <==
▁'% s ' ▁फ़ोल्डर ▁में ▁संदेश ▁जोड़ ▁नहीं ▁सकता ▁है : ▁अज्ञात ▁त्रुटि

==> src_subword.txt.dev <==
▁কানাডা

==> tgt_subword.txt.dev <==
▁कनाडा

==> src_subword.txt.test <==
▁প্রথম

==> tgt_subword.txt.test <==
▁प्रथम
