## Subword Tokenization and Dataset Preparation

After completing the preprocessing stage, the next step involves preparing the dataset for Neural Machine Translation by applying **subword segmentation**.  
Subwording helps the model handle **rare words**, **morphological variations**, and **unknown tokens** by breaking text into smaller, learnable units.

 *By the end of this stage, the dataset is completely subworded, properly split, and organized — ready for model training in the next phase.*


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd /content/drive/MyDrive/Colab Notebooks/LLM/workflow/

/content/drive/MyDrive/Colab Notebooks/LLM/workflow


### Step 1: Setting Up the Environment
  A new working directory is created, and the **MT-Preparation** repository from GitHub is cloned.  
This repository provides standardized scripts for performing subword tokenization, data cleaning, and dataset splitting.  
It ensures consistent preprocessing across different language pairs and simplifies the subwording process.


In [None]:

# Create a directory and clone the Github MT-Preparation repository
!mkdir -p nmt
%cd nmt
!git clone https://github.com/ymoslem/MT-Preparation.git

/home/prashanth/project/NMT/nmt
Cloning into 'MT-Preparation'...
remote: Enumerating objects: 323, done.[K
remote: Counting objects: 100% (55/55), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 323 (delta 35), reused 21 (delta 20), pack-reused 268 (from 2)[K
Receiving objects: 100% (323/323), 94.95 KiB | 382.00 KiB/s, done.
Resolving deltas: 100% (156/156), done.


In [None]:
# Install the requirements from the cloned repository
!pip3 install -r MT-Preparation/requirements.txt




### Step 2: Training the Subword Model
We train a **SentencePiece unigram model** using both the English and Telugu parallel data.  
During training:
- The model learns the most frequent and meaningful subword units in both languages.  
- Multilingual tags (e.g., `<2te-Computer_science>`, `<2en-Mathematics>`) are included to make sure they are treated as complete tokens.  
- The output of this step is two model files — one for the source language (English) and one for the target language (Telugu).

This step is crucial because it defines how words and subwords will be represented during model training and inference.

In [None]:
# Train a SentencePiece model for subword tokenization
!python3 MT-Preparation/subwording/1-train_unigram.py dataset/parallel_copora/en-te.en dataset/parallel_copora/en-te.te \
    "<2te-Computer_science>, <2te-Mathematics>, <2en-Computer_science>, <2en-Mathematics>"

['<2te-Computer_science>', ' <2te-Mathematics>', ' <2en-Computer_science>', ' <2en-Mathematics>']
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: dataset/parallel_copora/en-te.en
  input_format: 
  model_prefix: source
  model_type: UNIGRAM
  vocab_size: 50000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 1
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  user_defined_symbols: <2te-Computer_science>
  user_defined_symbols:  <2te-Mathematics>
  user_defined_symbols:  <2en-Computer_science>
  user_defined_symbols:  <2en-Mathematics>
  required_chars: 
  byte_fallback: 0
  vocabulary_output_pi

### Step 3: Applying Subword Tokenization
Once the models are trained, the raw parallel sentences are tokenized into subword units.  
Each sentence is segmented according to the learned vocabulary, producing parallel subworded files such as:
- English source file (e.g., `en-te.en.subword`)  
- Telugu target file (e.g., `en-te.te.subword`)  

These subworded files are saved in a structured directory, ready to be used for model training.

In [None]:
# Subword the dataset
!python3 MT-Preparation/subwording/2-subword.py source.model target.model dataset/parallel_copora/en-te.en dataset/parallel_copora/en-te.te

Source Model: source.model
Target Model: target.model
Source Dataset: dataset/parallel_copora/en-te.en
Target Dataset: dataset/parallel_copora/en-te.te
Done subwording the source file! Output: dataset/parallel_copora/en-te.en.subword
Done subwording the target file! Output: dataset/parallel_copora/en-te.te.subword


###  Step 4: Splitting the Dataset
The tokenized dataset is divided into **training**, **development (validation)**, and **testing** subsets.  
A fixed number of segments (in this case, 2000 each) are reserved for development and testing to ensure balanced evaluation.  
The remaining data is used for training the NMT model.

This ensures that:
- The model is trained on a large, diverse dataset.  
- Performance is validated and tested on unseen data for fair evaluation.

In [None]:
# Split the dataset into training set, development set, and test set
# Development and test sets should be between 1000 and 5000 segments (here we chose 2000)
!python3 MT-Preparation/train_dev_split/train_dev_test_split.py 2000 2000  dataset/subword_corpora/en-te.en.subword dataset/subword_corpora/en-te.te.subword

Dataframe shape: (133742, 2)
--- Empty Cells Deleted --> Rows: 133742
--- Wrote Files
Done!
Output files
dataset/subword_corpora/en-te.en.subword.train
dataset/subword_corpora/en-te.te.subword.train
dataset/subword_corpora/en-te.en.subword.dev
dataset/subword_corpora/en-te.te.subword.dev
dataset/subword_corpora/en-te.en.subword.test
dataset/subword_corpora/en-te.te.subword.test
