## Model Configuration, Training, and Export

After completing the preprocessing and subwording stages, this section defines, trains, and exports a lightweight **Neural Machine Translation (NMT)** model using **OpenNMT-py** and **CTranslate2**.  
The purpose of this phase is to perform a **dummy training run** to confirm that the entire NMT workflow — from data preparation to model export — functions correctly on GPU before scaling up.


## Change the python version to 3.11

In [None]:
    !sudo apt-get update -y
    !sudo apt-get install python3.11 python3.11-distutils

0% [Working]            Hit:1 https://cli.github.com/packages stable InRelease
0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (185.1                                                                               Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:10 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,827 kB]
Get:11 http://security.ubuntu.com/ubuntu jammy-security/mai

In [None]:
    !sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1

In [None]:
!sudo update-alternatives --config python3

There are 3 choices for the alternative python3 (providing /usr/bin/python3).

  Selection    Path                 Priority   Status
------------------------------------------------------------
  0            /usr/bin/python3.12   2         auto mode
* 1            /usr/bin/python3.10   1         manual mode
  2            /usr/bin/python3.11   1         manual mode
  3            /usr/bin/python3.12   2         manual mode

Press <enter> to keep the current choice[*], or type selection number: 

In [None]:
!sudo apt install python3-pip

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  python3-pkg-resources python3-setuptools python3-wheel
Suggested packages:
  python-setuptools-doc
The following NEW packages will be installed:
  python3-pip python3-setuptools python3-wheel
The following packages will be upgraded:
  python3-pkg-resources
1 upgraded, 3 newly installed, 0 to remove and 38 not upgraded.
Need to get 2,019 kB of archives.
After this operation, 9,616 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3-wheel all 0.37.1-2ubuntu0.22.04.1 [32.0 kB]
Get:2 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy/main amd64 python3-pkg-resources all 68.1.2-2~jammy3 [216 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3-pip all 22.0.2+dfsg-1ubuntu0.7 [1,306 kB]
Get:4 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubun

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd /content/drive/MyDrive/Colab Notebooks/LLM/workflow/

/content/drive/MyDrive/Colab Notebooks/LLM/workflow


In [None]:
!pip3 install OpenNMT-py

Collecting OpenNMT-py
  Downloading OpenNMT_py-3.5.1-py3-none-any.whl (262 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.8/262.8 KB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fasttext-wheel
  Downloading fasttext_wheel-0.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 KB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting configargparse
  Downloading configargparse-1.7.1-py3-none-any.whl (25 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-2.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.6/106.6 KB[0m [3

---

### Step 1: Creating the Model Configuration (config.yaml)
A YAML configuration file (`config.yaml`) is created to define all parameters necessary for model training.

**Key components include:**
- **Data Paths:** Specifies the locations of subworded training and validation files.  
- **Subword Models:** References the previously trained SentencePiece models for English and Telugu.  
- **Vocabularies:** Built directly from the subworded datasets to ensure token alignment.  
- **Logging and Checkpoints:** Defines where model checkpoints and training logs are saved.  
- **Training Parameters:** Controls training behavior, including:
  - Number of training steps and validation intervals  
  - Early stopping criteria  
  - Optimizer (Adam), learning rate scheduling, and warmup steps  
- **Model Architecture:**  
  A small Transformer model used for debugging and pipeline validation:
  - 2 encoder layers and 2 decoder layers  
  - 4 attention heads  
  - Hidden and embedding size of 256  
  - Feed-forward network size of 1024  
  - Dropout of 0.2 for regularization  
- **Hardware Setup:**  
  Configured to run on **GPU** (`gpu_ranks: [0]`) for faster computation and validation of GPU-based training.

This configuration provides a minimal but complete setup suitable for verifying the correctness of the data, model, and training pipeline.


In [None]:
config = '''# config.yaml

## Where the samples will be written
save_data: NMT_small

# Training files
data:
    corpus_1:
        path_src: dataset/train_test_split/en-te.en.subword.train
        path_tgt: dataset/train_test_split/en-te.te.subword.train
        transforms: [filtertoolong]
    valid:
        path_src: dataset/train_test_split/en-te.en.subword.dev
        path_tgt: dataset/train_test_split/en-te.te.subword.dev
        transforms: [filtertoolong]

# Vocabulary files (you can still include .vocab files if already built)
src_vocab: models/SW_model/source.vocab
tgt_vocab: models/SW_model/target.vocab

# Subword models (SentencePiece)
src_subword_model: models/SW_model/source.model
tgt_subword_model: models/SW_model/target.model

# Logging and output
log_file: train_small.log
save_model: models/small_model.en-te

# Early stopping
early_stopping: 2

# Training parameters
train_steps: 1000
valid_steps: 500
save_checkpoint_steps: 500
report_every: 50

seed: 3435

# Device setup (CPU only)
world_size: 1
gpu_ranks: [0]   # empty → CPU
model_dtype: "fp32"

# Batching (smaller for CPU)
batch_type: "tokens"
batch_size: 1024
valid_batch_size: 512
num_workers: 0

# Optimization
optim: "adam"
learning_rate: 2.0
decay_method: "noam"
warmup_steps: 400
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model (lightweight Transformer)
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 2
dec_layers: 2
heads: 4
hidden_size: 256
word_vec_size: 256
transformer_ff: 1024
dropout: [0.2]
attention_dropout: [0.2]
self_attn_type: scaled-dot
# Filter out long sequences
src_seq_length: 100
tgt_seq_length: 100
'''
with open("config.yaml", "w+") as f:
    f.write(config)


### Step 2: Building the Vocabulary
Before model training begins, OpenNMT constructs vocabularies from the preprocessed subworded datasets.  
This step:
- Extracts all unique subword tokens from both English and Telugu corpora.  
- Maps them to numerical indices for model embedding layers.  
- Produces vocabulary files that align perfectly with the SentencePiece segmentation.

A consistent vocabulary ensures that subword tokens used during training and inference match correctly.

In [None]:
# Build Vocabulary

# -config: path to your config.yaml file
# -n_sample: use -1 to build vocabulary on all the segment in the training dataset
# -num_threads: change it to match the number of CPUs to run it faster

!onmt_build_vocab -config config.yaml -n_sample -1 -num_threads 2


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/usr/local/bin/onmt_build_vocab", line 5, in <module>
    from onmt.bin.build_vocab import main
  File "/usr/local/lib/python3.10/dist-packages/onmt/__init__.py", line 2, in <module>
    import onmt.inputters
  File "/usr/local/lib/python3.10/dist-packages/onmt/inputters/__init__.py", line 7, in <module>
    from onmt.inputters.text_utils import text_sort_key, process, numericalize, tensorify
  File "/usr/local/lib/python3.10/dist-packages/onmt/inputters/text_utils.py", line 1, in <module>
    import torch
  File "

In [None]:
# Check if the GPU is active
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-dc8b03ec-63fc-f820-42f0-7fc2850acf58)


### Step 3: Training the Dummy Model on GPU
The model is then trained using the configuration file on a **GPU-enabled environment**.  
This small-scale “dummy” training serves to:
- Validate the correctness of data paths, vocabularies, and subword models.  
- Ensure GPU resources are being utilized correctly.  
- Confirm that the Transformer model initializes and updates as expected.

During training:
- Periodic validation checks monitor model performance.  
- Logs and checkpoints are created at defined intervals (e.g., every 500 steps).  
- The training run typically ends after 1000 steps, sufficient to confirm that all components function as intended.

This process verifies the end-to-end workflow, ensuring the setup is ready for full-scale training with more epochs and larger datasets.


In [None]:
!onmt_train -config config.yaml



A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/usr/local/bin/onmt_train", line 5, in <module>
    from onmt.bin.train import main
  File "/usr/local/lib/python3.10/dist-packages/onmt/__init__.py", line 2, in <module>
    import onmt.inputters
  File "/usr/local/lib/python3.10/dist-packages/onmt/inputters/__init__.py", line 7, in <module>
    from onmt.inputters.text_utils import text_sort_key, process, numericalize, tensorify
  File "/usr/local/lib/python3.10/dist-packages/onmt/inputters/text_utils.py", line 1, in <module>
    import torch
  File "/usr/local/l

### Step 4: Exporting the Model with CTranslate2
Once the model is trained successfully, it is exported to **CTranslate2 format** for optimized inference.

**CTranslate2** is a fast and efficient runtime engine designed for OpenNMT models, supporting both GPU and CPU inference.  
It enables:
- Faster translation speeds  
- Lower memory usage  
- Easy deployment in production environments  

The export process involves:
1. Installing the `ctranslate2` package (if not already installed).  
2. Converting the trained OpenNMT checkpoint (`small_model.en-te_step_1000.pt`) into a deployable CTranslate2 model directory.  

The converted model is stored in `models/ctranslate2_model/`, ready for efficient translation inference using GPU acceleration.


In [None]:
# Install CTranslate2 if not already
!pip install ctranslate2

# Export your trained OpenNMT-py checkpoint
!ct2-opennmt-export \
    --model models/small_model.en-te_step_1000.pt \
    --output_dir models/ctranslate2_model


[0m/bin/bash: line 1: ct2-opennmt-export: command not found


In [None]:
!pip install ctranslate2 sentencepiece

[0m