<a href="https://colab.research.google.com/github/Karabi-codehub/Machine-Translation-project-En_to_BN-/blob/main/Capstone_Project_Machine_Translation(English_to_Bangla)_using_Pretrained_Model_with_code_drescription.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<span style="color:green; font-size:18px; font-weight:500;">
Objectives:

1. End-to-end machine translation training pipeline   
2. Fine-tune a pre-trained model for the custom dataset
 </span>

<span style="color:#dc2626; font-size:24px; font-weight:700;">
Project 3 : Machine Translation using Pretrained Model
 </span>



<span style="color:Tomato; font-size:18px; font-weight:700;">
🔄 Fine-Tuning
 </span>

**Fine-tuning** means taking a **pre-trained model** (already trained on a large general dataset) and then training it a little more on your **custom dataset** so it can adapt to your specific task.  

👉 **Example:**  
A model trained on millions of English–French sentences can already *translate*.  
If you fine-tune it on **medical text translations**, it will become better at translating **medical terms**.  

**✅ Why We Use Fine-Tuning**
- **Saves time & resources** – Training from scratch needs huge data and computation. Fine-tuning reuses the pre-trained model’s knowledge.  
- **Better accuracy** – The model already understands general language patterns; fine-tuning adapts it to your domain (e.g., legal, medical, tech).  
- **Works with smaller datasets** – You don’t need millions of samples; a smaller domain-specific dataset is enough.  
- **Custom specialization** – Makes the model good at your *specific* task (translation, sentiment, Q&A, etc.) instead of general tasks only.  

👉 **In short:**  
**Fine-tuning = faster, cheaper, and more accurate way to make AI models work for your needs.**


<span style="color:Navy; font-size:20px; font-weight:600;">
Package installation
</span>

In [None]:
!pip install -q pytorch-lightning > /dev/null 2>&1

In [None]:
"""
-For type hints
-Any means a variable or return type can be any data type (string, int, tensor, etc.).
"""
from typing import Any

"""
# Install MLflow (if not already installed or to upgrade to latest)
# The exclamation mark (!) runs a shell command from the notebook cell
"""
!pip install -U mlflow


"""
# Import MLflow into Python
"""
import mlflow

"""
# Import the infer_signature utility from MLflow
# This is useful for automatically inferring input/output schema
# when logging models
"""
from mlflow.models import infer_signature



"""
-Defines step output types for Lightning training
-In PyTorch Lightning, some functions (like training_step, validation_step, test_step) return outputs that can be different types
-Lightning gives you a ready-made type alias called STEP_OUTPUT.
           -STEP_OUTPUT = shorthand type hint for “whatever the training/validation/test step is allowed to return in Lightning.”
           -PyTorch Lightning makes writing and training deep learning models easier and cleaner.
           -It handles things like training loops, validation, logging, GPU/TPU support, etc., so you don’t have to write them manually.
"""
!pip install pytorch-lightning
from pytorch_lightning.utilities.types import STEP_OUTPUT


"""
-High-level training framework to simplify PyTorch code
"""
import pytorch_lightning as pl


"""
-Core PyTorch library for tensors and computations
"""
import torch


"""
-Build neural network layers and models
"""
import torch.nn as nn


"""
-Handle custom datasets and batch loading
"""
from torch.utils.data import Dataset, DataLoader


"""
-Load and preprocess dataset (CSV, Excel, etc.)
"""
import pandas as pd


"""
-BLEUScore measures how good a machine translation is by comparing it to reference translations.
-BLEU (Bilingual Evaluation Understudy): a metric that checks how closely your model’s translations match the correct/reference translations.
-torchmetrics.text.BLEUScore: makes it easy to calculate BLEU in PyTorch projects.
-BLEU tells how accurate your translations are.”
"""
from torchmetrics.text import BLEUScore


"""
-AutoTokenizer: Converts text into numbers (tokens) the model can understand.
-AutoModelForSeq2SeqLM: Pretrained sequence-to-sequence model for tasks like translation or summarization.
-Tokenizer prepares text, Model translates or generates text.
-Install NLP-related libraries silently (transformers, datasets, sentencepiece)
- %pip is the preferred magic in Jupyter notebooks (keeps it scoped to the current kernel)
"""
%pip install -q transformers datasets sentencepiece

"""
-Import the tokenizer and model classes from Hugging Face Transformers
-(Here we’re using T5Tokenizer and a generic Seq2Seq model as an example)
"""
from transformers import T5Tokenizer, AutoModelForSeq2SeqLM

"""
 Install/update additional tools:
- mlflow: for experiment tracking
- pyngrok: to expose local web apps to the internet (useful for Gradio)
- pytorch-lightning: for easier training loops
- gradio: to create quick web interfaces
- sacremoses: tokenizer/preprocessing library often used with Hugging Face models
"""
!pip install -U mlflow pyngrok pytorch-lightning gradio sacremoses


<span style="color:Navy; font-size:20px; font-weight:600;">
CPU to GPU Transform
</span>

**Why GPU is preferred:**

-CPU is slower for deep learning, especially with large models or datasets.

-Neural network training and large tensor operations are much faster on GPU.



In [None]:
"""
-torch.cuda.is_available() → checks if your computer has a GPU ready.
-"cuda" → runs computations on GPU (faster).
-"cpu" → runs on the processor if no GPU is available.
-torch.device(...) → tells PyTorch where to run your model and tensors.
"""
# Chooses GPU (cuda) if available, otherwise CPU, for running your PyTorch model.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

<span style="color:Navy; font-size:20px; font-weight:600;">
Task: English to Bangla
</span>

<span style="color:blue; font-size:20px; font-weight:600;">
👉 pre tarined model initialization
</span>

In [None]:
mt_pretrained_model_name = "csebuetnlp/banglat5_nmt_en_bn"

<span style="color:blue; font-size:20px; font-weight:600;">
👉 pre tarined Tokenizer and model initialization
</span>

<span style="color:Tomato; font-size:18px; font-weight:700;">
🔄 Tokenizer & Model in NLP
 </span>

For NLP tasks, we usually need **two main parts**:

**1️⃣ Tokenizer**

* **What it does:** Converts text (words/sentences) into numbers (tokens) the model can understand.
* **Example:**

  * Sentence: `"I love AI"`
  * Tokenizer → `[101, 1045, 2293, 9937, 102]` (IDs for words)
   Here "I"= 101, " "=1045, "love"=2293, " "=9937, "AI"=102

👉 Think of it as a **translator from text → numbers**.

**2️⃣ Model**

* **What it does:** Takes those numbers (tokens) and predicts or generates new tokens (output).
* **Example (Translation task):**

  * Input tokens = `"I love AI"`
  * Model output tokens = `"J'adore l'IA"` (French translation)

👉 Think of it as the **brain that learns and generates answers**.

* `tokenizer` → prepares the text.
* `model` → does the actual task (e.g., translation, summarization, etc.).

**Summary:**

* **Tokenizer = Text → Numbers**
* **Model = Numbers → Predictions/Text**



In [None]:
tokenizer = AutoTokenizer.from_pretrained(mt_pretrained_model_name)
mt_pretrained_model = AutoModelForSeq2SeqLM.from_pretrained(mt_pretrained_model_name)

"""
✅ AutoModelForSeq2SeqLM is:Automatic Model for Sequence-to-Sequence Language Modeling

Let’s break it down:

AutoModel → “Auto” means it automatically picks the right pretrained model architecture (you just give the model name,
             e.g., "t5-small", "facebook/mbart-large-50", etc.).
Seq2Seq → stands for Sequence-to-Sequence (input sequence → output sequence).
          Used in tasks like translation, summarization, and text generation.
"""

<span style="color:blue; font-size:20px; font-weight:700;">👉 Data Pipeline </span>


<span style="color:Tomato; font-size:18px; font-weight:700;">Class 1 (Dataset):</span> Custom **PyTorch Dataset** Class for Machine Translation (English → Bangla).  

<span style="color:Tomato; font-size:18px; font-weight:700;">Class 2 (Module):</span> **PyTorch Lightning DataModule** for Machine Translation (English → Bangla).  


<span style="color:Tomato; font-size:18px; font-weight:700;">Tokenizer:<span>
Converts text into **token IDs** (numbers).  

<span style="color:Tomato; font-size:18px; font-weight:700;">Dataset class (MTDataset)-</span>Uses the tokenizer to prepare **samples (English → Bangla)**.  

<span style="color:Tomato; font-size:18px; font-weight:700;">DataModule class (MTDataModule)-</span> Wraps **train/val/test datasets + DataLoaders**,  
  


<span style="color:darkslateblue; font-size:18px; font-weight:600;">
    
✔️ Class 1 (Dataset):Custom PyTorch Dataset Class for Machine Translation (English → Bangla).  
→ Reads CSV and uses **Tokenizer** to convert text → tokens.

</span>


**Sentence: How are you, dude?**

**Tokens:** 'How', 'are', 'you', 'dude?'

**ids:** 125, 14, 145, 78

**max_length =** 3

**ids: [125, 14, 145]**

**MTDataset**
- Loads data from a **CSV file** (your custom dataset).
- Each row has:  
  - **en** → English text (source sentence)  
  - **bn** → Bangla text (target sentence)  
- Uses a **tokenizer** to convert text → tokens (numbers).  
- Returns:
  - `src_input_ids` → tokens for English  
  - `src_attention_mask` → tells model which tokens are padding  
  - `tgt_input_ids` → tokens for Bangla  
  - `tgt_attention_mask` → mask for Bangla


In [None]:
"""A custom dataset class for Machine Translation (MT)."""
#MTDataset → Loads data from a CSV file (your custom dataset).
MAX_LENGTH = 128
class MTDataset(Dataset):
    def __init__(self, csv_file): # __init__: called when we create the dataset object.
        self.data=pd.read_csv(csv_file) #loads the CSV file
    def __len__(self): # total number of samples in the dataset.
        return len(self.data)
    def __getitem__(self,idx): # __getitem__: fetches one sample (source + target) by index.
        src_text = str(self.data.iloc[idx]['en']) # Source text (English) from column 'en'
        tgt_text = str(self.data.iloc[idx]['bn']) # Target text (Bangla) from column 'bn'
        src_encoding=tokenizer(
            src_text, # Input sentence (English)
            max_length=MAX_LENGTH,  # integer,Maximum length of tokens (fixed size input)
            padding='max_length',
            truncation=True, # Cuts off text longer than max_length
            return_tensors='pt'# Returns PyTorch tensors instead of plain lists
         )

        tgt_encoding = tokenizer(
        tgt_text,              # The target sentence (Bangla text) that we want the model to generate.
        max_length=MAX_LENGTH ,        # Fixes the size of the sequence (like saying "all sentences must be 128 tokens long").
        padding='max_length',  # If the sentence is shorter, add [PAD] tokens until it’s 128 tokens.
        truncation=True, # Cut off extra words if the sentence is longer than 128 tokens.
        return_tensors='pt'    # Converts everything into PyTorch tensors, ready for training.
    )
        return {
    'src_input_ids': src_encoding['input_ids'].squeeze(0),        # Token IDs for source (English) sentence
    'src_attention_mask': src_encoding['attention_mask'].squeeze(0),  # Mask for source (1 = real token, 0 = padding)
    'tgt_input_ids': tgt_encoding['input_ids'].squeeze(0),        # Token IDs for target (Bangla) sentence
    'tgt_attention_mask': tgt_encoding['attention_mask'].squeeze(0)   # Mask for target
}

"""
example: How are you, dude?
input_ids: 125, 14, 145, 78
max_length = 7
input_ids: [125, 14, 145, 147, 0, 0, 0]
attention_mask: [1, 1, 1, 1, 0, 0, 0],src_attention_mask → Mask to ignore [PAD] tokens.

"""


<span style="color:blue; font-size:18px; font-weight:600;">
✔️ Class 2 (DataModule):PyTorch Lightning DataModule for Machine Translation (English → Bangla).  
→ Organizes train/val/test datasets and provides DataLoaders (no tokenization here).

</span>

<span style="color:Tomato; font-size:18px; font-weight:700;">
🔄 What the DataModule does here </span>

* Think of it as a **data manager**  for PyTorch Lightning.
* It organizes your data into **train**, **validation**, and **test** sets.
* It tells PyTorch **how to load the data** in batches.

**In your case (`MTDataModule`):**

1. **setup()** → Reads CSV files and creates datasets (`MTDataset`).
2. **train\_dataloader()** → Gives batches of training data (shuffled).
3. **val\_dataloader()** → Gives batches of validation data (not shuffled).
4. **test\_dataloader()** → Gives batches of test data (not shuffled).

👉 In short:
**DataModule = One place that prepares & serves data in the dataset for training, validation, and testing.** 🚀



**MTDataModule**
- Wraps **datasets** + **dataloaders** together (so Lightning can use them easily).  

Functions:
- `setup()` → creates train, validation, and test datasets.  
- `train_dataloader()` → returns training data in batches (shuffled).  
- `val_dataloader()` → returns validation data in batches (not shuffled).  
- `test_dataloader()` → returns test data in batches (not shuffled).

**data_module = MTDataModule(...)**
- Creates a **DataModule object** with:
  - train CSV  
  - val CSV  
  - test CSV  
  - batch size (e.g., 32 samples per batch)

In [None]:
# DataModule definition
class MTDataModule(pl.LightningDataModule):
    def __init__(self, train_csv, val_csv, test_csv, batch_size=32):
        super().__init__() # Call parent LightningDataModule __init__
        # Save CSV file paths
        self.train_csv = train_csv
        self.val_csv = val_csv
        self.test_csv = test_csv
        # Save batch size (how many samples per batch)
        self.batch_size = batch_size

      # This method prepares datasets for train, val, test
    def setup(self, stage=None):
        # Create dataset objects using the CSV paths
        self.train_dataset = MTDataset(self.train_csv)
        self.val_dataset = MTDataset(self.val_csv)
        self.test_dataset = MTDataset(self.test_csv)

    # DataLoader for training (shuffle=True so model sees random data order every epoch)
    def train_dataloader(self):
        return DataLoader(
            self.train_dataset,
            batch_size=self.batch_size,
            shuffle=True
        )

    # DataLoader for validation (shuffle=False so order is fixed)
    def val_dataloader(self):
        return DataLoader(
            self.val_dataset,
            batch_size=self.batch_size,
            shuffle=False
        )

    # DataLoader for testing (also no shuffle)
    def test_dataloader(self):
        return DataLoader(
            self.test_dataset,
            batch_size=self.batch_size,
            shuffle=False
        )

#  Create DataModule object
data_module = MTDataModule(
    train_csv='train.csv',   # path to training data CSV
    val_csv='val.csv',       # path to validation data CSV
    test_csv='test.csv',     # path to testing data CSV
    batch_size=32            # how many samples per batch
)

<span style="color:Tomato; font-size:18px; font-weight:700;">
🔄 Machine Translation Data Flow (with PyTorch Lightning)
</span>

**1. CSV File (train.csv, val.csv, test.csv)**
- **Used for:** Storing raw bilingual text (English + Bangla).
- **Why:** Keeps data organized and easy to load.

**2. `MTDataset`**
- **Used for:** Reading a CSV file and preparing samples.
- **Why:** Converts each row (English → Bangla) into tokenized input/output tensors so the model can understand.

**3. `MTDataModule`**
- **Used for:** Combining all datasets (train/val/test) + making DataLoaders.
- **Why:** PyTorch Lightning expects a `DataModule` to organize and feed data consistently.

**4. ⚡ `setup()`**
- **Used for:** Creating train, validation, and test datasets.
- **Why:** Prepares data once so loaders can fetch it anytime.

**5. `train_dataloader()`**
- **Used for:** Returning batches of training data.
- **Why:** The model learns from shuffled mini-batches to generalize better.

**6. `val_dataloader()`**
- **Used for:** Returning validation data batches.
- **Why:** Check how well the model is performing on unseen data (no shuffle).

**7. `test_dataloader()`**
- **Used for:** Returning test data batches.
- **Why:** Evaluate final model performance on completely unseen data.

**8. `data_module = MTDataModule(...)`**
- **Used for:** Creating the actual data module object with paths + batch size.
- **Why:** This object is given to the Lightning `Trainer` so it knows **where and how to get the data**.

✅ In short:
- **MTDataset** = how to read one CSV file into tokenized samples.  
- **MTDataModule** = organizes multiple datasets + loaders for training/validation/testing.  
- **data_module** = the actual object you will give to your Lightning Trainer.
  
✅ **CSV → Dataset → DataModule → DataLoader → Model.**

<span style="color:blue; font-size:20px; font-weight:700;">👉 Model </span>

<span style="color:Tomato; font-size:18px; font-weight:700;">
🔄 Flowchart: Working Process of MTModel
</span>

A[Start Training / Validation / Testing] --> B[Load Pretrained Seq2Seq Model + Tokenizer]

B --> C[Input English Sentence → Tokenizer → src_input_ids, src_attention_mask]

C --> D[Input Bangla Sentence (Target) → Tokenizer → tgt_input_ids, tgt_attention_mask]


D --> E[Forward Pass (self.forward)]

E --> F[Encoder Processes English Input]

F --> G[Decoder Processes Bangla Input (Teacher Forcing: tgt_input_ids shifted right)]


G --> H[Model Outputs Logits (Predicted Token Probabilities)]

H --> I[Compute CrossEntropy Loss (ignore PAD tokens)]


I --> J{Stage?}

J -->|train| K[Log Train Loss → Backpropagation → Optimizer Update (AdamW)]

J -->|val/test| L[Compute Predictions → Decode to Text → BLEU Score]

L --> M[Log Validation/Test Loss + BLEU Score]


K --> N[Learning Rate Scheduler (Cosine Annealing)]

M --> N

N --> O[Repeat for Next Batch/Epoch]

O --> P[End Training/Validation/Testing]


In [None]:
# Machine Translation Model
class MTModel(pl.LightningModule):
    def __init__(self):
        super().__init__()

        # Load a pretrained Seq2Seq model
        self.model = AutoModelForSeq2SeqLM.from_pretrained(mt_pretrained_model_name)

        #Load tokenizer for the same model
        # This converts text ↔ tokens (numbers).
        self.tokenizer = AutoTokenizer.from_pretrained(mt_pretrained_model_name)

        #Define learning rate (small value because we are fine-tuning a pretrained model)
        self.learning_rate = 2e-5

        # Define loss function (CrossEntropyLoss)
        # "ignore_index=pad_token_id" means padding tokens won't be counted in loss.
        self.loss_fn = nn.CrossEntropyLoss(ignore_index=self.tokenizer.pad_token_id)

        # Define evaluation metric (BLEU Score)
        # BLEU checks how close translations are to the target sentences.
        self.bleu = BLEUScore()

    # Forward pass: how the model processes one batch of data
    def forward(self,
                src_input_ids,        # English tokens
                src_attention_mask,   # Mask for English (ignore PAD tokens)
                tgt_input_ids,        # Bangla tokens
                tgt_attention_mask    # Mask for Bangla
        ):
        #Call the underlying HuggingFace seq2seq model
        outputs = self.model(
            input_ids=src_input_ids,                # Source sentence (English)
            attention_mask=src_attention_mask,      # Mask for English
            decoder_input_ids=tgt_input_ids[:, :-1],# Decoder input (Bangla shifted right, teacher forcing)
            decoder_attention_mask=tgt_attention_mask[:, :-1] # Mask for Bangla
        )
        return outputs   # Contains logits (predicted token probabilities)

 # Training loop: runs for every batch during training
    def training_step(self, batch, batch_idx):
        loss = self.compute_loss(batch, batch_idx, 'train')   # Compute loss
        self.log('train_loss', loss, prog_bar=True)           # Log train loss on progress bar
        return loss

# Validation loop: runs after each epoch on validation data
    def validation_step(self, batch, batch_idx):
        loss = self.compute_loss(batch, batch_idx, 'val')     # Compute validation loss
        self.log('val_loss', loss, prog_bar=True)             # Log validation loss
        return loss

# Test loop: runs on test data
    def test_step(self, batch, batch_idx):
        loss = self.compute_loss(batch, batch_idx, 'test')    # Compute test loss
        self.log('test_loss', loss, prog_bar=True)            # Log test loss
        return loss

 # Optimizer + Scheduler setup
    def configure_optimizers(self):

        # Use AdamW optimizer (works well with transformers)
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.learning_rate)

        # Use learning rate scheduler (Cosine Annealing: decreases LR smoothly)
        scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
            optimizer,
            T_max=10   # Number of epochs to restart cycle
        )

        return {'optimizer': optimizer, 'lr_scheduler': scheduler}

# ⚡ Compute loss + BLEU (shared by train/val/test)
    def compute_loss(self, batch, batch_idx, stage):
        # Unpack batch
        src_input_ids = batch['src_input_ids']
        src_attention_mask = batch['src_attention_mask']
        tgt_input_ids = batch['tgt_input_ids']
        tgt_attention_mask = batch['tgt_attention_mask']

        # Forward pass through model
        outputs = self(
            src_input_ids,
            src_attention_mask,
            tgt_input_ids,
            tgt_attention_mask
        )

        # Get predicted token logits (probabilities before softmax)
        logits = outputs.logits

        #Compute CrossEntropy loss
        # Shift target tokens by one position (teacher forcing)
        loss = self.loss_fn(
            logits.view(-1, logits.size(-1)), # Predictions: flatten for all tokens
            tgt_input_ids[:, 1:].contiguous().view(-1) # Targets: shifted right by 1
        )

        #If validation/test → also compute BLEU score
        if stage == 'val' or stage == 'test':
            preds = torch.argmax(logits, dim=-1)   # Pick highest probability tokens
            pred_texts = self.tokenizer.batch_decode(preds, skip_special_tokens=True)   # Convert to text
            tgt_texts = self.tokenizer.batch_decode(tgt_input_ids[:, 1:], skip_special_tokens=True)

            # Compute BLEU score (higher = better translation quality)
            bleu_score = self.bleu(pred_texts, [[tgt] for tgt in tgt_texts])

            # Log BLEU score to progress bar
            self.log(f'{stage}_bleu', bleu_score, prog_bar=True)

        return loss



<span style="color:Tomato; font-size:18px; font-weight:700;">
🔄 Class Model (What & Why)</span>

#### Model & Tokenizer
- **AutoModelForSeq2SeqLM** → Pretrained translation model (already knows how to translate).
- **AutoTokenizer** → Converts text ↔ tokens (needed for model input/output).

#### Loss & Metrics
- **Loss function (CrossEntropyLoss)** → Tells model how wrong predictions are (ignores PAD tokens).
- **BLEU Score** → Measures how good the translations are (translation quality metric).

#### Training Logic
- **training_step / validation_step / test_step** → Defines what happens in training, validation, and testing.
- **compute_loss** → Central function where **loss and BLEU** are calculated together.

#### Optimization
- **Optimizer (AdamW)** → Updates model weights during training.
- **Scheduler (CosineAnnealingLR)** → Smoothly adjusts learning rate (helps stable training).


In [None]:
# Initialize Machine Translation model
model = MTModel()

**Add MLflow Logger**

Why: To track experiments (loss, BLEU), save checkpoints, and manage versions

In [None]:
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
import os

# Stop training early if validation loss doesn't improve for 2 epochs
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=2,
    verbose=True
)

# Save only the best model (highest val_loss in this case — usually use 'min')
checkpoint_callback = ModelCheckpoint(
    monitor='val_loss',
    save_top_k=1,
    mode='max'  # use 'min' if you want the lowest loss
)

# Create a path to save the best model checkpoint
checkpoint_path = os.path.join(os.getcwd(), "checkpoints", "best_model.pth")

# Make sure the directory exists before saving the checkpoint
if not os.path.exists(os.path.dirname(checkpoint_path)):
    os.makedirs(os.path.dirname(checkpoint_path), exist_ok=True)


<span style="color:blue; font-size:20px; font-weight:700;">👉 Train </span>

In [None]:
trainer = pl.Trainer(   # PyTorch Lightning Trainer: controls training/validation/testing loop
    max_epochs=3,       # Train the model for 2 full passes over the dataset
    accelerator='gpu' if torch.cuda.is_available() else 'cpu',
                        # Use GPU if available, otherwise fallback to CPU
    devices=1,          # Number of devices (GPUs or CPUs) to use (here = 1 device)
    precision=32,       # Use mixed precision (32-bit) to speed up training & use less memory
    log_every_n_steps=10,   # Log training metrics (loss etc.) every 10 steps
    val_check_interval=0.25, # Run validation 4 times per epoch (after every 25% of the data)
    callbacks = [checkpoint_callback, early_stopping]
)


<span style="color:Tomato; font-size:18px; font-weight:700;">
🔄 Trainer Parameters (What & Why) </span>

- **Trainer** → Automates the whole training loop (no need to manually write epochs, batches, backprop).

#### Training Duration
- **max_epochs** → How long to train (number of full dataset passes).

#### Hardware
- **accelerator + devices** → Pick GPU if available, else CPU (controls where training runs).

#### Precision
- **precision=16** → Mixed precision (faster training, less memory usage).

#### Logging & Validation
- **log_every_n_steps** → How often logs are recorded.
- **val_check_interval** → How often validation runs (can be multiple times per epoch).


In [None]:
import torch

torch.save(model.state_dict(), 'mt_model_weights.pt')

In [None]:
from google.colab import files
files.download('mt_model_weights.pt')

In [None]:
import mlflow
#MLflow Tracking

EPOCHS = 2
BATCH_SIZE = 16
LEARNING_RATE = 1e-5

mlflow.set_experiment("English-Bangla-Translation")




In [None]:
data_module = MTDataModule("train.csv", "val.csv", "test.csv", batch_size=BATCH_SIZE)
model = MTModel(learning_rate=LEARNING_RATE)

In [None]:
import mlflow
import mlflow.pytorch
from mlflow.models import infer_signature
from pyngrok import ngrok
import torch
import numpy as np

# ---------------- TRAIN + LOG ---------------- #
with mlflow.start_run() as run:
    mlflow.log_param("batch_size", BATCH_SIZE)
    mlflow.log_param("learning_rate", LEARNING_RATE)
    mlflow.log_param("epochs", EPOCHS)

    # Train & test
    trainer.fit(model=model, datamodule=data_module)
    evaluation_score = trainer.test(model=model, dataloaders=data_module.test_dataloader())
    mlflow.log_metric("test_loss", evaluation_score[0]['test_loss'])

    # Make sample input & output
    sample_batch = next(iter(data_module.test_dataloader()))
    sample_input = {
        'src_input_ids': sample_batch['src_input_ids'],
        'src_attention_mask': sample_batch['src_attention_mask']
    }

    with torch.no_grad():
        model.eval()
        sample_output = model(
            sample_input['src_input_ids'].to(model.device),
            sample_input['src_attention_mask'].to(model.device),
            sample_batch['tgt_input_ids'].to(model.device),
            sample_batch['tgt_attention_mask'].to(model.device)
        ).logits
        model.train()

    sample_input_np = {k: v.cpu().numpy().tolist() for k, v in sample_input.items()}
    sample_output_np = sample_output.cpu().numpy().tolist()

    # Save model to MLflow
    signature = infer_signature(sample_input_np, sample_output_np)
    mlflow.pytorch.log_model(
        pytorch_model=model,
        artifact_path="mt_model",
        signature=signature,
        input_example=sample_input_np
    )

    RUN_ID = run.info.run_id
    print("✅ Your MLflow run ID:", RUN_ID)



In [None]:
   # ---------------- START MLFLOW UI ---------------- #
!mlflow ui --port 5000 &>/dev/null &



In [None]:
from pyngrok import ngrok
public_url = ngrok.connect(5000)
print("🔗 MLflow UI URL:", public_url.public_url)

In [None]:
!pip install gradio


In [None]:
import torch
import gradio as gr
import mlflow.pytorch
from transformers import T5Tokenizer

# ---------------- CONFIG ---------------- #
mt_pretrained_model_name = "csebuetnlp/banglat5_nmt_en_bn"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
MAX_LENGTH = 128

# ---------------- LOAD TOKENIZER ---------------- #
tokenizer = T5Tokenizer.from_pretrained(mt_pretrained_model_name)

# ---------------- LOAD MODEL FROM MLFLOW ---------------- #
RUN_ID = "f5ec797e21654b46a383cfc582813440"
logged_model_uri = f"runs:/{RUN_ID}/mt_model"
print("Loading model from MLflow:", logged_model_uri)

model = mlflow.pytorch.load_model(logged_model_uri)
model.to(device)
model.eval()

# ---------------- TRANSLATION FUNCTION ---------------- #
def translate_english_to_bangla(sentence: str) -> str:
    input_ids = tokenizer(
        sentence,
        return_tensors="pt",
        padding="max_length",
        truncation=True,
        max_length=MAX_LENGTH
    ).input_ids.to(device)

    with torch.no_grad():
        output_tokens = model.model.generate(
            input_ids,
            max_length=MAX_LENGTH,
            num_beams=4,
            early_stopping=True
        )

    return tokenizer.decode(output_tokens[0], skip_special_tokens=True)

# ---------------- GRADIO INTERFACE ---------------- #
gr_interface = gr.Interface(
    fn=translate_english_to_bangla,
    inputs=gr.Textbox(lines=3, placeholder="Enter English sentence here...", label="English Text"),
    outputs=gr.Textbox(label="Bangla Translation"),
    title="English → Bangla Translator (from MLflow)",
    description="Translates English into Bangla using model loaded directly from MLflow."
)

# ---------------- LAUNCH ---------------- #
gr_interface.launch(share=True)
