<a href="https://colab.research.google.com/github/MishraShardendu22/Transformers/blob/main/Translate_Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip uninstall torch -y
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Found existing installation: torch 2.5.1+cu121
Uninstalling torch-2.5.1+cu121:
  Successfully uninstalled torch-2.5.1+cu121
Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch
  Using cached https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp312-cp312-linux_x86_64.whl (780.4 MB)
Installing collected packages: torch
Successfully installed torch-2.5.1+cu121


In [2]:
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

True
Tesla T4


In [3]:
!pip install transformers datasets sentencepiece accelerate evaluate



# Explaination (ai help)

```python
dataset = dataset.train_test_split(test_size=0.1)
```

### What it does

It splits your dataset into two parts:

* **90% → training set**
* **10% → validation (test) set**

Since you selected **30,000 samples**:

* 27,000 → used to train the model
* 3,000 → used to evaluate model performance

---

### Why this is required

During training:

* Model learns on the **train set**
* After each epoch, performance is checked on the **validation set**
* Prevents overfitting
* Lets you measure BLEU score properly

---

### What `print(dataset)` shows

You will see something like:

```
DatasetDict({
    train: Dataset({
        features: ...
        num_rows: 27000
    })
    test: Dataset({
        features: ...
        num_rows: 3000
    })
})
```

In [4]:
from datasets import load_dataset

# Correct dataset name
dataset = load_dataset("cfilt/iitb-english-hindi")

# Shuffle and take 1,000,000 samples
dataset = dataset["train"].shuffle(seed=42).select(range(1_000_000))

# Train-validation split
dataset = dataset.train_test_split(test_size=0.1)

print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 900000
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 100000
    })
})


In [5]:
from transformers import (
    AutoTokenizer,
    EncoderDecoderConfig,
    EncoderDecoderModel,
    BertConfig
)

tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")

encoder_config = BertConfig(
    vocab_size=tokenizer.vocab_size,
    hidden_size=512,
    num_hidden_layers=6,
    num_attention_heads=8,
    intermediate_size=2048,
    max_position_embeddings=512,
)

decoder_config = BertConfig(
    vocab_size=tokenizer.vocab_size,
    hidden_size=512,
    num_hidden_layers=6,
    num_attention_heads=8,
    intermediate_size=2048,
    is_decoder=True,
    add_cross_attention=True,
    max_position_embeddings=512,
)

config = EncoderDecoderConfig.from_encoder_decoder_configs(
    encoder_config,
    decoder_config
)

model = EncoderDecoderModel(config=config)

model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id



In [None]:
def preprocess_function(examples):
    inputs = [x["hi"] for x in examples["translation"]]
    targets = [x["en"] for x in examples["translation"]]

    model_inputs = tokenizer(
        inputs,
        max_length=128,
        padding="max_length",
        truncation=True,
    )

    labels = tokenizer(
        targets,
        max_length=128,
        padding="max_length",
        truncation=True,
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset["train"].column_names,
    num_proc=6
)

print(tokenized_dataset)

Map (num_proc=6):   0%|          | 0/900000 [00:00<?, ? examples/s]

In [53]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model
)

training_args = Seq2SeqTrainingArguments(
    output_dir="./scratch-hi-en",
    eval_strategy="steps", # Changed from evaluation_strategy
    save_strategy="steps",
    logging_steps=1000,
    save_steps=5000,
    eval_steps=5000,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    learning_rate=3e-4,
    weight_decay=0.01,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
    save_total_limit=2,
    load_best_model_at_end=True
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator,
)

print("Trainer ready")

Trainer ready


In [54]:
trainer.train()



Step,Training Loss,Validation Loss
5000,2.264737,7.357111
10000,2.222866,8.060638


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]



Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]



Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

There were missing keys in the checkpoint model loaded: ['decoder.cls.predictions.decoder.weight', 'decoder.cls.predictions.decoder.bias'].


TrainOutput(global_step=14063, training_loss=2.2573528982858386, metrics={'train_runtime': 6133.0321, 'train_samples_per_second': 146.746, 'train_steps_per_second': 2.293, 'total_flos': 1.54781891712e+16, 'train_loss': 2.2573528982858386, 'epoch': 1.0})

# What is the use of these lines ?

Those lines define **special tokens required for generation** in an encoder–decoder model.

Your scratch model does not automatically know:

* where decoding starts
* where it ends
* what padding token is

So you manually set them.

---

## 1️⃣ `decoder_start_token_id`

```python
model.config.decoder_start_token_id = tokenizer.cls_token_id
```

Tells the decoder:

> Start generation with this token.

Without it → generation fails.

It is the first token fed into decoder at time step 0.

---

## 2️⃣ `bos_token_id`

```python
model.config.bos_token_id = tokenizer.cls_token_id
```

BOS = Beginning Of Sentence.

Used internally by generation utilities.

For BERT tokenizer:

* `cls_token_id` works as BOS.

---

## 3️⃣ `eos_token_id`

```python
model.config.eos_token_id = tokenizer.sep_token_id
```

EOS = End Of Sentence.

When model predicts this token:
→ generation stops.

Without EOS → model may generate endlessly.

---

## 4️⃣ `pad_token_id`

```python
model.config.pad_token_id = tokenizer.pad_token_id
```

Used for:

* Padding batches
* Ignoring padded tokens in loss
* Beam search masking

---

## Why this was necessary

Because you built model **from scratch config**, not from pretrained checkpoint.

Pretrained models already contain these IDs.
Scratch config does not.

Without setting these:

* `generate()` throws errors
* Decoding behaves incorrectly

---

## 1️⃣ `tokenizer.cls_token_id`

This is the ID of the `[CLS]` token.

Example (mBERT):

```python
tokenizer.cls_token        → "[CLS]"
tokenizer.cls_token_id     → 101
```

Meaning:

* `[CLS]` is stored in vocabulary
* It has fixed integer ID
* That integer is used inside tensors

You used it as:

* decoder start token
* beginning-of-sentence token

---

## 2️⃣ `tokenizer.sep_token_id`

This is the ID of `[SEP]`.

Example:

```python
tokenizer.sep_token        → "[SEP]"
tokenizer.sep_token_id     → 102
```

Used as:

* end-of-sentence marker

When model predicts ID 102 → generation stops.

---

## 3️⃣ `tokenizer.pad_token_id`

This is the ID of `[PAD]`.

Example:

```python
tokenizer.pad_token        → "[PAD]"
tokenizer.pad_token_id     → 0
```

Used to:

* fill shorter sequences
* ignore padded positions during loss
* mask attention

---

## Why use tokenizer IDs?

Because:

* Model works with integers, not strings.
* Vocabulary mapping is defined inside tokenizer.
* These IDs must match tokenizer vocabulary exactly.

If mismatched → decoding breaks.


In [58]:
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id
model.config.eos_token_id = tokenizer.sep_token_id
model.config.bos_token_id = tokenizer.cls_token_id

In [66]:
text = "तुम कौन हो? तुम यहाँ क्या कर रहे हो? तुम्हारे यहाँ होने का क्या कारण है?"

inputs = tokenizer(
    text,
    return_tensors="pt",
    max_length=64,
    truncation=True
).to(model.device)

outputs = model.generate(
    **inputs,
    max_length=64,
    decoder_start_token_id=tokenizer.cls_token_id,
    bos_token_id=tokenizer.cls_token_id,
    eos_token_id=tokenizer.sep_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

translation = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Translation:", translation)

Translation: the the
