# Dataset link
**link :** https://huggingface.co/datasets/cfilt/iitb-english-hindi

# pretrained model link
**link :** https://huggingface.co/Helsinki-NLP/opus-mt-en-hi

In [1]:
# checking whether GPU is running or not
!nvidia-smi

Sun Jan 21 17:36:35 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:00:05.0 Off |  

In [2]:
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q

In [3]:
import os
import sys
import transformers
import tensorflow as tf
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
from transformers import AdamWeightDecay



In [4]:
import numpy as np
import pandas as pd

In [5]:
model_checkpoint = "Helsinki-NLP/opus-mt-en-hi"

In [6]:
raw_dataset = load_dataset("cfilt/iitb-english-hindi")

Downloading:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading and preparing dataset json/default (download: 181.38 MiB, generated: 427.93 MiB, post-processed: Unknown size, total: 609.31 MiB) to /root/.cache/huggingface/datasets/parquet/cfilt--iitb-english-hindi-2cfae92395f2614b/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/85.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/190M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/500k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/cfilt--iitb-english-hindi-2cfae92395f2614b/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
raw_dataset

DatasetDict({
    validation: Dataset({
        features: ['translation'],
        num_rows: 520
    })
    train: Dataset({
        features: ['translation'],
        num_rows: 1659083
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2507
    })
})

In [8]:
raw_dataset['train'][4]

{'translation': {'en': 'A list of plugins that are disabled by default',
  'hi': 'उन प्लग-इनों की सूची जिन्हें डिफोल्ट रूप से निष्क्रिय किया गया है'}}

In [9]:
source_lang = "en"
target_lang = "hi"

for i in raw_dataset['validation']['translation']:
    inp = i[source_lang]
    out = i[target_lang]
    
print(inp,"\n", out)

In this way he wants to turn Guajrat into an impenetrable fortress. 
 ऐसे में वह गुजरात के किले को पूरी तरह से अभेद बना देना चाहते हैं।


In [10]:
# downloading tokenizers of pretrained model by automatically from hugging face

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/812k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]



#### for input variable tokenizer checking

In [11]:
tokenizer("hello how are you")

{'input_ids': [39915, 287, 54, 27, 0], 'attention_mask': [1, 1, 1, 1, 1]}

In [12]:
tokenizer(["hello how are you", "where are you from"])

{'input_ids': [[39915, 287, 54, 27, 0], [573, 54, 27, 72, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

#### for target variable tokenizer checking

In [13]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["उन प्लग-इनों की सूची जिन्हें डिफोल्ट रूप से निष्क्रिय किया गया है"]))

{'input_ids': [[141, 10076, 69, 38232, 15, 342, 1058, 22433, 246, 12, 2709, 78, 115, 5, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}




#### converting text data into numerical representation

In [14]:
max_input_len = 128
max_target_len = 128

source_lang = "en"
target_lang = "hi"

def preprocess_func(examples):
    inputs = [ex[source_lang] for ex in examples["translation"]] 
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_len, truncation=True)
    
    # setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_len, truncation=True)
        
    model_inputs["labels"] = labels["input_ids"]
    
    return model_inputs

In [15]:
# it returns both tokenizer for engilsh and hindi 
preprocess_func(raw_dataset['train'][:2])

{'input_ids': [[3872, 85, 2501, 132, 15441, 36398, 0], [32643, 28541, 36253, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1]], 'labels': [[63, 2025, 18, 16155, 346, 20311, 24, 2279, 679, 0], [26618, 16155, 346, 33383, 0]]}

In [16]:
tokenized_data = raw_dataset.map(preprocess_func, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1660 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

## downloading our pretrained model

In [17]:
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

tf_model.h5:   0%|          | 0.00/306M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-hi.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

#### Training

In [18]:
'''datacollator means, whenever we define the datacollator, it will take your data as a 
batches and it will pass it to your model batchwise.'''

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors='tf')

In [19]:
generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors='tf', 
                                                  pad_to_multiple_of=128)

In [20]:
np.object = object    

In [21]:
batch_size = 20
learning_rate = 2e-5
weight_decay = 0.01


train_dataset = model.prepare_tf_dataset(
    tokenized_data["test"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=data_collator,
)

validation_dataset = model.prepare_tf_dataset(
    tokenized_data["validation"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=data_collator,
)


generation_dataset = model.prepare_tf_dataset(
    tokenized_data["validation"],
    batch_size=10,
    shuffle=False,
    collate_fn=data_collator,
)

In [22]:
# we are using AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)

model.compile(optimizer=optimizer)

In [25]:
model.fit(train_dataset, validation_data=validation_dataset, epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x7e26e6b106d0>

In [26]:
model.save_pretrained("translation_tf_model")

## model_testing

In [27]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

model = TFAutoModelForSeq2SeqLM.from_pretrained("/kaggle/working/translation_tf_model")

All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at /kaggle/working/translation_tf_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


In [28]:
input_text = "Hey! how are you"

tokenized = tokenizer([input_text], return_tensors='np')
out = model.generate(**tokenized, max_length=128)

print(out)

tf.Tensor([[61949   707  6001     2   118   280    28    22     0 61949]], shape=(1, 10), dtype=int32)


In [29]:
with tokenizer.as_target_tokenizer():
    print(tokenizer.decode(out[0], skip_special_tokens=True))

हे भगवान, आप कैसे हैं?




### please increase no.of.epochs during training time, so our model can translate accurately.

In [30]:
input_text = "Hi my name is Narender"

tokenized = tokenizer([input_text], return_tensors='np')
out = model.generate(**tokenized, max_length=128)

print(out)

tf.Tensor([[61949  5201   500   179    67   130 10916  3130     5     0]], shape=(1, 10), dtype=int32)


In [31]:
with tokenizer.as_target_tokenizer():
    print(tokenizer.decode(out[0], skip_special_tokens=True))

हाय मेरा नाम नीशेद है
