## Install essential libraries

In [1]:
!pip install transformers[sentencepiece] sacrebleu datasets -q

## Check for GPU

In [2]:
!nvidia-smi

Wed Sep 13 09:56:21 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Login to Huggingface if required

In [3]:
#!huggingface-cli login

## Import the essential libraries

In [4]:
import os
import sys
import transformers
import tensorflow as tf
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM
from transformers import AdamWeightDecay
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM



### Model we are going to use

In [5]:
model_checkpoint = "t5-small"

### Load the dataset

In [6]:
raw_datasets = load_dataset("harish03/english_hinglist_sentences",split='train')

dataset = raw_datasets.train_test_split(test_size=0.3)

Downloading and preparing dataset json/harish03--english_hinglist_sentences to /root/.cache/huggingface/datasets/json/harish03--english_hinglist_sentences-470ce68d4c21b2e7/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/27.1M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/harish03--english_hinglist_sentences-470ce68d4c21b2e7/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


### Load the tokenizer for t5-small model

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

### Assign some parameters

In [8]:
max_input_length = 128 # Max input length
max_target_length = 128 # Max output length

source_lang = "en" # Input text language
target_lang = "hi_ng" # Output hinglish language


def preprocess_function(examples):
    inputs = [ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

## Preprocess (tokenize the dataset)

In [9]:
tokenized_datasets = dataset.map(preprocess_function, batched=True)

  0%|          | 0/133 [00:00<?, ?ba/s]



  0%|          | 0/57 [00:00<?, ?ba/s]

## Load the model (pytorch stable)

In [10]:
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


### Assign some more parameters

In [11]:
batch_size = 16
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 2

In [12]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

In [13]:
generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=128)

## Generate the dataset

In [14]:
train_dataset = model.prepare_tf_dataset(
    tokenized_datasets["train"],
    batch_size=batch_size,
    shuffle=True,
    collate_fn=data_collator,
)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [15]:
validation_dataset = model.prepare_tf_dataset(
    tokenized_datasets["test"],
    batch_size=batch_size,
    shuffle=False,
    collate_fn=data_collator,
)

## Initialize the optimizer

In [16]:
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer)

## Fit the model

In [17]:
model.fit(train_dataset, validation_data=validation_dataset, epochs=num_train_epochs)
#model.fit(train_dataset,epochs=num_train_epochs)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7d2db439f190>

## Save the pretrained model

In [18]:
model.save_pretrained("model/")

### Load the `tokenizer` and `model`

In [19]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = TFAutoModelForSeq2SeqLM.from_pretrained("/kaggle/working/model")

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at /kaggle/working/tf_model2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


## Test the model

In [20]:
 def generate_output(input_text):
    tokenized = tokenizer([input_text], return_tensors='np')
    out = model.generate(**tokenized, max_length=128)
    with tokenizer.as_target_tokenizer():
        return tokenizer.decode(out[0], skip_special_tokens=True)


In [21]:
texts  = ["Definitely share your feedback in the comment section",
          "So even it's a big video, I will clearly mention all the products",
          "I was waiting for my bag"
         ]

for input_text in texts:
    print(generate_output(input_text))



ye comment section me feedback share kare
hua ye ek big video ki ye sabhi products ko mention kare
mujhe mere bag kab tak waiting karne ke liye
