#**T5 stands** for Text-To-Text Transfer Transformer.

🔹 What it is:
T5 is a transformer-based NLP model from Google. Its main idea is: convert every NLP problem into a text-to-text format.

Input: always text

Output: always text

build of encoder and decoder both.


**1. Text Summarization**

In [1]:
from transformers import pipeline

summerizer = pipeline("summarization", model= "google-t5/t5-small")
text = """ Hi welcome to the world of Ai. Ai is transforming everything almost
 into automation and cutting the human efforts on large scale."""
summary = summerizer(text, max_length=50, min_length=10, do_sample=False)
print(summary[0]['summary_text'])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0
Your max_length is set to 50, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Ai is transforming everything almost into automation and cutting the human efforts .


**2. AI Content Rewriter / Paraphraser**

In [2]:
from transformers import pipeline
rewriter = pipeline("text2text-generation", model="google/flan-t5-base")
text = """Hi welcome to the world of Ai. Ai is transforming everything almost
 into automation and cutting the human efforts on large scale."""
result =  rewriter(text)
print(result[0]['generated_text'])

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Ai is transforming everything almost into automation and cutting human efforts on large scale.


**3. Q&A Bot (FAQ Automation)**

In [3]:
from transformers import pipeline

qa_model = pipeline("text2text-generation", model="t5-base")

context = "T5 is a transformer model created by Google for NLP tasks."
question = "Who created T5?"

output = qa_model("question: " + question + " context: " + context)
print(output[0]['generated_text'])


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use cuda:0


Google


**4. Multi-Language Translator**

In [4]:
from transformers import pipeline

translator = pipeline("text2text-generation", model="t5-base")

sentence = "How are you?"
output = translator("translate English to German: " + sentence)
print(output[0]['generated_text'])

Device set to use cuda:0


Wie sind Sie?


In [5]:
!pip install --upgrade transformers



In [6]:
# Finetune Google/T5-Small on Summarization
!pip install -q datasets transformers accelerate transformers[sentencepiece] sacrebleu rouge_score py7zr



[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.0/97.0 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.7/51.7 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m142.7/142.7 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m413.6/413.6 kB[0m [31m37.5 MB/s[0m eta [36m0

In [7]:
from datasets import load_dataset  # Load dataset
import torch  # PyTorch tensors & GPU
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM  # Automatically picks the right tokenizer & model
from transformers import DataCollatorForSeq2Seq  # Dynamic padding and batching
from transformers import TrainingArguments, Trainer  # Training setup & loop
from transformers import pipeline  # High-level API for easy inference
import warnings  # Handle warnings
warnings.filterwarnings("ignore")  # Suppress warnings

In [8]:
dataset = load_dataset("knkarthick/samsum")
dataset

README.md: 0.00B [00:00, ?B/s]

train.csv: 0.00B [00:00, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/14731 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14731
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
})

In [9]:
model_checkpoint = "t5-small"  # ✅ You can also use "google/flan-t5-base", "facebook/bart-base", etc.

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [10]:
# Tokenize the Dataset

def tokenize_content(data):
    dialogues = data["dialogue"]
    summaries = data["summary"]

    inputs = ["summarize: " + d if d else "summarize: " for d in dialogues]
    targets = [s if s else "" for s in summaries]

    input_encoding = tokenizer(inputs, max_length=1024, truncation=True, padding="max_length")
    with tokenizer.as_target_tokenizer():
        target_encoding = tokenizer(targets, max_length=128, truncation=True, padding="max_length")

    return {
        "input_ids": input_encoding["input_ids"],
        "attention_mask": input_encoding["attention_mask"],
        "labels": target_encoding["input_ids"],
    }

tokenized_dataset = dataset.map(tokenize_content, batched=True)

Map:   0%|          | 0/14731 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

In [11]:
# Setup Data Collator
seq2seq_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Define Training Arguments
training_args = TrainingArguments(
    output_dir="t5-samsum-model",              # Where to save the model
    num_train_epochs=1,                        # Number of training passes over data
    per_device_train_batch_size=1,             # Samples per GPU during training
    per_device_eval_batch_size=1,              # Samples per GPU during evaluation
    warmup_steps=500,                          # Gradually increase LR for first 500 steps
    weight_decay=0.01,                         # Regularization to prevent overfitting
    logging_steps=10,                          # Log training metrics every 10 steps
    eval_steps=500,                            # Run evaluation every 500 steps
    save_steps=1e6,                            # Disable auto-saving during training
    gradient_accumulation_steps=16,            # Accumulate gradients for larger batch effect
    report_to="none"                           # Disable logging to external tools
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=seq2seq_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"]
)

In [12]:
# Train the Model
trainer.train()

Step,Training Loss
10,12.0507
20,11.7816
30,11.847
40,11.1383
50,10.4226
60,9.9007
70,8.7986
80,8.1546
90,7.0604
100,5.4981


TrainOutput(global_step=921, training_loss=1.6627551675748877, metrics={'train_runtime': 1597.8277, 'train_samples_per_second': 9.219, 'train_steps_per_second': 0.576, 'total_flos': 3987440154968064.0, 'train_loss': 1.6627551675748877, 'epoch': 1.0})

In [13]:
#  Save Model & Tokenizer
model.save_pretrained("t5_samsum_finetuned_model")
tokenizer.save_pretrained("t5_samsum_tokenizer")

('t5_samsum_tokenizer/tokenizer_config.json',
 't5_samsum_tokenizer/special_tokens_map.json',
 't5_samsum_tokenizer/spiece.model',
 't5_samsum_tokenizer/added_tokens.json',
 't5_samsum_tokenizer/tokenizer.json')

In [14]:
#  Reload & Setup for Inference
tokenizer = AutoTokenizer.from_pretrained("t5_samsum_tokenizer")
model = AutoModelForSeq2SeqLM.from_pretrained("t5_samsum_finetuned_model")
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)
# No manual tokenization, no manual model.generate() — it abstracts all that under the hood.

Device set to use cuda:0


In [15]:
#  Test Sample Dialogue (Luffy & Naruto)
sample_text = '''Luffy: Naruto! You won the ramen eating contest again?! That’s your fifth win this month!

Naruto: Believe it, Luffy! Ichiraku’s secret menu is my new training ground. Gotta keep up the chakra and the appetite!

Luffy: Haha! I like that! I trained by eating 20 meat-on-the-bone last night. Zoro thought I was insane.

Naruto: Bro, I’ve fought Akatsuki, and even I think that’s dangerous. What’s next? Competing with Goku?

Luffy: Maybe! But first I wanna become the Pirate King. Then I’ll eat ramen on the moon!

Naruto: You sure talk big, rubber boy. But I respect that. Becoming Hokage wasn’t easy either.

Luffy: We’re kinda the same, huh? Chasing dreams, fighting crazy villains, making loyal friends.

Naruto: True that. Though I don’t have a reindeer doctor or a skeleton with an afro.

Luffy: And I don’t have a giant fox inside me. We’re even!

Naruto: Hey, wanna team up for a mission? I heard there’s a lost treasure in the Hidden Mist village.

Luffy: Treasure?! I’m in! Let’s go find it, and maybe snack along the way.

Naruto: Deal. I’ll bring the kunai, you bring the appetite.

Luffy: This is gonna be epic! Let's GO!!!

Naruto: Dattebayo!!!'''

In [16]:
#  Show the Summary Output
from IPython.display import Markdown, display
result = summarizer(sample_text, max_length=100, min_length=30, do_sample=False) ## do_sampilng = False means Use greedy decoding (no randomness); always returns same result
display(Markdown(f"**Summary:** {result[0]['summary_text']}"))
# result format -->> [{'summary_text': 'Here is the generated summary.'}]


Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


**Summary:** Luffy won the ramen eating contest again this month. Luffy has fought Akatsuki, and he will compete with Goku. Naruto will bring the kunai.

In [17]:
import os
import nbformat

# Get current notebook name (Colab workaround)
from google.colab import drive
drive.mount('/content/drive')  # Optional if you're working from Drive

# Replace with your actual notebook path
notebook_path = '/content/drive/MyDrive/Colab Notebooks/imdb sentiment classification using RNN,LSTM,GRU.ipynb'  # Update this
cleaned_path = '/content/imdb_sentiment_classification_using_RNN_LSTM_GRU.ipynb'

# Load and clean notebook
with open(notebook_path, 'r', encoding='utf-8') as f:
    nb = nbformat.read(f, as_version=4)

if 'widgets' in nb['metadata']:
    print("Removing metadata.widgets...")
    del nb['metadata']['widgets']

# Save cleaned notebook
with open(cleaned_path, 'w', encoding='utf-8') as f:
    nbformat.write(nb, f)

print(f"Cleaned notebook saved to: {cleaned_path}")

Mounted at /content/drive
