#T5 — Text-To-Text Transfer Transformer


T5 is a new transformer model from Google that is trained in an end-to-end manner with text as input and modified text as output.

Transformers provide us with thousands of pre-trained models, which can be used for text summarization as well as for a wide variety of NLP tasks such as text classification, question answering, translation, speech recognition, optical character recognition etc.

T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format.

# T5

## Import

In [None]:
!pip install transformers T5 SentencePiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install rouge

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [None]:
import pandas as pd
import numpy as np

import re
import nltk
import rouge

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
import torch
import json
import transformers 
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
from tqdm import tqdm
tqdm.pandas()

In [None]:
device = torch.device('cpu')

In [None]:
testd = pd.read_csv("/content/drive/MyDrive/miniproject/testd.csv",index_col="Unnamed: 0")
teste = pd.read_csv("/content/drive/MyDrive/miniproject/teste.csv",index_col="Unnamed: 0")

## T5

In [None]:
model = T5ForConditionalGeneration.from_pretrained('t5-base')

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
tokenizer = T5Tokenizer.from_pretrained("t5-base", model_max_length=1024)

In [None]:
'''
file_name = "test.txt"
file = open(file_name, "r")
text = file.read()
file.close()
'''
#text = 
'''
The US has "passed the peak" on new coronavirus cases, President Donald Trump said and predicted that some states would reopen this month.

The US has over 637,000 confirmed Covid-19 cases and over 30,826 deaths, the highest for any country in the world.

At the daily White House coronavirus briefing on Wednesday, Trump said new guidelines to reopen the country would be announced on Thursday after he speaks to governors.

"We'll be the comeback kids, all of us," he said. "We want to get our country back."

The Trump administration has previously fixed May 1 as a possible date to reopen the world's largest economy, but the president said some states may be able to return to normalcy earlier than that.
'''
text = testd['text'][0]

In [None]:
preprocess_text = text.strip().replace("\n","")
t5_prepared_Text = "summarize: "+preprocess_text
print ("original text preprocessed: \n", preprocess_text)

original text preprocessed: 
 LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK bo

In [None]:
tokenized_text = tokenizer.encode(t5_prepared_Text,  max_length=1024, return_tensors="pt" , truncation=True).to(device)

In [None]:
# summmarize 
summary_ids = model.generate(tokenized_text,num_beams=4,no_repeat_ngram_size=2,min_length=30,max_length=512,early_stopping=True)#1 15 500
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print ("\n\nSummarized text: \n",output)



Summarized text: 
 "i don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection," he says. "the things I like buying are things that cost about 10 pounds -- books and CDs and DVDs" at 18, Radcliffe will be able to gamble in casino, buy drink in pub or see horror film "Hostel: Part II"


In [None]:
# summmarize 
summary_ids = model.generate(tokenized_text,num_beams=4,no_repeat_ngram_size=1,min_length=30,max_length=512,early_stopping=True)#1 15 500
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print ("\n\nSummarized text: \n",output)



Summarized text: 
 "i don't plan to be one of those people who buy themselves a massive sports car collection" at 18, Harry Potter star Daniel Radcliffe will have no plans for party, casino or other extravagant things. his latest outing as the boy wizard is breaking records on both sides; last two films are due later this year and my Boy Jack in 2012.


In [None]:
text = teste['text'][7]

In [None]:
preprocess_text = text.strip().replace("\n","")
t5_prepared_Text = "summarize: "+preprocess_text
print ("original text preprocessed: \n", preprocess_text)

original text preprocessed: 
 Presenter: Good afternoon, everyone. Today we're going to talk about the basics of machine learning. Machine learning is a method of teaching computers to learn from data, without being explicitly programmed. There are three types of machine learning: supervised learning, unsupervised learning, and reinforcement learning.Slide 1: Types of Machine LearningSupervised Learning: learning from labeled dataUnsupervised Learning: learning from unlabeled dataReinforcement Learning: learning from reward or punishmentPresenter: In supervised learning, the computer is trained on labeled data, which means the input data is labeled with the correct output. The computer then uses this training data to make predictions on new, unseen data.Slide 2: Supervised LearningLabeled data: input data labeled with correct outputComputer uses labeled data to make predictions on unseen dataPresenter: In unsupervised learning, the computer is trained on unlabeled data, which means the

In [None]:
tokenized_text = tokenizer.encode(t5_prepared_Text,  max_length=1024, return_tensors="pt", truncation=True).to(device)

In [None]:
# summmarize 
summary_ids = model.generate(tokenized_text,num_beams=4,no_repeat_ngram_size=2,min_length=30,max_length=512,early_stopping=True)#1 15 500
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print ("\n\nSummarized text: \n",output)



Summarized text: 
 machine learning is a method of teaching computers to learn from data, without being explicitly programmed. in supervised learning, the computer is trained on labeled data and makes predictions on new, unseen data.


In [None]:
# summmarize 
summary_ids = model.generate(tokenized_text,num_beams=4,no_repeat_ngram_size=1,min_length=30,max_length=512,early_stopping=True)#1 15 500
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print ("\n\nSummarized text: \n",output)



Summarized text: 
 machine learning is a method of teaching computers to learn from data, without being explicitly programmed. it's used in three different ways: supervised and unsupervised; reinforcing or rewarding the computer for its performance after receiving feedback on work performed by other machines with similar results as previously described (see below).


# func

In [None]:
model = T5ForConditionalGeneration.from_pretrained('t5-base')
tokenizer = T5Tokenizer.from_pretrained("t5-base", model_max_length=1024)
device = torch.device('cpu')

def t5suma(text):
    preprocess_text = text.strip().replace("\n","")
    t5_prepared_Text = "summarize: "+preprocess_text
    tokenized_text = tokenizer.encode(t5_prepared_Text,  max_length=1024, return_tensors="pt", truncation=True).to(device)
    summary_ids = model.generate(tokenized_text,num_beams=4,no_repeat_ngram_size=2,min_length=30,max_length=512,early_stopping=True)#1 15 500
    output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return output

In [None]:
from tqdm import tqdm
tqdm.pandas()
testd['t5pred'] = testd['text'].progress_apply(lambda x : t5suma(x))

  0%|          | 0/17 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
100%|██████████| 17/17 [07:29<00:00, 26.44s/it]


In [None]:
from tqdm import tqdm
tqdm.pandas()
teste['t5pred'] = teste['text'].progress_apply(lambda x : t5suma(x))

100%|██████████| 11/11 [02:02<00:00, 11.09s/it]


# ROUGE 
Recall-Oriented Understudy for Gisting Evaluation


In [None]:
import rouge
def evaluate_summary(y_test, predicted):    
   rouge_score = rouge.Rouge()    
   scores = rouge_score.get_scores(y_test, predicted, avg=True)       
   score_1 = round(scores['rouge-1']['f'], 2)    
   score_2 = round(scores['rouge-2']['f'], 2)    
   score_L = round(scores['rouge-l']['f'], 2)    
   print("rouge1:", score_1, "| rouge2:", score_2, "| rougeL:", score_L, "--> avg rouge:", round(np.mean([score_1,score_2,score_L]), 2))
   return np.mean([score_1,score_2,score_L])

In [None]:
for i in testd.index:
    evaluate_summary(testd['y'][i], testd['t5pred'][i])

rouge1: 0.13 | rouge2: 0.0 | rougeL: 0.13 --> avg rouge: 0.09
rouge1: 0.37 | rouge2: 0.2 | rougeL: 0.37 --> avg rouge: 0.31
rouge1: 0.35 | rouge2: 0.14 | rougeL: 0.3 --> avg rouge: 0.26
rouge1: 0.26 | rouge2: 0.11 | rougeL: 0.26 --> avg rouge: 0.21
rouge1: 0.21 | rouge2: 0.04 | rougeL: 0.21 --> avg rouge: 0.15
rouge1: 0.23 | rouge2: 0.12 | rougeL: 0.23 --> avg rouge: 0.19
rouge1: 0.39 | rouge2: 0.2 | rougeL: 0.36 --> avg rouge: 0.32
rouge1: 0.41 | rouge2: 0.14 | rougeL: 0.38 --> avg rouge: 0.31
rouge1: 0.19 | rouge2: 0.0 | rougeL: 0.19 --> avg rouge: 0.13
rouge1: 0.24 | rouge2: 0.09 | rougeL: 0.19 --> avg rouge: 0.17
rouge1: 0.29 | rouge2: 0.17 | rougeL: 0.29 --> avg rouge: 0.25
rouge1: 0.06 | rouge2: 0.0 | rougeL: 0.06 --> avg rouge: 0.04
rouge1: 0.18 | rouge2: 0.1 | rougeL: 0.18 --> avg rouge: 0.15
rouge1: 0.41 | rouge2: 0.25 | rougeL: 0.41 --> avg rouge: 0.36
rouge1: 0.29 | rouge2: 0.04 | rougeL: 0.18 --> avg rouge: 0.17
rouge1: 0.57 | rouge2: 0.33 | rougeL: 0.53 --> avg rouge: 0.48

In [None]:
for i in teste.index:
    evaluate_summary(teste['y'][i], teste['t5pred'][i])

rouge1: 0.31 | rouge2: 0.13 | rougeL: 0.31 --> avg rouge: 0.25
rouge1: 0.44 | rouge2: 0.15 | rougeL: 0.31 --> avg rouge: 0.3
rouge1: 0.51 | rouge2: 0.23 | rougeL: 0.51 --> avg rouge: 0.42
rouge1: 0.28 | rouge2: 0.09 | rougeL: 0.26 --> avg rouge: 0.21
rouge1: 0.34 | rouge2: 0.22 | rougeL: 0.32 --> avg rouge: 0.29
rouge1: 0.34 | rouge2: 0.22 | rougeL: 0.32 --> avg rouge: 0.29
rouge1: 0.5 | rouge2: 0.31 | rougeL: 0.48 --> avg rouge: 0.43
rouge1: 0.5 | rouge2: 0.27 | rougeL: 0.45 --> avg rouge: 0.41
rouge1: 0.54 | rouge2: 0.35 | rougeL: 0.51 --> avg rouge: 0.47
rouge1: 0.47 | rouge2: 0.21 | rougeL: 0.47 --> avg rouge: 0.38
rouge1: 0.4 | rouge2: 0.09 | rougeL: 0.37 --> avg rouge: 0.29


In [None]:
lst5 = []
for i in testd.index:
    x = evaluate_summary(testd['ey'][i], testd['nlpred'][i])
    lst5.append(x)
print("teste")
for i in teste.index:
    x = evaluate_summary(teste['y'][i], teste['t5pred'][i])
    lst5.append(x)

rouge1: 0.68 | rouge2: 0.62 | rougeL: 0.67 --> avg rouge: 0.66
rouge1: 0.77 | rouge2: 0.7 | rougeL: 0.77 --> avg rouge: 0.75
rouge1: 0.67 | rouge2: 0.61 | rougeL: 0.67 --> avg rouge: 0.65
rouge1: 0.56 | rouge2: 0.44 | rougeL: 0.56 --> avg rouge: 0.52
rouge1: 0.79 | rouge2: 0.74 | rougeL: 0.79 --> avg rouge: 0.77
rouge1: 0.67 | rouge2: 0.56 | rougeL: 0.67 --> avg rouge: 0.63
rouge1: 0.57 | rouge2: 0.45 | rougeL: 0.56 --> avg rouge: 0.53
rouge1: 0.58 | rouge2: 0.48 | rougeL: 0.58 --> avg rouge: 0.55
rouge1: 0.76 | rouge2: 0.7 | rougeL: 0.76 --> avg rouge: 0.74
rouge1: 0.61 | rouge2: 0.53 | rougeL: 0.6 --> avg rouge: 0.58
rouge1: 0.84 | rouge2: 0.77 | rougeL: 0.84 --> avg rouge: 0.82
rouge1: 0.79 | rouge2: 0.76 | rougeL: 0.78 --> avg rouge: 0.78
rouge1: 0.75 | rouge2: 0.68 | rougeL: 0.75 --> avg rouge: 0.73
rouge1: 0.78 | rouge2: 0.76 | rougeL: 0.78 --> avg rouge: 0.77
rouge1: 0.76 | rouge2: 0.63 | rougeL: 0.76 --> avg rouge: 0.72
rouge1: 0.48 | rouge2: 0.32 | rougeL: 0.44 --> avg rouge: 

In [None]:
testd.to_csv("/content/drive/MyDrive/miniproject/testd.csv")
teste.to_csv("/content/drive/MyDrive/miniproject/teste.csv")