Mounting Google drive 


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


A T5 is an encoder-decoder model. It converts all NLP problems like language translation, summarization, text generation, question-answering, to a text-to-text task.


In [2]:
pip install simplet5

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting simplet5
  Downloading simplet5-0.1.4.tar.gz (7.3 kB)
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 13.3 MB/s 
Collecting transformers==4.16.2
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 54.2 MB/s 
[?25hCollecting pytorch-lightning==1.5.10
  Downloading pytorch_lightning-1.5.10-py3-none-any.whl (527 kB)
[K     |████████████████████████████████| 527 kB 57.8 MB/s 
Collecting future>=0.17.1
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 65.4 MB/s 
Collecting pyDeprecate==0.3.1
  Downloading pyDeprecate-0.3.1-py3-none-any.whl (10 kB)
Collecting setuptools==59.5.0
  Downloading setuptools-59.5.0-py3-none-any.whl (952 kB)
[K     |████████████

Fine tune with simplet5

In [1]:
import pandas as pd 
from sklearn.model_selection import train_test_split

In [2]:
path = '/content/drive/MyDrive/NLP/Summarization/New_sumarry_Data/news_summary_more.csv'

In [3]:
df =  pd.read_csv(path)

In [4]:
df.head()

Unnamed: 0,headlines,text
0,upGrad learner switches to career in ML & Al w...,"Saurav Kant, an alumnus of upGrad and IIIT-B's..."
1,Delhi techie wins free food from Swiggy for on...,Kunal Shah's credit card bill payment platform...
2,New Zealand end Rohit Sharma-led India's 12-ma...,New Zealand defeated India by 8 wickets in the...
3,Aegon life iTerm insurance plan helps customer...,"With Aegon Life iTerm Insurance plan, customer..."
4,"Have known Hirani for yrs, what if MeToo claim...",Speaking about the sexual harassment allegatio...


simpleT5 expects a pandas dataframe with 2 columns — source_text and target_text. As we are summarizing news articles, we want our T5 model to learn how to convert actual news (text column) → one line summary (headlines column). So, our source_text will be the text column, and target_text will be the headlines column.

T5 also expects a task-related prefix — to uniquely identify the task that we want to perform on our dataset. Let’s add “summarize: “ as a prefix to our source_text.

In [5]:
# rename and arrange the column as source and target
df = df.rename(columns={"headlines":"target_text","text":"source_text"})
df= df[['source_text','target_text']]
df


Unnamed: 0,source_text,target_text
0,"Saurav Kant, an alumnus of upGrad and IIIT-B's...",upGrad learner switches to career in ML & Al w...
1,Kunal Shah's credit card bill payment platform...,Delhi techie wins free food from Swiggy for on...
2,New Zealand defeated India by 8 wickets in the...,New Zealand end Rohit Sharma-led India's 12-ma...
3,"With Aegon Life iTerm Insurance plan, customer...",Aegon life iTerm insurance plan helps customer...
4,Speaking about the sexual harassment allegatio...,"Have known Hirani for yrs, what if MeToo claim..."
...,...,...
98396,A CRPF jawan was on Tuesday axed to death with...,CRPF jawan axed to death by Maoists in Chhatti...
98397,"'Uff Yeh', the first song from the Sonakshi Si...",First song from Sonakshi Sinha's 'Noor' titled...
98398,"According to reports, a new version of the 199...",'The Matrix' film to get a reboot: Reports
98399,A new music video shows rapper Snoop Dogg aimi...,Snoop Dogg aims gun at clown dressed as Trump ...


In [6]:
# Add Prefix summarize for source_text
df['source_text'] = 'summarize: ' + df['source_text']
df

Unnamed: 0,source_text,target_text
0,"summarize: Saurav Kant, an alumnus of upGrad a...",upGrad learner switches to career in ML & Al w...
1,summarize: Kunal Shah's credit card bill payme...,Delhi techie wins free food from Swiggy for on...
2,summarize: New Zealand defeated India by 8 wic...,New Zealand end Rohit Sharma-led India's 12-ma...
3,summarize: With Aegon Life iTerm Insurance pla...,Aegon life iTerm insurance plan helps customer...
4,summarize: Speaking about the sexual harassmen...,"Have known Hirani for yrs, what if MeToo claim..."
...,...,...
98396,summarize: A CRPF jawan was on Tuesday axed to...,CRPF jawan axed to death by Maoists in Chhatti...
98397,"summarize: 'Uff Yeh', the first song from the ...",First song from Sonakshi Sinha's 'Noor' titled...
98398,"summarize: According to reports, a new version...",'The Matrix' film to get a reboot: Reports
98399,summarize: A new music video shows rapper Snoo...,Snoop Dogg aims gun at clown dressed as Trump ...


In [7]:
# Spliting to train and test dataset
train_df, test_df = train_test_split(df, test_size=0.3)
train_df.shape,test_df.shape

((68880, 2), (29521, 2))

We will import SimpleT5 class, download a pre-trained T5 model and then train it on our dataset — train_df and test_df. we can also specify other optional model arguments, such as — source_max_token_len, target_max_token_len, batch_size, epochs, early_stopping etc.



In [8]:
# Downloading Pretrainmodel
from simplet5 import SimpleT5

model = SimpleT5()
model.from_pretrained(model_type="t5", model_name="t5-base")


INFO:pytorch_lightning.utilities.seed:Global seed set to 42


Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

In [10]:
# Training the model on our dataset
model.train(train_df=train_df[:5000],
            eval_df=test_df[:100], 
            source_max_token_len=128, 
            target_max_token_len=50, 
            batch_size= 8, 
            max_epochs= 5, 
            use_gpu=True
           )

INFO:pytorch_lightning.utilities.distributed:GPU available: True, used: True
INFO:pytorch_lightning.utilities.distributed:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.distributed:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 222 M 
-----------------------------------------------------
222 M     Trainable params
0         Non-trainable params
222 M     Total params
891.614   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.seed:Global seed set to 42


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

In [11]:
# outputs 
! ( cd outputs; ls )

simplet5-epoch-0-train-loss-1.5733-val-loss-1.2471
simplet5-epoch-1-train-loss-1.1727-val-loss-1.2556
simplet5-epoch-2-train-loss-0.9456-val-loss-1.2857
simplet5-epoch-3-train-loss-0.7719-val-loss-1.3449
simplet5-epoch-4-train-loss-0.6386-val-loss-1.4571


In [None]:
ls

[0m[01;34msimplet5-epoch-0-train-loss-1.5618-val-loss-1.3309[0m/
[01;34msimplet5-epoch-1-train-loss-1.1642-val-loss-1.3212[0m/
[01;34msimplet5-epoch-2-train-loss-0.9392-val-loss-1.3456[0m/
[01;34msimplet5-epoch-3-train-loss-0.7632-val-loss-1.4009[0m/
[01;34msimplet5-epoch-4-train-loss-0.6284-val-loss-1.4596[0m/


In [13]:
# loading output model 
model.load_model("t5","/content/outputs/simplet5-epoch-4-train-loss-0.6386-val-loss-1.4571", use_gpu=True)


In [25]:
# News text (Input)
text_to_summarie_1 = "Priyanka Chopra and Nick Jonas' sangeet ceremony which took place on Friday at Umaid Bhawan Palace in Jodhpur was attended by Mukesh Ambani, Nita Ambani, Isha Ambani, Anant Ambani and Radhika Merchant. Priyanka, who performed a dance act at the sangeet ceremony dedicated to Nick, turned emotional when Nick performed a special act for her, as per reports."

In [32]:
# Predicting (Output)
pred = model.predict(text_to_summarie_1)
print(pred[0])

Priyanka, Nick's sangeet ceremony was attended by Mukesh Ambani, Nita Ambani, Anant Ambani, Radhika Merchant


In [33]:
text_to_summarie_2 = """Mary Kom, who recently became the first female boxer to win six world championship titles, has said she wants to win world championship again and also wants to win a gold medal in Olympics. The 35-year-old, who won bronze in 2012 Olympics, added, "The government...gave me an extra responsibility by naming me member of parliament but I never stopped training."""

In [36]:
pred=model.predict(text_to_summarie_2)
print(pred[0])

I want to win world championship again and also Olympic gold, says Mary Kom
