## **Text Summarization using FLan-T5, Dolly-V2**

### OPTIONAL

#### System Information

In [37]:
import psutil
import torch

ram = psutil.virtual_memory()
ram_total = ram.total / (1024 ** 3)
print("MemTotal: %.2f GB", ram_total)

print("=============GPU INFO=============")

if torch.cuda.is_available():
       !/opt/bin/nvidia-smi || ture
else:
    print("GPU NOT available")

MemTotal: %.2f GB 15.851795196533203
GPU NOT available


### Getting the Essential Task done

#### Installing the required packages

In [38]:
!pip install -U -q openllm datasets matplotlib transformers pandas numpy nltk rouge_score


[notice] A new release of pip available: 22.3.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


#### Initializing the libraries

In [115]:
from transformers import pipeline, set_seed
import openllm
import matplotlib.pyplot as plt
from datasets import load_metric
import pandas as pd
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import pandas as pd
import numpy as np

import nltk
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

[nltk_data] Downloading package punkt to C:\Users\Aryan
[nltk_data]     Mohanty\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### Importing the CNN Dailymail library for sample text
An important aspect of the dataset is that the summaries are abstractive and not extractive, which means that they consist of new sentences instead of simple excerpts.

In [3]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", "3.0.0")

print(f"Features in cnn_dailymail : {dataset['train'].column_names}")

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Features in cnn_dailymail : ['article', 'highlights', 'id']


In [8]:
sample = dataset["train"][0]
print(f"Article (excerpt of 500 characters, total length: {len(sample['article'])}) : \n")
print(sample["article"][:500])
print(f'\nSummary (length: {len(sample["highlights"])}):')
print(sample["highlights"])

Article (excerpt of 500 characters, total length: 2527) : 

LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as s

Summary (length: 217):
Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .


This is the sample text that we will be using for our metrics of models

```
LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how...
```

In [62]:
sample_text = dataset["train"][0]["article"][:1000]
summaries_openllm = {}
summaries_huggingface = {}

### Flan-T5 Base Model 

To start the OPENLLM Server for Flan-T5 Base model Run the following command on the terminal.

```
openllm start flan-t5 --model-id google/flan-t5-base
```

#### Accessing the server through client instance

In [71]:
client = openllm.client.HTTPClient("http://localhost:3000")
query_text = sample_text + "\nTL;DR:\n"
response = client.query(query_text)
summaries_openllm["Flan-T5"] = response.responses[0]

Here is the summarized article by Flan-T5 Base Model

In [72]:
summaries_openllm["Flan-T5"]

"Harry Potter star Daniel Radcliffe gains access to a reported £20 million fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him."

### OPT 350M Model

To start the OPENLLM Server for OPT 350M model Run the following command on the terminal.

```
openllm start opt --model-id facebook/opt-350m --backend pt
```

#### Accessing the server through client instance

In [75]:
client = openllm.client.HTTPClient("http://localhost:3000")
query_text = sample_text + "\nTL;DR:\n"
response = client.query(query_text)
summaries_openllm["OPT"] = response.responses[0][len(query_text):]

Here is the summarized article by OPT 350M Model

In [76]:
summaries_openllm["OPT"]

'Radcliffe\'s fortune is said to be worth around £20 million, although his agent says he\'s likely to make a modest fortune on the back of the success of "Harry Potter and the Order of the Phoenix." The star of the "Harry Potter" movie franchise has not publicly disclosed his fortune. In a new interview with "The Independent," Radcliffe, who turns 18 on Monday, said he would never be tempted to spend money on extravagances, but he admitted that money is important and he could use it to improve his life. "I don\'t want to spend a lot of money on things that I know are going to fail," he said. "But I do want to do things that I think are going to help me get out of the hole I\'ve got myself in." "Harry Potter and the Order of the Phoenix" was released on Monday night, the day after Radcliffe\'s 18th birthday.'

### Dolly-V2 3B Model 

To start the OPENLLM Server for Dolly-V2 3B model Run the following command on the terminal.

```
openllm start dolly-v2 --model-id databricks/dolly-v2-3b --backend pt
```

#### Accessing the server through client instance

**NOTICE** : This Model Could not produce the desired output since I am limited on my hardware and Dolly-V2 3B Model is the 3rd most lightweight model on OPENLLM Library. The code here is viable and can be run on a capable machine to obtain the text summarization output.

In [None]:
client = openllm.client.HTTPClient("http://localhost:3000")
query_text = sample_text + "\nTL;DR:\n"
response = client.query(query_text)
summaries_openllm["Dolly-V2"] = response.responses[0]

Here is the summarized article by Dolly-V2 3B Model

In [None]:
summaries_openllm["Dolly-V2"]

### ROUGE Metrics for the Models Processed by OPENLLM

#### Comparison for different produced by the models

In [101]:
print("GROUND TRUTH")

print(dataset['train'][0]['highlights'])

for model_name in summaries_openllm:
    print(f"\n{model_name.upper()}")
    print(summaries_openllm[model_name])

GROUND TRUTH
Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .

OPT
Radcliffe's fortune is said to be worth around £20 million, although his agent says he's likely to make a modest fortune on the back of the success of "Harry Potter and the Order of the Phoenix." The star of the "Harry Potter" movie franchise has not publicly disclosed his fortune. In a new interview with "The Independent," Radcliffe, who turns 18 on Monday, said he would never be tempted to spend money on extravagances, but he admitted that money is important and he could use it to improve his life. "I don't want to spend a lot of money on things that I know are going to fail," he said. "But I do want to do things that I think are going to help me get out of the hole I've got myself in." "Harry Potter and the Order of the Phoenix" was released on Monday

The ROUGE score was specifically developed for applications like summarization where high recall is more important than just precision.

In [93]:
rouge_metric = load_metric('rouge')
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

reference = dataset['train'][0]['highlights']

records = []

for model_name in summaries_openllm:
    rouge_metric.add(prediction = summaries_openllm[model_name], reference = reference)
    score = rouge_metric.compute()
    rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names)
    records.append(rouge_dict)

pd.DataFrame.from_records(records, index = summaries_openllm.keys())

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
OPT,0.182741,0.041026,0.121827,0.162437
Flan-T5,0.371429,0.235294,0.342857,0.342857


### Hugging Face Pipeline for Text Summarization

Since OPENLLM is still a new and mostly BENTOML Service based library, it is not the most capable resource for tasks for Text Summarization. For a better usecase for the task we will look into how HuggingFace Model ,combined with Pipeline for easy interface, perform in this situation.

Defining a Baseline sample summary to compare with other models

In [105]:
summaries_huggingface['baseline'] = "\n".join(sent_tokenize(sample_text)[:3])

summaries_huggingface['baseline']

'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him.\nDaniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties.\n"I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month.'

#### GPT2 Medium Model

We can use GPT-2 it to generate summaries by simply appending “TL;DR” at the end of the input text.

The expression “TL;DR” (too long; didn’t read) is often used on platforms like Reddit to indicate a short version of a long post. We will start our summarization experiment by re-creating the procedure of the original paper with the pipeline() function from Transformers.

Creating Pipeline for GPT-2 Medium Model

In [135]:
set_seed(42)

pipeline_GPT2 = pipeline('text-generation', model = 'gpt2-medium' )

query_text = sample_text + "\nTL;DR:\n"

response = pipeline_GPT2(query_text, max_length = 512, clean_up_tokenization_spaces = True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [136]:
summaries_huggingface['GPT-2'] = "\n".join(sent_tokenize(response[0]["generated_text"][len(query_text):]))                                   

In [153]:
summaries_huggingface

{'baseline': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him.\nDaniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties.\n"I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month.',
 'GPT-2': 'Daniel Radcliffe has made a fortune as a celebrity, which means he doesn\'t want to sell out his dream of becoming the next big thing, he told a young Australian interviewer last month.\nAt 18 he will be able to gamble in a casino, buy a drink in a pub\nDaniel Radcliffe looks on as Harry Potter star and actress Sa

#### T5 Small Model

T5 (Text-To-Text Transfer Transformer) is a transformer model that is trained in an end-to-end manner with text as input and modified text as output, in contrast to BERT-style models that can only output either a class label or a span of the input. This text-to-text formatting makes the T5 model fit for multiple NLP tasks like Summarization, Question-Answering, Machine Translation, and Classification problems.

Creating Pipeline for the T5 Small Model

In [144]:
pipeline_T5 = pipeline('summarization', model = 't5-small' )

response = pipeline_T5(sample_text)

In [147]:
summaries_huggingface['T5'] = '\n'.join(sent_tokenize(response[0]["summary_text"]))

#### PEGASUS Model

The PEGASUS model’s pre-training task is very similar to summarization, i.e. important sentences are removed and masked from an input document and are later generated together as one output sequence from the remaining sentences, which is fairly similar to a summary. In PEGASUS, several whole sentences are removed from documents during pre-training, and the model is tasked with recovering them. The Input for such pre-training is a document with missing sentences, while the output consists of the missing sentences being concatenated together. The advantage of this self-supervision is that you can create as many examples as there are documents without any human intervention, which often becomes a bottleneck problem in purely supervised systems.

Creating Pipeline for the PEGASUS Model

In [149]:
pipeline_PEGASUS = pipeline('summarization', model="google/pegasus-cnn_dailymail")

repsonse = pipeline_PEGASUS(sample_text)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [150]:
summaries_huggingface["PEGASUS"] = response[0]["summary_text"].replace(".<n>", ".\n")

### ROUGE Metrics for the Models Processed by HuggingFace

#### Comparison for different produced by the models

In [152]:
print("GROUND TRUTH")

print(dataset['train'][0]['highlights'])

for model_name in summaries_huggingface:
    print(f"\n{model_name.upper()}")
    print(summaries_huggingface[model_name])

GROUND TRUTH
Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .

BASELINE
LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him.
Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties.
"I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month.

GPT-2
Daniel Radcliffe has made a fortune as a celebrity, which means he doesn't want to sell out his dr

The ROUGE score was specifically developed for applications like summarization where high recall is more important than just precision.

In [154]:
rouge_metric = load_metric('rouge')
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

reference = dataset['train'][0]['highlights']

records = []

for model_name in summaries_huggingface:
    rouge_metric.add(prediction = summaries_huggingface[model_name], reference = reference)
    score = rouge_metric.compute()
    rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names)
    records.append(rouge_dict)

pd.DataFrame.from_records(records, index = summaries_huggingface.keys())

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
baseline,0.335484,0.248366,0.296774,0.335484
GPT-2,0.147059,0.059259,0.102941,0.125
T5,0.35,0.282051,0.325,0.35
PEGASUS,0.35,0.282051,0.325,0.35
