# Single Document Pegasus
**Pegasus** is a state-of-the-art language model developed by Google that has shown impressive results in several natural language processing tasks, including text summarization. Here are some pros and cons of using Pegasus for single document summarization:

## Pros:

* High-quality summaries: Pegasus is known for generating high-quality summaries that capture the main ideas and key points of the input document accurately. This is because Pegasus is trained on a large corpus of diverse texts, which helps it to understand the nuances of language better.

* Generalization: Pegasus can generalize to unseen data, which means that it can summarize documents on topics it has not encountered during training. This makes it suitable for summarizing a wide range of texts, including those outside its training domain.

* Abstractive summarization: Pegasus uses abstractive summarization, which means that it can generate summaries that are not just a selection of sentences from the input document, but rather a rephrasing of the content in a more concise and coherent manner. This makes Pegasus summaries more readable and engaging.

* Speed: Pegasus is relatively fast at generating summaries, which makes it suitable for real-time applications, such as news summarization.

## Cons:

* Large computational requirements: Pegasus requires significant computational resources, including a powerful GPU and a large amount of memory. This can make it challenging to run on low-end hardware or in resource-constrained environments.

* Limited control: Pegasus generates summaries automatically, without any input from the user. This means that users have limited control over the length and content of the summary.

* Limited transparency: Because Pegasus is a black-box model, it can be challenging to understand how it generates summaries. This can make it difficult to diagnose errors or to fine-tune the model for specific tasks.

* Domain-specific summarization: Pegasus is trained on a diverse corpus of texts, but it may not perform as well on documents that are highly specialized or technical in nature. In such cases, a domain-specific summarization model may be more appropriate.

These are the scores we achieved:

    ROUGE Score:
    Precision: 0.583
    Recall: 0.046
    F1-Score: 0.086

    BLEU Score: 0

### References

Here are some related papers on Pegasus:

1. "PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization" by Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. This is the original paper that introduced Pegasus and describes the pre-training method used to train the model.

1. "Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping" by Yanyao Shen, Xiaodong Liu, Kevin Duh, and Jianfeng Gao. This paper explores various fine-tuning strategies for Pegasus and other language models, including weight initialization, data ordering, and early stopping.

1. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. This paper presents a unified text-to-text transformer model, which includes Pegasus, and evaluates its performance on a range of natural language processing tasks.

1. "Text Summarization with Pretrained Encoders" by Yang Liu and Mirella Lapata. This paper explores the use of pretrained language models, including Pegasus, for text summarization and compares their performance to traditional summarization models.

1. "Beyond Accuracy: Behavioral Testing of NLP Models with CheckList" by Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. This paper presents CheckList, a suite of behavioral tests for natural language processing models, including Pegasus, to evaluate their performance on specific linguistic phenomena.





In [None]:
# Install PyTorch
!pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio===0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/lts/1.8/torch_lts.html
Collecting torch==1.8.2+cu111
  Downloading https://download.pytorch.org/whl/lts/1.8/cu111/torch-1.8.2%2Bcu111-cp39-cp39-linux_x86_64.whl (1982.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 GB[0m [31m953.6 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.9.2+cu111
  Downloading https://download.pytorch.org/whl/lts/1.8/cu111/torchvision-0.9.2%2Bcu111-cp39-cp39-linux_x86_64.whl (17.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.4/17.4 MB[0m [31m65.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio===0.8.2
  Downloading https://download.pytorch.org/whl/lts/1.8/torchaudio-0.8.2-cp39-cp39-linux_x86_64.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m51.4 MB/s[0m eta [36m0:00:00[0m
Install

In [None]:
# Install transformers
!pip install transformers
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97


In [None]:
# Importing dependencies from transformers
from transformers import PegasusForConditionalGeneration, AutoTokenizer

In [None]:
# Load tokenizer 
tokenizer = AutoTokenizer.from_pretrained("google/pegasus-xsum")

Downloading (…)okenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

In [None]:
# Load model 
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")

Downloading pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

In [None]:
text = """
India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India."""

In [None]:
# Create tokens - number representation of our text
tokens = tokenizer(text, truncation=True, padding="longest", return_tensors="pt")

In [None]:
# Input tokens
tokens

{'input_ids': tensor([[ 1144,   131,   116,  1300,  4674,   148,  1487,   120,   109,   531,
           131,   116,  4585, 44078, 11545, 19138,   919,   138,   239,   129,
          4617,   112,   444,   200,   204,   109,   779,   113,  1790,   111,
           274,   204,  3125,   122,  1229,   121, 88392,   107,   139,   696,
           117,  1214,   112,   885,   142,   853, 21086,   604,   200,   108,
           395,   126,   156,   113,   109,  1368, 19138,  4930,   115,   109,
           278,   107,   159,  1057,   140,   784,   244,   114,   988,   113,
           109,   765, 10708,  1260,   124, 38664,  4396,   118,  4585, 44078,
         11545,   143, 14068,  1064, 59034,   312,   162,  2087,   109,  3847,
           113,   109, 19138,   431,   107,   139,  9350,  1064, 59034,   163,
          3498,   120,   808,  5518,   218,   129,  1608,   112, 15079,   109,
         10733,   108,  1670,   109,   703,   113,   136,   127,   610,   112,
           129, 20511,   107, 14080,  

In [None]:
# Summarize 
summary = model.generate(**tokens)
summary[0]



tensor([    0, 25803,   154,   200,   127,   112,   129, 33507,   464,   109,
        10011,  4585, 11545,  5807,   107,     1])

In [None]:
# Decode summary
tokenizer.decode(summary[0], skip_special_tokens=True)

'Millions more people are to be vaccinated against the deadly CO-19 virus.'

In [None]:
summ=tokenizer.decode(summary[0], skip_special_tokens=True)

In [None]:
summ

'Millions more people are to be vaccinated against the deadly CO-19 virus.'

In [None]:
!pip install rouge
!pip install nltk
import torch
import nltk 
nltk.download('punkt')
import json 
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
from rouge import Rouge 
import nltk.translate.bleu_score as bleu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
rouge = Rouge()
scores = rouge.get_scores(summ, text)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))

ROUGE Score:
Precision: 0.583
Recall: 0.046
F1-Score: 0.086


In [None]:
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import sent_tokenize, word_tokenize

def summary_to_sentences(summ):
    # Split the summary into sentences using the nltk.sent_tokenize() function
    sentences = sent_tokenize(summ)
    
    # Convert each sentence into a list of words using the nltk.word_tokenize() function
    sentence_lists = [word_tokenize(sentence) for sentence in sentences]
    
    return sentence_lists

def paragraph_to_wordlist(paragraph):
    # Split the paragraph into words using the nltk.word_tokenize() function
    words = word_tokenize(paragraph)
    return words

reference_paragraph = text
reference_summary = summary_to_sentences(reference_paragraph)
predicted_paragraph = summ
predicted_summary = paragraph_to_wordlist(predicted_paragraph)

score = sentence_bleu(reference_summary, predicted_summary)
print(score)

6.572888952391001e-155


In [None]:
print("BLEU Score: {:.3f}".format(score))

BLEU Score: 0.000
