# Multi Document Pegasus

Pegasus is a state-of-the-art language model developed by Google that has shown impressive results in several natural language processing tasks, including text summarization. Here are some pros and cons of using Pegasus for multi-document summarization of news articles:

### Pros:

* High-quality summaries: Pegasus is known for generating high-quality summaries that capture the main ideas and key points of the input documents accurately. This makes it suitable for summarizing news articles that cover multiple events and topics.

* Multi-document summarization: Pegasus can summarize multiple documents, which allows it to provide a comprehensive overview of a topic or event. This is particularly useful for news articles that cover a developing story over time.

* Abstractive summarization: Pegasus uses abstractive summarization, which means that it can generate summaries that are not just a selection of sentences from the input documents, but rather a rephrasing of the content in a more concise and coherent manner. This makes Pegasus summaries more readable and engaging.

* Generalization: Pegasus can generalize to unseen data, which means that it can summarize news articles on topics it has not encountered during training. This makes it suitable for summarizing a wide range of news articles.

* Speed: Pegasus is relatively fast at generating summaries, which makes it suitable for real-time applications, such as news summarization.

### Cons:

* Large computational requirements: Pegasus requires significant computational resources, including a powerful GPU and a large amount of memory. This can make it challenging to run on low-end hardware or in resource-constrained environments.

* Limited control: Pegasus generates summaries automatically, without any input from the user. This means that users have limited control over the length and content of the summary.

* Limited transparency: Because Pegasus is a black-box model, it can be challenging to understand how it generates summaries. This can make it difficult to diagnose errors or to fine-tune the model for specific tasks.

* Biased summarization: Pegasus may exhibit bias in its summarization if the input data is biased. This can lead to summaries that are incomplete or inaccurate.

* Limited domain-specific summarization: Pegasus is trained on a diverse corpus of texts, but it may not perform as well on news articles that are highly specialized or technical in nature. In such cases, a domain-specific summarization model may be more appropriate.

These are the scores we achieved:

    ROUGE Score:
    Precision: 0.391
    Recall: 0.059
    F1-Score: 0.103

    BLEU Score: 0.000


### References

1. "PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization" by Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. This is the original paper that introduced Pegasus and describes the pre-training method used to train the model. The paper also includes experiments on multi-document summarization of news articles.

2. "Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping" by Yanyao Shen, Xiaodong Liu, Kevin Duh, and Jianfeng Gao. This paper explores various fine-tuning strategies for Pegasus and other language models, including those for multi-document summarization of news articles.

3. "Extract and Summarize: Improving Large Lexicalized Summarization with Text Extraction" by Wei Li, Börkur Sigurbjörnsson, and Nick Campbell. This paper proposes a method for combining Pegasus with text extraction techniques for multi-document summarization of news articles.

4. "Leveraging Pre-trained Checkpoints for Sequence Generation Tasks" by Chieh-Kai Lin, Ming Shen, Zhe Gan, Yuwei Fang, and Jingjing Liu. This paper explores the use of Pegasus for multi-document summarization of news articles and evaluates its performance on the CNN/Daily Mail dataset.

5. "TL;DR: Mining Reddit to Learn Automatic Summarization" by Greg Durrett, Adam Pauls, and Dragomir Radev. This paper proposes a method for using Pegasus to generate abstractive summaries of news articles from Reddit, which allows for multi-document summarization. The paper includes experiments that show the effectiveness of the approach.





In [1]:
!pip install transformers
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m60.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.1-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.1 tokenizers-0.13.2 transformers-4.26.1
Looking in indexes: https://pypi.org/simple

In [2]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer, AutoTokenizer

In [3]:
# Load the Pegasus tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('google/pegasus-xsum')

# tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum')
model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum')

Downloading (…)okenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

In [13]:
a="""
The Supreme Court on February 23 allowed former Tamil Nadu Chief Minister Edappadi K. Palaniswami to continue as the interim general secretary of the AIADMK. The apex court has affirmed a Madras High Court Division Bench decision that upheld the conduct of July 2022 general council meeting of the party, during which Mr. Palaniswami was made the party leader and his rival O. Panneerselvam was expelled.
A Bench, led by Justice Dinesh Maheshwari, also directed that an interim order of the apex court on July 6, 2022, in the case was "absolute". The interim order had permitted the July 11 meeting to be held. It had further directed that no restrictions should be placed on the agenda of an earlier General Council meeting held on June 23, 2022.
"""
b="""
Ukrainian President has warned that if China sides with Russia in the war against Ukraine, it would mean World War III. 
President Zelensky thinks the world will hazard a nuclear war to serve his ambition. Perhaps he pins his hope on winning a war with the help of borrowed war machine. Powers, not proxies, fight wars.
Russia does not need military help from any country, as does Ukraine, to fight the war. The Ukraine war is a long-deferred war against the colonial mentality of raising proxies to fight their wars and push their interests. This line of thinking convinced China and India not to sign the US-sponsored condemnation resolution against Russia.
"""
c="""
Congress leader Pawan Khera was arrested at the Delhi airport today, after being taken off a flight to Chhattisgarh capital Raipur, over an alleged insult to Prime Minister Narendra Modi. Nearly 50 Congress leaders launched a rare protest on the tarmac, refusing to let the flight leave. The opposition party also approached the Supreme Court against the arrest.
Pawan Khera, a senior Congress spokesperson, was forced to exit the IndiGo flight after he boarded it as part of a Congress group heading to Raipur for a meeting of the All India Congress Committee (AICC).
"""
docs = [a,b,c]

# Concatenate the documents into a single string
# text = ' '.join(docs)

# For testing
text= """
India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.
"""

# Encode the text using the tokenizer
inputs = tokenizer.encode(text, return_tensors='pt')

In [17]:

# Generate the summary using the Pegasus model
summary_ids = model.generate(inputs, num_beams=4, length_penalty=2.0, max_length=100, min_length=30, no_repeat_ngram_size=3)

# Decode the summary and print it
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
result = summary
print(summary)

India's main opposition Congress has demanded the arrest of its leader Pawan Khera for "insulting" the prime minister and "causing a disturbance" on a flight.


In [15]:
# Generate summary for each document
for doc in docs:
    # Tokenize input document
    inputs = tokenizer.encode(doc, return_tensors='pt')

    # Generate summary
    summary_ids = model.generate(inputs, num_beams=4, length_penalty=2.0, max_length=100, min_length=30, no_repeat_ngram_size=3)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Print summary
    print('Input Document:', doc)
    print('Summary:', summary)
    print()

Input Document: 
The Supreme Court on February 23 allowed former Tamil Nadu Chief Minister Edappadi K. Palaniswami to continue as the interim general secretary of the AIADMK. The apex court has affirmed a Madras High Court Division Bench decision that upheld the conduct of July 2022 general council meeting of the party, during which Mr. Palaniswami was made the party leader and his rival O. Panneerselvam was expelled.
A Bench, led by Justice Dinesh Maheshwari, also directed that an interim order of the apex court on July 6, 2022, in the case was "absolute". The interim order had permitted the July 11 meeting to be held. It had further directed that no restrictions should be placed on the agenda of an earlier General Council meeting held on June 23, 2022.

Summary: The Supreme Court has upheld a High Court decision that upheld the conduct of July 2022 general council meeting of the AIADMK, during which Edappadi Palaniswami was made the party leader and his rival O. Panneerselvam was exp

In [16]:
!pip install scikit-learn
!pip install rouge
!pip install nltk
from rouge import Rouge 
import nltk
import nltk.translate.bleu_score as bleu
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize, sent_tokenize 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [18]:
rouge = Rouge()
scores = rouge.get_scores(result, text)
print("ROUGE Score:")
print("Precision: {:.3f}".format(scores[0]['rouge-1']['p']))
print("Recall: {:.3f}".format(scores[0]['rouge-1']['r']))
print("F1-Score: {:.3f}".format(scores[0]['rouge-1']['f']))

ROUGE Score:
Precision: 0.391
Recall: 0.059
F1-Score: 0.103


In [19]:
from nltk.translate.bleu_score import sentence_bleu

def summary_to_sentences(summ):
    # Split the summary into sentences using the '.' character as a separator
    sentences = summ.split('.')
    
    # Convert each sentence into a list of words
    sentence_lists = [sentence.split() for sentence in sentences]
    
    return sentence_lists

def paragraph_to_wordlist(paragraph):
    # Split the paragraph into words using whitespace as a separator
    words = paragraph.split()
    return words

reference_paragraph = text
reference_summary = summary_to_sentences(reference_paragraph)
predicted_paragraph = result
predicted_summary = paragraph_to_wordlist(predicted_paragraph)

score = sentence_bleu(reference_summary, predicted_summary)
print(score)   

1.4837867640225538e-231


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [20]:
print("BLEU Score: {:.3f}".format(score))

BLEU Score: 0.000
