In [1]:
! pip install sentence-splitter

Collecting sentence-splitter
  Downloading sentence_splitter-1.4-py2.py3-none-any.whl (44 kB)
[?25l[K     |███████▎                        | 10 kB 18.0 MB/s eta 0:00:01[K     |██████████████▋                 | 20 kB 11.4 MB/s eta 0:00:01[K     |█████████████████████▉          | 30 kB 8.5 MB/s eta 0:00:01[K     |█████████████████████████████▏  | 40 kB 7.6 MB/s eta 0:00:01[K     |████████████████████████████████| 44 kB 1.3 MB/s 
Installing collected packages: sentence-splitter
Successfully installed sentence-splitter-1.4


In [2]:
! pip install transformers

Collecting transformers
  Downloading transformers-4.19.0-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 5.2 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 3.4 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 32.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 42.6 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed huggingface-hub-0.6.0 py

In [3]:
! pip install SentencePiece

Collecting SentencePiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 3.1 MB/s 
[?25hInstalling collected packages: SentencePiece
Successfully installed SentencePiece-0.1.96


In [4]:
# https://huggingface.co/tuner007/pegasus_paraphrase

import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

model_name = 'tuner007/pegasus_paraphrase'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)

def get_response(input_text,num_return_sequences):
  batch = tokenizer.prepare_seq2seq_batch([input_text],truncation=True,padding='longest',max_length=60, return_tensors="pt").to(torch_device)
  translated = model.generate(**batch,max_length=60,num_beams=10, num_return_sequences=num_return_sequences, temperature=1.5)
  tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
  return tgt_text

Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

In [5]:
text = "Both GSG and MLM are applied simultaneously to this example as pre-training objectives."

In [6]:
get_response(text, 1)

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



['This example has both GSG and MLM applied at the same time.']

In [7]:
# Paragraph of text
context = "Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models. Important sentences from an input text are removed/masked in PEGASUS, and the remaining sentences are formed as one output sequence from the remaining sentences, similar to an extractive summary. Text summarization aims at generating accurate and concise summaries from input document(s). In contrast to extractive summarization which merely copies informative fragments from the input, abstractive summarization may generate novel words. A good abstractive summary covers principal information in the input and is linguistically fluent."
print(context)

Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models. Important sentences from an input text are removed/masked in PEGASUS, and the remaining sentences are formed as one output sequence from the remaining sentences, similar to an extractive summary. Text summarization aims at generating accurate and concise summaries from input document(s). In contrast to extractive summarization which merely copies informative fragments from the input, abstractive summarization may generate novel words. A good abstractive summary covers principal information in the input and is linguistically fluent.


In [8]:
# Takes the input paragraph and splits it into a list of sentences
from sentence_splitter import SentenceSplitter, split_text_into_sentences

splitter = SentenceSplitter(language='en')

sentence_list = splitter.split(context)
sentence_list

['Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models.',
 'Important sentences from an input text are removed/masked in PEGASUS, and the remaining sentences are formed as one output sequence from the remaining sentences, similar to an extractive summary.',
 'Text summarization aims at generating accurate and concise summaries from input document(s).',
 'In contrast to extractive summarization which merely copies informative fragments from the input, abstractive summarization may generate novel words.',
 'A good abstractive summary covers principal information in the input and is linguistically fluent.']

In [9]:
# Do a for loop to iterate through the list of sentences and paraphrase each sentence in the iteration
paraphrase = []

for i in sentence_list:
  a = get_response(i,2)
  paraphrase.append(a)

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



In [10]:
# This is the paraphrased text
paraphrase

[['Pre-training with Extracted Gap-sentences.',
  'Pre-training with extracted gap-sentences.'],
 ['Important sentences from an input text are removed/masked in PEGASUS, and the remaining sentences are formed as one output sequence from the remaining sentences, similar to an extractive summary.',
  'Important sentences from an input text are removed/masked in PEGASUS, and the remaining sentences are formed as one output sequence from the remaining sentences.'],
 ['Text summarization can be used to generate accurate and concise summaries.',
  'Text summarization tries to get accurate and concise summaries from the input document.'],
 ['In contrast to extractive summarization, abstractive summarization can generate novel words.',
  'In contrast to extractive summarization, abstractive summarization may generate novel words.'],
 ['A good abstractive summary covers most of the information in the input.',
  'A good summary covers the main information in the input.']]

In [11]:
paraphrase2 = [' '.join(x) for x in paraphrase]
paraphrase2

['Pre-training with Extracted Gap-sentences. Pre-training with extracted gap-sentences.',
 'Important sentences from an input text are removed/masked in PEGASUS, and the remaining sentences are formed as one output sequence from the remaining sentences, similar to an extractive summary. Important sentences from an input text are removed/masked in PEGASUS, and the remaining sentences are formed as one output sequence from the remaining sentences.',
 'Text summarization can be used to generate accurate and concise summaries. Text summarization tries to get accurate and concise summaries from the input document.',
 'In contrast to extractive summarization, abstractive summarization can generate novel words. In contrast to extractive summarization, abstractive summarization may generate novel words.',
 'A good abstractive summary covers most of the information in the input. A good summary covers the main information in the input.']

In [12]:
# Combines the above list into a paragraph
paraphrase3 = [' '.join(x for x in paraphrase2) ]
paraphrased_text = str(paraphrase3).strip('[]').strip("'")
paraphrased_text

'Pre-training with Extracted Gap-sentences. Pre-training with extracted gap-sentences. Important sentences from an input text are removed/masked in PEGASUS, and the remaining sentences are formed as one output sequence from the remaining sentences, similar to an extractive summary. Important sentences from an input text are removed/masked in PEGASUS, and the remaining sentences are formed as one output sequence from the remaining sentences. Text summarization can be used to generate accurate and concise summaries. Text summarization tries to get accurate and concise summaries from the input document. In contrast to extractive summarization, abstractive summarization can generate novel words. In contrast to extractive summarization, abstractive summarization may generate novel words. A good abstractive summary covers most of the information in the input. A good summary covers the main information in the input.'

In [13]:
print(context)
print(paraphrased_text)

Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models. Important sentences from an input text are removed/masked in PEGASUS, and the remaining sentences are formed as one output sequence from the remaining sentences, similar to an extractive summary. Text summarization aims at generating accurate and concise summaries from input document(s). In contrast to extractive summarization which merely copies informative fragments from the input, abstractive summarization may generate novel words. A good abstractive summary covers principal information in the input and is linguistically fluent.
Pre-training with Extracted Gap-sentences. Pre-training with extracted gap-sentences. Important sentences from an input text are removed/masked in PEGASUS, and the remaining sentences are formed as one output sequence from the remaining sentences, similar to an extractive summary. Important sentences from an input text are removed/masked in PEGASUS, and the r