First, we install all the dependencies.

In [2]:
!pip install torch
!pip install transformers
!pip install sentencepiece
!pip install accelerate -U
!pip install transformers[torch]
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch


Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99
Collecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.25.0
Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3

If GPU is available, we use the GPU.

In [3]:
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'

We use the fine tuned "Tuner007" model for our paraphrasing tasks.

In [4]:
model = 'tuner007/pegasus_paraphrase'

Create an instance of the PegasusTokenizer class, which is pre-trained and capable of tokenizing text in a way that is compatible with the Pegasus model.

In [5]:
tokenizer = PegasusTokenizer.from_pretrained(model)

tokenizer_config.json:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

Create a pretrained and finetuned Pegasus model.

In [7]:
finetuned_model = PegasusForConditionalGeneration.from_pretrained(model).to(torch_device)

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at tuner007/pegasus_paraphrase and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
input_example = "rain atlanta till nine"

Prepare a batch of input sequences.

In [32]:
batch = tokenizer.prepare_seq2seq_batch(input_example,
                                            truncation=True, #Since all our words in the sentence are keywords, we do not want to truncate any of these words.
                                            padding='longest', #Pads all sequence to the size of the longest sequence
                                            max_length=200,  #Max sequence length after padding.
                                        return_tensors="pt").to(torch_device)

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



Generate a decoded output from the Tuner007 model.

In [33]:
batch_output = finetuned_model.generate(**batch, max_length=200,
                                num_beams=3, #We keep this value low to avoid inconsistent outputs. We want outputs to be as standard and uniform as possible. This is also dependent on the kind of inputs that were provided.
                                num_return_sequences=1, #We call this sentence by sentence; we want only one output sentence for each input sequence of keywords.
                                temperature=0.5)



In [34]:
output = tokenizer.batch_decode(batch_output, skip_special_tokens=True)


In [35]:
print(output)

['The rain is in Atlanta until nine.']


Tidying up and putting the parapohrasing in one function.

In [36]:
def paraphrase(input_text, num_return_sequences):
    return tokenizer.batch_decode(model.generate(**tokenizer.prepare_seq2seq_batch(input,
                                            truncation=True,
                                            padding='longest',
                                            max_length=200, return_tensors="pt").to(torch_device), max_length=200,
                                num_beams=3,
                                num_return_sequences=num_return_sequences,
                                temperature=0.5), skip_special_tokens=True)

