#How to paraphrase text using transformers in Python

Notes and References:

PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive Summarization) is a sequence-to-sequence model developed by Google Research. In this work, the PEGASUS model has been used from the Huggingface Transformers library. For more details, refer to the original paper: PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization.


Install Libraries

In [8]:
!pip install sentence-splitter
!pip install transformers
!pip install SentencePiece




Download pretrained Google Pegasus Paraphrase Model and its tokenizer

In [9]:
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

In [10]:
tokenizer = PegasusTokenizer.from_pretrained("tuner007/pegasus_paraphrase")
model = PegasusForConditionalGeneration.from_pretrained("tuner007/pegasus_paraphrase")


Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at tuner007/pegasus_paraphrase and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#Tokenization

In [12]:
text = "Machine Learning is Fast-Growing Field in Artificial Intelligence"

batch = tokenizer([text], padding=True, tuncation=True, max_length=60, return_tensors='pt') #Pytorch
output = model.generate(**batch, max_length=60, num_beams=5, num_return_sequences=5)

results = tokenizer.batch_decode(output, skip_special_tokens=True)


Keyword arguments {'tuncation': True} not recognized.


In [13]:
results

['Machine Learning is a fast-growing field.',
 'Machine learning is growing fast.',
 'Machine learning is a fast-growing field.',
 'Machine Learning is growing fast.',
 'The field of machine learning is growing fast.']

#Predictive System (Generate Paraphrase)

In [14]:
def get_response(input_text, num_return_sequences, num_beams):
  tokenize = tokenizer([input_text], truncation=True, padding='longest', max_length=60, return_tensors='pt')
  train = model.generate(**tokenize, max_length=60, num_beams=num_beams, num_return_sequences=num_return_sequences)
  decode = tokenizer.batch_decode(train, skip_special_tokens=True)
  return decode

In [16]:
num_beams = 10
num_return_sequences = 10
text = "If you want, I can give you a few other options"
get_response(text, num_return_sequences, num_beams)


['I can give you other options if you want.',
 'I can give you other options if you want to.',
 'I can give you other options if you want them.',
 'I can give you a few other options if you want.',
 'I can give you a few other options.',
 'I can give you a few other options if you want to.',
 'I can give you a few other options if you want them.',
 'I can give you more options if you want.',
 'If you want, I can give you other options.',
 'I can give you more options if you want to.']

#Save model and tokenizer

In [None]:
model.save_pretrained('/content/drive/MyDrive/Ai class diu Project/model300')
tokenizer.save_pretrained('/content/drive/MyDrive/Ai class diu Project/tokenizer300')



('/content/drive/MyDrive/Ai class diu Project/tokenizer3/tokenizer_config.json',
 '/content/drive/MyDrive/Ai class diu Project/tokenizer3/special_tokens_map.json',
 '/content/drive/MyDrive/Ai class diu Project/tokenizer3/spiece.model',
 '/content/drive/MyDrive/Ai class diu Project/tokenizer3/added_tokens.json')