<a href="https://colab.research.google.com/github/Mustafa017/paraphrasing_tool/blob/main/paraphrasing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q -U watermark
!pip install -qq transformers

In [5]:
%reload_ext watermark
%watermark -v -p transformers

Python implementation: CPython
Python version       : 3.7.15
IPython version      : 7.9.0

transformers: 4.24.0



## Construct a PEGASUS tokenizer. Based on SentencePiece.
A tokenizer is in charge of preparing the inputs for a model.

* Tokenizing (splitting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e., tokenizing and converting to integers).

* Adding new tokens to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece…).

* Managing special tokens (like mask, beginning-of-sentence, etc.): adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization.

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training.

SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

In [4]:
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
# Pegasus is the name of a model that can be used for paraphrase purposes only.
model_name = 'tuner007/pegasus_paraphrase'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)

def get_response(input_text,num_return_sequences,num_beams):
  batch = tokenizer([input_text],truncation=True,padding='longest',max_length=60, return_tensors="pt").to(torch_device)
  translated = model.generate(**batch,max_length=60,num_beams=num_beams, num_return_sequences=num_return_sequences, temperature=1.5)
  tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
  return tgt_text


Downloading:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

In [7]:
sentence = "Learning is the process of acquiring new understanding, knowledge, behaviors, skills, values, attitudes, and preferences."
get_response(sentence, num_return_sequences=10, num_beams=10)

['Learning involves the acquisition of new understanding, knowledge, behaviors, skills, values, attitudes, and preferences.',
 'Learning is the acquisition of new understanding, knowledge, behaviors, skills, values, attitudes, and preferences.',
 'The process of learning is the acquisition of new understanding, knowledge, behaviors, skills, values, attitudes, and preferences.',
 'Gaining new understanding, knowledge, behaviors, skills, values, attitudes, and preferences is the process of learning.',
 'New understanding, knowledge, behaviors, skills, values, attitudes, and preferences are acquired through learning.',
 'Learning is the acquisition of new understanding, knowledge, behaviors, skills, values, attitudes and preferences.',
 'The process of learning is the acquisition of new understanding, knowledge, behaviors, skills, values, attitudes and preferences.',
 'New understanding, knowledge, behaviors, skills, values, attitudes, and preferences can be acquired through learning.',
 

In [9]:
sentence2 = 'Earlier in the day in Group C, Saudi Arabia pulled off the most unlikely of upsets over heavy favorite Argentina. If this stalemate holds between Mexico and Poland, the Saudis would end the day atop the group with 3 points.'
get_response(sentence2, num_return_sequences=10, num_beams=10)

['Saudi Arabia pulled off the most unlikely of upsets over Argentina earlier in the day.',
 'Saudi Arabia pulled off the most unlikely of upsets over Argentina earlier in the day and would end the day atop the group with 3 points.',
 'Saudi Arabia pulled off the most unlikely of upsets over Argentina earlier in the day, so they would end the day atop the group with 3 points.',
 'Saudi Arabia pulled off the most unlikely of upsets over Argentina in Group C.',
 'Saudi Arabia pulled off the most unlikely of upsets over Argentina earlier in the day in Group C.',
 'Saudi Arabia pulled off the most unlikely of upsets over Argentina earlier in the day, and they would end the day atop the group with 3 points.',
 'Saudi Arabia pulled off the most unlikely of upsets against Argentina in Group C.',
 'In Group C, Saudi Arabia pulled off the most unlikely of upsets over Argentina.',
 'Saudi Arabia pulled off the most unlikely of upsets over Argentina earlier in the day, so they would end the day at