##**Installing Libraries**

In [1]:
!pip install transformers
!pip install sentence-splitter
!pip install SentencePiece

Collecting sentence-splitter
  Downloading sentence_splitter-1.4-py2.py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentence-splitter
Successfully installed sentence-splitter-1.4


###**Downloading Pre-Trained Google Pegasus Paraphrase Model and its tokenizer**

In [2]:
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

In [3]:
model = PegasusForConditionalGeneration.from_pretrained('tuner007/pegasus_paraphrase')
tokenizer = PegasusTokenizer.from_pretrained('tuner007/pegasus_paraphrase')

config.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at tuner007/pegasus_paraphrase and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

###**Tokenization**

In [5]:
text = "The ultimate test of your knowledge is your capacity to convey it to another"

batch = tokenizer([text], padding=True, truncation = True, max_length=60, return_tensors='pt')

output = model.generate(**batch, max_length = 60, num_beams = 5, num_return_sequences = 5, temperature = 1.5)



In [7]:
results = tokenizer.batch_decode(output, skip_special_tokens = True)
print(results)

['The test of your knowledge is your ability to convey it.', 'Your capacity to convey your knowledge is the ultimate test of it.', 'The ability to convey your knowledge is the ultimate test of your knowledge.', 'The test of your knowledge is your ability to communicate it.', 'Your capacity to convey your knowledge is the ultimate test.']


###**Saving model & tokenizer**

In [9]:
# Saving trained model & Tokenizer
model.save_pretrained('/content/drive/MyDrive/model')
tokenizer.save_pretrained('/content/drive/MyDrive/tokenizer')

Non-default generation parameters: {'max_length': 60, 'num_beams': 8, 'length_penalty': 0.8, 'forced_eos_token_id': 1}


('/content/drive/MyDrive/tokenizer/tokenizer_config.json',
 '/content/drive/MyDrive/tokenizer/special_tokens_map.json',
 '/content/drive/MyDrive/tokenizer/spiece.model',
 '/content/drive/MyDrive/tokenizer/added_tokens.json')

###**Predictive System (Generate Parapharse)**

In [14]:
def predict(input_text, num_return_sequences = 5, num_beams = 5):
  batch = tokenizer([input_text], padding = True, truncation = True, max_length = 60, return_tensors = 'pt')

  translated = model.generate(**batch, max_length = 60, num_beams = 5, num_return_sequences = 5, temperature = 1.5)

  tgt_text = tokenizer.batch_decode(translated, skip_special_tokens = True)

  return tgt_text

In [15]:
num_beams = 10
num_return_sequences = 10
input_text = "Data is getting dangerous and taking over the world"
predict(input_text, num_return_sequences, num_beams)

['Data is taking over the world.',
 'Data is taking over the world in dangerous ways.',
 'The world is being taken over by data.',
 'Data is taking over the world in a dangerous way.',
 'Data is taking over the world']