In [1]:
# https://github.com/google-research/pegasus
# https://arxiv.org/abs/2212.06933 (Paraphrase Identification with Deep Learning: A Review of Datasets and Methods)

''' Reference '''

' Reference '

In [2]:
''' Requirements '''
# absl-py
# mock
# numpy
# rouge-score
# sacrebleu
# sentencepiece
# tensorflow-text==1.15.0rc0
# tensor2tensor==1.15.0
# tensorflow-datasets==2.1.0
# tensorflow-gpu==1.15.2

' Requirements '

In [3]:
! pip install sentence-splitter

# https://pypi.org/project/sentence-splitter/

''' This module allows splitting of text paragraphs into sentences. '''


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-splitter
  Downloading sentence_splitter-1.4-py2.py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentence-splitter
Successfully installed sentence-splitter-1.4


' This module allows splitting of text paragraphs into sentences. '

In [4]:
! pip install transformers

# ‘Attention Is All You Need’
# https://arxiv.org/abs/1706.03762

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m51.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m101.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.2


In [5]:
! pip install SentencePiece
# https://pypi.org/project/sentencepiece/

''' 
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems 

'''


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting SentencePiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: SentencePiece
Successfully installed SentencePiece-0.1.99


' \nSentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems \n\n'

In [6]:
# https://huggingface.co/tuner007/pegasus_paraphrase
# https://github.com/google-research/pegasus

import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

model_name = 'tuner007/pegasus_paraphrase'  #pre-trained_model 

torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = PegasusTokenizer.from_pretrained(model_name)

model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)

def get_response(input_text,num_return_sequences):
  batch = tokenizer.prepare_seq2seq_batch([input_text],
                                          truncation=True,
                                          padding='longest',
                                           max_length=60, return_tensors="pt").to(torch_device)
    # pare_seq2seq_batch;- Prepare model inputs for translation. For best performance, translate one sentence at a time.

  translated = model.generate(**batch,max_length=60,
                              num_beams=10, 
                              num_return_sequences=num_return_sequences, 
                              temperature=1.5)
  
  tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)

  return tgt_text

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

### **Processing a single sentence**


In [7]:
text = "Artificial Intelligence is the science and engineering of making intelligent machines, especially intelligent computer programs." 
print(len(text.split(" ")))


15


In [8]:
get_response(text,10)

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



['Artificial intelligence is the science and engineering of making machines.',
 'Artificial Intelligence is the science and engineering of making machines.',
 'Artificial intelligence is the science and engineering of computers.',
 'Artificial intelligence is the science and engineering of computer programs.',
 'Artificial Intelligence is the science and engineering of computers.',
 'Artificial intelligence is the science and engineering of machines.',
 'Artificial intelligence is the science and engineering of making machines and computer programs.',
 'Artificial intelligence is the science and engineering of creating machines.',
 'Artificial Intelligence is the science and engineering of making machines and computer programs.',
 'Artificial intelligence is the science and engineering of making machines that are smart.']

In [9]:
get_response("The top AI and ML instructor is Dr.Niladri Chatterjee.",10)

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



['The top trainer of artificial intelligence and machine learning is Dr. Niladri Chatterjee.',
 'The top artificial intelligence and machine learning instructor is Dr. Niladri Chatterjee.',
 'The top artificial intelligence and machine learning instructor is Dr.Niladri Chatterjee.',
 'The top artificial intelligence and machine learning teacher is Dr. Niladri Chatterjee.',
 'The top trainer of artificial intelligence and machine learning is Dr.Niladri Chatterjee.',
 'The top artificial intelligence and machine learning teacher is Dr.Niladri Chatterjee.',
 'The top instructor for artificial intelligence and machine learning is Dr. Niladri Chatterjee.',
 'The top instructor in artificial intelligence and machine learning is Dr. Niladri Chatterjee.',
 'The top instructor in artificial intelligence and machine learning is Dr.Niladri Chatterjee.',
 'The top artificial intelligence and machine learning instructor is Dr.Niladri.']

# **Processing a paragraph of text**

In [10]:
# Paragraph of text
context = "Artificial Intelligence (AI) and Machine Learning (ML) are two closely related but distinct fields within the broader field of computer science. AI is a discipline that focuses on creating intelligent machines that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and natural language processing. It involves the development of algorithms and systems that can reason, learn, and make decisions based on input data."
print(context)

Artificial Intelligence (AI) and Machine Learning (ML) are two closely related but distinct fields within the broader field of computer science. AI is a discipline that focuses on creating intelligent machines that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and natural language processing. It involves the development of algorithms and systems that can reason, learn, and make decisions based on input data.


In [11]:
# Takes the input paragraph and splits it into a list of sentences
from sentence_splitter import SentenceSplitter, split_text_into_sentences

splitter = SentenceSplitter(language='en')

sentence_list = splitter.split(context)
sentence_list
     

['Artificial Intelligence (AI) and Machine Learning (ML) are two closely related but distinct fields within the broader field of computer science.',
 'AI is a discipline that focuses on creating intelligent machines that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and natural language processing.',
 'It involves the development of algorithms and systems that can reason, learn, and make decisions based on input data.']

In [12]:
# Do a for loop to iterate through the list of sentences and paraphrase each sentence in the iteration
paraphrase = []

for i in sentence_list:
  a = get_response(i,1) #single response
  paraphrase.append(a)

In [13]:
#  This is the paraphrased text
paraphrase

[['Both Artificial Intelligence and Machine Learning are related in some way to the broader field of computer science.'],
 ['Artificial intelligence focuses on creating machines that can perform tasks that require human intelligence, such as visual perception, speech recognition, decision-making, and natural language processing.'],
 ['The development of systems that can reason, learn, and make decisions based on input data is involved.']]

In [14]:
paraphrase2 = [' '.join(x) for x in paraphrase]
paraphrase2

['Both Artificial Intelligence and Machine Learning are related in some way to the broader field of computer science.',
 'Artificial intelligence focuses on creating machines that can perform tasks that require human intelligence, such as visual perception, speech recognition, decision-making, and natural language processing.',
 'The development of systems that can reason, learn, and make decisions based on input data is involved.']

In [15]:
# Combines the above list into a paragraph
paraphrase3 = [' '.join(x for x in paraphrase2) ]
paraphrased_text = str(paraphrase3).strip('[]').strip("'")
paraphrased_text

'Both Artificial Intelligence and Machine Learning are related in some way to the broader field of computer science. Artificial intelligence focuses on creating machines that can perform tasks that require human intelligence, such as visual perception, speech recognition, decision-making, and natural language processing. The development of systems that can reason, learn, and make decisions based on input data is involved.'

In [16]:
# Comparison of the original (context variable) and the paraphrased version (paraphrase3 variable)

print(context)
print()
print(paraphrased_text)

Artificial Intelligence (AI) and Machine Learning (ML) are two closely related but distinct fields within the broader field of computer science. AI is a discipline that focuses on creating intelligent machines that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and natural language processing. It involves the development of algorithms and systems that can reason, learn, and make decisions based on input data.

Both Artificial Intelligence and Machine Learning are related in some way to the broader field of computer science. Artificial intelligence focuses on creating machines that can perform tasks that require human intelligence, such as visual perception, speech recognition, decision-making, and natural language processing. The development of systems that can reason, learn, and make decisions based on input data is involved.
