# Existing Paraphrasing Models
 
Authors: Ruslan Mammadov \<ruslanmammadov48@gmail.com\>

Copyright (C) 2021 Ruslan Mammadov and DynaGroup i.T. GmbH

## Important
Here, only few cherry-picked models are descibed. This is in no way representative for all existing paraphrasing models.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from tqdm import tqdm
import pandas as pd
import torch

# Load parabank 2 - First 100k sentences

In [None]:
# Let's load first 10000 parabank2.
take_first_n = 100000
parabank2 = []

with open("drive/MyDrive/Paraphrasing API/datasets/Machine Made Datasets/parabank2.tsv", "r") as file:
  for i, line in enumerate(file):
    if i > take_first_n:
      break
    parabank2.append(line.strip().split("\t"))

In [None]:
parabank2 = parabank2[10000:] # Remove first 10k sentence which are just names of organizations

In [None]:
df_parabank2 = pd.DataFrame(parabank2, columns=["quality", "text", "ref1", "ref2", "ref3","ref4", "ref5"])
df_parabank2.head(10)

Unnamed: 0,quality,text,ref1,ref2,ref3,ref4,ref5
0,0.7753040530907698,He grew up in Poland.,Grew up in Poland.,Raised in Poland.,,,
1,0.7753040530907698,You look ridiculous.,You look ludicrous.,You look laughable.,You seem ludicrous.,Looking ridiculous.,How ridiculous you look.
2,0.7753040530907698,Welcome to Uppsala.,Hello. -Welcome to Uppsala.,Welcome to the Uppala.,,,
3,0.7753040530907698,Welcome to Silicon Valley.,Welcome to silicone valley.,And welcome to Silycon Valley.,,,
4,0.7753040530907698,I saw him yesterday afternoon.,I saw him last night in the afternoon.,Saw it yesterday afternoon.,Saw him last night afternoon.,I saw this guy last afternoon.,I see that guy yesterday afternoon.
5,0.7753040530907698,The war in Europe is over.,The war is over in Europe.,Europe's war is over.,The war was over in Europe.,This war is over in Europe.,
6,0.7753040530907698,The role of a free press should be to put pres...,The role of the free press should be to press ...,The role for free pressing should be to exert ...,It would be the task for free press to press p...,Free press action should play a role in puttin...,It should be a function of a free press to pre...
7,0.7753040530907698,The markets we need are closed to us.,The markets we need are foreclosed to us.,Markets that we need are shut down for us.,The markets we need are foreclosed for us.,The markets that are needed are closed for us.,We're closed to the markets we need.
8,0.7753040530907698,This fundamental right is expressly confirmed ...,This fundamental right is explicitly confirmed...,This fundamental right is explicitly confirmed...,Such a fundamental right is explicitly confirm...,That fundamental right is explicitly affirmed ...,That principle of freedom of appeal has been e...
9,0.7753040530907698,Thomas looks like a wild animal.,Thomas looks like an animal of the wild.,Thomas seems like some sort of wild animal.,Thomas looks a feral animal.,Thomas looks like he's some kind of wild animal.,Thomas seems a wild creature.


# Load the first model - ProtAugment
*   TLDR: It is bad
* Interesting things:
  * Diverse Beam Search - Beam search with extra diversity penalty
  * Contrained Beam Search - Restrict beam search from using randomly used bigrams or unigrams in the beam search





In [None]:
# Load the models from github
# !git clone https://github.com/tdopierre/ProtAugment
# !cp ProtAugment drive/MyDrive/Paraphrasing\ API/models/ProtAugment

In [None]:
# Copy the model to the current directory
!cp drive/MyDrive/Paraphrasing\ API/models/ProtAugment ProtAugment -r

In [None]:
# Attention, it will take 10 minutes!
!cat ProtAugment/requirements.txt | xargs -n 1 pip install > /dev/null

In [None]:
from ProtAugment.utils.python import set_seeds
from ProtAugment.paraphrase.modeling import UnigramRandomDropParaphraseBatchPreparer, BigramDropParaphraseBatchPreparer, BaseParaphraseBatchPreparer, DBSParaphraseModel
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

In [None]:
tokenizer = AutoTokenizer.from_pretrained("tdopierre/ProtAugment-ParaphraseGenerator")
fine_tuned_bart = AutoModelForSeq2SeqLM.from_pretrained("tdopierre/ProtAugment-ParaphraseGenerator")

In [None]:
paraphrase_model = DBSParaphraseModel(
    model_name_or_path="tdopierre/ProtAugment-ParaphraseGenerator",
    tok_name_or_path="tdopierre/ProtAugment-ParaphraseGenerator",
    num_beams=15,
    beam_group_size=3,
    diversity_penalty=0.5,
    filtering_strategy="bleu",
    # BaseParaphraseBatchPreparer
    # UnigramRandomDropParaphraseBatchPreparer
    # BigramDropParaphraseBatchPreparer
    paraphrase_batch_preparer=BaseParaphraseBatchPreparer(tokenizer=tokenizer, device=device),
    device=device
)

# Bad repharases from ProtAugment

In [None]:
original = df_parabank2.text[100]
rephrases = paraphrase_model.paraphrase(original)
original, rephrases[0]

('In the third phase, lasting three to five years, aid supports the first phase of post-war economic development, including restoration of schools, clinics, farms, factories, and ports.',
 ['In the third phase, lasting three to five years, aid supports the first phase of post-war economic development, including restoration of schools, clinics, farms, factories and ports.',
  'In the third phase, lasting three to five years, aid supports the first phase of post-war economic development, including the restoration of schools, clinics, farms, factories, and',
  'In the third phase, lasting three to five years, aid supports the first phase of post-war economic development, including restoration of schools, clinics, farms, factories and ports,',
  'In the third phase of aid, lasting three to five years, aid supports the first phase of post-war economic development, including restoration of schools, clinics, farms, factories,',
  'In the third phase of aid, lasting three to five years, suppor

In [None]:
original = df_parabank2.text[200]
rephrases = paraphrase_model.paraphrase(original)
original, rephrases[0]

('What about my family and friends?',
 ['What do I mean by family the',
  'What do I mean by family of',
  'What does it feel like to to',
  'What does it feel like to and',
  'What about my family and friends.'])

In [None]:
original = "The ultimate test of your knowledge is your capacity to convey it to another."
rephrases = paraphrase_model.paraphrase(original)
original, rephrases[0]

('The ultimate test of your knowledge is your capacity to convey it to another.',
 ['The ultimate test of your knowledge is your capacity to convey it to another',
  'The ultimate test of knowledge is the capacity to convey it to another.',
  '"The ultimate test of your knowledge is your capacity to convey knowledge to',
  'The ultimate test of your knowledge is your capacity to convey it to another',
  'The test of knowledge is your capacity to convey knowledge to another.'])

#### Using bi-gram and uni-gram parahrasers inproves the divesity, but create nonsense

In [None]:
paraphrase_model = DBSParaphraseModel(
    model_name_or_path="tdopierre/ProtAugment-ParaphraseGenerator",
    tok_name_or_path="tdopierre/ProtAugment-ParaphraseGenerator",
    num_beams=15,
    beam_group_size=3,
    diversity_penalty=0.5,
    filtering_strategy="bleu",
    # BaseParaphraseBatchPreparer
    # UnigramRandomDropParaphraseBatchPreparer
    # BigramDropParaphraseBatchPreparer
    paraphrase_batch_preparer=BigramDropParaphraseBatchPreparer(tokenizer=tokenizer, device=device),
    device=device
)

In [None]:
original = "The ultimate test of your knowledge is your capacity to convey it to another."
rephrases = paraphrase_model.paraphrase(original)
original, rephrases[0]

('The ultimate test of your knowledge is your capacity to convey it to another.',
 ['If you have knowledge, what is it that you can convey to others',
  'If you have knowledge, what is it that you can convey to one',
  "The final test is the capacity of one's knowledge, the ability to",
  'As a human being, the final test is the capacity of the capacity',
  "The final test is the capacity of one's knowledge, the capacity for"])

In [None]:
paraphrase_model = DBSParaphraseModel(
    model_name_or_path="tdopierre/ProtAugment-ParaphraseGenerator",
    tok_name_or_path="tdopierre/ProtAugment-ParaphraseGenerator",
    num_beams=15,
    beam_group_size=3,
    diversity_penalty=0.5,
    filtering_strategy="bleu",
    # BaseParaphraseBatchPreparer
    # UnigramRandomDropParaphraseBatchPreparer
    # BigramDropParaphraseBatchPreparer
    paraphrase_batch_preparer=UnigramRandomDropParaphraseBatchPreparer(auc=0.1, 
                                                                       drop_chance_speed="flat", 
                                                                       tokenizer=tokenizer, 
                                                                       device=device),
    device=device
)

In [None]:
original = "The ultimate test of your knowledge is your capacity to convey it to another."
rephrases = paraphrase_model.paraphrase(original)
original, rephrases[0]

('The ultimate test of your knowledge is your capacity to convey it to another.',
 ['The ultimate test is your capacity to convey knowledge to another.',
  'The ultimate test is your capacity to convey knowledge to others.',
  'The ultimate test in knowledge is your capacity to convey knowledge to another person',
  'The ultimate test is the capacity to convey knowledge to another.',
  'The ultimate test is your capacity to convey your knowledge to another, without'])

In [None]:
paraphrase_model = DBSParaphraseModel(
    model_name_or_path="tdopierre/ProtAugment-ParaphraseGenerator",
    tok_name_or_path="tdopierre/ProtAugment-ParaphraseGenerator",
    num_beams=15,
    beam_group_size=3,
    diversity_penalty=0.5,
    filtering_strategy="bleu",
    # BaseParaphraseBatchPreparer
    # UnigramRandomDropParaphraseBatchPreparer
    # BigramDropParaphraseBatchPreparer
    paraphrase_batch_preparer=UnigramRandomDropParaphraseBatchPreparer(auc=0.5, 
                                                                       drop_chance_speed="flat", 
                                                                       tokenizer=tokenizer, 
                                                                       device=device),
    device=device
)

In [None]:
original = "The ultimate test of your knowledge is your capacity to convey it to another."
rephrases = paraphrase_model.paraphrase(original)
original, rephrases[0]

('The ultimate test of your knowledge is your capacity to convey it to another.',
 ['The final test of knowledge is the capacity of the person who can convey',
  'The final test of knowledge is the capacity of the person who has knowledge',
  'What is the final test of knowledge, and what is the capacity of',
  'In the final test of knowledge, the capacity of the human mind is',
  'In the final test of knowledge, the capacity of the knowledge is that'])

# Let's find other paraphraser
# Bart paraphraser from eugenesiow
TLDR:
1. No code or information
2. Too low diversity
3. Too short output

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = BartForConditionalGeneration.from_pretrained('eugenesiow/bart-paraphrase').to(device)
tokenizer = BartTokenizer.from_pretrained('eugenesiow/bart-paraphrase')

In [None]:
def rephrase(input_sentence):
  batch = tokenizer(input_sentence, return_tensors='pt')
  generated_ids = model.generate(batch['input_ids'])
  return tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

def rephrase_batch(batch):
  batch = tokenizer(batch, return_tensors='pt', padding=True, truncation=True)
  generated_ids = model.generate(batch['input_ids'], )
  return tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

## Bad paraphraser: too low diversity

In [None]:
source = "The ultimate test of your knowledge is your capacity to convey it to another."
source, rephrase(source)

('The ultimate test of your knowledge is your capacity to convey it to another.',
 ['The ultimate test of your knowledge is your ability to convey it to another.'])

In [None]:
for i in range(100, 106):
  print(df_parabank2.text[i])
  print(rephrase(df_parabank2.text[i]))
  print()

In the third phase, lasting three to five years, aid supports the first phase of post-war economic development, including restoration of schools, clinics, farms, factories, and ports.
['In the third phase, lasting three to five years, aid supports the first phase of post']

I believe in love.
['I believe in love.']

The public consultation ended on 31 May 2011.
['The public consultation ended on May 31, 2011.']

Your mother was shot.
['Your mother was shot and killed. What did you do?']

The resolution was adopted unanimously without amendments.
['The resolution was adopted unanimously without amendments.']

They will also save on investments in new roads, power plants, schools, and other public services.
['They will also save on new roads, power plants, schools and other public services.']



In [None]:
source = "When you are shopping for vintage jewelry, one way to ensure that you are not buying fake vintage, is to ask the seller about the history of a particular item. If the piece is actually vintage the seller should be able to explain how they came across the item. For instance, it may have been passed down through the family, purchased at an estate sale or auction, or found while antique hunting., Most vintage jewelry was marked by the jewelry maker, either with initials or small emblems. Use a magnifying glass to examine the jewelry for marks before purchasing. If you notice any discrepancies between marks then the piece is likely a fake or replica.Search online for pictures of well-known vintage jewelers' marks.<n>There may be instances when some older items of jewelry were not marked. For example, early pieces of Chanel jewelry were unmarked and different markings were used during different periods.If you can’t locate any markings, then ask the seller about the history of the piece.<n> It is also important to carefully examine the condition of an item of jewelry before purchasing it. Although most vintage jewelry will have some minor signs of wear and tear, you want to make sure they are minimal. For instance, check for broken clasps, missing gems or jewels, as well as major scratches. All of these blemishes will decrease the value of the piece. Try and find gently used pieces that only have minor signs of wear.Most importantly check for good craftsmanship, which includes straight lines, and the symmetrical placement of stones.<n>Be wary of any jewelry marketed as vintage but that appears in mint condition.<n>In these instances ask the seller if the piece has been recently restored. This can decrease the value of the jewelry.<n> When buying vintage jewelry ask the retailer to provide you with documentation concerning the origin of the piece. This documentation can add value to the item, making it more authentic. This will also help to ensure that you are buying legitimate vintage jewelry, instead of mass produced new jewelry designed to look like vintage jewelry. Different types of documentation and authentication include:Certificate of authentication from a professional.<n>Original receipts from when the jewelry was purchased that include the purchasers name.<n>A photograph showing the piece being worn.<n>Handwritten notes from previous owners.<n>Other documents showing the items history.<n> You should always consider the price when you are shopping for vintage jewelry. Items that contain real diamonds and are made of gold will be pricey. If an item is being sold as a designer piece of gold jewelry, but is priced reasonably, it is likely fake. That being said, you do not have to break the bank to buy vintage jewelry. You can find very unique and beautiful pieces of vintage and antique costume jewelry that is reasonable priced.<n>Take into consideration the type of piece you want and make sure that you truly love the piece before purchasing it."
rephrase(context)

['When you are shopping for vintage jewelry, check for broken clasps, missing gems or']

In [None]:
source_batch = source.split("<n>")[0].split(". ")
result = rephrase_batch(source_batch)
for original, paraphrase in zip(source_batch, result):
  print(original)
  print(paraphrase)
  print()

When you are shopping for vintage jewelry, one way to ensure that you are not buying fake vintage, is to ask the seller about the history of a particular item
When you are shopping for vintage jewelry, one way to ensure that you are not buying fake

If the piece is actually vintage the seller should be able to explain how they came across the item
If the piece is actually vintage, the seller should be able to explain how they came across

For instance, it may have been passed down through the family, purchased at an estate sale or auction, or found while antique hunting., Most vintage jewelry was marked by the jewelry maker, either with initials or small emblems
For instance, it may have been passed down through the family, purchased at an estate sale

Use a magnifying glass to examine the jewelry for marks before purchasing
Use a magnifying glass to examine the jewelry for marks before purchasing.

If you notice any discrepancies between marks then the piece is likely a fake or replic

# Third paraphraser infos
# One russian paraphraser based on T5 and one based on translation
1. There is a blog, according to the blog, not so good, but ok.
2. Metrics: 
  1. Perplexity for fluency.
  2. LaBSE - Language-Agnostic BERT Sentence Embedding for adequacty
  3. BLUE and n-gram overlap for diversity
  4. New metrics: 
3. They have also used Contrained Beam Search
4. They also set num_beams and bad_words_ids
  1. num_beams - - how width is the search for the best output in the model
  2. num_beams is a parameter in trnasformer model
  3. bigger num_beams -> longer generation, but better text
  4. bad_word_ids -> means no these words in output, also parameter in transformers models
4. Translate had similar quality to t5
5. Fine tuned t5 for 5 days in google colab, but opened the notebook every day