In [30]:
import os

import nlpaug.augmenter.word as naw
import pandas as pd
from IPython.display import Markdown

In [2]:
original_en = [
    "A picture is worth a thousand words.",
    "The pen is mightier than the sword.",
    "You can't judge a book by its cover.",
    "Two wrongs don't make a right.",
    "The grass is always greener on the other side.",
    "The best way to predict the future is to invent it.",
    "It's not a bug, it's a feature.",
    "Any sufficiently advanced technology is indistinguishable from magic.",
    "Technology is a useful servant but a dangerous master.",
    "The advance of technology is based on making it fit in so that you don't really even notice it, so it's part of everyday life.",
]

In [3]:
aug = naw.SynonymAug(aug_src="wordnet")

[nltk_data] Downloading package wordnet to /home/prajwal/nltk_data...
[nltk_data] Downloading package omw-1.4 to /home/prajwal/nltk_data...


In [4]:
results = []
for sentence in original_en:
    augmented = aug.augment(sentence)
    results.append({"Original": sentence, "Augmented": augmented})

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/prajwal/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [5]:
df = pd.DataFrame(results)

In [6]:
Markdown(df.to_markdown(index=False))

| Original                                                                                                                       | Augmented                                                                                                                                                         |
|:-------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| A picture is worth a thousand words.                                                                                           | ['A mental picture comprise worth a 1000 words.']                                                                                                                 |
| The pen is mightier than the sword.                                                                                            | ['The penitentiary personify mightier than the sword.']                                                                                                           |
| You can't judge a book by its cover.                                                                                           | ["You canful ' t judge a christian bible by its blanket."]                                                                                                        |
| Two wrongs don't make a right.                                                                                                 | ["Deuce wrongs wear ' t progress to a right."]                                                                                                                    |
| The grass is always greener on the other side.                                                                                 | ['The grass is always greener on the other slope.']                                                                                                               |
| The best way to predict the future is to invent it.                                                                            | ['The best way to forecast the futurity is to forge it.']                                                                                                         |
| It's not a bug, it's a feature.                                                                                                | ["Information technology ' s not a hemipteran, it ' s a feature article."]                                                                                        |
| Any sufficiently advanced technology is indistinguishable from magic.                                                          | ['Any sufficiently advance engineering is undistinguishable from magic.']                                                                                         |
| Technology is a useful servant but a dangerous master.                                                                         | ['Applied science is a useful servant but a severe victor.']                                                                                                      |
| The advance of technology is based on making it fit in so that you don't really even notice it, so it's part of everyday life. | ["The advance of technology is found on making information technology fit in thence that you don ' t rattling even observe it, so it ' s part of everyday life."] |

above results are not that accurate, trying with ppdb aug_src instead of wordnet

In [11]:
# from http://paraphrase.org/#/download
# select small size
!wget http://nlpgrid.seas.upenn.edu/PPDB/eng/ppdb-2.0-s-all.gz

--2024-06-30 15:59:28--  http://nlpgrid.seas.upenn.edu/PPDB/eng/ppdb-2.0-s-all.gz
Resolving nlpgrid.seas.upenn.edu (nlpgrid.seas.upenn.edu)... 158.130.57.54
Connecting to nlpgrid.seas.upenn.edu (nlpgrid.seas.upenn.edu)|158.130.57.54|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 567671280 (541M) [application/x-gzip]
Saving to: ‘ppdb-2.0-s-all.gz’


2024-06-30 16:03:04 (2.51 MB/s) - ‘ppdb-2.0-s-all.gz’ saved [567671280/567671280]



In [12]:
!gunzip ppdb-2.0-s-all.gz

In [20]:
def get_file_size(file_path):
    # Get the size of the file in bytes
    size_in_bytes = os.path.getsize(file_path)

    # Convert the size into GB
    size_in_gb = size_in_bytes / (1024 * 1024 * 1024)

    return f"{size_in_gb:.2f} GB"

In [31]:
# huge file size!
get_file_size("ppdb-2.0-s-all")

'3.72 GB'

In [33]:
!head ppdb-2.0-s-all

[NN] ||| transplant ||| transplantation ||| PPDB2.0Score=5.24981 PPDB1.0Score=3.295900 -logp(LHS|e1)=0.18597 -logp(LHS|e2)=0.14031 -logp(e1|LHS)=11.83583 -logp(e1|e2)=1.80507 -logp(e1|e2,LHS)=1.46728 -logp(e2|LHS)=11.47593 -logp(e2|e1)=1.49083 -logp(e2|e1,LHS)=1.10738 AGigaSim=0.63439 Abstract=0 Adjacent=0 CharCountDiff=5 CharLogCR=0.40547 ContainsX=0 Equivalence=0.371472 Exclusion=0.000344 GlueRule=0 GoogleNgramSim=0.03067 Identity=0 Independent=0.078161 Lex(e1|e2)=9.64663 Lex(e2|e1)=59.48919 Lexical=1 LogCount=4.67283 MVLSASim=NA Monotonic=1 OtherRelated=0.372735 PhrasePenalty=1 RarityPenalty=0 ForwardEntailment=0.177287 SourceTerminalsButNoTarget=0 SourceWords=1 TargetComplexity=0.98821 TargetFormality=0.98464 TargetTerminalsButNoSource=0 TargetWords=1 UnalignedSource=0 UnalignedTarget=0 WordCountDiff=0 WordLenDiff=5.00000 WordLogCR=0 ||| 0-0 ||| OtherRelated
[JJ] ||| <www.un.org/depts/dgacm/docs/crp/aconf212crp1/russian.pdf> ||| <www.un.org/depts/dgacm/docs/crp/aconf212crp1/arabic.

In [34]:
!tail ppdb-2.0-s-all

[SBAR] ||| [SBAR/NP,1] [NP/NP,2] this sort ||| [SBAR/NP,1] [NP/NP,2] this kind ||| PPDB2.0Score=5.19773 PPDB1.0Score=4.701110 -logp(LHS|e1)=0.45676 -logp(LHS|e2)=0.28594 -logp(e1|LHS)=16.05067 -logp(e1|e2)=3.62267 -logp(e1|e2,LHS)=3.36208 -logp(e2|LHS)=13.33563 -logp(e2|e1)=1.07844 -logp(e2|e1,LHS)=0.64704 AGigaSim=0.99027 Abstract=0 Adjacent=1 CharCountDiff=0 CharLogCR=0 ContainsX=0 Equivalence=0.296712 Exclusion=0.114243 GlueRule=0 GoogleNgramSim=0.33946 Identity=0 Independent=0.222087 Lex(e1|e2)=62.61371 Lex(e2|e1)=62.61371 Lexical=0 LogCount=0.69315 MVLSASim=NA Monotonic=1 OtherRelated=0.173756 PhrasePenalty=1 RarityPenalty=0 ReverseEntailment=0.193202 SourceTerminalsButNoTarget=0 SourceWords=2 TargetTerminalsButNoSource=0 TargetWords=2 UnalignedSource=0 UnalignedTarget=0 WordCountDiff=0 WordLenDiff=0 WordLogCR=0 ||| 0-0 1-1 2-2 3-3 ||| Equivalence
[SBAR/PP] ||| [SBAR/VP,1] to be [VP/PP,2] ||| [SBAR/VP,1] become [VP/PP,2] ||| PPDB2.0Score=5.19773 PPDB1.0Score=15.811930 -logp(LHS|e1

In [23]:
aug = naw.SynonymAug(aug_src="ppdb", model_path="ppdb-2.0-s-all")

In [24]:
results = []
for sentence in original_en:
    augmented = aug.augment(sentence)
    results.append({"Original": sentence, "Augmented": augmented})

In [25]:
df = pd.DataFrame(results)

In [26]:
Markdown(df.to_markdown(index=False))

| Original                                                                                                                       | Augmented                                                                                                                                                             |
|:-------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| A picture is worth a thousand words.                                                                                           | ['A perceptions embodies worth a thousand worlds.']                                                                                                                   |
| The pen is mightier than the sword.                                                                                            | ['The pencil facilitates mightier than the sword.']                                                                                                                   |
| You can't judge a book by its cover.                                                                                           | ["You can ' t magistrate a book by its coverage."]                                                                                                                    |
| Two wrongs don't make a right.                                                                                                 | ["Two wrongdoings donated ' t make a right."]                                                                                                                         |
| The grass is always greener on the other side.                                                                                 | ['The grass strengthens always greener on the other sidelines.']                                                                                                      |
| The best way to predict the future is to invent it.                                                                            | ['The bestest way to forecast the future ceases to reinvent it.']                                                                                                     |
| It's not a bug, it's a feature.                                                                                                | ["It ' seconds not a bug, it ' proposing a feature."]                                                                                                                 |
| Any sufficiently advanced technology is indistinguishable from magic.                                                          | ['Any adequately advanced telecommunications characterizes indistinguishable from magic.']                                                                            |
| Technology is a useful servant but a dangerous master.                                                                         | ['Technology guarantees a useful servant but a hazardous masters.']                                                                                                   |
| The advance of technology is based on making it fit in so that you don't really even notice it, so it's part of everyday life. | ["The developments of telecommunications contributes  based on making it fitting in so that you donated ' t real even ignored it, so it ' s part of daily lifetime."] |

Not sure if the results are better 🤷‍♂️