# Data Augmentation

Enhance the dataset using paraphrasing techniques to improve model robustness.

## Paraphrasing Techniques for Dataset Enhancement
Paraphrasing techniques involve rewording or restructuring sentences while preserving their original meaning. In machine learning and NLP, paraphrasing is often used to augment datasets, improve model robustness, and increase linguistic diversity.

Common Paraphrasing Techniques
1. Synonym Replacement
- Replace words with their synonyms.
- Example: "The quick brown fox" → "The fast brown fox"

2. Back-Translation
- Translate a sentence into another language and then translate it back to the original language.
- Example: English → French → English
- "She loves playing soccer" → "She enjoys playing football"

3. Sentence Restructuring
- Change the sentence structure without altering the meaning.
- Example: "I went to the store because I needed milk" → "Because I needed milk, I went to the store"

4. Active to Passive Voice (and Vice Versa)
- Switch between active and passive forms.
- Example: "The cat chased the mouse" → "The mouse was chased by the cat"

5. Use of Different Phrasing
- Express the same idea using different expressions or idiomatic language.
- Example: "It's raining heavily" → "It's pouring down"

6. Rule-based Template Paraphrasing
- Apply predefined grammatical or syntactical rules to transform sentences.
- Example: Change "I want to [verb]" to "I'd like to [verb]"

7. Using Paraphrase Models
- Leverage pre-trained NLP models (like T5, Pegasus, or GPT) to automatically generate paraphrases.
- Example: Feed the model a sentence and request several reworded versions.

## Benefits of Paraphrasing for Datasets
- Increases diversity without collecting more data.
- Improves model generalization and robustness.
- Helps balance classes in classification tasks (by creating variations).

In [10]:
import pandas as pd 

In [11]:
df = pd.read_csv('/Users/dionnespaltman/Desktop/Luiss /Machine Learning/Project/stopword_removal_dataframe.csv')

display(df.head())   

Unnamed: 0,sentence1,sentence2,score,lang1,lang2,processed_language1,processed_language2
0,ein flugzeug hebt gerade ab,an air plane is taking off,5.0,de,en,Flugzeug heben,air plane
1,ein flugzeug hebt gerade ab,un avión está despegando,5.0,de,es,Flugzeug heben,avión despegar
2,ein flugzeug hebt gerade ab,un avion est en train de décoller,5.0,de,fr,Flugzeug heben,avion train décoller
3,ein flugzeug hebt gerade ab,un aereo sta decollando,5.0,de,it,Flugzeug heben,aereo stare decollare
4,ein flugzeug hebt gerade ab,飛行機が離陸します,5.0,de,ja,Flugzeug heben,飛行機 離陸


I will follow this github https://github.com/Vamsi995/Paraphrase-Generator 

In [None]:
# ! pip install transformers


In [None]:
# ! pip install --upgrade tensorflow


In [17]:
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from tqdm import tqdm

In [21]:
print(df.shape)

(949080, 7)


In [20]:
english_sentences = df[df['lang2'] == 'en']['processed_language2']

display(english_sentences)

0                                                 air plane
21                                                air plane
31                                                air plane
41                                                air plane
51                                                air plane
                                ...                        
949031    north korea delegation meet south korean official
949041    north korea delegation meet south korean official
949051    north korea delegation meet south korean official
949061    north korea delegation meet south korean official
949071    north korea delegation meet south korean official
Name: processed_language2, Length: 86280, dtype: object

In [23]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Load Pegasus paraphrase model
tokenizer = AutoTokenizer.from_pretrained("tuner007/pegasus_paraphrase")
model = AutoModelForSeq2SeqLM.from_pretrained("tuner007/pegasus_paraphrase")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

print("Pegasus model loaded successfully!")


ValueError: Converting from Tiktoken failed, if a converter for SentencePiece is available, provide a model path with a SentencePiece tokenizer.model file.Currently available slow->fast converters: ['AlbertTokenizer', 'BartTokenizer', 'BarthezTokenizer', 'BertTokenizer', 'BigBirdTokenizer', 'BlenderbotTokenizer', 'CamembertTokenizer', 'CLIPTokenizer', 'CodeGenTokenizer', 'ConvBertTokenizer', 'DebertaTokenizer', 'DebertaV2Tokenizer', 'DistilBertTokenizer', 'DPRReaderTokenizer', 'DPRQuestionEncoderTokenizer', 'DPRContextEncoderTokenizer', 'ElectraTokenizer', 'FNetTokenizer', 'FunnelTokenizer', 'GPT2Tokenizer', 'HerbertTokenizer', 'LayoutLMTokenizer', 'LayoutLMv2Tokenizer', 'LayoutLMv3Tokenizer', 'LayoutXLMTokenizer', 'LongformerTokenizer', 'LEDTokenizer', 'LxmertTokenizer', 'MarkupLMTokenizer', 'MBartTokenizer', 'MBart50Tokenizer', 'MPNetTokenizer', 'MobileBertTokenizer', 'MvpTokenizer', 'NllbTokenizer', 'OpenAIGPTTokenizer', 'PegasusTokenizer', 'Qwen2Tokenizer', 'RealmTokenizer', 'ReformerTokenizer', 'RemBertTokenizer', 'RetriBertTokenizer', 'RobertaTokenizer', 'RoFormerTokenizer', 'SeamlessM4TTokenizer', 'SqueezeBertTokenizer', 'T5Tokenizer', 'UdopTokenizer', 'WhisperTokenizer', 'XLMRobertaTokenizer', 'XLNetTokenizer', 'SplinterTokenizer', 'XGLMTokenizer', 'LlamaTokenizer', 'CodeLlamaTokenizer', 'GemmaTokenizer', 'Phi3Tokenizer']