# Data Augmentation
The following technoques will be implemented for every minority text entry:
- If there are multiple sentences in the text, the sentence order will be shuffled
- If there is only 1 sentence in the text, the following EDA techniques will be applied:
  - Synonym replacement
  - Random insertion
  - Random swap
  - Random deletion

*From my experience, the most commonly used and effective technique is synonym replacement via word embeddings*
https://neptune.ai/blog/data-augmentation-nlp

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install numpy requests nlpaug
!pip install torch>=1.6.0 transformers>=4.11.3 sentencepiece

Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: nlpaug
Successfully installed nlpaug-1.1.11


In [None]:
import pandas as pd
import nlpaug.augmenter.word as naw

In [None]:
# Get train data
# train_df = pd.read_csv("/content/drive/MyDrive/CSI4900/Datasets/ProcessedTrainingDatasets/MergedDatasets/training_df.csv") # og - upsampled
train_df = pd.read_csv("/content/drive/MyDrive/CSI4900/Datasets/ProcessedTrainingDatasets/MergedDatasets/LiarRemoved/training_df.csv") # liar
display(train_df.head())
display(train_df["domain"].value_counts())

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,text,label,metadata,domain
0,30966,30966,Jennifer Aniston and Justin Theroux Double-Dat...,1,"[{'article': None, 'author': None, 'date': Non...",SOCIAL
1,17410,17410,Kim Kardashian West on Her New Beauty Line and...,0,"[{'article': None, 'author': None, 'date': Non...",SOCIAL
2,23715,23715,Ruby Rose Admits That Being Mean Doesn’t Suit ...,0,"[{'article': None, 'author': None, 'date': Non...",SOCIAL
3,30383,30383,Kourtney Kardashian moves on from Younes Bendj...,1,"[{'article': None, 'author': None, 'date': Non...",SOCIAL
4,57496,57496,(Reuters) - The United States is in an economi...,0,"[{'article': None, 'author': None, 'date': 'Au...",POLITICS


POLITICS    52558
SOCIAL      17712
HEALTH       8565
SCIENCE       726
CRIME         643
Name: domain, dtype: int64

## Running sample to augment

In [None]:
sample = train_df.iloc[1]["text"][:275]
sample

'Trump claims that Clinton’s policy on Syria would lead to World War 3. \nLet’s fact check … \nThe Washington Post points out that a vote for Clinton is a vote for escalating military confrontation in Syria and elsewhere: \nIn the rarefied world of the Washington foreign policy '

In [None]:
aug = naw.ContextualWordEmbsAug(
    model_path='bert-base-uncased', action="insert")

#Substitute word by contextual word embeddings
augmented_text = aug.augment(sample)

In [None]:
print("before:",sample)
print("after:", augmented_text)

before: Trump claims that Clinton’s policy on Syria would lead to World War 3. 
Let’s fact check … 
The Washington Post points out that a vote for Clinton is a vote for escalating military confrontation in Syria and elsewhere: 
In the rarefied world of the Washington foreign policy 
after: ['trump claims that clinton de ’ will s policy on syria would likely lead to world a war... 3. let ’ s fact not check this … the washington post points out that a vote for clinton is a vote for escalating military confrontation in syria and elsewhere : in the obama rarefied world of politics the washington current foreign policy']


In [None]:
aug = naw.SynonymAug(aug_src='wordnet')

#Substitute word by WordNet's synonym
augmented_text2 = aug.augment(sample, n = int(len(sample) * 0.10))

In [None]:
aug.augment(sample)

['Trumpet call that Clinton ’ s policy on Syrian arab republic would lead to Cosmos War triad. Let ’ s fact check … The Washington Place points out that a vote for Clinton is a vote for escalating military confrontation in Syria and elsewhere: In the rarefied world of the Washington extraneous policy']

In [None]:
print("before:",sample)
print("after:", augmented_text2)

before: Trump claims that Clinton’s policy on Syria would lead to World War 3. 
Let’s fact check … 
The Washington Post points out that a vote for Clinton is a vote for escalating military confrontation in Syria and elsewhere: 
In the rarefied world of the Washington foreign policy 
after: ['Trump call that Clinton ’ s policy on Syrian arab republic would lead to World War 3. Let ’ s fact check … The Washington Post full point stunned that a vote for Dewitt clinton is a vote for escalating military opposition in Syria and elsewhere: In the rarefied world of the Washington foreign policy']


In [None]:
aug.augment(sample)

['Trump claim that Clinton ’ s policy on Syrian arab republic would precede to World War 3. Let ’ s fact check … The Washington Post points proscribed that a vote for Hilary rodham clinton is a vote for escalating military confrontation in Syria and elsewhere: In the rarefied world of the Washington alien policy']

I think the wordnet's synonym technique works better than the BERT contextual embedding, so I will proceed to use the latter.

## Normal Upsampling

In [None]:
max_count = max(train_df['domain'].value_counts())
max_count

52558

In [None]:
# duplicating rows to match max_count
def add_copies_of_rows(train_df, domain_name, max_count):
  num_entries_for_social = train_df['domain'].value_counts()[domain_name]
  social_subset = train_df[train_df['domain'] == domain_name].copy(deep=True)

  required_count = max_count / num_entries_for_social
  print(f"require #{required_count} copies for domain {domain_name}")

  train_df_pass1 = train_df
  # print("original length:", len(train_df_pass1))

  for i in range(1, int(required_count)):
    train_df_pass1 = pd.concat([train_df_pass1, social_subset])
    # print("...adding, new length:", len(train_df_pass1))

  final_values = int(num_entries_for_social * (required_count - int(required_count)))
  # print("...adding, final concat for extra # entries:", final_values)
  train_df_pass1 = pd.concat([train_df_pass1, social_subset.head(final_values)])

  return train_df_pass1

In [None]:
train_df_upsampled = train_df.copy(deep=True)
for domain_name in train_df['domain'].unique():
  train_df_upsampled = add_copies_of_rows(train_df_upsampled, domain_name, max_count)

display(train_df_upsampled.head())
display(train_df_upsampled['domain'].value_counts())

require #2.9673667570009035 copies for domain SOCIAL
require #1.0 copies for domain POLITICS
require #6.1363689433741975 copies for domain HEALTH
require #81.73872472783826 copies for domain CRIME
require #72.39393939393939 copies for domain SCIENCE


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,text,label,metadata,domain
0,30966,30966,Jennifer Aniston and Justin Theroux Double-Dat...,1,"[{'article': None, 'author': None, 'date': Non...",SOCIAL
1,17410,17410,Kim Kardashian West on Her New Beauty Line and...,0,"[{'article': None, 'author': None, 'date': Non...",SOCIAL
2,23715,23715,Ruby Rose Admits That Being Mean Doesn’t Suit ...,0,"[{'article': None, 'author': None, 'date': Non...",SOCIAL
3,30383,30383,Kourtney Kardashian moves on from Younes Bendj...,1,"[{'article': None, 'author': None, 'date': Non...",SOCIAL
4,57496,57496,(Reuters) - The United States is in an economi...,0,"[{'article': None, 'author': None, 'date': 'Au...",POLITICS


SOCIAL      52558
POLITICS    52558
HEALTH      52558
CRIME       52558
SCIENCE     52557
Name: domain, dtype: int64

In [None]:
train_df_upsampled[train_df_upsampled['Unnamed: 0'] == 1366].head() # checking duplicates created

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,text,label,metadata,domain
19296,1366,1366,"Currently, humans are emitting around 29 billi...",0,"[{'article': 'Carbon dioxide', 'author': None,...",SCIENCE
19296,1366,1366,"Currently, humans are emitting around 29 billi...",0,"[{'article': 'Carbon dioxide', 'author': None,...",SCIENCE
19296,1366,1366,"Currently, humans are emitting around 29 billi...",0,"[{'article': 'Carbon dioxide', 'author': None,...",SCIENCE
19296,1366,1366,"Currently, humans are emitting around 29 billi...",0,"[{'article': 'Carbon dioxide', 'author': None,...",SCIENCE
19296,1366,1366,"Currently, humans are emitting around 29 billi...",0,"[{'article': 'Carbon dioxide', 'author': None,...",SCIENCE


In [None]:
# saving default upsampling
train_df_upsampled.to_csv("/content/drive/MyDrive/CSI4900/Datasets/ProcessedTrainingDatasets/MergedDatasets/LiarRemoved/Upsampled/training_df.csv")

## Data Augmented Upsampling

In [None]:
aug = naw.SynonymAug(aug_src='wordnet')
#Substitute word by WordNet's synonym

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [None]:
def augment_df(df):
  new_df = df.copy(deep=True)
  new_df["text"] = df["text"].apply(lambda x: aug.augment(x)[0])
  return new_df

In [None]:
# TESTING

# train_df_extra = train_df.copy(deep=True)
# subset = train_df_extra[train_df_extra["domain"] == 'HEALTH'].head(2)
# display(subset)
# subset2 = train_df_extra[train_df_extra["domain"] == 'SCIENCE'].head(2)
# subset = pd.concat([subset, subset2])
# display(subset)
# train_df_extra = augment_df(subset)
# display(train_df_extra)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,text,label,metadata,domain
0,5168,5168,#COVID-19 adds to the woes of #Telangana's jai...,0,"[{'article': None, 'author': None, 'context': ...",HEALTH
2,4971,4971,Opinion: Quarantining cities isn't needed. But...,0,"[{'article': None, 'author': None, 'context': ...",HEALTH


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,text,label,metadata,domain
0,5168,5168,#COVID-19 adds to the woes of #Telangana's jai...,0,"[{'article': None, 'author': None, 'context': ...",HEALTH
2,4971,4971,Opinion: Quarantining cities isn't needed. But...,0,"[{'article': None, 'author': None, 'context': ...",HEALTH
54,1366,1366,"Currently, humans are emitting around 29 billi...",0,"[{'article': 'Carbon dioxide', 'author': None,...",SCIENCE
98,927,927,"‘While volcanic eruptions are natural events, ...",0,"[{'article': '1257 Samalas eruption', 'author'...",SCIENCE


In [None]:
max_count = max(train_df['domain'].value_counts())
max_count

52558

In [None]:
# duplicating rows to match max_count
def add_copies_of_rows_aug(train_df, domain_name, max_count):
  num_entries_for_social = train_df['domain'].value_counts()[domain_name]
  social_subset = train_df[train_df['domain'] == domain_name].copy(deep=True)

  required_count = max_count / num_entries_for_social
  print(f"require #{required_count} copies for domain {domain_name}")

  train_df_pass1 = train_df
  # print("original length:", len(train_df_pass1))

  for i in range(1, int(required_count)):
    train_df_pass1 = pd.concat([train_df_pass1, augment_df(social_subset)])
    # print("...adding, new length:", len(train_df_pass1))

  final_values = int(num_entries_for_social * (required_count - int(required_count)))
  # print("...adding, final concat for extra # entries:", final_values)
  train_df_pass1 = pd.concat([train_df_pass1, augment_df(social_subset.head(final_values))])

  return train_df_pass1

In [None]:
train_df_upsampled_aug = train_df.copy(deep=True)
for domain_name in train_df['domain'].unique():
  train_df_upsampled_aug = add_copies_of_rows_aug(train_df_upsampled_aug, domain_name, max_count)

display(train_df_upsampled_aug.head())
display(train_df_upsampled_aug['domain'].value_counts())

require #2.9673667570009035 copies for domain SOCIAL


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


require #1.0 copies for domain POLITICS
require #6.1363689433741975 copies for domain HEALTH
require #81.73872472783826 copies for domain CRIME
require #72.39393939393939 copies for domain SCIENCE


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,text,label,metadata,domain
0,30966,30966,Jennifer Aniston and Justin Theroux Double-Dat...,1,"[{'article': None, 'author': None, 'date': Non...",SOCIAL
1,17410,17410,Kim Kardashian West on Her New Beauty Line and...,0,"[{'article': None, 'author': None, 'date': Non...",SOCIAL
2,23715,23715,Ruby Rose Admits That Being Mean Doesn’t Suit ...,0,"[{'article': None, 'author': None, 'date': Non...",SOCIAL
3,30383,30383,Kourtney Kardashian moves on from Younes Bendj...,1,"[{'article': None, 'author': None, 'date': Non...",SOCIAL
4,57496,57496,(Reuters) - The United States is in an economi...,0,"[{'article': None, 'author': None, 'date': 'Au...",POLITICS


SOCIAL      52558
POLITICS    52558
HEALTH      52558
CRIME       52558
SCIENCE     52557
Name: domain, dtype: int64

In [None]:
display(train_df_upsampled_aug[train_df_upsampled_aug["domain"] == 'CRIME'].head())

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,text,label,metadata,domain
26,83,83,ion 4 April 2017 At least 19 people were kille...,1,"[{'article': None, 'author': None, 'date': '4/...",CRIME
81,678,678,ed 1329 20.07.2016) Get short URL 0 154 The Sy...,0,"[{'article': None, 'author': None, 'date': '7/...",CRIME
329,207,207,it Qaeda in Syria as well as IS Monitor AFP Tu...,1,"[{'article': None, 'author': None, 'date': '10...",CRIME
372,300,300,Jul 012016 BEIRUT At least 70 regime and rebel...,0,"[{'article': None, 'author': None, 'date': '7/...",CRIME
458,324,324,east 10 people were killed on Sunday when barr...,0,"[{'article': None, 'author': None, 'date': '9/...",CRIME


In [None]:
two = train_df_upsampled_aug[train_df_upsampled_aug['Unnamed: 0'] == 1366].head(4) # checking duplicates created
display(two["text"].iloc[0])
display(two["text"].iloc[1])
display(two["text"].iloc[2])
display(two["text"].iloc[3])

'Currently, humans are emitting around 29 billion tonnes of carbon dioxide into the atmosphere per year.'

'Presently, humans constitute emitting around xxix billion tonnes of c dioxide into the atmosphere per year.'

'Currently, humans make up let loose around 29 trillion tonnes of carbon copy dioxide into the atmosphere per year.'

'Currently, mankind are emitting around xxix trillion tonnes of carbon dioxide into the atmosphere per yr.'

Output:
- Currently, humans are emitting around 29 billion tonnes of carbon dioxide into the atmosphere per year.
- Currently, humans equal emit around twenty nine billion tonnes of carbon paper dioxide into the atmosphere per year.
- Currently, man are let loose around 29 billion metric ton of carbon dioxide into the standard atmosphere per year.
- Currently, humans be pass off around 29 billion tonnes of carbon dioxide into the standard pressure per yr.

In [None]:
# saving default upsampling
train_df_upsampled_aug.to_csv("/content/drive/MyDrive/CSI4900/Datasets/ProcessedTrainingDatasets/MergedDatasets/LiarRemoved/Upsampled_Augmented/training_df.csv")