<a href="https://colab.research.google.com/github/Dutta-SD/AggDetect/blob/master/core/Data_Augmentation_Aggression_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Augmentation for Aggression and Misogyny Detection using BERT
# © Sandip Dutta, 2021
---
Since we did not own a GPU, so we trained the BERT data augmentation pipeline in Google Colab.
We downloaded the `.csv` files from colab and used it for our purpose.

We do not refactor the code into `.py` files as we would not be able to test without a GPU.
Therefore, IPython NoteBook format is retained.


# Downloads and Imports 

In [None]:
# ENABLE GPU BEFORE PROCEEDING WITH NOTEBOOK
# Transformers library -- contains BERT based models
! pip3 install -qq transformers
# Import Libraries
import transformers
import numpy as np
import nltk
import pandas as pd
from google.colab import files
from tqdm import tqdm
import warnings

# Data Specific Parameters
The data has two target columns

*   Sub Task 1 - Aggression Classification into 3 classes
*   Sub Task 2 - Misogyny Detection into 2 classes

We keep one task for augmentation and delete another task for one run. 
Then we repeat for the next task by changing `non_aug_map` to `aug_map` and vice versa.

We do this to prevent colab environment from crashing due to limited size of available resources.






In [None]:
!pip freeze | grep transformers

transformers==4.9.2


In [None]:
# nltk setup
nltk.download('stopwords')
# Set random seed
RANDOM_SEED = 0
transformers.trainer_utils.set_seed(RANDOM_SEED)

# Number of times to augment
NUM_TIMES_TO_AUGMENT = 2

# Stop words for filtering
stop_words = set(nltk.corpus.stopwords.words('english'))

# Train Data URL
train_data_url = 'https://raw.githubusercontent.com/Dutta-SD/NLP/master/Aggression_Detection/trac2_eng_train.csv'

# Map of tasks
T1 = {
    'task' : 'Sub-task B',
    'map' : {
        'NGEN' : 0,
        'GEN' : 1,
    },
    'low' : ('GEN',)
}

T2 = {
    'task' : 'Sub-task A',
    'map' : {
        'NAG' : 0,
        'CAG' : 1,
        'OAG' : 2,
    },
    'low' : ('CAG', 'OAG')
}

# Set the values
# aug_map : task to be augmented
# non_aug_map : task to be discarded
aug_map, non_aug_map = T1, T2

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Fetch the Data and Remove one Task

In [None]:
# Fetch the data and remove irrelevant columns
train = pd.read_csv(train_data_url)
train.drop(['ID', non_aug_map['task']], axis = 1, inplace = True)

# Preprocess and Augmentation Training
Define functions for preprocessing and Augmentation Training


In [None]:
def create_aug_pipeline(model_name : str):
    """
    Returns the Mask word filling BERT language model. This
    is the main model that will do the augmentation for us.
    This fills words masked with [MASK] token into the most
    likely word. So we get additional data. Some noise might
    be present in the data, which helps reduce overfitting.

    Args:
        model_name (string) : Denotes the model names to use for
        augmentation. See Transformers library for more details.

    Returns:
        transformers.Pipeline object
    """
    
    return transformers.pipeline("fill-mask", model_name, device=0)

In [None]:
def stringCleanerMasker(
    ip_string : str, 
    stop_words : set, 
    num_mask_per_str : int = 1, 
    mask_delim : str = '[MASK]',
    max_valid_length : int = 512,
    max_mask_delim_replace : int = 1,
    invalid_string : str = 'INVALID'
    ):
    """
    Cleans the string, removes stopwords and masks certain words.
    These words will be replaced with a special token. The BERT Augmentation
    pipeline will predict these masked words and give us augmented data.

    Args:
        ip_string (str) : input string to mask
        stop_words (set) : set of stopwords to filter from data
        num_mask_per_str (int, Optional, default = 1) : number of tokens to mask in text
        mask_delim (str, Optional, default = '[MASK]') : The token which will mask tokens in text
        max_valid_length (int, Optional, default = 512) : Max length of the number of input tokens.
        max_mask_delim_replace (int, Optional, default = 1) : max tokens to replace with mask.
        invalid_string (str, Optional, default = 'INVALID') : Invalid string to return in case of
        Exception

    Returns:
        str, cleaned string with masked token
    """   

    # Remove Stop Words
    ip_list = [tok for tok in ip_string.split() if tok not in stop_words]
    length = len(ip_list)

    if length > max_valid_length:
        return invalid_string

    try:
        mask_token = np.random.choice(ip_list, num_mask_per_str)[0]
        finalString = ' '.join(ip_list)
        finalString = finalString.replace(str(mask_token), mask_delim, max_mask_delim_replace)
        return finalString
    except Exception as e:
        return invalid_string

In [None]:
def appendAugDataToDataFrame(
    train : pd.DataFrame,
    aug_pipe_model_name : str, 
    stopwords : set, 
    target_col : str,
    target_label : list,
    invalid_string : str = 'INVALID',
    top_k : int = 2,
    ):
    """
    Created Augmented Data using augmentation pipeline.
    Then append to list. Replace non augmented data frame
    with augemented data.

    Args:
        train (DataFrame) : The dataframe to append data to
        aug_pipe_model_name (str) : BERT model name string 
        stopwords (set) : set of stopwords
        target_col (str) : name of column to augment
        target_label (list) : target labels to augment. Labels that have less count in data is 
        mentioned here
        invalid_string (str, default = 'INVALID') : string to detect wheter to augment text or not.
        top_k (int, default = 2) : The number of top predictions to append per
        masked string.

    Returns:
        pandas.DataFrame - DataFrame containing augmented Data
    """
    
    # which data to augment
    subset = train[target_col].isin(target_label)

    _data = train[subset]

    _text, _labels = _data['Text'], _data[target_col]

    # Augmentation Pipeline, 
    aug_pipe = create_aug_pipeline(aug_pipe_model_name)
    
    # Augmented Data
    aug_data = []
    
    def append_string(string, target_label, stopwords, aug_container):
        """Appends string to given container"""
        clean_string = stringCleanerMasker(string, stop_words = stopwords)
        # If get invalid string, add it
        if clean_string == invalid_string:
            aug_container.append((string, target_label))
            return aug_container

        # Generate augementations
        aug_preds = aug_pipe(clean_string)

        # Top 2 predictions
        for item in aug_preds[:top_k]:
            sentence = item['sequence']
            aug_container.append((sentence, target_label))

        return aug_container

    for x, y in tqdm(zip(_text, _labels), desc = 'Augmenting... '):
        aug_data.extend(append_string(x, y, stopwords, []))

    aug_data = pd.DataFrame(aug_data, columns = train.columns)
    df = train.append(aug_data, ignore_index = True)
    return df

# Training Augmentation Pipeline

In [None]:
# Augmentation for specified number  of times

for _ in range(NUM_TIMES_TO_AUGMENT):
    train = appendAugDataToDataFrame(
        train = train,
        aug_pipe_model_name = "bert-base-multilingual-cased",
        stopwords = stop_words,
        target_col = aug_map['task'],
        target_label = aug_map['low']
    )
# Below warning is normal, ignore
# can run twice for additional data
# As more data accumulates time increases
# TASK B time ~ 2 minutes for default and GPU
# TASK A time ~ 2 minutes for default and GPU

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Augmenting... : 309it [00:10, 30.53it/s]
Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another archit

# Validation that the Augmentation model worked fine

In [None]:
train[aug_map['task']].value_counts()

NGEN    3954
GEN     2780
Name: Sub-task B, dtype: int64

# Download the data

In [None]:
# This data is added to the prediction pipeline.
# We download it and to data folder of prediction pipeline
file_name = f"Final_AUG_{aug_map['task']}_ENGLISH.csv"

# For inputting to input folder, uncomment below lines

"""
# Assuming this file is in '/core' folder
path = '../input/'
file_name = path + file_name
"""
train.to_csv(file_name, index = False)

# For storing to input folder, remove this line
files.download(file_name)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>