# Contextual Augmentation

For this experiment, we use contextual augmentation. This method randomly chooses words in a text, which will be replaced by another word, which is predicted using the surrounding words.
We use the training data and create more toxic comments until there are enough of them to make the dataset balanced. 
Afterwards, we train an XLM Roberta model with this augmented dataset and test it on the test set.


## Prepare datasets

In [1]:
# Install required packages
!pip install -q transformers[torch] datasets nlpaug
import os
import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments
)
import nlpaug.augmenter.word as naw

# Load data
train_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv')[:50000]
train_df = train_df[['comment_text', 'toxic']].dropna()

# Class distribution analysis
toxic_df = train_df[train_df['toxic'] == 1]
non_toxic_df = train_df[train_df['toxic'] == 0]

print(f"Original counts - Toxic: {len(toxic_df)}, Non-toxic: {len(non_toxic_df)}")

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hOriginal counts - Toxic: 4883, Non-toxic: 45117


In [2]:
# Original counts
toxic_count = len(toxic_df)
non_toxic_count = len(non_toxic_df)
required_toxic = non_toxic_count - toxic_count

# Setup for augmentation
aug = naw.ContextualWordEmbsAug(
    model_path='bert-base-uncased',
    action="substitute",
    aug_max=10,
    aug_p=0.6,
    batch_size=256,
    device='cuda'
)

# Batch processing function
def batch_augment(texts, aug, num_variants):
    """Generate multiple variants for a batch of texts"""
    return [aug.augment(text) for text in texts for _ in range(num_variants)]

# Calculate needed variants per sample
remaining = required_toxic
# Prepare list for augmented texts
augmented_toxic = []
# Define batch size
batch_size = 512

# Perform augmentation
for i in range(0, len(toxic_df), batch_size):
    batch_texts = toxic_df['comment_text'].iloc[i:i+batch_size].tolist()
    
    # Calculate variants needed from this batch
    variants_needed = min(remaining // (len(toxic_df) // batch_size), 8)
    variants_needed = max(variants_needed, 1)
    
    # Augment batch
    try:
        augmented_batch = batch_augment(batch_texts, aug, variants_needed)
        augmented_toxic.extend(augmented_batch)
        remaining -= len(augmented_batch)
        
        print(f"Generated {len(augmented_batch)} samples | Remaining: {remaining}")
        
        if remaining <= 0:
            break
            
    # Exception handling        
    except Exception as e:
        print(f"Error in batch {i}: {str(e)}")
        continue

# Final balanced dataset: 

# 1. Convert original toxic comments to list and concatenate with list of augmented texts, afterwards, convert to dataframe
balanced_toxic = pd.DataFrame({
    'comment_text': toxic_df['comment_text'].tolist() + augmented_toxic[:required_toxic],
    'toxic': 1
})

# 2. Concatenate with non-toxic samples and shuffle the new dataset
balanced_df = pd.concat([balanced_toxic, non_toxic_df]).sample(frac=1, random_state=42)

# 3. Convert datatype of column "comment_text" to string
balanced_df['comment_text'] = balanced_df['comment_text'].astype(str)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Generated 4096 samples | Remaining: 36138
Generated 4096 samples | Remaining: 32042
Generated 4096 samples | Remaining: 27946
Generated 4096 samples | Remaining: 23850
Generated 4096 samples | Remaining: 19754
Generated 4096 samples | Remaining: 15658
Generated 4096 samples | Remaining: 11562
Generated 4096 samples | Remaining: 7466
Generated 4096 samples | Remaining: 3370
Generated 2200 samples | Remaining: 1170


In [3]:
# Use new dataframe as training set
train_df = balanced_df

#Import validation set from Kaggle and extract relevant columns
val_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
val_df = val_df[['comment_text', 'toxic']]

#Import test set from Kaggle
test_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')
#Import labels of test set from Kaggle
test_df_labels = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test_labels.csv')

# Add labels to test set
test_df['toxic'] = test_df_labels['toxic']
# Rename column
test_df['comment_text'] = test_df['content']
# Extract relevant columns from dataframe
test_df = test_df[['comment_text', 'toxic']]

# Print sizes of the datasets
print(f"Train size: {len(train_df)}, Val size: {len(val_df)}, Test size: {len(test_df)}")

Train size: 89064, Val size: 8000, Test size: 63812


## Training

In [4]:
from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import torch
from sklearn.metrics import roc_auc_score, accuracy_score

# Initialize tokenizer
tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

# Tokenization function
def tokenize_function(batch):
    tokenized = tokenizer(
        batch['comment_text'],
        padding='max_length',
        truncation=True,
        max_length=128
    )

    tokenized['labels'] = batch['toxic']
    return tokenized

# Convert to HuggingFace datasets
train_dataset = Dataset.from_pandas(train_df[['comment_text', 'toxic']])
val_dataset = Dataset.from_pandas(val_df[['comment_text', 'toxic']])
test_dataset = Dataset.from_pandas(test_df[['comment_text', 'toxic']])

train_dataset = train_dataset.map(tokenize_function, batched=True, batch_size=1024)
val_dataset = val_dataset.map(tokenize_function, batched=True, batch_size=1024)
test_dataset = test_dataset.map(tokenize_function, batched=True, batch_size=1024)

# Metric computation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = torch.softmax(torch.tensor(logits), dim=1)[:, 1].numpy()
    preds = torch.argmax(torch.tensor(logits), dim=1).numpy()
    roc_auc = roc_auc_score(labels, probs)
    accuracy = accuracy_score(labels, preds)
    
    return {
        'roc_auc': roc_auc,
        'accuracy': accuracy
    }

# Model configuration
model = XLMRobertaForSequenceClassification.from_pretrained(
    'xlm-roberta-base',
    num_labels=2
).to('cuda' if torch.cuda.is_available() else 'cpu')

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    save_strategy='no',
    load_best_model_at_end=False,
    metric_for_best_model='roc_auc',
    greater_is_better=True,
    fp16=True,
    report_to="none"
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

# Start training
trainer.train()

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Map:   0%|          | 0/89064 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/63812 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Roc Auc,Accuracy
1,0.199,1.787323,0.903857,0.870375
2,0.1801,1.429178,0.915215,0.881375
3,0.1495,1.606748,0.914398,0.885875




TrainOutput(global_step=8352, training_loss=0.17253897062206633, metrics={'train_runtime': 5794.8364, 'train_samples_per_second': 46.109, 'train_steps_per_second': 1.441, 'total_flos': 1.757529227593728e+16, 'train_loss': 0.17253897062206633, 'epoch': 3.0})

## Evaluation

In [5]:
# Evaluate on test dataset
test_results = trainer.evaluate(eval_dataset=test_dataset)

# Print results
print(f"Test ROC-AUC: {test_results['eval_roc_auc']:.4f}")
print(f"Test Accuracy: {test_results['eval_accuracy']:.4f}")



Test ROC-AUC: 0.8934
Test Accuracy: 0.8386


The result is good. The best value of ROC AUC is 100%. WIth 89.34%, our result is quite high.