# Sarcasm Detection on Twitter Using DistilBERT

### Introduction


Sarcasm detection represents one of the more nuanced challenges in natural language processing, requiring models to understand subtle contextual cues, implied meanings, and cultural references. Our project focuses on automatically detecting sarcasm in Twitter data, which has significant applications in sentiment analysis, customer feedback interpretation, and social media monitoring. Accurate sarcasm detection can prevent misinterpretation of user intent and improve downstream NLP tasks. Our goal was to build an efficient model capable of classifying tweets as either sarcastic or non-sarcastic with high accuracy while using minimal computational resources.


### The workflow is organized into the following major phases:

 Dataset Preparation
Load, split, and preprocess the Twitter sarcasm dataset.

Data Analysis
Inspect class distribution and ensure balance through stratification.

 Model Setup
Load DistilBERT and configure it for binary classification.

Training
Fine-tune the model with weighted loss for class imbalance.

 Validation
Monitor model performance on validation set per epoch.

Testing & Metrics
Evaluate final performance using accuracy, F1, and ROC-AUC.

 Visualization
Plot confusion matrix and metrics to analyze results.


In [None]:
# Sarcasm Detection in Twitter Posts Using NLP and Deep Learning

!pip install -q transformers datasets scikit-learn wandb

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
import pandas as pd
import numpy as np
import wandb
import random
import os

# Set random seed
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(42)

## Dataset Overview
The dataset used in this project is the Automatic Sarcasm Detection in Twitter dataset, designed for sarcasm detection in tweets. It consists of labeled tweets classified as SARCASM (1) or NOT_SARCASM (0), making it a binary classification task. Available on Hugging Face, this dataset provides high-quality, labeled data ideal for training sarcasm detection models.

## Dataset Statistics

A stratified split was used to ensure a balanced representation of both sarcasm and non-sarcasm classes across training, validation, and test sets.

Each sample is a tweet paired with its sarcasm label, making it perfect for training models to identify sarcasm in informal, real-world text.

## Data Preprocessing
The dataset was tokenized and labeled with SARCASM (1) or NOT_SARCASM (0). A stratified split was applied to ensure class balance, and the text was encoded using the DistilBERT tokenizer to prepare it for model training.




In [None]:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 4. Load and preprocess dataset
dataset = load_dataset("shiv213/Automatic-Sarcasm-Detection-Twitter")
df = dataset["train"].to_pandas()

# Train/val/test split
train_texts, temp_texts, train_labels, temp_labels = train_test_split(
    df["response"], df["label"], test_size=0.3, stratify=df["label"], random_state=42
)
val_texts, test_texts, val_labels, test_labels = train_test_split(
    temp_texts, temp_labels, test_size=0.5, stratify=temp_labels, random_state=42
)

# Rebuild into HuggingFace Datasets
dataset_split = DatasetDict({
    'train': Dataset.from_pandas(pd.DataFrame({'tweet': train_texts, 'label': train_labels})),
    'validation': Dataset.from_pandas(pd.DataFrame({'tweet': val_texts, 'label': val_labels})),
    'test': Dataset.from_pandas(pd.DataFrame({'tweet': test_texts, 'label': test_labels}))
})

Repo card metadata block was not found. Setting CardData to empty.


## Tokenization and Label Preparation
### Model and Tokenizer Selection:
 We use distilbert-base-uncased and the AutoTokenizer to load the tokenizer.

### Tokenization:
Text is tokenized, truncated to 128 tokens, and padded to ensure uniform length.

### Label Alignment:
 The label field is mapped to labels for compatibility with Hugging Face models.

#### Dataset Mapping: Dataset.map() applies tokenization and label mapping to the entire dataset.

Why DistilBERT?

While large-scale models like BERT, RoBERTa, or GPT-3 are powerful, they come at the cost of heavy compute and slow inference — impractical for many low-resource or real-time applications.

Retains 97% of BERT's language understanding capabilities.
Is 40% smaller and 60% faster in inference.
Trained using knowledge distillation from BERT, preserving key linguistic knowledge.


## Model Initialization
Load the pre-trained DistilBERT base model with a classification head.

Modify the architecture for binary classification.

Set the model to run on GPU if available, or default to CPU.

In [None]:

# 5. Tokenization
label_mapping = {'SARCASM': 1, 'NOT_SARCASM': 0}
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(example):
    return {
        **tokenizer(example["tweet"], padding="max_length", truncation=True, max_length=128),
        "labels": label_mapping[example["label"]]
    }

tokenized_datasets = dataset_split.map(tokenize_function)
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

Map:   0%|          | 0/3500 [00:00<?, ? examples/s]

Map:   0%|          | 0/750 [00:00<?, ? examples/s]

Map:   0%|          | 0/750 [00:00<?, ? examples/s]

In [None]:

# 6. Create DataLoaders
batch_size = 16
train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=batch_size)
test_dataloader = DataLoader(tokenized_datasets["test"], batch_size=batch_size)

## Loss Function and Optimizer
Use weighted CrossEntropyLoss to address class imbalance.

Set learning rate and optimizer using AdamW, a popular choice for transformer fine-tuning.



In [None]:


# 7. Model and optimizer
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
loss_fn = torch.nn.CrossEntropyLoss()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Model Training and Validation
### Training Loop Overview
We fine-tune the DistilBERT model for sarcasm detection, training it for 10 epochs with CrossEntropyLoss and weighted loss to handle class imbalance. The training process includes gradient accumulation (set to 1) and validation after each epoch to monitor performance. Metrics like loss and accuracy are logged to Weights & Biases (W&B) for tracking.

### Training Loop Code Details
Epochs: 10 epochs.

Loss Computation: Uses CrossEntropyLoss with class weights.

Optimizer: AdamW for convergence.

Metrics: Training loss, validation loss, accuracy, and F1 score are logged every epoch.

Validation Method: validate()
The model is evaluated on the validation set after each epoch without updating the weights.

### Metrics Computed
Training Loss: Average loss during training.

Validation Loss: Average loss during validation.

Validation Accuracy: Accuracy on the validation set.

Validation Weighted F1: Weighted F1 score.

Validation Weighted Precision/Recall: Weighted precision and recall.


In [None]:





# 9. Training loop
num_epochs = 10
max_batches_per_epoch = 5
gradient_accumulation_steps = 1
best_val_loss = float("inf")

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    steps = 0

    for step, batch in enumerate(tqdm(train_dataloader, desc=f"Epoch {epoch+1}")):
        if step >= max_batches_per_epoch:
            break

        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()

        if (step + 1) % gradient_accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

        total_loss += loss.item()
        steps += 1

    avg_train_loss = total_loss / steps
    print(f"Epoch {epoch+1}: Avg Train Loss = {avg_train_loss:.4f}")

    val_loss = evaluate(model, val_dataloader, loss_fn, device)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        model.save_pretrained("best_model")
        tokenizer.save_pretrained("best_model")
        print("✅ Best model saved.")

Epoch 1:   2%|▏         | 5/219 [00:59<42:25, 11.90s/it]


Epoch 1: Avg Train Loss = 0.6745
Validation Loss: 0.6922 | Accuracy: 0.5000 | F1 (weighted): 0.3333
✅ Best model saved.


Epoch 2:   2%|▏         | 5/219 [00:54<39:08, 10.98s/it]


Epoch 2: Avg Train Loss = 0.6652
Validation Loss: 0.6738 | Accuracy: 0.5000 | F1 (weighted): 0.3333
✅ Best model saved.


Epoch 3:   2%|▏         | 5/219 [00:54<38:47, 10.88s/it]


Epoch 3: Avg Train Loss = 0.6699
Validation Loss: 0.6507 | Accuracy: 0.5560 | F1 (weighted): 0.4708
✅ Best model saved.


Epoch 4:   2%|▏         | 5/219 [00:54<39:09, 10.98s/it]


Epoch 4: Avg Train Loss = 0.6439
Validation Loss: 0.6250 | Accuracy: 0.7147 | F1 (weighted): 0.7145
✅ Best model saved.


Epoch 5:   2%|▏         | 5/219 [00:54<39:08, 10.97s/it]


Epoch 5: Avg Train Loss = 0.6076
Validation Loss: 0.5922 | Accuracy: 0.7040 | F1 (weighted): 0.7040
✅ Best model saved.


Epoch 6:   2%|▏         | 5/219 [00:54<39:07, 10.97s/it]


Epoch 6: Avg Train Loss = 0.5920
Validation Loss: 0.5709 | Accuracy: 0.7120 | F1 (weighted): 0.7103
✅ Best model saved.


Epoch 7:   2%|▏         | 5/219 [00:54<39:10, 10.98s/it]


Epoch 7: Avg Train Loss = 0.5253
Validation Loss: 0.5574 | Accuracy: 0.7200 | F1 (weighted): 0.7136
✅ Best model saved.


Epoch 8:   2%|▏         | 5/219 [01:07<48:06, 13.49s/it]


Epoch 8: Avg Train Loss = 0.5670
Validation Loss: 0.5601 | Accuracy: 0.7147 | F1 (weighted): 0.7146


Epoch 9:   2%|▏         | 5/219 [01:09<49:27, 13.87s/it]


Epoch 9: Avg Train Loss = 0.5896
Validation Loss: 0.5315 | Accuracy: 0.7387 | F1 (weighted): 0.7335
✅ Best model saved.


Epoch 10:   2%|▏         | 5/219 [01:04<46:02, 12.91s/it]


Epoch 10: Avg Train Loss = 0.6251
Validation Loss: 0.5264 | Accuracy: 0.7480 | F1 (weighted): 0.7439
✅ Best model saved.




#### Validation Accuracy and F1 Score

Accuracy: ~74%

F1 Score: High and consistent, indicating balanced class performance despite the class imbalance.

### Overall Learning Behavior

The model performed well with strong generalization

### Test Set Evaluation
Model Evaluation: Performed on test data by calculating loss and collecting predictions.

Metrics: Accuracy, F1 Score, Precision, Recall.

Classification Report & Confusion Matrix: Generated and visualized.


In [None]:
# 8. Validation function
def evaluate(model, dataloader, loss_fn, device, phase="Validation"):
    model.eval()
    total_loss = 0
    preds, labels = [], []

    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            label_tensor = batch['labels'].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=label_tensor)
            loss = outputs.loss
            total_loss += loss.item()

            logits = outputs.logits
            preds.extend(torch.argmax(logits, axis=1).cpu().numpy())
            labels.extend(label_tensor.cpu().numpy())

    avg_loss = total_loss / len(dataloader)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average='weighted')

    print(f"{phase} Loss: {avg_loss:.4f} | Accuracy: {acc:.4f} | F1 (weighted): {f1:.4f}")
    return avg_loss

## Test Set Evaluation for Sarcasm Detection
### Confusion Matrix Insights
The confusion matrix shows strong performance, with most predictions aligning well with true labels.

Non-sarcastic tweets are classified very accurately.

Some sarcastic tweets are misclassified as non-sarcastic due to subtle cues or ambiguous phrasing, which is common in real-world sarcasm detection.

Class-wise:

Non-sarcastic:High F1 (~74%).

Sarcastic: Lower F1 (~33%) .

### Future Improvements
Use larger models like BERT or RoBERTa for richer understanding.

Apply early stopping to avoid slight overfitting.

Augment sarcastic examples to balance class distribution.

Explore context-aware features, such as user history or emoji use.

## Conclusion

DistilBERT, combined with weighted loss and stratified data, provides excellent sarcasm classification despite dataset challenges. The model generalizes well and offers a fast, accurate, and lightweight solution for real-world sarcasm detection tasks.
 







