# Sarcasm Detection in Albanian News

#### Objective

- Develop a machine learning model (BERT-based) to detect sarcasm in Albanian news articles.
- Perform binary classification:  
  **Sarcastic (1)** vs **Not Sarcastic (0)**.

---

--- Challenges

- No pre-annotated sarcasm labels exist for Albanian news.
- Sarcasm detection requires contextual and semantic understanding.
- The dataset is large (~4GB), requiring efficient sampling and preprocessing.
- Sarcasm is naturally rare and may lead to class imbalance.

---

--- Approach

--- 1. Data Sampling

- Extract a manageable subset (1,500–3,000 articles) for manual annotation.
- Apply:
  - **Stratified sampling** across categories and sources.
  - **Keyword-based filtering** to identify potential sarcasm candidates.
  - Include articles from satire domains (e.g., Kungulli) as sarcasm candidates.

---

--- 2. Annotation Process

- Two annotators manually label the selected articles.
- Labels:
  - `1 = Sarcastic`
  - `0 = Not Sarcastic`
  - `? = Unsure` (for later review)

- Create clear annotation guidelines to ensure consistency.
- Perform initial calibration:
  - Both annotators label the same 100 samples.
  - Compare results and refine guidelines.
- Resolve disagreements through discussion.

---

--- 3. Active Learning (Optional Optimization)

- Train a preliminary classifier on early labeled data.
- Identify uncertain samples (probability close to 0.5).
- Prioritize these samples for annotation.
- Iteratively improve dataset quality and model performance.

---

--- 4. Model Training

- Fine-tune a multilingual transformer model:
  - **XLM-R**
  - or **Multilingual BERT**

- Compare against baseline models:
  - Logistic Regression
  - LinearSVC
  - Multinomial Naive Bayes

- Use standard NLP preprocessing and tokenization.

---

--- 5. Evaluation Strategy

- Split dataset into:
  - 70% Training
  - 15% Validation
  - 15% Test (held-out set)

- Apply stratified splitting to maintain class balance.
- Avoid data leakage.
- Perform cross-validation during development.

- Evaluate using:
  - **Precision**
  - **Recall**
  - **F1-score (Primary Metric)**
  - Confusion Matrix
  - Accuracy

---

--- Expected Outcome

- A trained sarcasm detection model for Albanian news.
- The first manually annotated sarcasm dataset in Albanian news domain.
- Performance comparison between:
  - Classical machine learning models
  - Transformer-based deep learning models
- A reproducible research pipeline for future sarcasm detection studies.

---

--- Project Summary

This project aims to build the first sarcasm detection system for Albanian news articles by constructing a manually annotated dataset and applying transformer-based classification methods. The study evaluates both classical machine learning approaches and deep learning architectures to determine the most effective method for detecting sarcasm in low-resource languages.


## BERT

In [2]:
# ============================================================
# Lightweight BERT fine-tuning for Sarcasm Detection (Albanian) - MPS friendly
# Input CSV columns: content , is_sarcasm (0/1)
# ============================================================

from pathlib import Path
import numpy as np
import pandas as pd
import torch

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix, classification_report

from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    set_seed,
)

set_seed(42)

# ---------------------------
# PATHS
# ---------------------------
REPO_ROOT = Path.cwd()
while REPO_ROOT != REPO_ROOT.parent and not (REPO_ROOT / "data").exists():
    REPO_ROOT = REPO_ROOT.parent

DATA_DIR = REPO_ROOT / "data"
LABELED_FILE = DATA_DIR / "sarcasm_detection_dataset_v1.csv"

OUT_DIR = REPO_ROOT / "models" / "sarcasm_bert_v1"
OUT_DIR.mkdir(parents=True, exist_ok=True)

PRED_FILE = DATA_DIR / "sarcasm_labeled_predictions_v1.csv"

# ---------------------------
# DEVICE
# ---------------------------
use_mps = torch.backends.mps.is_available()
use_cuda = torch.cuda.is_available()
device = "mps" if use_mps else ("cuda" if use_cuda else "cpu")
print("Device:", device)

# ---------------------------
# MODEL (smaller than xlm-roberta-base)
# ---------------------------
MODEL_NAME = "distilbert-base-multilingual-cased"  # ✅ much smaller, MPS-friendly
MAX_LEN = 96
BATCH_SIZE = 1
EPOCHS = 3
LR = 2e-5

# ---------------------------
# LOAD DATA
# ---------------------------
df = pd.read_csv(LABELED_FILE, engine="python", on_bad_lines="skip", dtype={"content": str})
df.columns = [c.strip().lower() for c in df.columns]

df["content"] = df["content"].fillna("").astype(str).str.strip()
df = df[df["content"] != ""].copy()

df["is_sarcasm"] = df["is_sarcasm"].fillna("").astype(str).str.strip()
df = df[df["is_sarcasm"].isin(["0", "1"])].copy()
df["labels"] = df["is_sarcasm"].astype(int)

print("Loaded labeled rows:", len(df))
print(df["labels"].value_counts())

train_df, test_df = train_test_split(
    df[["content", "labels"]],
    test_size=0.2,
    random_state=42,
    stratify=df["labels"],
)

train_ds = Dataset.from_pandas(train_df.reset_index(drop=True))
test_ds  = Dataset.from_pandas(test_df.reset_index(drop=True))

# ---------------------------
# TOKENIZE
# ---------------------------
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_batch(batch):
    return tokenizer(batch["content"], truncation=True, max_length=MAX_LEN)

train_ds = train_ds.map(tokenize_batch, batched=True)
test_ds  = test_ds.map(tokenize_batch, batched=True)

keep = {"input_ids", "attention_mask", "labels"}
train_ds = train_ds.remove_columns([c for c in train_ds.column_names if c not in keep])
test_ds  = test_ds.remove_columns([c for c in test_ds.column_names if c not in keep])

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# ---------------------------
# MODEL LOAD (fp16 on MPS helps memory)
# ---------------------------
torch_dtype = torch.float16 if device == "mps" else torch.float32
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2,
    torch_dtype=torch_dtype,
)

# Enable checkpointing (saves memory during training)
try:
    model.gradient_checkpointing_enable()
except Exception:
    pass

# ---------------------------
# METRICS
# ---------------------------
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary", zero_division=0)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

# ---------------------------
# TRAINING ARGS
# ---------------------------
args = TrainingArguments(
    output_dir=str(OUT_DIR),
    eval_strategy="epoch",
    save_strategy="no",
    logging_strategy="steps",
    logging_steps=25,
    learning_rate=LR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=16,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    report_to="none",
    fp16=False,   # leave False on MPS
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# ---------------------------
# TRAIN
# ---------------------------
trainer.train()

# ---------------------------
# EVAL
# ---------------------------
pred_out = trainer.predict(test_ds)
logits = pred_out.predictions
y_true = pred_out.label_ids
y_pred = np.argmax(logits, axis=1)

print("\n=== Metrics (test) ===")
print(trainer.evaluate(test_ds))

print("\n=== Confusion Matrix ===")
print(confusion_matrix(y_true, y_pred))

print("\n=== Classification Report ===")
print(classification_report(y_true, y_pred, digits=4))

# ---------------------------
# SAVE
# ---------------------------
trainer.save_model(str(OUT_DIR))
tokenizer.save_pretrained(str(OUT_DIR))
print("\nSaved model to:", OUT_DIR)

probs = torch.softmax(torch.tensor(logits), dim=1).numpy()
conf = probs.max(axis=1)

pred_df = test_df.copy()
pred_df["pred_label"] = y_pred
pred_df["confidence"] = conf
pred_df.to_csv(PRED_FILE, index=False, encoding="utf-8")
print("Saved predictions CSV to:", PRED_FILE)

Device: mps
Loaded labeled rows: 3000
labels
0    2259
1     741
Name: count, dtype: int64


Map: 100%|██████████| 2400/2400 [00:00<00:00, 6742.62 examples/s]
Map: 100%|██████████| 600/600 [00:00<00:00, 6286.29 examples/s]
Loading weights: 100%|██████████| 100/100 [00:00<00:00, 197.43it/s, Materializing param=distilbert.transformer.layer.5.sa_layer_norm.weight]  
[1mDistilBertForSequenceClassification LOAD REPORT[0m from: distilbert-base-multilingual-cased
Key                     | Status     | 
------------------------+------------+-
vocab_layer_norm.weight | UNEXPECTED | 
vocab_layer_norm.bias   | UNEXPECTED | 
vocab_transform.bias    | UNEXPECTED | 
vocab_transform.weight  | UNEXPECTED | 
vocab_projector.bias    | UNEXPECTED | 
classifier.bias         | MISSING    | 
classifier.weight       | MISSING    | 
pre_classifier.bias     | MISSING    | 
pre_classifier.weight   | MISSING    | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING[3m	:those params were newly initialized because mi

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.0,,0.753333,0.0,0.0,0.0
2,0.0,,0.753333,0.0,0.0,0.0
3,0.0,,0.753333,0.0,0.0,0.0


  super().__init__(loader)
  super().__init__(loader)
  super().__init__(loader)



=== Metrics (test) ===


  super().__init__(loader)


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


{'eval_loss': nan, 'eval_accuracy': 0.7533333333333333, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_f1': 0.0, 'eval_runtime': 8.8221, 'eval_samples_per_second': 68.011, 'eval_steps_per_second': 68.011, 'epoch': 3.0}

=== Confusion Matrix ===
[[452   0]
 [148   0]]

=== Classification Report ===
              precision    recall  f1-score   support

           0     0.7533    1.0000    0.8593       452
           1     0.0000    0.0000    0.0000       148

    accuracy                         0.7533       600
   macro avg     0.3767    0.5000    0.4297       600
weighted avg     0.5675    0.7533    0.6474       600



Writing model shards: 100%|██████████| 1/1 [00:00<00:00,  3.01it/s]


Saved model to: /Users/bleronaidrizi/Sources/Master_Tema_e_Diplomes/Punimi/Sarcasm-Detection-Albanian-News-Dataset/models/sarcasm_bert_v1
Saved predictions CSV to: /Users/bleronaidrizi/Sources/Master_Tema_e_Diplomes/Punimi/Sarcasm-Detection-Albanian-News-Dataset/data/sarcasm_labeled_predictions_v1.csv



