# Sarcasm Detection in Albanian News

#### Objective

- Develop a machine learning model (BERT-based) to detect sarcasm in Albanian news articles.
- Perform binary classification:  
  **Sarcastic (1)** vs **Not Sarcastic (0)**.

---

--- Challenges

- No pre-annotated sarcasm labels exist for Albanian news.
- Sarcasm detection requires contextual and semantic understanding.
- The dataset is large (~4GB), requiring efficient sampling and preprocessing.
- Sarcasm is naturally rare and may lead to class imbalance.

---

--- Approach

--- 1. Data Sampling

- Extract a manageable subset (1,500–3,000 articles) for manual annotation.
- Apply:
  - **Stratified sampling** across categories and sources.
  - **Keyword-based filtering** to identify potential sarcasm candidates.
  - Include articles from satire domains (e.g., Kungulli) as sarcasm candidates.

---

--- 2. Annotation Process

- Two annotators manually label the selected articles.
- Labels:
  - `1 = Sarcastic`
  - `0 = Not Sarcastic`
  - `? = Unsure` (for later review)

- Create clear annotation guidelines to ensure consistency.
- Perform initial calibration:
  - Both annotators label the same 100 samples.
  - Compare results and refine guidelines.
- Resolve disagreements through discussion.

---

--- 3. Active Learning (Optional Optimization)

- Train a preliminary classifier on early labeled data.
- Identify uncertain samples (probability close to 0.5).
- Prioritize these samples for annotation.
- Iteratively improve dataset quality and model performance.

---

--- 4. Model Training

- Fine-tune a multilingual transformer model:
  - **XLM-R**
  - or **Multilingual BERT**

- Compare against baseline models:
  - Logistic Regression
  - LinearSVC
  - Multinomial Naive Bayes

- Use standard NLP preprocessing and tokenization.

---

--- 5. Evaluation Strategy

- Split dataset into:
  - 70% Training
  - 15% Validation
  - 15% Test (held-out set)

- Apply stratified splitting to maintain class balance.
- Avoid data leakage.
- Perform cross-validation during development.

- Evaluate using:
  - **Precision**
  - **Recall**
  - **F1-score (Primary Metric)**
  - Confusion Matrix
  - Accuracy

---

--- Expected Outcome

- A trained sarcasm detection model for Albanian news.
- The first manually annotated sarcasm dataset in Albanian news domain.
- Performance comparison between:
  - Classical machine learning models
  - Transformer-based deep learning models
- A reproducible research pipeline for future sarcasm detection studies.

---

--- Project Summary

This project aims to build the first sarcasm detection system for Albanian news articles by constructing a manually annotated dataset and applying transformer-based classification methods. The study evaluates both classical machine learning approaches and deep learning architectures to determine the most effective method for detecting sarcasm in low-resource languages.

In [9]:
# Used Libraries

import pandas as pd
import os
import re

In [10]:
# Helper methods

def print_dataset(text, df):
    print("\n" + text + ":")
    display(df.head())

def read_dataset(path):
    return pd.read_csv(path)

### Constants

In [None]:
DF_COLUMNS = ['content', 'date', 'title', 'category', 'author', 'source']
DF_PATH = "../data/kosovo_news.csv"
PREPROCESSED_DF_PATH = "../data/preprocessed_kosovo_news.csv"
SCFA_OUT_FILE  = "sarcasm_candidates_for_annotation.csv"
TITLE_COL    = "title"
TEXT_COL     = "text"
CATEGORY_COL = "category"
SOURCE_COL   = "source"

In [20]:
df = read_dataset(PREPROCESSED_DF_PATH)
df.head()

Unnamed: 0,title,category,source,text
0,As Kate as Meghan; ja cila është princesha më ...,Fun;Argëtim,Lajmi,as kate as meghan; ja cila është princesha më ...
1,"I kapen 10 kg substanca narkotike në BMW X5, a...",Lajme;Nacionale,Lajmi,"i kapen 10 kg substanca narkotike në bmw x5, a..."
2,"E fundit, Mbappe mund të zyrtarizohet nesër te...",La Liga;Lajme futbolli;Sport,Lajmi,"e fundit, mbappe mund të zyrtarizohet nesër te..."
3,Enca e quan jetë pushimin në plazh me poza në ...,nan;Entertainment,Lajmi,enca e quan jetë pushimin në plazh me poza në ...
4,Gurët në veshka – Kurat natyrale dhe si t’i pë...,Lifestyle;Shëndeti,Lajmi,gurët në veshka – kurat natyrale dhe si t’i pë...



## Approach

### 1. Data Sampling

In [27]:
import os
import re
import pandas as pd

DATA_PATH = "../data/preprocessed_kosovo_news.csv"
OUT_FILE  = "annotation_dataset_balanced.csv"

TARGET_N = 3000
CHUNKSIZE = 50_000
RANDOM_STATE = 42

EXCLUDE_SOURCES = ["kallxo", "kallxo.com"]
TOP_SOURCES = 12

def get_output_path(data_path: str, out_file: str) -> str:
    folder = os.path.dirname(os.path.abspath(data_path))
    return os.path.join(folder, out_file)

def is_only_timestamp(x) -> bool:
    if pd.isna(x):
        return True
    s = str(x).strip()
    return bool(
        re.match(r"^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(Z)?$", s) or
        re.match(r"^\d{4}-\d{2}-\d{2}$", s)
    )

def is_junk_title(x) -> bool:
    if pd.isna(x):
        return True
    s = str(x).strip().lower()
    if s in {"unknown", "none", "null", "nan", ""}:
        return True
    return len(s.split()) < 4

def exclude_source(x) -> bool:
    if pd.isna(x):
        return False
    s = str(x).strip().lower()
    return any(bad in s for bad in EXCLUDE_SOURCES)

def read_chunks():
    return pd.read_csv(
        DATA_PATH,
        chunksize=CHUNKSIZE,
        sep=",",
        engine="python",
        on_bad_lines="skip"
    )

def clean_chunk(chunk: pd.DataFrame, chunk_idx: int) -> pd.DataFrame:
    chunk.columns = chunk.columns.astype(str).str.strip().str.lower()

    required = ["title", "text", "category", "source"]
    if not set(required).issubset(set(chunk.columns)):
        print(f"⚠️ Skipping malformed chunk #{chunk_idx}. Columns:", list(chunk.columns)[:20])
        return pd.DataFrame(columns=required)

    chunk = chunk[required].copy()
    chunk = chunk[~chunk["source"].apply(exclude_source)]
    chunk = chunk[~chunk["title"].apply(is_only_timestamp)]
    chunk = chunk[~chunk["title"].apply(is_junk_title)]
    return chunk

def count_sources():
    source_counts = {}
    for i, chunk in enumerate(read_chunks()):
        chunk = clean_chunk(chunk, i)
        if len(chunk) == 0:
            continue

        vc = chunk["source"].value_counts()
        for src, cnt in vc.items():
            source_counts[src] = source_counts.get(src, 0) + int(cnt)

    counts = pd.Series(source_counts).sort_values(ascending=False)
    print("✅ Unique sources after cleaning:", len(counts))
    print("Top sources:\n", counts.head(20))
    return counts

def sample_balanced_by_source(counts: pd.Series) -> pd.DataFrame:
    top = counts.head(TOP_SOURCES)
    base_quota = max(1, TARGET_N // len(top))
    quota = {src: min(int(cnt), base_quota) for src, cnt in top.items()}

    leftover = TARGET_N - sum(quota.values())
    if leftover > 0:
        for src, cnt in top.items():
            cap = int(cnt) - quota[src]
            if cap <= 0:
                continue
            add = min(cap, leftover)
            quota[src] += add
            leftover -= add
            if leftover == 0:
                break

    collected = {src: [] for src in quota.keys()}
    collected_n = {src: 0 for src in quota.keys()}

    for i, chunk in enumerate(read_chunks()):
        chunk = clean_chunk(chunk, i)
        if len(chunk) == 0:
            continue

        chunk = chunk[chunk["source"].isin(quota.keys())]
        if len(chunk) == 0:
            continue

        for src in quota.keys():
            need = quota[src] - collected_n[src]
            if need <= 0:
                continue

            sub = chunk[chunk["source"] == src]
            if len(sub) == 0:
                continue

            take = min(need, len(sub))
            picked = sub.sample(take, random_state=RANDOM_STATE)
            collected[src].append(picked)
            collected_n[src] += take

        if all(collected_n[s] >= quota[s] for s in quota.keys()):
            break

    df_out = pd.concat(
        [pd.concat(v, ignore_index=True) for v in collected.values() if len(v) > 0],
        ignore_index=True
    ).drop_duplicates()

    df_out = df_out.sample(frac=1, random_state=RANDOM_STATE).reset_index(drop=True)
    df_out = df_out.head(TARGET_N)

    df_out["is_sarcasem(1|0|?)"] = ""
    return df_out

def run():
    # sanity check header
    df0 = pd.read_csv(DATA_PATH, nrows=1)
    print("Header:", list(df0.columns))

    counts = count_sources()
    df_ann = sample_balanced_by_source(counts)

    out_path = get_output_path(DATA_PATH, OUT_FILE)
    df_ann.to_csv(out_path, index=False, encoding="utf-8")

    print(f"\n✅ Saved: {out_path}")
    print("Rows:", len(df_ann))
    print("Sources in output:\n", df_ann["source"].value_counts().head(20))

run()

Header: ['title', 'category', 'source', 'text']


KeyError: 'title'

In [25]:
import pandas as pd
df0 = pd.read_csv(DATA_PATH, nrows=5)
print(list(df0.columns))

['title', 'category', 'source', 'text']
