# Assignment 1
**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Sexism Detection, Multi-class Classification, RNNs, Transformers, Huggingface



# Contact
For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

- Federico Ruggeri -> federico.ruggeri6@unibo.it
- Eleonora Mancini -> e.mancini@unibo.it

Professor:
- Paolo Torroni -> p.torroni@unibo.it

# Introduction
You are asked to address the [EXIST 2023 Task 2](https://clef2023.clef-initiative.eu/index.php?page=Pages/labs.html#EXIST) on sexism detection.

## Problem Definition

This task aims to categorize the sexist messages according to the intention of the author in one of the following categories: (i) direct sexist message, (ii) reported sexist message and (iii) judgemental message.

### Examples:

#### DIRECT 
The intention was to write a message that is sexist by itself or incites to be sexist, as in:

''*A woman needs love, to fill the fridge, if a man can give this to her in return for her services (housework, cooking, etc), I don’t see what else she needs.*''

#### REPORTED
The intention is to report and share a sexist situation suffered by a woman or women in first or third person, as in:

''*Today, one of my year 1 class pupils could not believe he’d lost a race against a girl.*''

#### JUDGEMENTAL
The intention was to judge, since the tweet describes sexist situations or behaviours with the aim of condemning them.

''*As usual, the woman was the one quitting her job for the family’s welfare…*''

# [Task 1 - 1.0 points] Corpus

We have preparared a small version of EXIST dataset in our dedicated [Github repository](https://github.com/lt-nlp-lab-unibo/nlp-course-material/tree/main/2025-2026/Assignment%201/data).

Check the `A1/data` folder. It contains 3 `.json` files representing `training`, `validation` and `test` sets.


### Dataset Description
- The dataset contains tweets in both English and Spanish.
- There are labels for multiple tasks, but we are focusing on **Task 2**.
- For Task 2, labels are assigned by six annotators.
- The labels for Task 2 represent whether the tweet is non-sexist ('-') or its sexist intention ('DIRECT', 'REPORTED', 'JUDGEMENTAL').







### Example

```
    "203260": {
        "id_EXIST": "203260",
        "lang": "en",
        "tweet": "ik when mandy says “you look like a whore” i look cute as FUCK",
        "number_annotators": 6,
        "annotators": ["Annotator_473", "Annotator_474", "Annotator_475", "Annotator_476", "Annotator_477", "Annotator_27"],
        "gender_annotators": ["F", "F", "M", "M", "M", "F"],
        "age_annotators": ["18-22", "23-45", "18-22", "23-45", "46+", "46+"],
        "labels_task1": ["YES", "YES", "YES", "NO", "YES", "YES"],
        "labels_task2": ["DIRECT", "DIRECT", "REPORTED", "-", "JUDGEMENTAL", "REPORTED"],
        "labels_task3": [
          ["STEREOTYPING-DOMINANCE"],
          ["OBJECTIFICATION"],
          ["SEXUAL-VIOLENCE"],
          ["-"],
          ["STEREOTYPING-DOMINANCE", "OBJECTIFICATION"],
          ["OBJECTIFICATION"]
        ],
        "split": "TRAIN_EN"
      }
    }
```

### Instructions
1. **Download** the `A1/data` folder.
2. **Load** the three JSON files and encode them as ``pandas.DataFrame``.
3. **Aggregate labels** for Task 2 using majority voting and store them in a new dataframe column called `label`. Items without a clear majority will be removed from the dataset.
4. **Filter the DataFrame** to keep only rows where the `lang` column is `'en'`.
5. **Remove unwanted columns**: Keep only `id_EXIST`, `lang`, `tweet`, and `label`.
6. **Encode the `label` column**: Use the following mapping

```
{
    '-': 0,
    'DIRECT': 1,
    'JUDGEMENTAL': 2,
    'REPORTED': 3
}
```

In [1]:
# === TASK 1 (lean) — EXIST / Task 2 ===
from pathlib import Path
from collections import Counter
import json
import pandas as pd

DATA_DIR = Path("data")
FILES = {
    "train":      DATA_DIR / "training.json",
    "validation": DATA_DIR / "validation.json",
    "test":       DATA_DIR / "test.json",
}

# Label mapping required by the assignment
LABEL_MAP = {"-": 0, "DIRECT": 1, "JUDGEMENTAL": 2, "REPORTED": 3}
ALLOWED_LABELS = set(LABEL_MAP.keys())  # ignore anything else (e.g., UNKNOWN)

def read_json_df(path: Path) -> pd.DataFrame:
    """Load EXIST split stored as dict-of-dicts and convert to DataFrame."""
    with open(path, "r", encoding="utf-8") as f:
        data = json.load(f)
    
    df = pd.DataFrame.from_dict(data, orient="index").reset_index(drop=True)
    if "id_EXIST" not in df.columns:
        print("primo if")
        # if the payload doesn't carry id, use the original keys
        df["id_EXIST"] = list(data.keys())
    return df


def majority_vote(labels):
    """
    Strict majority on Task 2 labels:
    - keep only labels in ALLOWED_LABELS
    - return None if no clear winner (> 50%)
    """
    labels_norm = [str(x).strip().upper() for x in labels if isinstance(x, str)]
    labels_norm = [x for x in labels_norm if x in ALLOWED_LABELS]
    lab, cnt = Counter(labels_norm).most_common(1)[0]
    return lab if cnt > len(labels_norm) / 2 else None

def process_split(df: pd.DataFrame) -> pd.DataFrame:
    # 1) majority vote on Task 2
    df["label_str"] = df["labels_task2"].apply(majority_vote)

    # 2) drop rows without clear majority
    before = len(df)
    df = df[df["label_str"].notna()].copy()
    dropped_majority = before - len(df)

    # 3) keep English only (accept 'en' and 'en-*')
    before = len(df)
    df = df[df["lang"].astype(str).str.lower() == "en"].copy()
    dropped_lang = before - len(df)

    # 4) keep and encode required columns
    df = df[["id_EXIST", "lang", "tweet", "label_str"]].rename(columns={"label_str": "label"})
    df["label"] = df["label"].map(LABEL_MAP).astype("Int64")

    # quick report
    print(f"- Dropped (no majority): {dropped_majority}")
    print(f"- Dropped (non-EN): {dropped_lang}")
    print("Label distribution (encoded):")
    print(df["label"].value_counts().sort_index())

    return df

# ---- Run and save ----
df_train_raw = read_json_df(FILES["train"])
df_val_raw   = read_json_df(FILES["validation"])
df_test_raw  = read_json_df(FILES["test"])

print("\n[TRAIN]")
df_train = process_split(df_train_raw)
print("\n[VALIDATION]")
df_val   = process_split(df_val_raw)
print("\n[TEST]")
df_test  = process_split(df_test_raw)

out_dir = Path("processed")
out_dir.mkdir(parents=True, exist_ok=True)
df_train.to_csv(out_dir / "train.csv", index=False)
df_val.to_csv(out_dir / "validation.csv", index=False)
df_test.to_csv(out_dir / "test.csv", index=False)
print(f"\nSaved to: {out_dir.resolve()}")



[TRAIN]
- Dropped (no majority): 2375
- Dropped (non-EN): 2333
Label distribution (encoded):
label
0    1735
1     340
2      43
3      94
Name: count, dtype: Int64

[VALIDATION]
- Dropped (no majority): 259
- Dropped (non-EN): 352
Label distribution (encoded):
label
0    90
1    14
2     7
3     4
Name: count, dtype: Int64

[TEST]
- Dropped (no majority): 95
- Dropped (non-EN): 0
Label distribution (encoded):
label
0    160
1     42
2      5
3     10
Name: count, dtype: Int64

Saved to: C:\CODING\NLP-Project\A1\processed


In [2]:
df_train.head()

Unnamed: 0,id_EXIST,lang,tweet,label
3661,200002,en,Writing a uni essay in my local pub with a cof...,3
3665,200006,en,According to a customer I have plenty of time ...,3
3667,200008,en,New to the shelves this week - looking forward...,0
3669,200010,en,I guess that’s fairly normal for a Neanderthal...,0
3670,200011,en,#EverydaySexism means women usually end up in ...,2


# [Task2 - 0.5 points] Data Cleaning
In the context of tweets, we have noisy and informal data that often includes unnecessary elements like emojis, hashtags, mentions, and URLs. These elements may interfere with the text analysis.



### Instructions
- **Remove emojis** from the tweets.
- **Remove hashtags** (e.g., `#example`).
- **Remove mentions** such as `@user`.
- **Remove URLs** from the tweets.
- **Remove special characters and symbols**.
- **Remove specific quote characters** (e.g., curly quotes).
- **Perform lemmatization** to reduce words to their base form.

In [3]:
# === TASK 2 — CLEANING & LEMMATIZATION ===
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

for resource in ["punkt", "punkt_tab", "wordnet", "omw-1.4"]:
    try:
        if resource == "punkt_tab":
            nltk.data.find(f"tokenizers/{resource}")
        else:
            nltk.data.find(resource)
    except LookupError:
        nltk.download(resource, quiet=True)

pos_tagger_found = False
for pos_resource in ["averaged_perceptron_tagger_eng", "averaged_perceptron_tagger"]:
    if not pos_tagger_found:
        try:
            nltk.data.find(f"taggers/{pos_resource}")
            pos_tagger_found = True
        except LookupError:
            try:
                nltk.download(pos_resource, quiet=True)
                pos_tagger_found = True
            except:
                pass

lemmatizer = WordNetLemmatizer()

# ------------------------------------------------------------------
# Regex patterns for text cleaning
# ------------------------------------------------------------------
URL_RE = re.compile(r"https?://\S+|www\.\S+", flags=re.IGNORECASE)
MENTION_RE = re.compile(r"@\w+")
HASHTAG_RE = re.compile(r"#\w+")
EMOJI_RE = re.compile(
    "["
    "\U0001F600-\U0001F64F"  # emoticons
    "\U0001F300-\U0001F5FF"  # symbols & pictographs
    "\U0001F680-\U0001F6FF"  # transport & map
    "\U0001F1E0-\U0001F1FF"  # flags
    "\U00002700-\U000027BF"  # dingbats
    "\U0001F900-\U0001F9FF"  # supplemental symbols
    "\U0001FA70-\U0001FAFF"  # extended pictographs
    "\U00002600-\U000026FF"  # misc symbols
    "]+",
    flags=re.UNICODE,
)
FANCY_QUOTES_RE = re.compile(r"[“”‘’´`]")
SPECIAL_CHARS_RE = re.compile(r"[^a-zA-Z0-9\s]")
WHITESPACE_RE = re.compile(r"\s+")


def normalize_whitespace(text: str) -> str:
    """Collapse multiple spaces/newlines to a single space and trim."""
    return WHITESPACE_RE.sub(" ", text).strip()

# ------------------------------------------------------------------
# Helper functions for POS tagging
# ------------------------------------------------------------------
def penn_to_wordnet_pos(tag: str):
    """
    Convert Penn Treebank POS tags
    into the POS format expected by WordNetLemmatizer.
    If the tag is unknown, default to noun.
    """
    if tag.startswith("J"):
        return wordnet.ADJ
    if tag.startswith("V"):
        return wordnet.VERB
    if tag.startswith("N"):
        return wordnet.NOUN
    if tag.startswith("R"):
        return wordnet.ADV
    return wordnet.NOUN


def lemmatize_tokens_pos(tokens):
    """
    Lemmatize each token using its predicted POS tag.
    """
    # POS tagging on tokens
    tagged = nltk.pos_tag(tokens)

    lemmas = []
    for word, pos_tag in tagged:
        wn_pos = penn_to_wordnet_pos(pos_tag)
        lemmas.append(lemmatizer.lemmatize(word, pos=wn_pos))
    return " ".join(lemmas)


# ------------------------------------------------------------------
# Full cleaning pipeline for a single tweet
# ------------------------------------------------------------------
def clean_tweet(text: str) -> str:
    t = text.lower()

    # remove urls, mentions, hashtags, emojis
    t = URL_RE.sub(" ", t)
    t = MENTION_RE.sub(" ", t)
    t = HASHTAG_RE.sub(" ", t)
    t = EMOJI_RE.sub(" ", t)

    # remove fancy quotes
    t = FANCY_QUOTES_RE.sub(" ", t)

    # remove remaining non-alphanumeric chars / punctuation (keep only [a-z0-9 space])
    t = SPECIAL_CHARS_RE.sub(" ", t)

    # normalize whitespace before tokenizing
    t = normalize_whitespace(t)

    # tokenize
    tokens = nltk.word_tokenize(t)

    # POS-aware lemmatization
    t = lemmatize_tokens_pos(tokens)

    # final whitespace cleanup
    t = normalize_whitespace(t)
    return t


# ------------------------------------------------------------------
# Apply cleaning to entire DataFrame
# ------------------------------------------------------------------
def apply_cleaning(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["tweet"] = df["tweet"].astype(str).apply(clean_tweet)
    return df


# ------------------------------------------------------------------
# Load and clean CSVs (train / validation / test)
# ------------------------------------------------------------------
in_dir = out_dir = Path("processed")
out_dir.mkdir(parents=True, exist_ok=True)

splits = ["train", "validation", "test"]

for split in splits:
    in_path = in_dir / f"{split}.csv"
    print(f"\n[{split.upper()}] reading {in_path} ...")
    df_split = pd.read_csv(in_path)

    # clean
    df_clean = apply_cleaning(df_split)

    # preview
    print(df_clean.head(2))

    # save
    out_path = out_dir / f"{split}.csv"
    df_clean.to_csv(out_path, index=False)
    print(f" -> saved cleaned split to {out_path} ({len(df_clean)} rows)")



[TRAIN] reading processed\train.csv ...
   id_EXIST lang                                              tweet  label
0    200002   en  write a uni essay in my local pub with a coffe...      3
1    200006   en  accord to a customer i have plenty of time to ...      3
 -> saved cleaned split to processed\train.csv (2212 rows)

[VALIDATION] reading processed\validation.csv ...
   id_EXIST lang                                              tweet  label
0    400001   en  you should smile more love just pretend you re...      0
1    400003   en  some man move my suitcase in the overhead lugg...      3
 -> saved cleaned split to processed\validation.csv (115 rows)

[TEST] reading processed\test.csv ...
   id_EXIST lang                                              tweet  label
0    400178   en  1st day at the pool on a beautiful sunday in n...      0
1    400180   en  same though the angst just come and go lonely ...      0
 -> saved cleaned split to processed\test.csv (217 rows)


# [Task 3 - 0.5 points] Text Encoding
To train a neural sexism classifier, you first need to encode text into numerical format.




### Instructions

* Embed words using **GloVe embeddings**.
* You are **free** to pick any embedding dimension.





### What about OOV tokens?
   * All the tokens in the **training** set that are not in GloVe **must** be added to the vocabulary.
   * For the remaining tokens (i.e., OOV in the validation and test sets), you have to assign them a **special token** (e.g., ``<UNK>``) and a **static** embedding.
   * You are **free** to define the static embedding using any strategy (e.g., random, neighbourhood, etc...)



### More about OOV

For a given token:

* **If in train set**: add to vocabulary and assign an embedding (use GloVe if token in GloVe, custom embedding otherwise).
* **If in val/test set**: assign special token if not in vocabulary and assign custom embedding.

Your vocabulary **should**:

* Contain all tokens in train set; or
* Union of tokens in train set and in GloVe $\rightarrow$ we make use of existing knowledge!

In [4]:
# === TASK 3 — GloVe embeddings with explicit OOV handling ===
from pathlib import Path
import json
import numpy as np
import pandas as pd
import gensim.downloader as gloader  # we use gensim's pre-trained GloVe 

# ----------------- Configuration -----------------
PROCESSED_DIR = Path("processed")
# We assume Task 2 already produced these cleaned CSVs with columns: id_EXIST, lang, tweet, label
TRAIN_CSV = PROCESSED_DIR / "train.csv"
VAL_CSV   = PROCESSED_DIR / "validation.csv"
TEST_CSV  = PROCESSED_DIR / "test.csv"

# Pick a GloVe model and make sure EMB_DIM matches it
MODEL_NAME = "glove-wiki-gigaword-100"  # available: 50 | 100 | 200 | 300
EMB_DIM = 100                           # must equal the dimension of MODEL_NAME

# Fixed RNG for reproducibility (important for random OOV initialization)
SEED = 13
rng = np.random.default_rng(SEED)

# Reserve two special tokens:
# - <PAD> (index 0) to pad sequences to a fixed length; embedding is all-zeros
# - <UNK> (index 1) for tokens unseen in TRAIN but appearing in VAL/TEST; embedding is static
SPECIAL_TOKENS = {"<PAD>": 0, "<UNK>": 1}
PAD_IDX = SPECIAL_TOKENS["<PAD>"]
UNK_IDX = SPECIAL_TOKENS["<UNK>"]

# ----------------- Load cleaned splits (Task 2 outputs) -----------------
# We only cast the label to int; tweets stay as strings
df_tr = pd.read_csv(TRAIN_CSV)
df_va = pd.read_csv(VAL_CSV)
df_te = pd.read_csv(TEST_CSV)

y_tr = df_tr["label"].astype(int).to_numpy()
y_va = df_va["label"].astype(int).to_numpy()
y_te = df_te["label"].astype(int).to_numpy()

# Simple tokenizer:
# After Task 2 cleaning (URLs, mentions, punctuation removed and lemmatized),
# whitespace split is sufficient and fast.
def tokenize(s: str):
    return s.strip().split()

X_tr_tokens = [tokenize(t) for t in df_tr["tweet"].astype(str)]
X_va_tokens = [tokenize(t) for t in df_va["tweet"].astype(str)]
X_te_tokens = [tokenize(t) for t in df_te["tweet"].astype(str)]

# ----------------- Load GloVe vectors via gensim -----------------
# gensim will download once to ~/gensim-data and then cache the model;
# using this avoids shipping a large .txt with the project.
print(f"[GloVe] Loading {MODEL_NAME} (cached in ~/gensim-data)...")
kv = gloader.load(MODEL_NAME)  # KeyedVectors object with token -> vector lookups

# We compute global mean/std of the pre-trained space to:
# - initialize TRAIN OOV tokens with random vectors ~ N(mean, std)
# - define <UNK> as the mean vector (static)
glove_mean = kv.vectors.mean(axis=0).astype(np.float32)
glove_std  = kv.vectors.std(axis=0).astype(np.float32)

# ----------------- Build vocabulary from TRAIN only -----------------
# Rationale:
# - The assignment requires: include ALL TRAIN tokens in the vocabulary.
# - If a TRAIN token exists in GloVe, use its pre-trained vector.
# - If it does not, assign a static random vector drawn from N(glove_mean, glove_std).
# - VAL/TEST tokens that are not in TRAIN's vocab are mapped to <UNK>.
train_vocab = set()
for toks in X_tr_tokens:
    train_vocab.update(toks)

# token_to_idx starts with special tokens; indices must align with the embedding matrix rows.
token_to_idx = dict(SPECIAL_TOKENS)

# Embedding matrix rows will be stacked in 'embeds' in the same order as indices.
#  - Row 0 (<PAD>) is all zeros so it does not affect the model.
#  - Row 1 (<UNK>) is a static "generic" embedding; we set it to the mean GloVe vector.
embeds = [
    np.zeros(EMB_DIM, dtype=np.float32),  # <PAD>
    glove_mean.copy(),                     # <UNK>
]

oov_train = 0     # counter for TRAIN tokens missing in GloVe
next_idx = len(SPECIAL_TOKENS)

# Deterministic insertion order (sorted) to make experiments reproducible
for tok in sorted(train_vocab):
    token_to_idx[tok] = next_idx

    if tok in kv.key_to_index:
        # Known by GloVe: use the pre-trained vector
        vec = kv.get_vector(tok)
    else:
        # OOV in TRAIN: initialize a STATIC vector matching GloVe's global distribution
        # Motivation: keeps scale and variance consistent with the pre-trained space
        vec = rng.normal(loc=glove_mean, scale=glove_std).astype(np.float32)
        oov_train += 1

    embeds.append(vec)
    next_idx += 1

# Final embedding matrix: shape = [vocab_size, EMB_DIM]
embedding_matrix = np.vstack(embeds).astype(np.float32)

# ----------------- Convert text to indices -----------------
# TRAIN: every token should be in vocab (no <UNK> expected).
# VAL/TEST: unseen tokens map to <UNK>.
def tokens_to_ids(tokens, token2idx, unk_idx=UNK_IDX):
    return [token2idx.get(t, unk_idx) for t in tokens]

Xtr_ids = [tokens_to_ids(toks, token_to_idx) for toks in X_tr_tokens]
Xva_ids = [tokens_to_ids(toks, token_to_idx) for toks in X_va_tokens]
Xte_ids = [tokens_to_ids(toks, token_to_idx) for toks in X_te_tokens]

# Sanity check: if TRAIN contains <UNK>, something went wrong (e.g., tokenization mismatch)
unk_in_train = sum(UNK_IDX in seq for seq in Xtr_ids)

# ----------------- Pad / truncate sequences to a fixed length -----------------
# We choose MAX_LEN as the 95th percentile of TRAIN lengths:
# - Motivation: keeps almost all content while preventing extreme outliers from bloating tensors.
def infer_max_len(seqs, q=95):
    lens = np.array([len(s) for s in seqs], dtype=np.int32)
    return max(1, int(np.percentile(lens, q))) if len(lens) else 1

MAX_LEN = infer_max_len(Xtr_ids, q=95)

# Left-align and pad with <PAD> up to MAX_LEN; truncate longer sequences.
def pad_truncate(seqs, max_len, pad_idx=PAD_IDX):
    out = np.full((len(seqs), max_len), pad_idx, dtype=np.int32)
    for i, s in enumerate(seqs):
        out[i, :min(len(s), max_len)] = s[:max_len]
    return out

Xtr_mat = pad_truncate(Xtr_ids, MAX_LEN, PAD_IDX)
Xva_mat = pad_truncate(Xva_ids, MAX_LEN, PAD_IDX)
Xte_mat = pad_truncate(Xte_ids, MAX_LEN, PAD_IDX)

# ----------------- Save artifacts for Task 4 -----------------
# We persist exactly what the next tasks need:
# - vocab.json: token -> index mapping (to tokenize new texts consistently)
# - embeddings.npy: embedding matrix aligned with indices (row i = vector for index i)
# - dataset_indices.npz: padded index matrices + labels + minimal metadata
OUT_DIR = Path("embeddings")
OUT_DIR.mkdir(parents=True, exist_ok=True)

with open(OUT_DIR / "vocab.json", "w", encoding="utf-8") as f:
    json.dump(token_to_idx, f, ensure_ascii=False)

np.save(OUT_DIR / "embeddings.npy", embedding_matrix)

np.savez_compressed(
    OUT_DIR / "dataset_indices.npz",
    X_train=Xtr_mat, y_train=y_tr,
    X_val=Xva_mat,  y_val=y_va,
    X_test=Xte_mat, y_test=y_te,
    pad_idx=PAD_IDX, unk_idx=UNK_IDX, max_len=MAX_LEN
)

# ----------------- Textual summary (quick sanity-check) -----------------
print("\n=== Task 3 summary ===")
print(f"Vocabulary size (incl. PAD/UNK): {embedding_matrix.shape[0]}")
print(f"Embedding dimension: {embedding_matrix.shape[1]}")
print(f"TRAIN tokens OOV in GloVe (random static vectors assigned): {oov_train}")
print(f"Number of TRAIN sequences containing <UNK>: {unk_in_train}  (expected: 0)")
print(f"MAX_LEN (95th percentile on TRAIN): {MAX_LEN}")
print("Shapes -> X_train, X_val, X_test:", Xtr_mat.shape, Xva_mat.shape, Xte_mat.shape)
print("Embeddings matrix shape:", embedding_matrix.shape)
print(f"Saved to: {OUT_DIR.resolve()}")


[GloVe] Loading glove-wiki-gigaword-100 (cached in ~/gensim-data)...

=== Task 3 summary ===
Vocabulary size (incl. PAD/UNK): 7842
Embedding dimension: 100
TRAIN tokens OOV in GloVe (random static vectors assigned): 718
Number of TRAIN sequences containing <UNK>: 0  (expected: 0)
MAX_LEN (95th percentile on TRAIN): 51
Shapes -> X_train, X_val, X_test: (2212, 51) (115, 51) (217, 51)
Embeddings matrix shape: (7842, 100)
Saved to: C:\CODING\NLP-Project\A1\embeddings


In [5]:
# === Inspect artifacts from Task 3 (np.load) ===
import numpy as np, json, pandas as pd
from pathlib import Path

ART_DIR = Path("embeddings")

# load artifacts
E = np.load(ART_DIR / "embeddings.npy")                       # (vocab_size, emb_dim)
D = np.load(ART_DIR / "dataset_indices.npz", allow_pickle=False)
tok2idx = json.load(open(ART_DIR / "vocab.json", "r", encoding="utf-8"))
idx2tok = {int(v): k for k, v in tok2idx.items()}

# unpack arrays + meta
Xtr, ytr = D["X_train"], D["y_train"]
Xva, yva = D["X_val"],   D["y_val"]
Xte, yte = D["X_test"],  D["y_test"]
pad_idx, unk_idx, max_len = int(D["pad_idx"]), int(D["unk_idx"]), int(D["max_len"])

# (optional) original CSVs for id/text preview, if available
df_tr = pd.read_csv("processed/train.csv") if Path("processed/train.csv").exists() else None
df_va = pd.read_csv("processed/validation.csv") if Path("processed/validation.csv").exists() else None
df_te = pd.read_csv("processed/test.csv") if Path("processed/test.csv").exists() else None
dfs = {"train": df_tr, "val": df_va, "test": df_te}

def decode(seq, drop_pad=True):
    """indices -> tokens (stop at first PAD for leggibilità)."""
    toks = []
    for i in seq:
        i = int(i)
        if drop_pad and i == pad_idx:
            break
        toks.append(idx2tok.get(i, "<NA>"))
    return toks

def show_example(split="train", i=0, show_embed=True, preview_chars=160):
    X, y = {"train": (Xtr, ytr), "val": (Xva, yva), "test": (Xte, yte)}[split]
    seq = X[i]
    no_pad_len = int((seq != pad_idx).sum())
    print(f"\n[{split.upper()}] row={i} | label={int(y[i])} | len(no-pad)={no_pad_len}/{len(seq)}")
    print("indices:", seq[:min(30, len(seq))].tolist(), "...")
    toks = decode(seq)
    print("tokens :", toks[:30], "...")
    if dfs[split] is not None:
        row = dfs[split].iloc[i]
        print(f"id_EXIST={row['id_EXIST']} | text: {str(row['tweet'])[:preview_chars]}{'...' if len(str(row['tweet']))>preview_chars else ''}")
    if show_embed and no_pad_len > 0:
        # pick first non-pad token index
        j = int(next(x for x in seq if int(x) != pad_idx))
        print(f"embedding for token '{idx2tok.get(j)}' (idx={j}) -> first 8 dims:\n", E[j][:8])

# summary
print("Embeddings:", E.shape, "| X_train/X_val/X_test:", Xtr.shape, Xva.shape, Xte.shape)
print("pad_idx:", pad_idx, "unk_idx:", unk_idx, "max_len:", max_len)
print("PAD vec (first 8):", E[pad_idx][:8])
print("UNK vec (first 8):", E[unk_idx][:8])

# show a few examples
show_example("train", 0)
show_example("val", 0)
show_example("test", 0)


Embeddings: (7842, 100) | X_train/X_val/X_test: (2212, 51) (115, 51) (217, 51)
pad_idx: 0 unk_idx: 1 max_len: 51
PAD vec (first 8): [0. 0. 0. 0. 0. 0. 0. 0.]
UNK vec (first 8): [ 0.05209883 -0.09711445 -0.1380765   0.11075345 -0.02722748 -0.00326409
  0.03176443 -0.05076874]

[TRAIN] row=0 | label=3 | len(no-pad)=47/51
indices: [7746, 197, 7302, 2480, 3530, 4661, 4170, 5568, 7680, 197, 1473, 5670, 4938, 4302, 3901, 623, 4402, 2238, 5621, 7609, 3437, 4249, 7185, 7051, 1574, 461, 2395, 7680, 3054, 4226] ...
tokens : ['write', 'a', 'uni', 'essay', 'in', 'my', 'local', 'pub', 'with', 'a', 'coffee', 'random', 'old', 'man', 'keep', 'ask', 'me', 'drunk', 'question', 'when', 'i', 'm', 'try', 'to', 'concentrate', 'amp', 'end', 'with', 'good', 'luck'] ...
id_EXIST=200002 | text: write a uni essay in my local pub with a coffee random old man keep ask me drunk question when i m try to concentrate amp end with good luck but you ll just end...
embedding for token 'write' (idx=7746) -> first 8 dims:


# [Task 4 - 1.0 points] Model definition

You are now tasked to define your sexism classifier.




### Instructions

* **Baseline**: implement a Bidirectional LSTM with a Dense layer on top.

* **Stacked**: add an additional Bidirectional LSTM layer to the Baseline model.

**Note**: You are **free** to experiment with hyper-parameters.

### Token to embedding mapping

You can follow two approaches for encoding tokens in your classifier.

### Work directly with embeddings

- Compute the embedding of each input token
- Feed the mini-batches of shape ``(batch_size, # tokens, embedding_dim)`` to your model

### Work with Embedding layer

- Encode input tokens to token ids
- Define a Embedding layer as the first layer of your model
- Compute the embedding matrix of all known tokens (i.e., tokens in your vocabulary)
- Initialize the Embedding layer with the computed embedding matrix
- You are **free** to set the Embedding layer trainable or not

# Model Definitions (BiLSTM)

Artifacts from Task 3:
- `embeddings/embeddings.npy` — embedding matrix (shape: `[vocab_size, embedding_dim]`)
- `embeddings/dataset_indices.npz` — indexed & padded datasets and meta (`max_len`, `pad_idx`, `unk_idx`)

We implement **two minimal architectures** for each approach:
- **Baseline:** 1×Bidirectional LSTM → Dense(softmax)
- **Stacked:**  2×Bidirectional LSTM → Dense(softmax)

Approaches (as in the notebook order):
- **Approach A — Work directly with embeddings** (inputs are `(batch, max_len, embedding_dim)`)
- **Approach B — Work with Embedding layer** (inputs are token ids; first layer is `Embedding`)

In [14]:
from pathlib import Path
import numpy as np

EMB_DIR = Path("embeddings")

# Load embedding matrix and dataset meta (for shapes only)
E = np.load(EMB_DIR / "embeddings.npy")                      # (vocab_size, embedding_dim)
D = np.load(EMB_DIR / "dataset_indices.npz", allow_pickle=False)

X_train, X_val, X_test = D["X_train"], D["X_val"], D["X_test"]
y_train, y_val, y_test = D["y_train"], D["y_val"], D["y_test"]
pad_idx, unk_idx, max_len = int(D["pad_idx"]), int(D["unk_idx"]), int(D["max_len"])

vocab_size    = int(E.shape[0])
embedding_dim = int(E.shape[1])
num_classes   = int(np.max(y_train)) + 1  # expected 4 for this task

print("Embedding matrix:", E.shape)
print("X_train/X_val/X_test:", X_train.shape, X_val.shape, X_test.shape)
print("max_len:", max_len, "| pad_idx:", pad_idx, "| unk_idx:", unk_idx, "| num_classes:", num_classes)


Embedding matrix: (7842, 100)
X_train/X_val/X_test: (2212, 51) (115, 51) (217, 51)
max_len: 51 | pad_idx: 0 | unk_idx: 1 | num_classes: 4


### Approach A — Work directly with embeddings (inputs = precomputed embeddings)

- Inputs have shape `(batch, max_len, embedding_dim)`; each time-step is already an embedding vector.
- We **do not** use an Embedding layer in this approach.
- Padding rows are the **all-zero** vector by construction (from Task 3).



In [16]:
import tensorflow as tf
from tensorflow.keras import layers, models

# Hyperparameters
BILSTM_UNITS_BASE = 128   # Baseline single BiLSTM
BILSTM_UNITS_1    = 128   # Stacked: first layer
BILSTM_UNITS_2    = 64    # Stacked: second layer

# Baseline: (batch, max_len, emb_dim) -> 1×BiLSTM -> Dense
def build_baseline_direct():
    inputs = layers.Input(shape=(max_len, embedding_dim), dtype="float32", name="embeddings")
    x = layers.Bidirectional(layers.LSTM(BILSTM_UNITS_BASE, return_sequences=False), name="bilstm")(inputs)
    outputs = layers.Dense(num_classes, activation="softmax", name="classifier")(x)
    return models.Model(inputs, outputs, name="baseline_bilstm_direct")

# Stacked (minimal): (batch, max_len, emb_dim) -> 2×BiLSTM -> Dense
def build_stacked_direct():
    inputs = layers.Input(shape=(max_len, embedding_dim), dtype="float32", name="embeddings")
    x = layers.Bidirectional(layers.LSTM(BILSTM_UNITS_1, return_sequences=True),  name="bilstm_1")(inputs)
    x = layers.Bidirectional(layers.LSTM(BILSTM_UNITS_2, return_sequences=False), name="bilstm_2")(x)
    outputs = layers.Dense(num_classes, activation="softmax", name="classifier")(x)
    return models.Model(inputs, outputs, name="stacked_bilstm_direct")

baseline_direct = build_baseline_direct()
stacked_direct  = build_stacked_direct()

baseline_direct.summary()
stacked_direct.summary()


### Approach B — Work with Embedding layer (inputs = token ids)

- Inputs are integer **token ids** from `dataset_indices.npz`.
- First layer is `Embedding`, initialized with Task 3 matrix:
  `Embedding(input_dim=vocab_size, output_dim=embedding_dim, weights=[E], mask_zero=True)`
- `mask_zero=True` makes LSTMs ignore padding tokens (requires `pad_idx == 0`).


In [17]:
from tensorflow.keras import layers, models

# Safety: masking expects PAD index 0
assert pad_idx == 0, "mask_zero=True requires PAD index 0."

# Toggle in Task 5 if you want to freeze or fine-tune embeddings
EMB_TRAINABLE = True

embedding_layer = layers.Embedding(
    input_dim=vocab_size,
    output_dim=embedding_dim,
    weights=[E],            # initialize with Task 3 weights
    mask_zero=True,         # automatically ignores PAD tokens
    name="encoder_embedding",
    trainable=EMB_TRAINABLE
)

# Baseline: Embedding -> 1×BiLSTM -> Dense
def build_baseline_embedding():
    inputs = layers.Input(shape=(max_len,), dtype="int32", name="token_ids")
    x = embedding_layer(inputs)  # (batch, max_len, embedding_dim) with mask
    x = layers.Bidirectional(layers.LSTM(BILSTM_UNITS_BASE, return_sequences=False), name="bilstm")(x)
    outputs = layers.Dense(num_classes, activation="softmax", name="classifier")(x)
    return models.Model(inputs, outputs, name="baseline_bilstm_embedding")

# Stacked: Embedding -> 2×BiLSTM -> Dense
def build_stacked_embedding():
    inputs = layers.Input(shape=(max_len,), dtype="int32", name="token_ids")
    x = embedding_layer(inputs)
    x = layers.Bidirectional(layers.LSTM(BILSTM_UNITS_1, return_sequences=True),  name="bilstm_1")(x)
    x = layers.Bidirectional(layers.LSTM(BILSTM_UNITS_2, return_sequences=False), name="bilstm_2")(x)
    outputs = layers.Dense(num_classes, activation="softmax", name="classifier")(x)
    return models.Model(inputs, outputs, name="stacked_bilstm_embedding")

baseline_emb = build_baseline_embedding()
stacked_emb  = build_stacked_embedding()

baseline_emb.summary()
stacked_emb.summary()


# [Task 5 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate the Baseline and Stacked models.



### Instructions

* Pick **at least** three seeds for robust estimation.
* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Compute macro F1-score, precision, and recall metrics on the validation set.
* Report average and standard deviation measures over seeds for each metric.
* Pick the **best** performing model according to the observed validation set performance (use macro F1-score).

# [Task 6 - 1.0 points] Transformers

In this section, you will use a transformer model specifically trained for hate speech detection, namely [Twitter-roBERTa-base for Hate Speech Detection](https://huggingface.co/cardiffnlp/twitter-roberta-base-hate).




### Relevant Material
- Tutorial 3

### Instructions
- **Load the Tokenizer and Model**

- **Preprocess the Dataset**:
   You will need to preprocess your dataset to prepare it for input into the model. Tokenize your text data using the appropriate tokenizer and ensure it is formatted correctly.

- **Train the Model**:
   Use the `Trainer` to train the model on your training data.

- **Evaluate the Model on the Test Set** using the same metrics used for LSTM-based models.

# [Task 7 - 0.5 points] Error Analysis

After evaluating the model, perform a brief error analysis:

### Instructions

 - Review the results and identify common errors.

 - Summarize your findings regarding the errors and their impact on performance (e.g. but not limited to Out-of-Vocabulary (OOV) words, data imbalance, and performance differences between the custom model and the transformer...)
 - Suggest possible solutions to address the identified errors.

# [Task 8 - 0.5 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is **not a copy-paste** of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.


# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

## Bonus Points
Bonus points are arbitrarily assigned based on significant contributions such as:
- Outstanding error analysis
- Masterclass code organization
- Suitable extensions

**Note**: bonus points are only assigned if all task points are attributed (i.e., 6/6).

**Possible Suggestions for Bonus Points:**
- **Try other preprocessing strategies**: e.g., but not limited to, explore techniques tailored specifically for tweets or  methods that are common in social media text.
- **Experiment with other custom architectures or models from HuggingFace**
- **Explore Spanish tweets**: e.g., but not limited to, leverage multilingual models to process Spanish tweets and assess their performance compared to monolingual models.

# FAQ

Please check this frequently asked questions before contacting us

### Trainable Embeddings

You are **free** to define a trainable or non-trainable Embedding layer to load the GloVe embeddings.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.


### Neural Libraries

You are **free** to use any library of your choice to implement the networks (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Robust Evaluation

Each model is trained with at least 3 random seeds.

Task 5 requires you to compute the average performance over the 3 seeds and its corresponding standard deviation.

### Expected Results

Task 2 leaderboard reports around 40-50 F1-score.
However, note that they perform a hierarchical classification.

That said, results around 30-40 F1-score are **expected** given the task's complexity.

### Model Selection for Analysis

To carry out the error analysis you are **free** to either

* Pick examples or perform comparisons with an individual seed run model (e.g., Baseline seed 1337)
* Perform ensembling via, for instance, majority voting to obtain a single model.

### Error Analysis

Some topics for discussion include:
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.


# The End

Feel free to reach out for questions/doubts!