# Assignment 1
**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Sexism Detection, Multi-class Classification, RNNs, Transformers, Huggingface



# Contact
For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

- Federico Ruggeri -> federico.ruggeri6@unibo.it
- Eleonora Mancini -> e.mancini@unibo.it

Professor:
- Paolo Torroni -> p.torroni@unibo.it

# Introduction
You are asked to address the [EXIST 2023 Task 1](https://clef2023.clef-initiative.eu/index.php?page=Pages/labs.html#EXIST) on sexism detection.

## Problem Definition
The systems have to decide whether or not a given tweet contains or describes sexist expressions or behaviors (i.e., it is sexist itself, describes a sexist situation or criticizes a sexist behavior).

### Examples:

**Text**: *Can’t go a day without women womening*

**Label**: Sexist

**Text**: *''Society's set norms! Happy men's day though!#weareequal''*

**Label**: Not sexist

In [3]:
import pandas as pd
import numpy as np
import re
import string
import nltk
from tqdm import tqdm
import gensim
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet, stopwords
from functools import reduce
from nltk import OrderedDict
import gensim.downloader as gloader
import tensorflow as tf
from tensorflow.keras.utils import Sequence
import keras
from keras import Input
from keras.layers import Bidirectional, LSTM, Dense, Embedding, Concatenate

In [4]:
seed = 42
np.random.seed(seed)
tf.keras.utils.set_random_seed(seed)

# [Task 1 - 1.0 points] Corpus

We have preparared a small version of EXIST dataset in our dedicated [Github repository](https://github.com/lt-nlp-lab-unibo/nlp-course-material/tree/main/2024-2025/Assignment%201/data).

Check the `A1/data` folder. It contains 3 `.json` files representing `training`, `validation` and `test` sets.

The three sets are slightly unbalanced, with a bias toward the `Non-sexist` class.



### Dataset Description
- The dataset contains tweets in both English and Spanish.
- There are labels for multiple tasks, but we are focusing on **Task 1**.
- For Task 1, soft labels are assigned by six annotators.
- The labels for Task 1 represent whether the tweet is sexist ("YES") or not ("NO").







### Example


    "203260": {
        "id_EXIST": "203260",
        "lang": "en",
        "tweet": "ik when mandy says “you look like a whore” i look cute as FUCK",
        "number_annotators": 6,
        "annotators": ["Annotator_473", "Annotator_474", "Annotator_475", "Annotator_476", "Annotator_477", "Annotator_27"],
        "gender_annotators": ["F", "F", "M", "M", "M", "F"],
        "age_annotators": ["18-22", "23-45", "18-22", "23-45", "46+", "46+"],
        "labels_task1": ["YES", "YES", "YES", "NO", "YES", "YES"],
        "labels_task2": ["DIRECT", "DIRECT", "REPORTED", "-", "JUDGEMENTAL", "REPORTED"],
        "labels_task3": [
          ["STEREOTYPING-DOMINANCE"],
          ["OBJECTIFICATION"],
          ["SEXUAL-VIOLENCE"],
          ["-"],
          ["STEREOTYPING-DOMINANCE", "OBJECTIFICATION"],
          ["OBJECTIFICATION"]
        ],
        "split": "TRAIN_EN"
      }
    }

### Instructions
1. **Download** the `A1/data` folder.
2. **Load** the three JSON files and encode them as pandas dataframes.
3. **Generate hard labels** for Task 1 using majority voting and store them in a new dataframe column called `hard_label_task1`. Items without a clear majority will be removed from the dataset.
4. **Filter the DataFrame** to keep only rows where the `lang` column is `'en'`.
5. **Remove unwanted columns**: Keep only `id_EXIST`, `lang`, `tweet`, and `hard_label_task1`.
6. **Encode the `hard_label_task1` column**: Use 1 to represent "YES" and 0 to represent "NO".

---

1. **Download** the `A1/data` folder.
2. **Load** the three JSON files and encode them as pandas dataframes.

In [7]:
df_train = pd.read_json("./data/training.json").transpose().set_index("id_EXIST")
df_test = pd.read_json("./data/test.json").transpose().set_index("id_EXIST")
df_val = pd.read_json("./data/validation.json").transpose().set_index("id_EXIST")

In [8]:
df_train.head()

Unnamed: 0_level_0,lang,tweet,number_annotators,annotators,gender_annotators,age_annotators,labels_task1,labels_task2,labels_task3,split
id_EXIST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100001,es,"@TheChiflis Ignora al otro, es un capullo.El p...",6,"[Annotator_1, Annotator_2, Annotator_3, Annota...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[YES, YES, NO, YES, YES, YES]","[REPORTED, JUDGEMENTAL, -, REPORTED, JUDGEMENT...","[[OBJECTIFICATION], [OBJECTIFICATION, SEXUAL-V...",TRAIN_ES
100002,es,@ultimonomada_ Si comicsgate se parece en algo...,6,"[Annotator_7, Annotator_8, Annotator_9, Annota...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[NO, NO, NO, NO, YES, NO]","[-, -, -, -, DIRECT, -]","[[-], [-], [-], [-], [OBJECTIFICATION], [-]]",TRAIN_ES
100003,es,"@Steven2897 Lee sobre Gamergate, y como eso ha...",6,"[Annotator_7, Annotator_8, Annotator_9, Annota...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",TRAIN_ES
100004,es,@Lunariita7 Un retraso social bastante lamenta...,6,"[Annotator_13, Annotator_14, Annotator_15, Ann...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[NO, NO, YES, NO, YES, YES]","[-, -, DIRECT, -, REPORTED, REPORTED]","[[-], [-], [IDEOLOGICAL-INEQUALITY], [-], [IDE...",TRAIN_ES
100005,es,@novadragon21 @icep4ck @TvDannyZ Entonces como...,6,"[Annotator_19, Annotator_20, Annotator_21, Ann...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[YES, NO, YES, NO, YES, YES]","[REPORTED, -, JUDGEMENTAL, -, JUDGEMENTAL, DIR...","[[STEREOTYPING-DOMINANCE, OBJECTIFICATION], [-...",TRAIN_ES


In [9]:
df_test.head()

Unnamed: 0_level_0,lang,tweet,number_annotators,annotators,gender_annotators,age_annotators,labels_task1,labels_task2,labels_task3,split
id_EXIST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
400178,en,1st day at the pool on a beautiful Sunday in N...,6,"[Annotator_764, Annotator_765, Annotator_766, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",DEV_EN
400179,en,“I like your outfit too except when i dress up...,6,"[Annotator_805, Annotator_426, Annotator_806, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[YES, YES, YES, YES, YES, NO]","[JUDGEMENTAL, DIRECT, REPORTED, DIRECT, REPORT...","[[OBJECTIFICATION], [OBJECTIFICATION, MISOGYNY...",DEV_EN
400180,en,"@KNasFanFic 🥺💖 same, though!!! the angst just ...",6,"[Annotator_795, Annotator_796, Annotator_797, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",DEV_EN
400181,en,@themaxburns @GOP Fuck that cunt. Tried to vot...,6,"[Annotator_795, Annotator_796, Annotator_797, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, YES, YES, YES, YES]","[-, DIRECT, JUDGEMENTAL, DIRECT, DIRECT, DIRECT]","[[-], [IDEOLOGICAL-INEQUALITY, MISOGYNY-NON-SE...",DEV_EN
400182,en,@ultshunnie u gotta say some shit like “i’ll f...,6,"[Annotator_770, Annotator_771, Annotator_772, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[YES, YES, YES, YES, YES, YES]","[DIRECT, REPORTED, DIRECT, DIRECT, JUDGEMENTAL...","[[OBJECTIFICATION, SEXUAL-VIOLENCE], [SEXUAL-V...",DEV_EN


In [10]:
df_val.head()

Unnamed: 0_level_0,lang,tweet,number_annotators,annotators,gender_annotators,age_annotators,labels_task1,labels_task2,labels_task3,split
id_EXIST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
300001,es,@Fichinescu La comunidad gamer es un antro de ...,6,"[Annotator_726, Annotator_727, Annotator_357, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, YES, NO, YES, NO]","[-, JUDGEMENTAL, JUDGEMENTAL, -, REPORTED, -]","[[-], [MISOGYNY-NON-SEXUAL-VIOLENCE], [MISOGYN...",DEV_ES
300002,es,@anacaotica88 @MordorLivin No me acuerdo de lo...,6,"[Annotator_731, Annotator_732, Annotator_315, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[YES, YES, NO, YES, YES, YES]","[JUDGEMENTAL, REPORTED, -, JUDGEMENTAL, JUDGEM...","[[IDEOLOGICAL-INEQUALITY, STEREOTYPING-DOMINAN...",DEV_ES
300003,es,@cosmicJunkBot lo digo cada pocos dias y lo re...,6,"[Annotator_735, Annotator_736, Annotator_345, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",DEV_ES
300004,es,Also mientras les decia eso la señalaba y deci...,6,"[Annotator_259, Annotator_739, Annotator_291, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, YES, YES, YES, YES]","[-, REPORTED, REPORTED, REPORTED, JUDGEMENTAL,...","[[-], [SEXUAL-VIOLENCE], [SEXUAL-VIOLENCE], [S...",DEV_ES
300005,es,"And all people killed, attacked, harassed by ...",6,"[Annotator_731, Annotator_732, Annotator_315, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, NO, NO, NO, NO]","[-, DIRECT, -, -, -, -]","[[-], [STEREOTYPING-DOMINANCE], [-], [-], [-],...",DEV_ES


In [11]:
print("- Training dataset shape:", df_train.shape)
print("- Test dataset shape:", df_test.shape)
print("- Validation dataset shape:", df_val.shape)

- Training dataset shape: (6920, 10)
- Test dataset shape: (312, 10)
- Validation dataset shape: (726, 10)


3. **Generate hard labels** for Task 1 using majority voting and store them in a new dataframe column called `hard_label_task1`. Items without a clear majority will be removed from the dataset.

In [12]:
df_train_T1 = df_train.copy()
df_test_T1 = df_test.copy()
df_val_T1 = df_val.copy()

df_train_T1['hard_labels_task1'] = df_train_T1['labels_task1'].apply(
    lambda x: 'YES' if x.count('YES') > x.count('NO') else ('NO' if x.count('NO') > x.count('YES') else np.NAN)
)
df_test_T1['hard_labels_task1'] = df_test_T1['labels_task1'].apply(
    lambda x: 'YES' if x.count('YES') > x.count('NO') else ('NO' if x.count('NO') > x.count('YES') else np.NAN)
)
df_val_T1['hard_labels_task1'] = df_val_T1['labels_task1'].apply(
    lambda x: 'YES' if x.count('YES') > x.count('NO') else ('NO' if x.count('NO') > x.count('YES') else np.NAN)
)

df_train_T1.dropna(inplace=True)
df_test_T1.dropna(inplace=True)
df_val_T1.dropna(inplace=True)

4. **Filter the DataFrame** to keep only rows where the `lang` column is `'en'`.

In [13]:
df_train_T1 = df_train_T1[df_train_T1["lang"] == "en"]
df_test_T1 = df_test_T1[df_test_T1["lang"] == "en"]
df_val_T1 = df_val_T1[df_val_T1["lang"] == "en"]

5. **Remove unwanted columns**: Keep only `id_EXIST`, `lang`, `tweet`, and `hard_label_task1`.

In [14]:
df_train_T1 = df_train_T1.drop(
    ["number_annotators", "annotators","gender_annotators","age_annotators","labels_task1","labels_task2","labels_task3","split"],
    axis=1
)
df_test_T1 = df_test_T1.drop(
    ["number_annotators", "annotators","gender_annotators","age_annotators","labels_task1","labels_task2","labels_task3","split"],
    axis=1
)
df_val_T1 = df_val_T1.drop(
    ["number_annotators", "annotators","gender_annotators","age_annotators","labels_task1","labels_task2","labels_task3","split"],
    axis=1
)

6. **Encode the `hard_label_task1` column**: Use 1 to represent "YES" and 0 to represent "NO".

In [15]:
df_train_T1['hard_labels_task1'] = df_train_T1['hard_labels_task1'].apply(lambda x: 1 if x == 'YES' else 0)
df_test_T1['hard_labels_task1'] = df_test_T1['hard_labels_task1'].apply(lambda x: 1 if x == 'YES' else 0)
df_val_T1['hard_labels_task1'] = df_val_T1['hard_labels_task1'].apply(lambda x: 1 if x == 'YES' else 0)

In [16]:
df_train_T1.head()

Unnamed: 0_level_0,lang,tweet,hard_labels_task1
id_EXIST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
200002,en,Writing a uni essay in my local pub with a cof...,1
200003,en,@UniversalORL it is 2021 not 1921. I dont appr...,1
200006,en,According to a customer I have plenty of time ...,1
200007,en,"So only 'blokes' drink beer? Sorry, but if you...",1
200008,en,New to the shelves this week - looking forward...,0


In [17]:
df_test_T1.head()

Unnamed: 0_level_0,lang,tweet,hard_labels_task1
id_EXIST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
400178,en,1st day at the pool on a beautiful Sunday in N...,0
400179,en,“I like your outfit too except when i dress up...,1
400180,en,"@KNasFanFic 🥺💖 same, though!!! the angst just ...",0
400181,en,@themaxburns @GOP Fuck that cunt. Tried to vot...,1
400182,en,@ultshunnie u gotta say some shit like “i’ll f...,1


In [18]:
df_val_T1.head()

Unnamed: 0_level_0,lang,tweet,hard_labels_task1
id_EXIST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
400001,en,"@Mike_Fabricant “You should smile more, love. ...",0
400002,en,@BBCWomansHour @LabWomenDec @EverydaySexism Sh...,1
400003,en,#everydaysexism Some man moving my suitcase in...,1
400004,en,@KolHue @OliverJia1014 lol gamergate the go to...,0
400005,en,@ShelfStoriesGBL To me this has the same negat...,0


In [19]:
print("- Training dataset shape:", df_train_T1.shape)
print("- Test dataset shape:", df_test_T1.shape)
print("- Validation dataset shape:", df_val_T1.shape)

- Training dataset shape: (2870, 3)
- Test dataset shape: (286, 3)
- Validation dataset shape: (158, 3)


# [Task2 - 0.5 points] Data Cleaning
In the context of tweets, we have noisy and informal data that often includes unnecessary elements like emojis, hashtags, mentions, and URLs. These elements may interfere with the text analysis.



### Instructions
- **Remove emojis** from the tweets.
- **Remove hashtags** (e.g., `#example`).
- **Remove mentions** such as `@user`.
- **Remove URLs** from the tweets.
- **Remove special characters and symbols**.
- **Remove specific quote characters** (e.g., curly quotes).
- **Perform lemmatization** to reduce words to their base form.

---

In [20]:
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [21]:
##### REVIEW #####

lemmatizer = WordNetLemmatizer()
try:
    STOPWORDS = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    STOPWORDS = set(stopwords.words('english'))

def get_wordnet_key(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n'

def lem_text(text: str):
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    words = [lemmatizer.lemmatize(token, get_wordnet_key(tag)) for token, tag in tagged]
    return " ".join(words)

def strip_emoji(text):
    RE_EMOJI = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
    return RE_EMOJI.sub(r'', text)

def strip_tags(text):
    entity_prefixes = ['@','#']
    for separator in  string.punctuation:
        if separator not in entity_prefixes :
            text = text.replace(separator,' ')
    words = []
    for word in text.split():
        word = word.strip()
        if word:
            if word[0] not in entity_prefixes:
                words.append(word)
    return ' '.join(words)

def strip_links(text):
    link_regex = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
    links = re.findall(link_regex, text)
    for link in links:
        text = text.replace(link[0], ', ')
    return text

def remove_links(text):
    """
    Removes all links (URLs) from the given string.

    Args:
        text (str): The input string containing links.

    Returns:
        str: The string with all links removed.
    """
    # Regular expression pattern to match URLs
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    return re.sub(url_pattern, '', text)

def special_ch_sym(text):
    RE_ch_sym = re.compile(u'[^a-z A-Z 0-9]')
    return RE_ch_sym.sub(r'',text)

def replace_br(text: str) -> str:
    """
    Replaces br characters
    """
    return text.replace('br', '')

def remove_stopwords(text: str) -> str:
    return ' '.join([x for x in text.split() if x and x not in STOPWORDS])

def text_cleaning(text):
    return lem_text(remove_stopwords(replace_br(special_ch_sym(strip_tags(remove_links(strip_emoji(text.lower().strip())))))))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [22]:
df_train_T2 = df_train_T1.copy()
df_train_T2['tweet'] = df_train_T2['tweet'].apply(text_cleaning)

df_test_T2 = df_test_T1.copy()
df_test_T2['tweet'] = df_test_T2['tweet'].apply(text_cleaning)

df_val_T2 = df_val_T1.copy()
df_val_T2['tweet'] = df_val_T2['tweet'].apply(text_cleaning)

df_train_T2.head()

Unnamed: 0_level_0,lang,tweet,hard_labels_task1
id_EXIST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
200002,en,write uni essay local pub coffee random old ma...,1
200003,en,2021 1921 dont appreciate two ride team member...,1
200006,en,accord customer plenty time go spent stirling ...,1
200007,en,bloke drink beer sorry bloke drink wine appare...,1
200008,en,new shelf week look forward read book,0


# [Task 3 - 0.5 points] Text Encoding
To train a neural sexism classifier, you first need to encode text into numerical format.




### Instructions

* Embed words using **GloVe embeddings**.
* You are **free** to pick any embedding dimension.





### Note : What about OOV tokens?
   * All the tokens in the **training** set that are not in GloVe **must** be added to the vocabulary.
   * For the remaining tokens (i.e., OOV in the validation and test sets), you have to assign them a **special token** (e.g., [UNK]) and a **static** embedding.
   * You are **free** to define the static embedding using any strategy (e.g., random, neighbourhood, etc...)



### More about OOV

For a given token:

* **If in train set**: add to vocabulary and assign an embedding (use GloVe if token in GloVe, custom embedding otherwise).
* **If in val/test set**: assign special token if not in vocabulary and assign custom embedding.

Your vocabulary **should**:

* Contain all tokens in train set; or
* Union of tokens in train set and in GloVe $\rightarrow$ we make use of existing knowledge!

---

1. **Building a vocabulary**

In [23]:
def build_vocabulary(df):
    idx_to_word = OrderedDict()
    word_to_idx = OrderedDict()

    curr_idx = 0
    word_to_idx['<pad>'] = curr_idx
    idx_to_word[curr_idx] = '<pad>'
    curr_idx += 1

    for sentence in tqdm(df.tweet.values):
        tokens = word_tokenize(sentence)
        for token in tokens:
            if token not in word_to_idx:
                word_to_idx[token] = curr_idx
                idx_to_word[curr_idx] = token
                curr_idx += 1
    word_to_idx['<unk>'] = curr_idx
    idx_to_word[curr_idx] = '<unk>'

    word_listing = list(idx_to_word.values())
    return idx_to_word, word_to_idx, word_listing

In [24]:
idx_to_word, word_to_idx, word_listing = build_vocabulary(df_train_T2)
print()
print(f'[Debug] Index -> Word vocabulary size: {len(idx_to_word)}')
print(f'[Debug] Word -> Index vocabulary size: {len(word_to_idx)}')
print(f'[Debug] Some words: {[(idx_to_word[idx], idx) for idx in np.arange(9365, 9377)]}')

100%|██████████| 2870/2870 [00:01<00:00, 1988.03it/s]


[Debug] Index -> Word vocabulary size: 9377
[Debug] Word -> Index vocabulary size: 9377
[Debug] Some words: [('pleasant', 9365), ('drachen', 9366), ('armor', 9367), ('dragoon', 9368), ('estinen', 9369), ('nike', 9370), ('coochie', 9371), ('mutuals', 9372), ('grabs', 9373), ('lh', 9374), ('mandy', 9375), ('<unk>', 9376)]





In [25]:
def evaluate_vocabulary(idx_to_word, word_to_idx,
                        word_listing, df, check_default_size: bool = False):
    print("[Vocabulary Evaluation] Size checking...")
    assert len(idx_to_word) == len(word_to_idx)
    assert len(idx_to_word) == len(word_listing)

    print("[Vocabulary Evaluation] Content checking...")
    for i in tqdm(range(0, len(idx_to_word))):
        assert idx_to_word[i] in word_to_idx
        assert word_to_idx[idx_to_word[i]] == i

    print("[Vocabulary Evaluation] Consistency checking...")
    _, _, first_word_listing = build_vocabulary(df)
    _, _, second_word_listing = build_vocabulary(df)
    assert first_word_listing == second_word_listing

    print("[Vocabulary Evaluation] Toy example checking...")
    toy_df = pd.DataFrame.from_dict({
        'tweet': ["all that glitters is not gold", "all in all i like this assignment"]
    })
    _, _, toy_word_listing = build_vocabulary(toy_df)
    toy_valid_vocabulary = set(' '.join(toy_df.tweet.values).split())
    toy_valid_vocabulary.add("<unk>")
    toy_valid_vocabulary.add("<pad>")
    assert set(toy_word_listing) == toy_valid_vocabulary

In [26]:
print("Vocabulary evaluation...")
evaluate_vocabulary(idx_to_word, word_to_idx, word_listing, df_train_T2)
print("Evaluation completed!")

Vocabulary evaluation...
[Vocabulary Evaluation] Size checking...
[Vocabulary Evaluation] Content checking...


100%|██████████| 9377/9377 [00:00<00:00, 1144211.70it/s]


[Vocabulary Evaluation] Consistency checking...


100%|██████████| 2870/2870 [00:01<00:00, 2052.25it/s]
100%|██████████| 2870/2870 [00:03<00:00, 748.56it/s]


[Vocabulary Evaluation] Toy example checking...


100%|██████████| 2/2 [00:00<00:00, 4990.25it/s]

Evaluation completed!





2. **Embedding**

In [27]:
import gensim
import gensim.downloader as gloader

def load_embedding_model(model_type: str, embedding_dimension: int = 50) -> gensim.models.keyedvectors.KeyedVectors:

    download_path = ""
    if model_type.strip().lower() == 'word2vec':
        download_path = "word2vec-google-news-300"
    elif model_type.strip().lower() == 'glove':
        download_path = "glove-wiki-gigaword-{}".format(embedding_dimension)
    elif model_type.strip().lower() == 'fasttext':
        download_path = "fasttext-wiki-news-subwords-300"
    else:
        raise AttributeError("Unsupported embedding model type! Available ones: word2vec, glove, fasttext")

    try:
        emb_model = gloader.load(download_path)
    except ValueError as e:
        print("Invalid embedding model name! Check the embedding dimension:")
        print("Word2Vec: 300")
        print("Glove: 50, 100, 200, 300")
        print('FastText: 300')
        raise e

    return emb_model

In [28]:
# Modify these variables as you wish!
# Glove -> 50, 100, 200, 300
# Word2Vec -> 300
# Fasttext -> 300
embedding_dimension = 50
embedding_model = load_embedding_model(model_type="glove", embedding_dimension=embedding_dimension)



3. **Out of vocabulary (OOV) words**

In [29]:
def check_OOV_terms(embedding_model: gensim.models.keyedvectors.KeyedVectors, word_listing):
    embedding_vocabulary = set(embedding_model.key_to_index.keys())
    oov = set(word_listing).difference(embedding_vocabulary)
    return list(oov)

In [30]:
oov_terms = check_OOV_terms(embedding_model, word_listing)
oov_percentage = float(len(oov_terms)) * 100 / len(word_listing)
print(f"Total OOV terms: {len(oov_terms)} ({oov_percentage:.2f}%)")

Total OOV terms: 1231 (13.13%)


In [31]:
def build_embedding_matrix(embedding_model, embedding_dimension, word_to_idx, vocab_size, oov_terms):

    embedding_matrix = np.zeros((vocab_size, embedding_dimension), dtype=np.float32)
    for word, idx in tqdm(word_to_idx.items()):
        try:
            embedding_vector = embedding_model[word]
        except (KeyError, TypeError):
            if word == "<unk>":
                embedding_vector = np.zeros(embedding_dimension)
            else:
                embedding_vector = np.random.uniform(low=-0.05, high=0.05, size=embedding_dimension)

        embedding_matrix[idx] = embedding_vector

    return embedding_matrix

In [32]:
embedding_matrix = build_embedding_matrix(embedding_model,
                                          embedding_dimension,
                                          word_to_idx,
                                          len(word_to_idx),
                                          oov_terms)
print(f"\nEmbedding matrix shape: {embedding_matrix.shape}")
vocab_size = embedding_matrix.shape[0]

9377it [00:00, 315903.52it/s]


Embedding matrix shape: (9377, 50)





# [Task 4 - 1.0 points] Model definition

You are now tasked to define your sexism classifier.




### Instructions

* **Baseline**: implement a Bidirectional LSTM with a Dense layer on top.
* You are **free** to experiment with hyper-parameters to define the baseline model.

* **Model 1**: add an additional LSTM layer to the Baseline model.

### Token to embedding mapping

You can follow two approaches for encoding tokens in your classifier.

### Work directly with embeddings

- Compute the embedding of each input token
- Feed the mini-batches of shape (batch_size, # tokens, embedding_dim) to your model

### Work with Embedding layer

- Encode input tokens to token ids
- Define a Embedding layer as the first layer of your model
- Compute the embedding matrix of all known tokens (i.e., tokens in your vocabulary)
- Initialize the Embedding layer with the computed embedding matrix
- You are **free** to set the Embedding layer trainable or not

### Padding

Pay attention to padding tokens!

Your model **should not** be penalized on those tokens.

#### How to?

There are two main ways.

However, their implementation depends on the neural library you are using.

- Embedding layer
- Custom loss to compute average cross-entropy on non-padding tokens only

**Note**: This is a **recommendation**, but we **do not penalize** for missing workarounds.

---

In [33]:
### Hyperparameters ###

num_classes = 2
hidden_dim = 64
max_len = max(df_train_T2['tweet'].apply(lambda x: len(word_tokenize(x))))

In [34]:
class Bidirectional_LSTM_Baseline (tf.keras.Model):
    def __init__(self, input_dim, output_dim, hidden_dim):
        super(Bidirectional_LSTM_Baseline, self).__init__()

        self.input_layer = Input(shape=(39,))
        self.embed_layer = Embedding(input_dim=vocab_size,
                                      output_dim=embedding_dimension,
                                      weights=[embedding_matrix],
                                      mask_zero=True,              # automatically masks padding tokens
                                      name='encoder_embedding')
        self.forward_layer = LSTM(hidden_dim)
        self.backward_layer = LSTM(hidden_dim,
                                   go_backwards=True)
        self.bidir_layer = Bidirectional(self.forward_layer,
                                         backward_layer=self.backward_layer)
        self.dense_layer = Dense(output_dim,
                                 activation='softmax')

    def call(self, input):
        x = self.input_layer(input)
        x = self.embed_layer(x)
        x = self.bidir_layer(x)
        output = self.dense_layer(x)
        return output

class Bidirectional_LSTM_Model_1 (Bidirectional_LSTM_Baseline):
    def __init__(self, input_dim, output_dim, hidden_dim):
        super(Bidirectional_LSTM_Model_1, self).__init__()

    def call(self, input):
        x = self.input_layer(input)
        x = self.embed_layer(x)
        x = self.bidir_layer(x)
        x = self.bidir_layer(x)
        output = self.dense_layer(x)
        return keras.Model(input, output, name="model_1_LSTM")

In [35]:
def Bidir_LSTM_Baseline (input_dim, output_dim, hidden_dim):
    inputs = Input(shape=(None,))
    embed_layer = Embedding(input_dim=vocab_size,
                            output_dim=embedding_dimension,
                            weights=[embedding_matrix],
                            mask_zero=True,              # automatically masks padding tokens
                            name='encoder_embedding')(inputs)
    forward_layer = LSTM(hidden_dim)
    backward_layer = LSTM(hidden_dim,
                          go_backwards=True)
    bidir_layer = Bidirectional(forward_layer,
                                backward_layer=backward_layer)(embed_layer)
    dense_layer = Dense(output_dim,
                        activation='tanh')(bidir_layer)

    return keras.Model(inputs, dense_layer, name="baseline_LSTM")

In [36]:
def Bidir_LSTM_Model_1 (input_dim, output_dim, hidden_dim):
    inputs = Input(shape=(None,))
    embed_layer = Embedding(input_dim=vocab_size,
                            output_dim=embedding_dimension,
                            weights=[embedding_matrix],
                            mask_zero=True,              # automatically masks padding tokens
                            name='encoder_embedding')(inputs)
    #forward_layer = LSTM(hidden_dim, return_sequences=True)
    #backward_layer = LSTM(hidden_dim, go_backwards=True, return_sequences=True)
    bidir_layer_1 = Bidirectional(LSTM(hidden_dim, return_sequences=True),
                                backward_layer=LSTM(hidden_dim, go_backwards=True, return_sequences=True))(embed_layer)
    bidir_layer_2 = Bidirectional(LSTM(hidden_dim),
                                backward_layer= LSTM(hidden_dim, go_backwards=True))(bidir_layer_1)
    #bidir_layer = Concatenate()([bidir_layer_1, bidir_layer_2])
    dense_layer = Dense(output_dim,
                        activation='tanh')(bidir_layer_2)

    return keras.Model(inputs, dense_layer, name="baseline_LSTM")

In [37]:
baseline_LSTM = Bidir_LSTM_Baseline(input_dim=max_len, output_dim=1, hidden_dim=hidden_dim)
baseline_LSTM.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
baseline_LSTM.summary()

In [38]:
model_1_LSTM = Bidir_LSTM_Model_1(input_dim=max_len, output_dim=1, hidden_dim=hidden_dim)
model_1_LSTM.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model_1_LSTM.summary()

# [Task 5 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate the Baseline and Model 1.



### Instructions

* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Compute metrics on the validation set.
* Pick **at least** three seeds for robust estimation.
* Pick the **best** performing model according to the observed validation set performance.
* Evaluate your models using macro F1-score.

---

In [39]:
class DataGenerator(Sequence):
    def __init__(self, data, word_to_idx, batch_size=32, shuffle=True, seed=seed):
        super().__init__()
        self.data = data
        self.tweet = data["tweet"].to_numpy()
        self.hard_labels_task1 = data["hard_labels_task1"]
        self.word_to_idx = word_to_idx
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.seed = seed
        self.on_epoch_end()
        self._prepare_data()

    def __len__(self):
        return int(np.floor(len(self.data) / self.batch_size))

    def __getitem__(self, index):
        indexes = self.indexes[index*self.batch_size: (index+1)*self.batch_size]
        data_batch = np.array([self.tweet[k] for k in indexes])
        target_batch = np.array([self.hard_labels_task1.to_list()[k] for k in indexes])

        return (data_batch, target_batch)

    def on_epoch_end(self):
        self.indexes = np.arange(len(self.data))
        if self.shuffle:
            if self.seed is not None:
                np.random.seed(self.seed)
            np.random.shuffle(self.indexes)

    def _prepare_data(self):
        self.tweet = [word_tokenize(sentence) + ['<pad>']*(max_len - len(word_tokenize(sentence))) for sentence in self.tweet]
        #self.tweet = self.tweet.apply(lambda x: x.split() + ['<pad>']*(max_len - len(x.split())))
        self.tweet = [[self.word_to_idx[word] if word in word_listing else self.word_to_idx["<unk>"] for word in sentence] for sentence in self.tweet]
        #self.tweet = self.tweet.apply(lambda x: [self.word_to_idx[word] for word in x])

In [40]:
batch_size = 8

train_gen = DataGenerator(df_train_T2, word_to_idx, batch_size=batch_size, shuffle=True, seed=seed)
validation_gen = DataGenerator(df_val_T2, word_to_idx, batch_size=batch_size, shuffle=False, seed=seed)

In [41]:
baseline_LSTM.fit(train_gen, validation_data=validation_gen, batch_size=batch_size, epochs=10)

Epoch 1/10
[1m358/358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 8ms/step - accuracy: 0.6110 - loss: 1.0200 - val_accuracy: 0.6250 - val_loss: 0.6415
Epoch 2/10
[1m358/358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.7076 - loss: 0.5815 - val_accuracy: 0.6974 - val_loss: 0.6017
Epoch 3/10
[1m358/358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 10ms/step - accuracy: 0.8104 - loss: 0.4453 - val_accuracy: 0.7237 - val_loss: 1.1532
Epoch 4/10
[1m358/358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.8786 - loss: 0.3051 - val_accuracy: 0.8026 - val_loss: 0.8934
Epoch 5/10
[1m358/358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - accuracy: 0.9379 - loss: 0.2270 - val_accuracy: 0.7697 - val_loss: 1.2867
Epoch 6/10
[1m358/358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 13ms/step - accuracy: 0.9664 - loss: 0.1484 - val_accuracy: 0.7829 - val_loss: 2.0316
Epoch 7/10
[1m358/358[0

<keras.src.callbacks.history.History at 0x7965f2d09fc0>

In [42]:
model_1_LSTM.fit(train_gen, validation_data=validation_gen, batch_size=batch_size, epochs=10)

Epoch 1/10
[1m358/358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 16ms/step - accuracy: 0.6178 - loss: 0.7105 - val_accuracy: 0.7237 - val_loss: 0.5332
Epoch 2/10
[1m358/358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 11ms/step - accuracy: 0.7291 - loss: 0.5618 - val_accuracy: 0.7039 - val_loss: 0.5395
Epoch 3/10
[1m358/358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 11ms/step - accuracy: 0.7920 - loss: 0.5033 - val_accuracy: 0.7697 - val_loss: 0.6572
Epoch 4/10
[1m358/358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 12ms/step - accuracy: 0.8883 - loss: 0.3358 - val_accuracy: 0.8026 - val_loss: 0.7714
Epoch 5/10
[1m358/358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 11ms/step - accuracy: 0.9390 - loss: 0.2146 - val_accuracy: 0.7961 - val_loss: 1.6184
Epoch 6/10
[1m358/358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 12ms/step - accuracy: 0.9740 - loss: 0.1370 - val_accuracy: 0.8026 - val_loss: 1.1683
Epoch 7/10
[1m358/358

<keras.src.callbacks.history.History at 0x7965f2d0bb80>

# [Task 6 - 1.0 points] Transformers

In this section, you will use a transformer model specifically trained for hate speech detection, namely [Twitter-roBERTa-base for Hate Speech Detection](https://huggingface.co/cardiffnlp/twitter-roberta-base-hate).




### Relevant Material
- Tutorial 3

### Instructions
1. **Load the Tokenizer and Model**

2. **Preprocess the Dataset**:
   You will need to preprocess your dataset to prepare it for input into the model. Tokenize your text data using the appropriate tokenizer and ensure it is formatted correctly.

   **Note**: You have to use the plain text of the dataset and not the version that you tokenized before, as you need to tokenize the cleaned text obtained after the initial cleaning process.

3. **Train the Model**:
   Use the `Trainer` to train the model on your training data.

4. **Evaluate the Model on the Test Set** using F1-macro.

---

In [74]:
from datasets import Dataset
from transformers import AutoModelForSequenceClassification, TrainingArguments, AutoTokenizer, DataCollatorWithPadding, Trainer
from sklearn.metrics import f1_score, accuracy_score
from scipy.special import softmax
import evaluate

1. **Load the Tokenizer and Model**

In [51]:
model_card = 'cardiffnlp/twitter-roberta-base-hate'

tokenizer = AutoTokenizer.from_pretrained(model_card)

model = AutoModelForSequenceClassification.from_pretrained(model_card,
                                                           num_labels=2,
                                                           id2label={0: 'NEG', 1: 'POS'},
                                                           label2id={'NEG': 0, 'POS': 1})

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

print(model)


RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
         

2. **Preprocess the Dataset**:
   You will need to preprocess your dataset to prepare it for input into the model. Tokenize your text data using the appropriate tokenizer and ensure it is formatted correctly.

   **Note**: You have to use the plain text of the dataset and not the version that you tokenized before, as you need to tokenize the cleaned text obtained after the initial cleaning process.

In [56]:
df_train_T6 = df_train_T2.copy()
df_test_T6 = df_test_T2.copy()

train_data = Dataset.from_pandas(df_train_T6)
test_data = Dataset.from_pandas(df_test_T6)

def preprocess_text(texts):
    return tokenizer(texts['tweet'], truncation=True)

train_data = train_data.map(preprocess_text, batched=True)
test_data = test_data.map(preprocess_text, batched=True)

Map:   0%|          | 0/2870 [00:00<?, ? examples/s]

Map:   0%|          | 0/286 [00:00<?, ? examples/s]

In [57]:
print(train_data)
print(test_data)

Dataset({
    features: ['lang', 'tweet', 'hard_labels_task1', 'id_EXIST', 'input_ids', 'attention_mask'],
    num_rows: 2870
})
Dataset({
    features: ['lang', 'tweet', 'hard_labels_task1', 'id_EXIST', 'input_ids', 'attention_mask'],
    num_rows: 286
})


In [58]:
print(train_data['input_ids'][50])
print(train_data['attention_mask'][50])

[0, 6460, 761, 1812, 693, 213, 291, 604, 8861, 1079, 10170, 23725, 155, 1248, 3798, 45365, 2400, 477, 604, 3988, 693, 269, 1049, 7681, 11, 9736, 45676, 2249, 17844, 90, 1722, 11, 9736, 2]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [60]:
original_text = train_data['tweet'][50]
decoded_text = tokenizer.decode(train_data['input_ids'][50])

print(original_text)
print()
print()
print(decoded_text)

get kind 80 woman go 20 men ignore rest accord statistic 3 date apps thats pain point men hat woman really main component incel whats difference mgtow incel


<s>get kind 80 woman go 20 men ignore rest accord statistic 3 date apps thats pain point men hat woman really main component incel whats difference mgtow incel</s>


3. **Train the Model**:
   Use the `Trainer` to train the model on your training data.

In [66]:
acc_metric = evaluate.load('accuracy')
f1_metric = evaluate.load('f1')

def compute_metrics(output_info):
    predictions, labels = output_info
    predictions = np.argmax(predictions, axis=-1)

    f1 = f1_metric.compute(predictions=predictions, references=labels, average='macro')
    acc = acc_metric.compute(predictions=predictions, references=labels)
    return {**f1, **acc}


train_data = train_data.rename_column('hard_labels_task1', 'label')
test_data = test_data.rename_column('hard_labels_task1', 'label')

In [83]:
training_args = TrainingArguments(
    output_dir="test_dir",                 # where to save model
    learning_rate=2e-5,
    per_device_train_batch_size=8,         # accelerate defines distributed training
    per_device_eval_batch_size=8,
    num_train_epochs=4,
    weight_decay=0.01,
    eval_strategy="epoch",           # when to report evaluation metrics/losses
    save_strategy="epoch",                 # when to save checkpoint
    load_best_model_at_end=True,
    report_to='none'                       # disabling wandb (default)
)

In [84]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [85]:
trainer.train()

Epoch,Training Loss,Validation Loss,F1,Accuracy
1,No log,1.588907,0.785071,0.786713
2,0.061900,1.353542,0.817611,0.818182
3,0.103400,1.418771,0.809362,0.811189
4,0.103400,1.428322,0.802564,0.804196


TrainOutput(global_step=1436, training_loss=0.07049983324778777, metrics={'train_runtime': 233.961, 'train_samples_per_second': 49.068, 'train_steps_per_second': 6.138, 'total_flos': 202577012631120.0, 'train_loss': 0.07049983324778777, 'epoch': 4.0})

4. **Evaluate the Model on the Test Set** using F1-macro.

In [81]:
test_prediction_info = trainer.predict(test_data)
test_predictions, test_labels = test_prediction_info.predictions, test_prediction_info.label_ids

print(test_predictions.shape)
print(test_labels.shape)

(286, 2)
(286,)


In [82]:
test_metrics = compute_metrics([test_predictions, test_labels])
print(test_metrics)

{'f1': 0.7789392322131842, 'accuracy': 0.7797202797202797}


# [Task 7 - 0.5 points] Error Analysis

### Instructions

After evaluating the model, perform a brief error analysis:

 - Review the results and identify common errors.

 - Summarize your findings regarding the errors and their impact on performance (e.g. but not limited to Out-of-Vocabulary (OOV) words, data imbalance, and performance differences between the custom model and the transformer...)
 - Suggest possible solutions to address the identified errors.



---

In [None]:
### CODE HERE ###

# [Task 8 - 0.5 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.


# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ

Please check this frequently asked questions before contacting us

### Execution Order

You are **free** to address tasks in any order (if multiple orderings are available).

### Trainable Embeddings

You are **free** to define a trainable or non-trainable Embedding layer to load the GloVe embeddings.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).
However, you are **free** to play with their hyper-parameters.


### Neural Libraries

You are **free** to use any library of your choice to implement the networks (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Keras TimeDistributed Dense layer

If you are using Keras, we recommend wrapping the final Dense layer with `TimeDistributed`.

### Robust Evaluation

Each model is trained with at least 3 random seeds.

Task 4 requires you to compute the average performance over the 3 seeds and its corresponding standard deviation.

### Model Selection for Analysis

To carry out the error analysis you are **free** to either

* Pick examples or perform comparisons with an individual seed run model (e.g., Baseline seed 1337)
* Perform ensembling via, for instance, majority voting to obtain a single model.

### Error Analysis

Some topics for discussion include:
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.

### Bonus Points
Bonus points are arbitrarily assigned based on significant contributions such as:
- Outstanding error analysis
- Masterclass code organization
- Suitable extensions
Note that bonus points are only assigned if all task points are attributed (i.e., 6/6).

**Possible Extensions/Explorations for Bonus Points:**
- **Try other preprocessing strategies**: e.g., but not limited to, explore techniques tailored specifically for tweets or  methods that are common in social media text.
- **Experiment with other custom architectures or models from HuggingFace**
- **Explore Spanish tweets**: e.g., but not limited to, leverage multilingual models to process Spanish tweets and assess their performance compared to monolingual models.







# The End