# Assignment 1

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni



**Keywords**: Sexism Detection, Multi-class Classification, RNNs, Transformers, Huggingface




# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:



Teaching Assistants:



- Federico Ruggeri -> federico.ruggeri6@unibo.it

- Eleonora Mancini -> e.mancini@unibo.it



Professor:

- Paolo Torroni -> p.torroni@unibo.it

# Introduction

You are asked to address the [EXIST 2023 Task 1](https://clef2023.clef-initiative.eu/index.php?page=Pages/labs.html#EXIST) on sexism detection.

## Problem Definition

The systems have to decide whether or not a given tweet contains or describes sexist expressions or behaviors (i.e., it is sexist itself, describes a sexist situation or criticizes a sexist behavior).



### Examples:



**Text**: *Can’t go a day without women womening*



**Label**: Sexist



**Text**: *''Society's set norms! Happy men's day though!#weareequal''*



**Label**: Not sexist

In [3]:
!pip install evaluate
!mkdir -p ./data
!wget https://raw.githubusercontent.com/nlp-unibo/nlp-course-material/72c93981fe38020285a16e1558d5d4a314bb1b82/2024-2025/Assignment%201/data/test.json -O data/test.json
!wget https://raw.githubusercontent.com/nlp-unibo/nlp-course-material/72c93981fe38020285a16e1558d5d4a314bb1b82/2024-2025/Assignment%201/data/training.json -O data/training.json
!wget https://raw.githubusercontent.com/nlp-unibo/nlp-course-material/72c93981fe38020285a16e1558d5d4a314bb1b82/2024-2025/Assignment%201/data/validation.json -O data/validation.json
!wget https://zenodo.org/record/3234051/files/embeddings-m-model.vec?download=1 -O spanish_embed.vec
!python -m spacy download es_core_news_sm

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [

In [4]:
import pandas as pd
import numpy as np
import gensim
import gensim.downloader as gloader
import tensorflow as tf
import re
import nltk
import keras
import evaluate
import spacy
import matplotlib.pyplot as plt
from datasets import Dataset
#from functools import reduce
from keras import Input
from keras.layers import Bidirectional, LSTM, Dense, Embedding
from keras.callbacks import Callback, ModelCheckpoint
from keras.optimizers import Nadam
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, OrderedDict
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet, stopwords
from sklearn.metrics import f1_score, accuracy_score, precision_recall_curve, ConfusionMatrixDisplay, PrecisionRecallDisplay
from sklearn.utils.class_weight import compute_class_weight
from tqdm import tqdm
from tensorflow.keras.utils import Sequence
from transformers import AutoModelForSequenceClassification, TrainingArguments, AutoTokenizer, DataCollatorWithPadding, Trainer

#### Set an initial seed for reproducibility

In [6]:
seed = 42
np.random.seed(seed)

# [Task 1 - 1.0 points] Corpus



We have preparared a small version of EXIST dataset in our dedicated [Github repository](https://github.com/lt-nlp-lab-unibo/nlp-course-material/tree/main/2024-2025/Assignment%201/data).



Check the `A1/data` folder. It contains 3 `.json` files representing `training`, `validation` and `test` sets.



The three sets are slightly unbalanced, with a bias toward the `Non-sexist` class.




### Dataset Description

- The dataset contains tweets in both English and Spanish.

- There are labels for multiple tasks, but we are focusing on **Task 1**.

- For Task 1, soft labels are assigned by six annotators.

- The labels for Task 1 represent whether the tweet is sexist ("YES") or not ("NO").












### Example





    "203260": {

        "id_EXIST": "203260",

        "lang": "en",

        "tweet": "ik when mandy says “you look like a whore” i look cute as FUCK",

        "number_annotators": 6,

        "annotators": ["Annotator_473", "Annotator_474", "Annotator_475", "Annotator_476", "Annotator_477", "Annotator_27"],

        "gender_annotators": ["F", "F", "M", "M", "M", "F"],

        "age_annotators": ["18-22", "23-45", "18-22", "23-45", "46+", "46+"],

        "labels_task1": ["YES", "YES", "YES", "NO", "YES", "YES"],

        "labels_task2": ["DIRECT", "DIRECT", "REPORTED", "-", "JUDGEMENTAL", "REPORTED"],

        "labels_task3": [

          ["STEREOTYPING-DOMINANCE"],

          ["OBJECTIFICATION"],

          ["SEXUAL-VIOLENCE"],

          ["-"],

          ["STEREOTYPING-DOMINANCE", "OBJECTIFICATION"],

          ["OBJECTIFICATION"]

        ],

        "split": "TRAIN_EN"

      }

    }

### Instructions

1. **Download** the `A1/data` folder.

2. **Load** the three JSON files and encode them as pandas dataframes.

3. **Generate hard labels** for Task 1 using majority voting and store them in a new dataframe column called `hard_label_task1`. Items without a clear majority will be removed from the dataset.

4. **Filter the DataFrame** to keep only rows where the `lang` column is `'en'`.

5. **Remove unwanted columns**: Keep only `id_EXIST`, `lang`, `tweet`, and `hard_label_task1`.

6. **Encode the `hard_label_task1` column**: Use 1 to represent "YES" and 0 to represent "NO".

---

1. **Download** the `A1/data` folder.

2. **Load** the three JSON files and encode them as pandas dataframes.

In [7]:
df_train = pd.read_json("./data/training.json").transpose().set_index("id_EXIST")
df_test = pd.read_json("./data/test.json").transpose().set_index("id_EXIST")
df_val = pd.read_json("./data/validation.json").transpose().set_index("id_EXIST")

In [8]:
df_train.head()

Unnamed: 0_level_0,lang,tweet,number_annotators,annotators,gender_annotators,age_annotators,labels_task1,labels_task2,labels_task3,split
id_EXIST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100001,es,"@TheChiflis Ignora al otro, es un capullo.El p...",6,"[Annotator_1, Annotator_2, Annotator_3, Annota...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[YES, YES, NO, YES, YES, YES]","[REPORTED, JUDGEMENTAL, -, REPORTED, JUDGEMENT...","[[OBJECTIFICATION], [OBJECTIFICATION, SEXUAL-V...",TRAIN_ES
100002,es,@ultimonomada_ Si comicsgate se parece en algo...,6,"[Annotator_7, Annotator_8, Annotator_9, Annota...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[NO, NO, NO, NO, YES, NO]","[-, -, -, -, DIRECT, -]","[[-], [-], [-], [-], [OBJECTIFICATION], [-]]",TRAIN_ES
100003,es,"@Steven2897 Lee sobre Gamergate, y como eso ha...",6,"[Annotator_7, Annotator_8, Annotator_9, Annota...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",TRAIN_ES
100004,es,@Lunariita7 Un retraso social bastante lamenta...,6,"[Annotator_13, Annotator_14, Annotator_15, Ann...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[NO, NO, YES, NO, YES, YES]","[-, -, DIRECT, -, REPORTED, REPORTED]","[[-], [-], [IDEOLOGICAL-INEQUALITY], [-], [IDE...",TRAIN_ES
100005,es,@novadragon21 @icep4ck @TvDannyZ Entonces como...,6,"[Annotator_19, Annotator_20, Annotator_21, Ann...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[YES, NO, YES, NO, YES, YES]","[REPORTED, -, JUDGEMENTAL, -, JUDGEMENTAL, DIR...","[[STEREOTYPING-DOMINANCE, OBJECTIFICATION], [-...",TRAIN_ES


In [9]:
print("- Training dataset shape:", df_train.shape)
print("- Test dataset shape:", df_test.shape)
print("- Validation dataset shape:", df_val.shape)

- Training dataset shape: (6920, 10)
- Test dataset shape: (312, 10)
- Validation dataset shape: (726, 10)


3. **Generate hard labels** for Task 1 using majority voting and store them in a new dataframe column called `hard_label_task1`. Items without a clear majority will be removed from the dataset.

In [10]:
df_train_cp = df_train.copy()
df_test_cp = df_test.copy()
df_val_cp = df_val.copy()

# For each row of the datasets, using df.apply(), this sets the values for the new "hard_labels_task1" column.

df_train_cp['hard_labels_task1'] = df_train_cp['labels_task1'].apply(
    lambda x: 'YES' if x.count('YES') > x.count('NO') else ('NO' if x.count('NO') > x.count('YES') else np.NAN)
)

df_test_cp['hard_labels_task1'] = df_test_cp['labels_task1'].apply(
    lambda x: 'YES' if x.count('YES') > x.count('NO') else ('NO' if x.count('NO') > x.count('YES') else np.NAN)
)

df_val_cp['hard_labels_task1'] = df_val_cp['labels_task1'].apply(
    lambda x: 'YES' if x.count('YES') > x.count('NO') else ('NO' if x.count('NO') > x.count('YES') else np.NAN)
)


# Since for those rows without a clear majority vote the "hard_labels_task1" column was set to "NaN",
# this removes all those rows with the "NaN" value using df.dropna().

df_train_cp.dropna(inplace=True)
df_test_cp.dropna(inplace=True)
df_val_cp.dropna(inplace=True)

4. **Filter the DataFrame** to keep only rows where the `lang` column is `'en'`.

   **Expansion:** it has been decided to explore also spanish tweets, with the same preprocessing as the english text.

In [11]:
# English tweets
df_train_T1 = df_train_cp[df_train_cp["lang"] == "en"]
df_test_T1 = df_test_cp[df_test_cp["lang"] == "en"]
df_val_T1 = df_val_cp[df_val_cp["lang"] == "en"]

# Spanish tweets
df_train_es = df_train_cp[df_train_cp["lang"] == "es"]
df_test_es = df_test_cp[df_test_cp["lang"] == "es"]
df_val_es = df_val_cp[df_val_cp["lang"] == "es"]

5. **Remove unwanted columns**: Keep only `id_EXIST`, `lang`, `tweet`, and `hard_label_task1`.

In [12]:
# English tweets
df_train_T1 = df_train_T1.drop(
    ["number_annotators", "annotators","gender_annotators","age_annotators","labels_task1","labels_task2","labels_task3","split"],
    axis=1
)
df_test_T1 = df_test_T1.drop(
    ["number_annotators", "annotators","gender_annotators","age_annotators","labels_task1","labels_task2","labels_task3","split"],
    axis=1
)
df_val_T1 = df_val_T1.drop(
    ["number_annotators", "annotators","gender_annotators","age_annotators","labels_task1","labels_task2","labels_task3","split"],
    axis=1
)

# Spanish tweets
df_train_es = df_train_es.drop(
    ["number_annotators", "annotators","gender_annotators","age_annotators","labels_task1","labels_task2","labels_task3","split"],
    axis=1
)
df_test_es = df_test_es.drop(
    ["number_annotators", "annotators","gender_annotators","age_annotators","labels_task1","labels_task2","labels_task3","split"],
    axis=1
)
df_val_es = df_val_es.drop(
    ["number_annotators", "annotators","gender_annotators","age_annotators","labels_task1","labels_task2","labels_task3","split"],
    axis=1
)

6. **Encode the `hard_label_task1` column**: Use 1 to represent "YES" and 0 to represent "NO".

In [13]:
df_train_T1['hard_labels_task1'] = df_train_T1['hard_labels_task1'].apply(lambda x: 1 if x == 'YES' else 0)
df_test_T1['hard_labels_task1'] = df_test_T1['hard_labels_task1'].apply(lambda x: 1 if x == 'YES' else 0)
df_val_T1['hard_labels_task1'] = df_val_T1['hard_labels_task1'].apply(lambda x: 1 if x == 'YES' else 0)


df_train_es['hard_labels_task1'] = df_train_es['hard_labels_task1'].apply(lambda x: 1 if x == 'YES' else 0)
df_test_es['hard_labels_task1'] = df_test_es['hard_labels_task1'].apply(lambda x: 1 if x == 'YES' else 0)
df_val_es['hard_labels_task1'] = df_val_es['hard_labels_task1'].apply(lambda x: 1 if x == 'YES' else 0)

#### Visualisation

In [14]:
df_train_T1.head()

Unnamed: 0_level_0,lang,tweet,hard_labels_task1
id_EXIST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
200002,en,Writing a uni essay in my local pub with a cof...,1
200003,en,@UniversalORL it is 2021 not 1921. I dont appr...,1
200006,en,According to a customer I have plenty of time ...,1
200007,en,"So only 'blokes' drink beer? Sorry, but if you...",1
200008,en,New to the shelves this week - looking forward...,0


In [15]:
print("English tweets \n")
print("- Training dataset shape:", df_train_T1.shape)
print("- Test dataset shape:", df_test_T1.shape)
print("- Validation dataset shape:", df_val_T1.shape)

English tweets 

- Training dataset shape: (2870, 3)
- Test dataset shape: (286, 3)
- Validation dataset shape: (158, 3)


In [16]:
df_train_es.head()

Unnamed: 0_level_0,lang,tweet,hard_labels_task1
id_EXIST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100001,es,"@TheChiflis Ignora al otro, es un capullo.El p...",1
100002,es,@ultimonomada_ Si comicsgate se parece en algo...,0
100003,es,"@Steven2897 Lee sobre Gamergate, y como eso ha...",0
100005,es,@novadragon21 @icep4ck @TvDannyZ Entonces como...,1
100006,es,@yonkykong Aaah sí. Andrew Dobson. El que se d...,0


In [17]:
print("Spanish tweets \n")
print("- Training dataset shape:", df_train_es.shape)
print("- Test dataset shape:", df_test_es.shape)
print("- Validation dataset shape:", df_val_es.shape)

Spanish tweets 

- Training dataset shape: (3194, 3)
- Test dataset shape: (0, 3)
- Validation dataset shape: (490, 3)


As we can see from the previous output, **in the test set there are no spanish tweets**. Therefore, for the further computations and evaluations the validation set is used to test the models and the comparisons between the two languages are done by taking into consideration the corresponding validation sets.

# [Task2 - 0.5 points] Data Cleaning

In the context of tweets, we have noisy and informal data that often includes unnecessary elements like emojis, hashtags, mentions, and URLs. These elements may interfere with the text analysis.




### Instructions

- **Remove emojis** from the tweets.

- **Remove hashtags** (e.g., `#example`).

- **Remove mentions** such as `@user`.

- **Remove URLs** from the tweets.

- **Remove special characters and symbols**.

- **Remove specific quote characters** (e.g., curly quotes).

- **Perform lemmatization** to reduce words to their base form.

---

The same cleaning procedure is applied on both the languages. In particular, for Spanish tweets an ad hoc lemmatization is applied, taken from the [**spaCy library**](https://spacy.io/models/es), where there are specific models to deal with different languages.



In [18]:
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

lemmatizer = WordNetLemmatizer()
tagger_es = spacy.load("es_core_news_sm")

# This attempts to import a set of English and Spanish stopwords
try:
    STOPWORDS_EN = set(stopwords.words('english'))
    STOPWORDS_ES = set(stopwords.words('spanish'))

# If the stopwords resource is not found, it is dowloaded.
except LookupError:
    nltk.download('stopwords')
    STOPWORDS_EN = set(stopwords.words('english'))
    STOPWORDS_ES = set(stopwords.words('spanish'))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [19]:
# This function maps POS tags to WordNet POS tag
# This is needed for using the WordNetLemmatizer()
def get_wordnet_key(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ

    elif pos_tag.startswith('V'):
        return wordnet.VERB

    elif pos_tag.startswith('N'):
        return wordnet.NOUN

    elif pos_tag.startswith('R'):
        return wordnet.ADV

    else:
        return wordnet.NOUN



# This function lemmatizes text using WordNet POS tagging
def lem_text(text, lang='english'):
    if lang=='english':
        tokens = word_tokenize(text, language=lang)
        tagged = pos_tag(tokens)
        words = [lemmatizer.lemmatize(token, get_wordnet_key(tag)) for token, tag in tagged]
    elif lang=='spanish':
        tagged = tagger_es(text)
        words = [token.lemma_ for token in tagged]
    return " ".join(words)


# To remove emojis
def strip_emoji(text):
    RE_EMOJI = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
    return RE_EMOJI.sub(r'', text)


# To remove mentions, hashtags and punctuations.
def strip_tags(text):
    entity_prefixes = ['@','#']
    words = []
    for word in text.split():
        word = word.strip()
        if word:
            if word[0] not in entity_prefixes:
                if entity_prefixes[0] in word:
                    idx = word.find(entity_prefixes[0])
                    words.append(word[:idx])
                elif entity_prefixes[1] in word:
                    idx = word.find(entity_prefixes[1])
                    words.append(word[:idx])
                else:
                    words.append(word)
    return ' '.join(words)


# To remove all links (URLs)
def remove_links(text):
    # Regular expression pattern to match URLs
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    return re.sub(url_pattern, '', text)


# To remove special characters and symbols
def special_ch_sym(text):
    RE_ch_sym = re.compile(u'[^a-z A-Z 0-9 À-ú]')
    return RE_ch_sym.sub(r' ',text)


# To remove "br" characters
def replace_br(text):
    return text.replace('<br>', ' ')


# To remove stopwords
def remove_stopwords(text, stopwords):
    return ' '.join([x for x in text.split() if x and x not in stopwords])


# All the functions are then applied to the text
def text_cleaning_en(text):
    return lem_text(remove_stopwords(replace_br(special_ch_sym(strip_tags(remove_links(strip_emoji(text.lower().strip()))))), STOPWORDS_EN), 'english')

def text_cleaning_es(text):
    return lem_text(remove_stopwords(replace_br(special_ch_sym(strip_tags(remove_links(strip_emoji(text.lower().strip()))))), STOPWORDS_ES), 'spanish')

In [20]:
# df_train_T1 -> It contains the original tweets
# df_train_T2 -> It contains the cleaned tweets

df_train_T2 = df_train_T1.copy()
df_train_T2['tweet'] = df_train_T2['tweet'].apply(text_cleaning_en)

df_test_T2 = df_test_T1.copy()
df_test_T2['tweet'] = df_test_T2['tweet'].apply(text_cleaning_en)

df_val_T2 = df_val_T1.copy()
df_val_T2['tweet'] = df_val_T2['tweet'].apply(text_cleaning_en)

df_train_T2.head()

Unnamed: 0_level_0,lang,tweet,hard_labels_task1
id_EXIST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
200002,en,write uni essay local pub coffee random old ma...,1
200003,en,2021 1921 dont appreciate two ride team member...,1
200006,en,accord customer plenty time go spent stirling ...,1
200007,en,bloke drink beer sorry bloke drink wine appare...,1
200008,en,new shelf week look forward read book,0


In [21]:
df_train_clean_es = df_train_es.copy()
df_train_clean_es['tweet'] = df_train_clean_es['tweet'].apply(text_cleaning_es)

df_val_clean_es = df_val_es.copy()
df_val_clean_es['tweet'] = df_val_clean_es['tweet'].apply(text_cleaning_es)

df_train_clean_es.head()

Unnamed: 0_level_0,lang,tweet,hard_labels_task1
id_EXIST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100001,es,ignorar capullo problema youtuber denunciar ac...,1
100002,es,si comicsgate parecer gamergate pues bien acos...,0
100003,es,leer gamergate cambiado manera comunicar inter...,0
100005,es,entonces así mercado mejor hacer cambiar él se...,1
100006,es,aaah andrew dobson dedicar echar mierdo gamerg...,0


# [Task 3 - 0.5 points] Text Encoding

To train a neural sexism classifier, you first need to encode text into numerical format.






### Instructions



* Embed words using **GloVe embeddings**.

* You are **free** to pick any embedding dimension.








### Note : What about OOV tokens?

   * All the tokens in the **training** set that are not in GloVe **must** be added to the vocabulary.

   * For the remaining tokens (i.e., OOV in the validation and test sets), you have to assign them a **special token** (e.g., [UNK]) and a **static** embedding.

   * You are **free** to define the static embedding using any strategy (e.g., random, neighbourhood, etc...)




### More about OOV



For a given token:



* **If in train set**: add to vocabulary and assign an embedding (use GloVe if token in GloVe, custom embedding otherwise).

* **If in val/test set**: assign special token if not in vocabulary and assign custom embedding.



Your vocabulary **should**:



* Contain all tokens in train set; or

* Union of tokens in train set and in GloVe $\rightarrow$ we make use of existing knowledge!

---

1. **Building a vocabulary**

In [22]:
def build_vocabulary(df, lang='english'):
    idx_to_word = OrderedDict()
    word_to_idx = OrderedDict()
    curr_idx = 0

    # "<pad>" it the first element of the dictionary
    word_to_idx['<pad>'] = curr_idx
    idx_to_word[curr_idx] = '<pad>'
    curr_idx += 1

    for sentence in df.tweet.values:
        tokens = word_tokenize(sentence, language=lang)
        for token in tokens:
            if token not in word_to_idx:
                word_to_idx[token] = curr_idx
                idx_to_word[curr_idx] = token
                curr_idx += 1

    # "<unk>" it the last element of the dictionary
    word_to_idx['<unk>'] = curr_idx
    idx_to_word[curr_idx] = '<unk>'

    word_listing = list(idx_to_word.values())

    return idx_to_word, word_to_idx, word_listing

In [23]:
idx_to_word_en, word_to_idx_en, word_listing_en = build_vocabulary(df_train_T2)
vocab_size_en = len(idx_to_word_en)

print('English Vocabulary\n')
print(f'Index -> Word vocabulary size: {len(idx_to_word_en)}')
print(f'Word -> Index vocabulary size: {len(word_to_idx_en)}')
print(f'Some words: {[(idx_to_word_en[idx], idx) for idx in np.arange(0, 10)]}')

English Vocabulary

Index -> Word vocabulary size: 9009
Word -> Index vocabulary size: 9009
Some words: [('<pad>', 0), ('write', 1), ('uni', 2), ('essay', 3), ('local', 4), ('pub', 5), ('coffee', 6), ('random', 7), ('old', 8), ('man', 9)]


In [24]:
idx_to_word_es, word_to_idx_es, word_listing_es = build_vocabulary(df_train_clean_es, 'spanish')
vocab_size_es = len(idx_to_word_es)

print('Spanish Vocabulary \n')
print(f'Index -> Word vocabulary size: {len(idx_to_word_es)}')
print(f'Word -> Index vocabulary size: {len(word_to_idx_es)}')
print(f'Some words: {[(idx_to_word_es[idx], idx) for idx in np.arange(0, 10)]}')

Spanish Vocabulary 

Index -> Word vocabulary size: 11587
Word -> Index vocabulary size: 11587
Some words: [('<pad>', 0), ('ignorar', 1), ('capullo', 2), ('problema', 3), ('youtuber', 4), ('denunciar', 5), ('acoso', 6), ('afectar', 7), ('gente', 8), ('izquierdo', 9)]


In [None]:
def evaluate_vocabulary(idx_to_word, word_to_idx, word_listing, df, check_default_size: bool = False):

    print("  [Vocabulary Evaluation] Size checking...")
    assert len(idx_to_word) == len(word_to_idx)
    assert len(idx_to_word) == len(word_listing)

    print("  [Vocabulary Evaluation] Content checking...")
    for i in range(0, len(idx_to_word)):
        assert idx_to_word[i] in word_to_idx
        assert word_to_idx[idx_to_word[i]] == i

    print("  [Vocabulary Evaluation] Consistency checking...")
    _, _, first_word_listing = build_vocabulary(df)
    _, _, second_word_listing = build_vocabulary(df)
    assert first_word_listing == second_word_listing

In [None]:
print("Vocabulary evaluation...")
evaluate_vocabulary(idx_to_word_en, word_to_idx_en, word_listing_en, df_train_T2)
print("\nEvaluation completed!")

Vocabulary evaluation...
  [Vocabulary Evaluation] Size checking...
  [Vocabulary Evaluation] Content checking...
  [Vocabulary Evaluation] Consistency checking...

Evaluation completed!


In [None]:
print("Vocabulary evaluation...")
evaluate_vocabulary(idx_to_word_es, word_to_idx_es, word_listing_es, df_train_clean_es)
print("\nEvaluation completed!")

Vocabulary evaluation...
  [Vocabulary Evaluation] Size checking...
  [Vocabulary Evaluation] Content checking...
  [Vocabulary Evaluation] Consistency checking...

Evaluation completed!


2. **Embedding**

For the English tweets the choice has been a GloVe embedding model in the `gensim` library that is already trained on **Twitter** [*(glove-twitter-100)*](https://github.com/piskvorky/gensim-data). On the other hand, the availability of models for Spanish is not as big as English, even if there are a lot of valuable alternatives. The choice in this case has been a FastText embedding model, with embedding dimension 100, trained on the **Spanish Unannotated Corpora (SUC)** [*(spanish-word-embeddings)*](https://github.com/dccuchile/spanish-word-embeddings/blob/master/emb-from-suc.md).

In [25]:
embedding_dimension = 100
glove_model = gloader.load("glove-twitter-{}".format(embedding_dimension))
spanish_model = gensim.models.keyedvectors.KeyedVectors.load_word2vec_format("spanish_embed.vec")



3. **Out of vocabulary (OOV) words**

In [26]:
# This returns a list of all those in the training set that are not in GloVe
def check_OOV_terms(embedding_model: gensim.models.keyedvectors.KeyedVectors, word_listing):

    embedding_vocabulary = set(embedding_model.key_to_index.keys())
    oov = set(word_listing).difference(embedding_vocabulary)

    return list(oov)

In [27]:
oov_terms_glove = check_OOV_terms(glove_model, word_listing_en)
oov_percentage_glove = float(len(oov_terms_glove)) * 100 / len(word_listing_en)

print(f"OOV terms (English GloVe): {len(oov_terms_glove)} ({oov_percentage_glove:.2f}%)")

OOV terms (English GloVe): 1046 (11.61%)


In [28]:
oov_terms_es = check_OOV_terms(spanish_model, word_listing_es)
oov_percentage_es = float(len(oov_terms_es)) * 100 / len(word_listing_es)

print(f"OOV terms (Spanish FastText): {len(oov_terms_es)} ({oov_percentage_es:.2f}%)")

OOV terms (Spanish FastText): 1633 (14.09%)


4. **Building the Embedding Matrix**

In [29]:
def build_embedding_matrix(embedding_model, embedding_dimension, word_to_idx, vocab_size):

    embedding_matrix = np.zeros((vocab_size, embedding_dimension), dtype=np.float32)
    for word, idx in word_to_idx.items():

        # For each word in the training set, an embedding vector is created by GloVe
        try:
            embedding_vector = embedding_model[word]

        # If the word is not present, it must be added
        except (KeyError, TypeError):
            if word == "<unk>":
                # To "<unk>" a vector of all zeros is set
                embedding_vector = np.zeros(embedding_dimension)

            else:
                # To all the other words a random vector is set
                embedding_vector = np.random.uniform(low=-0.05, high=0.05, size=embedding_dimension)

        embedding_matrix[idx] = embedding_vector

    return embedding_matrix

An **embedding matrix** is built for each of the embedding model used, starting from the *English* and *Spanish vocabulary* previously defined.

In [30]:
embed_matrix_glove = build_embedding_matrix(glove_model,
                                          embedding_dimension,
                                          word_to_idx_en,
                                          vocab_size_en)

print(f"\nEmbedding matrix shape: {embed_matrix_glove.shape}")


Embedding matrix shape: (9009, 100)


In [31]:
embed_matrix_spanish = build_embedding_matrix(spanish_model,
                                              embedding_dimension,
                                              word_to_idx_es,
                                              vocab_size_es)

print(f"\nEmbedding matrix shape: {embed_matrix_spanish.shape}")


Embedding matrix shape: (11587, 100)


## References:

- Bojanowski, Piotr, et al. "Enriching word vectors with subword information." *Transactions of the association for computational linguistics* 5 (2017): 135-146.

# [Task 4 - 1.0 points] Model definition



You are now tasked to define your sexism classifier.






### Instructions



* **Baseline**: implement a Bidirectional LSTM with a Dense layer on top.

* You are **free** to experiment with hyper-parameters to define the baseline model.



* **Model 1**: add an additional LSTM layer to the Baseline model.

### Token to embedding mapping



You can follow two approaches for encoding tokens in your classifier.



### Work directly with embeddings



- Compute the embedding of each input token

- Feed the mini-batches of shape (batch_size, # tokens, embedding_dim) to your model



### Work with Embedding layer



- Encode input tokens to token ids

- Define a Embedding layer as the first layer of your model

- Compute the embedding matrix of all known tokens (i.e., tokens in your vocabulary)

- Initialize the Embedding layer with the computed embedding matrix

- You are **free** to set the Embedding layer trainable or not

### Padding



Pay attention to padding tokens!



Your model **should not** be penalized on those tokens.



#### How to?



There are two main ways.



However, their implementation depends on the neural library you are using.



- Embedding layer

- Custom loss to compute average cross-entropy on non-padding tokens only



**Note**: This is a **recommendation**, but we **do not penalize** for missing workarounds.

---

In [32]:
# ----- Hyperparameters ----- #

hidden_dim = 64
num_classes = 2
max_len_en = max(df_train_T2['tweet'].apply(lambda x: len(word_tokenize(x))))
max_len_es = max(df_train_es['tweet'].apply(lambda x: len(word_tokenize(x, language='spanish'))))
seeds = [42, 347, 1337]
batch_size = 16
epochs = 10

In [None]:
@keras.saving.register_keras_serializable()

class Bidirectional_LSTM (tf.keras.Model):

    # input_dim: Lenght of the input tweets.
    # hidden_dim: Dimensionality of the hidden layers in the LSTMs.
    # num_layers: Number of stacked bidirectional LSTM layers. Default value is 1.
        # 1 --> Baseline LSTM
        # 2 --> Model 1 LSTM

    def __init__(self, input_dim, output_dim, hidden_dim, vocab_size, embedding_matrix, num_layers = 1, name=None, **kwargs):
        super(Bidirectional_LSTM, self).__init__(**kwargs)

        self.name = name
        self.output_dim = output_dim
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.input_dim = input_dim
        self.vocab_size = vocab_size
        self.embedding_matrix = embedding_matrix

        # To map integer tokens to dense vectors
        self.embed_layer = Embedding(input_dim=self.vocab_size,
                                      output_dim=embedding_dimension,
                                      weights=[self.embedding_matrix],
                                      mask_zero=True,   # automatically masks padding tokens
                                      name='encoder_embedding')

        # First bidirectional LSTM layer - Baseline
        self.bidir_layer_1 = Bidirectional(LSTM(hidden_dim), backward_layer=LSTM(hidden_dim, go_backwards=True))
        self.bidir_layers_2 = []

        # Additional bidirectional LSTM layers - Model 1
        for i in range(num_layers-1):
            self.bidir_layers_2.append(Bidirectional(LSTM(hidden_dim, return_sequences=True),
                                                     backward_layer=LSTM(hidden_dim, go_backwards=True, return_sequences=True)))

        # Dense output layer
        self.dense_layer = Dense(output_dim, activation='softmax')



    def build(self):
        # Call the model with a random input to define its shape
        self.call(keras.random.normal((self.input_dim, 1)))
        self.built = True



    # Forward pass of the model
    def call(self, input):
        x = self.embed_layer(input)
        for idx, layer in enumerate(self.bidir_layers_2):
            x = layer(x)
        x = self.bidir_layer_1(x)
        output = self.dense_layer(x)
        return output



    #method for the serialization of the model
    def get_config(self):
        config = super().get_config()
        config.update({
            "input_dim" : self.input_dim,
            "output_dim" : self.output_dim,
            "hidden_dim" : self.hidden_dim,
            "num_layers" : self.num_layers,
            "name" : self.name,
        })
        return config

#### Visualisation of the number of parameters

In [None]:
baseline_LSTM = Bidirectional_LSTM(input_dim=max_len_en,
                                   output_dim=num_classes,
                                   hidden_dim=hidden_dim,
                                   vocab_size=vocab_size_en,
                                   embedding_matrix=embed_matrix_glove,
                                   num_layers=1,
                                   name="Baseline_LSTM")
baseline_LSTM.compile(optimizer=Nadam(ema_momentum=0.9), loss='binary_crossentropy', metrics=['accuracy'])
baseline_LSTM.build()
baseline_LSTM.summary()

In [None]:
model_1_LSTM = Bidirectional_LSTM(input_dim=max_len_en,
                                  output_dim=num_classes,
                                  hidden_dim=hidden_dim,
                                  vocab_size=vocab_size_en,
                                  embedding_matrix=embed_matrix_glove,
                                  num_layers=2,
                                  name="Model_1_LSTM")
model_1_LSTM.compile(optimizer=Nadam(ema_momentum=0.9), loss='binary_crossentropy', metrics=['accuracy'])
model_1_LSTM.build()
model_1_LSTM.summary()

## References:

* Zeyer, Albert, et al. "A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition." *2017 IEEE international conference on acoustics, speech and signal processing (ICASSP).* IEEE, 2017.

# [Task 5 - 1.0 points] Training and Evaluation



You are now tasked to train and evaluate the Baseline and Model 1.




### Instructions



* Train **all** models on the train set.

* Evaluate **all** models on the validation set.

* Compute metrics on the validation set.

* Pick **at least** three seeds for robust estimation.

* Pick the **best** performing model according to the observed validation set performance.

* Evaluate your models using macro F1-score.

---

## Training

Both the `Baseline LSTM` and the `Model 1 LSTM` models have been trained and evaluated using three different random seeds. For each model, we have picked the best instance, with the related seed, which has been evaluated on the _test set_ for single model performances and overall performaces.

In [None]:
# Callback to compute the F1 score at the end of each epoch.

class F1ScoreCallback(Callback):

    # it takes the DataGenerator for validation data
    def __init__(self, validation_generator):
        super(F1ScoreCallback, self).__init__()

        self.validation_generator = validation_generator


    def on_epoch_end(self, epoch, logs=None):
        # Accumulate predictions and true labels across all validation batches
        all_predictions = []
        all_true_labels = []

        for i in range(len(self.validation_generator)):

            # Fetch the next batch of validation data
            val_inputs, val_labels = self.validation_generator[i]

            # Predict on the batch
            batch_predictions = np.argmax(self.model.predict(val_inputs, verbose=0), axis=-1)
            val_labels = np.argmax(val_labels, axis=1)

            # Collect predictions and true labels
            all_predictions.append(batch_predictions)
            all_true_labels.append(val_labels)

        # Compute F1 score
        f1 = f1_score(all_true_labels, all_predictions, average='macro')

        # Log the F1 score
        logs['f1_score'] = f1

In [None]:
class DataGenerator(Sequence):

    def __init__(self, data, word_to_idx, word_listing, max_len, lang='english', batch_size=16, shuffle=True, seed=seed):
        super().__init__()

        self.data = data
        self.tweet = data["tweet"].to_numpy()
        self.hard_labels_task1 = data["hard_labels_task1"]
        self.word_to_idx = word_to_idx
        self.word_listing = word_listing
        self.max_len = max_len
        self.lang = lang
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.seed = seed
        self.on_epoch_end()
        self._prepare_data()



    # Number of batches in the dataset
    def __len__(self):
        return int(np.floor(len(self.data) / self.batch_size))



    # It returns a batch of data and its corresponding target labels
    def __getitem__(self, index):

        indexes = self.indexes[index*self.batch_size: (index+1)*self.batch_size]
        data_batch = np.array([self.tweet[k] for k in indexes])
        target_batch = np.array([[1, 0] if self.hard_labels_task1.to_list()[k]==0 else [0, 1] for k in indexes])

        return (data_batch, target_batch)



    # It resets data indexes for shuffling at the end of each epoch
    def on_epoch_end(self):
        self.indexes = np.arange(len(self.data))
        if self.shuffle:
            if self.seed is not None:
                np.random.seed(self.seed)
            np.random.shuffle(self.indexes)



    # It preprocesses tweet data for model input.
    def _prepare_data(self):

        # Tweets are firstly tokenized, and then padding is applied to reach "max_len" length
        self.tweet = [word_tokenize(sentence, self.lang) + ['<pad>']*(self.max_len - len(word_tokenize(sentence, self.lang))) for sentence in self.tweet]

        # Words are converted to their corresponding indices using `word_to_idx`, and unknown words are replaced with the '<unk>' index.
        self.tweet = [[self.word_to_idx[word] if word in self.word_listing else self.word_to_idx["<unk>"] for word in sentence] for sentence in self.tweet]

### Trainings for English models

In [None]:
GL_baseline_f1_scores = []
GL_baseline_urls = []
GL_model_1_f1_scores = []
GL_model_1_urls = []

for seed in seeds:

    # Seed initialization
    tf.random.set_seed(seed)
    np.random.seed(seed)

    # Data generation
    train_gen = DataGenerator(df_train_T2, word_to_idx_en, word_listing_en, max_len_en, batch_size=batch_size, shuffle=True, seed=seed)
    validation_gen = DataGenerator(df_val_T2, word_to_idx_en, word_listing_en, max_len_en, batch_size=1, shuffle=False, seed=seed)



    # ----- BASELINE LSTM ----- #
    baseline_LSTM = Bidirectional_LSTM(input_dim=max_len_en,
                                       output_dim=num_classes,
                                       hidden_dim=hidden_dim,
                                       vocab_size=vocab_size_en,
                                       embedding_matrix=embed_matrix_glove,
                                       num_layers=1,
                                       name=f"Baseline_LSTM_{seed}")
    baseline_LSTM.compile(optimizer=Nadam(ema_momentum=0.9), loss='binary_crossentropy', metrics=['accuracy'])

    # Callbacks definition
    f1_call = F1ScoreCallback(validation_gen)
    model_check = ModelCheckpoint(f'./GloVe/baseline_instances/baseline_{seed}.keras', monitor='f1_score', mode='max', verbose=0, save_best_only=True)

    # Training
    print(f'\nTraining model {baseline_LSTM.name}...')
    baseline_LSTM.fit(train_gen, validation_data=validation_gen, batch_size=batch_size, epochs=epochs, callbacks=[f1_call, model_check], verbose=0)
    GL_baseline_urls.append(f'./GloVe/baseline_instances/baseline_{seed}.keras')

    # F1 score computation on the validation set
    baseline_LSTM = keras.saving.load_model(f'./GloVe/baseline_instances/baseline_{seed}.keras')
    pred = baseline_LSTM.predict(validation_gen, verbose=0)
    f1 = f1_score(np.argmax(pred, axis=-1), df_val_T2['hard_labels_task1'].to_list())
    GL_baseline_f1_scores.append(f1)
    print('   [COMPLETE]')



    # ----- MODEL 1 LSTM ----- #
    model_1_LSTM = Bidirectional_LSTM(input_dim=max_len_en,
                                     output_dim=num_classes,
                                     hidden_dim=hidden_dim,
                                     vocab_size=vocab_size_en,
                                     embedding_matrix=embed_matrix_glove,
                                     num_layers=2,
                                     name=f"Model_1_LSTM_{seed}")
    model_1_LSTM.compile(optimizer=Nadam(ema_momentum=0.9), loss='binary_crossentropy', metrics=['accuracy'])

    # Callbacks definition
    f1_call = F1ScoreCallback(validation_gen)
    model_check = ModelCheckpoint(f'./GloVe/model_1_instances/model_1_{seed}.keras', monitor='f1_score', mode='max', verbose=0, save_best_only=True)

    # Training
    print(f'\nTraining model {model_1_LSTM.name}...')
    model_1_LSTM.fit(train_gen, validation_data=validation_gen, batch_size=batch_size, epochs=epochs, callbacks=[f1_call, model_check], verbose=0)
    GL_model_1_urls.append(f'./GloVe/model_1_instances/model_1_{seed}.keras')

    # F1 score computation on the validation set
    model_1_LSTM = keras.saving.load_model(f'./GloVe/model_1_instances/model_1_{seed}.keras')
    pred = model_1_LSTM.predict(validation_gen, verbose=0)
    f1 = f1_score(np.argmax(pred, axis=-1), df_val_T2['hard_labels_task1'].to_list())
    GL_model_1_f1_scores.append(f1)
    print('   [COMPLETE]')


Training model Baseline_LSTM_42...


ValueError: 'Baseline_LSTM_42(English)_Baseline_LSTM_42(English)_encoder_embedding_embeddings_momentum' is not a valid scope name. A scope name has to match the following pattern: ^[A-Za-z0-9_.\\/>-]*$

### Trainings for Spanish models

In [None]:
################################
# -- SPANISH TRAININGS HERE -- #
################################

## Evaluation

### Baseline performances evaluation (English)

In [None]:
average = np.average(GL_baseline_f1_scores)
std = np.std(GL_baseline_f1_scores)

idx_base = np.argmax(GL_baseline_f1_scores)
best_f1_base = GL_baseline_f1_scores[idx_base]
best_url_base = GL_baseline_urls[idx_base]
best_seed_base = seeds[idx_base]

print("Baseline LSTM average performances on the validation set")
print("  - Mean:", average)
print("  - Standard deviation:", std)


# Test set performances
baseline_LSTM = keras.saving.load_model(best_url_base)
test_gen = DataGenerator(df_test_T2, word_to_idx_en, word_listing_en, max_len_en, batch_size=1, shuffle=False, seed=best_seed_base)
pred_base = baseline_LSTM.predict(test_gen, verbose=0)
f1_base = f1_score(np.argmax(pred_base, axis=-1), df_test_T2['hard_labels_task1'].to_list(), average='macro')

print('\nBaseline LSTM test performances')
print('  - Seed:', best_seed_base)
print('  - F1 score =', f1_base)

### Model 1 performances evaluation (English)

In [None]:
average = np.average(GL_model_1_f1_scores)
std = np.std(GL_model_1_f1_scores)

idx_model_1 = np.argmax(GL_model_1_f1_scores)
best_f1_model_1 = GL_model_1_f1_scores[idx_model_1]
best_url_model_1 = GL_model_1_urls[idx_model_1]
best_seed_model_1 = seeds[idx_model_1]

print("Model 1 LSTM average performances on the validation set")
print("  - Mean:", average)
print("  - Standard deviation:", std)


# Test set performances
model_1_LSTM = keras.saving.load_model(best_url_model_1)
test_gen = DataGenerator(df_test_T2, word_to_idx_en, word_listing_en, max_len_en, language='english', batch_size=1, shuffle=False, seed=best_seed_model_1)
pred_model_1 = model_1_LSTM.predict(test_gen, verbose=0)
f1_model_1 = f1_score(np.argmax(pred_model_1, axis=-1), df_test_T2['hard_labels_task1'].to_list(), average='macro')

print('\nModel 1 LSTM test performances')
print('  - Seed:', best_seed_base)
print('  - F1 score =', f1_base)

### Best model evaluation (English)

In [None]:
if best_f1_base > best_f1_model_1:
    best_LSTM = baseline_LSTM
    best_f1_val = best_f1_base
    best_seed = best_seed_base
else:
    best_LSTM = model_1_LSTM
    best_f1_val = best_f1_model_1
    best_seed = best_seed_model_1


test_gen = DataGenerator(df_test_T2, word_to_idx_en, word_listing_en, max_len_en, batch_size=1, shuffle=False, seed=best_seed)
pred_LSTM = best_LSTM.predict(test_gen, verbose=0)
f1_LSTM = f1_score(np.argmax(pred_LSTM, axis=-1), df_test_T2['hard_labels_task1'].to_list(), average='macro')

print('\nBest Model (GloVe):', best_LSTM.name)
print('  - Seed:', best_seed)
print('  - Val F1 score =', best_f1_val)
print('  - F1 score =', f1_LSTM)

In [None]:
df_test_T5 = df_test_T2.copy()
df_test_T5['predictions'] = np.argmax(pred_LSTM, axis=-1)
df_test_T5.to_csv("df_test_LSTM.csv")

### Best model evaluation (Spanish)

In [None]:
############################
# -- SPANISH EVALUATION -- #
############################

# [Task 6 - 1.0 points] Transformers



In this section, you will use a transformer model specifically trained for hate speech detection, namely [Twitter-roBERTa-base for Hate Speech Detection](https://huggingface.co/cardiffnlp/twitter-roberta-base-hate).






### Relevant Material

- Tutorial 3

### Instructions

1. **Load the Tokenizer and Model**



2. **Preprocess the Dataset**:

   You will need to preprocess your dataset to prepare it for input into the model. Tokenize your text data using the appropriate tokenizer and ensure it is formatted correctly.



   **Note**: You have to use the plain text of the dataset and not the version that you tokenized before, as you need to tokenize the cleaned text obtained after the initial cleaning process.



3. **Train the Model**:

   Use the `Trainer` to train the model on your training data.



4. **Evaluate the Model on the Test Set** using F1-macro.

---

1. **Load the Tokenizer and Model**

In [None]:
model_card_en = 'cardiffnlp/twitter-roberta-base-hate'
tokenizer_en = AutoTokenizer.from_pretrained(model_card_en)
model_en = AutoModelForSequenceClassification.from_pretrained(model_card_en,
                                                              num_labels=num_classes,
                                                              id2label={0: 'NEG', 1: 'POS'},
                                                              label2id={'NEG': 0, 'POS': 1})
data_collator_en = DataCollatorWithPadding(tokenizer=tokenizer_en)

In [None]:
model_card_es = 'cardiffnlp/twitter-xlm-roberta-base-hate-spanish'
tokenizer_es = AutoTokenizer.from_pretrained(model_card_es)
model_es = AutoModelForSequenceClassification.from_pretrained(model_card_es,
                                                              num_labels=num_classes,
                                                              id2label={0: 'NEG', 1: 'POS'},
                                                              label2id={'NEG': 0, 'POS': 1})
data_collator_es = DataCollatorWithPadding(tokenizer=tokenizer_es)

2. **Preprocess the Dataset**:

   You will need to preprocess your dataset to prepare it for input into the model. Tokenize your text data using the appropriate tokenizer and ensure it is formatted correctly.



   **Note**: You have to use the plain text of the dataset and not the version that you tokenized before, as you need to tokenize the cleaned text obtained after the initial cleaning process.

In [None]:
# df_train_T2 -> It contains the english tweets after the cleaning step of Task 2
# df_train_es -> It contains the spanish tweets after the cleaning step of Task 2

train_data_en = Dataset.from_pandas(df_train_T2)
val_data_en = Dataset.from_pandas(df_val_T2)
test_data_en = Dataset.from_pandas(df_test_T2)

train_data_es = Dataset.from_pandas(df_train_clean_es)
val_data_es = Dataset.from_pandas(df_val_clean_es)



# Data are preprocessed through a tokenizer
def preprocess_text_en(texts):
    return tokenizer_en(texts['tweet'], truncation=True)

def preprocess_text_es(texts):
    return tokenizer_es(texts['tweet'], truncation=True)

# This applies the preprocessing function to training and test data in batches (batched = True)
train_data_en = train_data_en.map(preprocess_text_en, batched=True)
val_data_en = val_data_en.map(preprocess_text_en, batched=True)
test_data_en = test_data_en.map(preprocess_text_en, batched=True)

train_data_es = train_data_es.map(preprocess_text_es, batched=True)
val_data_es = val_data_es.map(preprocess_text_es, batched=True)



train_data_en = train_data_en.rename_column('hard_labels_task1', 'label')
val_data_en = val_data_en.rename_column('hard_labels_task1', 'label')
test_data_en = test_data_en.rename_column('hard_labels_task1', 'label')

train_data_es = train_data_es.rename_column('hard_labels_task1', 'label')
val_data_es = val_data_es.rename_column('hard_labels_task1', 'label')

Map:   0%|          | 0/2870 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/158 [00:00<?, ? examples/s]

Map:   0%|          | 0/286 [00:00<?, ? examples/s]

Map:   0%|          | 0/3194 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/490 [00:00<?, ? examples/s]

In [None]:
acc_metric = evaluate.load('accuracy')

# This function computes accuracy and F1 metrics
def compute_metrics(output_info):

    predictions, labels = output_info
    predictions = np.argmax(predictions, axis=-1)
    f1 = f1_score(predictions, labels, average="macro")
    acc = acc_metric.compute(predictions=predictions, references=labels)

    return {"f1": f1,**acc}

In [None]:
training_args = TrainingArguments(

    output_dir="test_dir",
    logging_first_step=True,
    learning_rate=5e-6,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    adam_epsilon=1e-8,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to='none',
    metric_for_best_model="f1",
    greater_is_better=True

)

#### English dataset training

In [None]:
trainer_en = Trainer(

    model=model_en,
    args=training_args,
    train_dataset=train_data_en,
    eval_dataset=val_data_en,
    processing_class=tokenizer_en,
    data_collator=data_collator_en,
    compute_metrics=compute_metrics,

)

In [None]:
trainer_en.train()

Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.603,0.417969,0.835437,0.841772
2,0.603,0.389795,0.829289,0.835443
3,0.3899,0.411744,0.849355,0.85443
4,0.3899,0.465224,0.828404,0.835443
5,0.3899,0.540898,0.819199,0.829114
6,0.2572,0.565269,0.821319,0.829114
7,0.2572,0.683183,0.824174,0.835443
8,0.2572,0.721598,0.82534,0.835443
9,0.1788,0.704318,0.833607,0.841772
10,0.1788,0.719538,0.839784,0.848101


TrainOutput(global_step=1800, training_loss=0.2536724606818623, metrics={'train_runtime': 508.182, 'train_samples_per_second': 56.476, 'train_steps_per_second': 3.542, 'total_flos': 542326357307640.0, 'train_loss': 0.2536724606818623, 'epoch': 10.0})

4. **Evaluate the Model on the Test Set** using F1-macro.

In [None]:
val_prediction_info = trainer_en.predict(val_data_en)
val_metrics_en = compute_metrics([val_prediction_info.predictions, val_prediction_info.label_ids])
print("Evaluation metrics (Val):\n", val_metrics_en, '\n')

test_prediction_info = trainer_en.predict(test_data_en)
test_metrics_en = compute_metrics([test_prediction_info.predictions, test_prediction_info.label_ids])
print("Evaluation metrics (Test):\n", test_metrics_en)

Evaluation metrics (Val):
 {'f1': 0.8493553869750861, 'accuracy': 0.8544303797468354} 



Evaluation metrics (Test):
 {'f1': 0.8132584297347575, 'accuracy': 0.8146853146853147}


In [None]:
df_test_T6 = df_test_T2.copy()
df_test_T6['predictions'] = np.argmax(test_prediction_info.predictions, axis=-1)
df_test_T6.to_csv("df_test_Transformer.csv")

#### Spanish dataset training

In [None]:
trainer_es = Trainer(

    model=model_es,
    args=training_args,
    train_dataset=train_data_es,
    eval_dataset=val_data_es,
    processing_class=tokenizer_es,
    data_collator=data_collator_es,
    compute_metrics=compute_metrics,

)

In [None]:
trainer_es.train()

Epoch,Training Loss,Validation Loss,F1,Accuracy
1,1.3563,0.445474,0.800681,0.804082
2,1.3563,0.45389,0.80307,0.806122
3,0.4166,0.534807,0.804886,0.806122
4,0.4166,0.589545,0.804322,0.806122
5,0.2172,0.660414,0.81602,0.816327
6,0.2172,0.754679,0.814006,0.814286
7,0.2172,0.877563,0.821538,0.822449
8,0.1039,0.965715,0.813801,0.814286
9,0.1039,1.045527,0.809709,0.810204
10,0.077,1.068711,0.817637,0.818367


TrainOutput(global_step=2000, training_loss=0.20415427029132843, metrics={'train_runtime': 952.1618, 'train_samples_per_second': 33.545, 'train_steps_per_second': 2.1, 'total_flos': 700558879335000.0, 'train_loss': 0.20415427029132843, 'epoch': 10.0})

In [None]:
val_prediction_info = trainer_es.predict(val_data_es)
val_metrics_es = compute_metrics([val_prediction_info.predictions, val_prediction_info.label_ids])
print("Spanish model evaluation metrics (Val):\n", val_metrics_es)

Spanish model evaluation metrics (Val):
 {'f1': 0.8215384615384616, 'accuracy': 0.8224489795918367}


# [Task 7 - 0.5 points] Error Analysis



### Instructions



After evaluating the model, perform a brief error analysis:



 - Review the results and identify common errors.



 - Summarize your findings regarding the errors and their impact on performance (e.g. but not limited to Out-of-Vocabulary (OOV) words, data imbalance, and performance differences between the custom model and the transformer...)

 - Suggest possible solutions to address the identified errors.




---

## Class Imbalance
One key challenge in this task is the **class imbalance** present in the training set. The majority of tweets are labeled as _"non-sexist" (0)_, which introduces a bias in the inference phase. This bias skews the model's predictions toward the majority class, as demonstrated by the class distributions shown below. Such imbalance can affect the model's ability to correctly identify the minority class _"sexist" (1)_, which is critical in this application.

The imbalance is further highlighted by differences in class distributions between the training and test datasets. These disparities can cause the model to generalize poorly, as it encounters a different data distribution during evaluation.

- **TRAINING SET**: labels' distribution of _sexist_ (**1**) and _non-sexist_ (**0**) tweets.

In [None]:
df_train_T2['hard_labels_task1'].value_counts(normalize=True)

- **TEST SET**: labels' distribution of _sexist_ (**1**) and _non-sexist_ (**0**) tweets.

In [None]:
df_test_T2['hard_labels_task1'].value_counts(normalize=True)

## Labeling evaluation
To better understand the factors influencing the performance of the model, the following information have been analysed and grouped:

- **Tweet Length**
- **Number of Unknown Tokens**
- **Number of Out-Of-Vocabulary (OOV) Terms**

The related values are compared between _correct_ and _wrong predictions_ to assess their impact on the labeling accuracy. The analysis is conducted on both the LSTM best model and the Transformer.

In [None]:
df_err_LSTM = pd.read_csv("df_test_LSTM.csv", index_col="id_EXIST")
df_err_LSTM['tweet_length'] = df_err_LSTM['tweet'].apply(lambda x: len(word_tokenize(x)))
df_err_LSTM["unknowns_number"] = df_err_LSTM['tweet'].apply(lambda x: len([word for word in word_tokenize(x) if word not in word_listing]))
df_err_LSTM["oov_terms_number"] = df_err_LSTM['tweet'].apply(lambda x: len([word for word in word_tokenize(x) if word in oov_terms]))

In [None]:
df_correct_LSTM = df_err_LSTM[df_err_LSTM['hard_labels_task1'] == df_err_LSTM['predictions']]
df_errors_LSTM = df_err_LSTM[df_err_LSTM['hard_labels_task1'] != df_err_LSTM['predictions']]

In [None]:
df_correct_LSTM.describe()

In [None]:
df_errors_LSTM.describe()

In [None]:
len_correct = df_correct_LSTM.shape[0]
len_errors = df_errors_LSTM.shape[0]

fig, axes = plt.subplots(3, 1, figsize=(8, 12))
axes[0].hist(df_correct_LSTM['tweet_length'], bins=range(max_len), label=f'correct - mean: {np.mean(df_correct_LSTM.tweet_length):.2f}')
axes[0].hist(df_errors_LSTM['tweet_length'], bins=range(max_len), label=f'errors - mean: {np.mean(df_errors_LSTM.tweet_length):.2f}')
axes[0].set_title('Tweet length')
axes[0].legend()

bin_n = max(len(df_correct_LSTM['unknowns_number'].unique()), len(df_errors_LSTM['unknowns_number'].unique()))
axes[1].hist(df_correct_LSTM['unknowns_number'], bins=range(bin_n), label=f'correct - mean: {np.mean(df_correct_LSTM.unknowns_number):.2f}')
axes[1].hist(df_errors_LSTM['unknowns_number'], bins=range(bin_n), label=f'errors - mean: {np.mean(df_errors_LSTM.unknowns_number):.2f}')
axes[1].set_title('Unknowns number')
axes[1].legend()

bin_n = max(len(df_correct_LSTM['oov_terms_number'].unique()), len(df_errors_LSTM['oov_terms_number'].unique()))
axes[2].hist(df_correct_LSTM['oov_terms_number'], bins=range(bin_n), label=f'correct - mean: {np.mean(df_correct_LSTM.oov_terms_number):.2f}')
axes[2].hist(df_errors_LSTM['oov_terms_number'], bins=range(bin_n), label=f'errors - mean: {np.mean(df_errors_LSTM.oov_terms_number):.2f}')
axes[2].set_title('OOV terms number')
axes[2].legend()

fig.show()

What can be inferred for the LSTM are the following statements:

- **The LSTM classifier appears to handle Unknown and OOV Terms relatively well**, as these factors don't drastically affect the model performances in general. On the other side, **the Tweet Lenght seems to play a bigger role in the differentiation of correct and incorrect predictions.**

- Tweets that are classified correctly tend to have slightly longer average lengths (mean: 14.95) compared to incorrectly classified tweets (mean: 13.06). In addition, looking at the distributions, there are more instances of shorter tweets in the error distribution, suggesting that shorter tweets might be harder for the classifier to handle. <br>
One reason might be that **the model relies on contextual information present in longer tweets for correct predictions, in contrast to shorter ones, which may lack sufficient information, leading to errors**.

In [None]:
df_err_transformer = pd.read_csv("df_test_Transformer.csv", index_col="id_EXIST")
df_err_transformer['tweet_length'] = df_err_transformer['tweet'].apply(lambda x: len(word_tokenize(x)))
df_err_transformer["unknowns_number"] = df_err_transformer['tweet'].apply(lambda x: len([word for word in word_tokenize(x) if word not in word_listing]))
df_err_transformer["oov_terms_number"] = df_err_transformer['tweet'].apply(lambda x: len([word for word in word_tokenize(x) if word in oov_terms]))

In [None]:
df_correct_transformer = df_err_transformer[df_err_transformer['hard_labels_task1'] == df_err_transformer['predictions']]
df_errors_transformer = df_err_transformer[df_err_transformer['hard_labels_task1'] != df_err_transformer['predictions']]

In [None]:
df_correct_transformer.describe()

In [None]:
df_errors_transformer.describe()

In [None]:
len_correct = df_correct_transformer.shape[0]
len_errors = df_errors_transformer.shape[0]

fig, axes = plt.subplots(3, 1, figsize=(8, 12))
axes[0].hist(df_correct_transformer['tweet_length'], bins=range(max_len), label=f'correct - mean: {np.mean(df_correct_transformer.tweet_length):.2f}')
axes[0].hist(df_errors_transformer['tweet_length'], bins=range(max_len), label=f'errors - mean: {np.mean(df_errors_transformer.tweet_length):.2f}')
axes[0].set_title('Tweet length')
axes[0].legend()

bin_n = max(len(df_correct_transformer['unknowns_number'].unique()), len(df_errors_transformer['unknowns_number'].unique()))
axes[1].hist(df_correct_transformer['unknowns_number'], bins=range(bin_n), label=f'correct - mean: {np.mean(df_correct_transformer.unknowns_number):.2f}')
axes[1].hist(df_errors_transformer['unknowns_number'], bins=range(bin_n), label=f'errors - mean: {np.mean(df_errors_transformer.unknowns_number):.2f}')
axes[1].set_title('Unknowns number')
axes[1].legend()

bin_n = max(len(df_correct_transformer['oov_terms_number'].unique()), len(df_errors_transformer['oov_terms_number'].unique()))
axes[2].hist(df_correct_transformer['oov_terms_number'], bins=range(bin_n), label=f'correct - mean: {np.mean(df_correct_transformer.oov_terms_number):.2f}')
axes[2].hist(df_errors_transformer['oov_terms_number'], bins=range(bin_n), label=f'errors - mean: {np.mean(df_errors_transformer.oov_terms_number):.2f}')
axes[2].set_title('OOV terms number')
axes[2].legend()

fig.show()

The analysis for the Transformer seems to be quite the same of the LSTM, for a model that suffers from the same problem as before on the shorter tweets. However, the performances are generally better, assessing that **the attention mechanism provides better generalisation properties for all the lenghts**, mitigating the previous problem. It can also be noticed by the fact that even if the f1 score for the Transformer with respect to the LSTM on the validation set is just a bit higher, on the test set the RoBERTa architecture shows a considerable improvement.

## Word Frequency Analysis

The aim this section is to provide a specific view of words occurrences in the tweets. In general, **_Word Frequency Analysis_** is about counting how often each word appears in a given collection of text data in order to identify keywords, common themes and anomalies.

In this case, the analysis is conducted by taking into account the most frequent words in the dataset for all the examples predicted as 1 (sexist) and 0 (not sexist). A comparison between the most frequent words in both classes is provided, with a focus on the wrong classified samples, in order to dectect eventual similarities. This has helped to gain more valuable insights on the textual information and structure of the dataset. The objective is to highlight if there is a predominance of words that drives the predictions towards a certain class label, rather than the other.

The same type of analysis has been carried out for both the LSTM and the Transformer, starting with the best performing model, with the following results.

### Transformer

In [None]:
df_err_transformer[df_err_transformer["predictions"] == 1]["tweet"].str.split().explode().value_counts(normalize=True).head(10)

In [None]:
df_err_transformer[df_err_transformer["predictions"] == 0]["tweet"].str.split().explode().value_counts(normalize=True).head(10)

In [None]:
df_errors_transformer["tweet"][df_errors_transformer["predictions"] == 1].str.split().explode().value_counts(normalize=True).head(10)

It can be deduced that the words occurrences for the two classes are different. In particular, the words distribution in the tweets classified as _non-sexist_ is quite uniform among all the tokens, as shown by the frequencies. On the other hand, for the _sexist_ samples, there is a predominance of some words, which is reflected also in the false positives predictions.

As a consequence, the main deduction of this analysis is that **the Transformer model tends to classify a tweet as _sexist_ whenever it finds words related to genders like "women", "men" or like "look", even if they are used in a _non-sexist_ context**.

### LSTM

In [None]:
df_err_LSTM[df_err_LSTM["predictions"] == 1]["tweet"].str.split().explode().value_counts(normalize=True).head(10)

In [None]:
df_err_LSTM[df_err_LSTM["predictions"] == 0]["tweet"].str.split().explode().value_counts(normalize=True).head(10)

In [None]:
df_errors_LSTM[df_errors_LSTM["predictions"] == 1]["tweet"].str.split().explode().value_counts(normalize=True).head(10)

The same deduction for the Trasformer holds also for the LSTM, with the differnce that in this case **the words under analysis are mostly "woman", "like" and "men"**.

## Confusion Matrices
Looking at the ***Confusion Matrices***, we can evaluate the performances of our binary classification models to discriminate positive from negative tweets.
What can be inferred by the data below is:

* **True Positives**
  * The LSTM seems to have some troubles in classifying correctly positive tweets. Only the 64% of _sexist_ tweets is identified, with respect to the 83% of the Transformer model.
* **False Positives**
  * As a consequence, the false postives ratio of the LSTM is much higher than the one found for Transformer model, with 36% over 17%. If we suppose that a human supervisor is integrated in the overall system, this kind of misclassification might not be a so big issue.
* **True Negatives**
  * Considering the true negatives, the performance of the models are very similar (84% for the LSTM, 80% for the Transformer). Since identifying a _sexist_ text is more important than misclassifying a _non-sexist_ tweet and given the class imbalance shown before, these values are not quite encouraging for the LSTM.
* **False Negatives**
  * As already said, the 16% of false negatives for the LSTM and the 20% for the Transformer are quite good performances. However, not identifying a sexist tweet could be a serious problem, showing the need of some ways to manage the issue and reduce this percentage.
  
In general, the Transformer model seems to be more robust to the class imbalance and the implied biases, as the **False Negative** and **False Positive** percentages are more similar one another. On the other hand, given also the analysis of the word frequencies, the LSTM appears to be more affected by that bias. Further considerations can be done by looking a the **_Precision-Recall Curve_**.

By considering the difference between **False Negative** and **False Positive** and having previously seen the class imbalance in training set, we could infer that this might lead our model to be more inclined to assign a 0 label to a tweet.


In [None]:
ConfusionMatrixDisplay.from_predictions(df_err_LSTM['hard_labels_task1'].to_list(), df_err_LSTM['predictions'].to_list(), labels=[0,1], normalize="true");

In [None]:
ConfusionMatrixDisplay.from_predictions(df_err_transformer['hard_labels_task1'].to_list(), df_err_transformer['predictions'].to_list(), labels=[0,1], normalize="true");

## Precision-Recall Curves

In order to better understand the performances, strength and witnesses of each model, it is important to analyse the **_Precision-Recall Curves_** of the two models. They provide valuable insights into how the Transformer and LSTM models perform across varying thresholds, especially when balancing **precision** _(accuracy of positive predictions)_ and **recall** _(coverage of true positives)_.

Unlike ***accuracy*** or ***ROC Curves***, the ***Precision-Recall Curves*** avoid overemphasizing true negatives, making them more reliable in scenarios where the positive class is costly to misclassify. Since in our task, misclassifying a _sexist_ tweet as _non-sexist_ one, might be a problem, using this curve is more reasonable.

In [None]:
fig, ax = plt.subplots()
PrecisionRecallDisplay.from_predictions(df_err_transformer['hard_labels_task1'], df_err_transformer['predictions'], name="Transformer", ax=ax)
PrecisionRecallDisplay.from_predictions(df_err_LSTM['hard_labels_task1'], df_err_LSTM['predictions'], name="LSTM", ax=ax, plot_chance_level=True)
ax.set_title("Precision-Recall Curve");

### Transformer
The **Transformer** model demonstrates **strong performances with an average precision (AP) of 0.71**, which is higher than the LSTM, as expected. In general, the curve shows that:
- at **_higher thresholds_**, so at _low recall values (left side)_, the Transformer achieves high precision, starting around **0.8**, and maintains this level up to a recall of approximately **0.8**. It means that, for much of the recall range, the Transformer is able to make accurate predictions.
- at **_lower thresholds_**, as recall increases beyond 0.8 _(right side)_, there is a _steady drop in precision_. It means that where the model predicts more positives to increase recall, it also starts to misclassify more negative samples, introducing false positives.

However, **the Transformer's overall ability to maintain high precision at moderate recall levels makes it a strong choice**. In addition, it maintains better precision across thresholds compared to the LSTM, confirming its superior balance between precision and recall.

### LSTM
The **LSTM** model, having an **average precision (AP) of 0.65**, shows resonably good performances, with a very similar behaviour with respect to the Transformer. In this case, it can be said that:
- at **_higher tresholds_**, its _precision starts at the same value of the Transformer_, at around **0.8**, and remains relatively stable up to a recall of **0.6**. However, it is clear that the LSTM consistently **struggles to match the precision-recall balance achieved by the Transformer across all the recall levels**.
- at **_lower tresholds_**, after a recall value of around **0.6**, the behaviour of the LSTM model is the same as the Transformer, reaching the precision levels of a chance predictor.

In general, even if the the LSTM still performs well in certain regions, **its lower recall at lower tresholds suggests that it may be more prone to false negatives compared to the Transformer, as confirmed also by the confusion matrix**. This gap highlights areas where the LSTM could be further optimized, for example, by an appropriate threshold tuning.

## Conclusions and Possible Solutions
What can be evinced from the entire analysis just presented is that the many things are in a way correlated. The _Precision-Recall Curves_ and the _Confusion Matrices_ are referred to the same kind of analysis, as said before. Also the type of information provided is influenced by the _class imbalance_, which is one of the main causes for poor performances in terms of shape of the PR Curve and the AP, especially for the LSTM model.

To address these issues, there could be different solutions, some for the data manipulation and some for model improvements. Regarding data:

- **rebalancing the training process by applying _class weights_** during the computation of the loss, such to help the model giving equal importance to both classes.

- **data augmentation**, since the dataset size is pretty small, and can be done by gathering new data from other sources or reshuffling the tokenized tweets in actual data, as a pre-processing operation.

- changing the **batch creation strategy** by mixing inputs, such to have always a similar number of data for each lenght range. It can be done by keeping the mean of the tweet lenghts equal or at least similar for all the batches in ach epoch.

Regarding the models:

- a way of improving the LSTM is to **increase the number of bidirectional blocks** and add **skip connections** to avoid the vanishing gradient problem.
- for the Transformer, a possibility to get better performances is to **change the fine-tuning strategy, like freezing part of the network and updating the remaining weights**.

# [Task 8 - 0.5 points] Report



Wrap up your experiment in a short report (up to 2 pages).

### Instructions



* Use the NLP course report template.

* Summarize each task in the report following the provided template.

### Recommendations



The report is not a copy-paste of graphs, tables, and command outputs.



* Summarize classification performance in Table format.

* **Do not** report command outputs or screenshots.

* Report learning curves in Figure format.

* The error analysis section should summarize your findings.


# Submission



* **Submit** your report in PDF format.

* **Submit** your python notebook.

* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...

* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ



Please check this frequently asked questions before contacting us

### Execution Order



You are **free** to address tasks in any order (if multiple orderings are available).

### Trainable Embeddings



You are **free** to define a trainable or non-trainable Embedding layer to load the GloVe embeddings.

### Model architecture



You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.


### Neural Libraries



You are **free** to use any library of your choice to implement the networks (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Keras TimeDistributed Dense layer



If you are using Keras, we recommend wrapping the final Dense layer with `TimeDistributed`.

### Robust Evaluation



Each model is trained with at least 3 random seeds.



Task 4 requires you to compute the average performance over the 3 seeds and its corresponding standard deviation.

### Model Selection for Analysis



To carry out the error analysis you are **free** to either



* Pick examples or perform comparisons with an individual seed run model (e.g., Baseline seed 1337)

* Perform ensembling via, for instance, majority voting to obtain a single model.

### Error Analysis



Some topics for discussion include:

   * Precision/Recall curves.

   * Confusion matrices.

   * Specific misclassified samples.

### Bonus Points

Bonus points are arbitrarily assigned based on significant contributions such as:

- Outstanding error analysis

- Masterclass code organization

- Suitable extensions

Note that bonus points are only assigned if all task points are attributed (i.e., 6/6).



**Possible Extensions/Explorations for Bonus Points:**

- **Try other preprocessing strategies**: e.g., but not limited to, explore techniques tailored specifically for tweets or  methods that are common in social media text.

- **Experiment with other custom architectures or models from HuggingFace**

- **Explore Spanish tweets**: e.g., but not limited to, leverage multilingual models to process Spanish tweets and assess their performance compared to monolingual models.












# The End