#  Sexism Detection and Categorization in Tweets Using Machine Learning and NLP

This project aims to detect sexism in tweets and classify it into specific categories using machine learning models and NLP techniques. The task is divided into two subtasks:

* Sexism Detection – Identifying whether a tweet contains sexist content.

* Sexism Categorization – Classifying detected sexism into four categories: 'JUDGEMENTAL', 'REPORTED', 'DIRECT', and 'UNKNOWN'.

It compares:
* Contextual embeddings using RoBERTa
* LSA based on TF-IDF of words (50 singular values)

as feature extraction methods. Three classifiers:
* Logistic regresion - l2 penalty, liblinear solver and 200 iterations
* Decision tree - with default hyperparameters. I tried multiple max_depths, min_sample_split, min_samples_leaf, but default one worked the best
* MultiLayerPerceptron - 2 hidden layers (256, 128), ReLu activation, 1500 iterations, but with early stopping, if the learning stops. I am using lbfgs solver (tried adam, but lbfgs works better), learning rate is pretty low, just 0.0005

are trained and evaluated to identify sexist content. The dataset consists of English and Spanish tweets labeled for different levels of sexism.

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
import pandas as pd
import torch
from transformers import AutoModel, AutoTokenizer
from transformers import RobertaTokenizer, RobertaModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, classification_report

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report

from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score, classification_report

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
spanish_stopwords = stopwords.words('spanish')

In [None]:
import sys
sys.path.append('/content/drive/MyDrive')


from readerEXIST2025 import EXISTReader

# reader_train = EXISTReader("EXIST2025_training.json")
# reader_dev = EXISTReader("EXIST2025_dev.json")
reader_train = EXISTReader("drive/MyDrive/EXIST2025_training.json")
reader_dev = EXISTReader("drive/MyDrive/EXIST2025_dev.json")

EnTrainTask1, EnDevTask1 = reader_train.get(lang="EN", subtask="1"), reader_dev.get(lang="EN", subtask="1")
EnTrainTask2, EnDevTask2 = reader_train.get(lang="EN", subtask="2"), reader_dev.get(lang="EN", subtask="2")

SpTrainTask1, SpDevTask1 = reader_train.get(lang="ES", subtask="1"), reader_dev.get(lang="ES", subtask="1")
SpTrainTask2, SpDevTask2 = reader_train.get(lang="ES", subtask="2"), reader_dev.get(lang="ES", subtask="2")


# ENGLISH

## Preprocessing

In [None]:
import re
web_re = re.compile(r"https?:\/\/[^\s]+", re.U)
user_re = re.compile(r"(@\w+\-?(?:\w+)?)", re.U)
hashtag_re = re.compile(r"(#\w+\-?(?:\w+)?)", re.U)

mapLabelToId = {"task1": {'NO': 0, 'YES': 1, "AMBIGUOUS": 2},
                "task2": {'-': 4, 'JUDGEMENTAL': 0, 'REPORTED': 1, 'DIRECT': 2, 'UNKNOWN': 3, "AMBIGUOUS": 5},
                "task3": {'OBJECTIFICATION': 0, 'STEREOTYPING-DOMINANCE': 1, 'MISOGYNY-NON-SEXUAL-VIOLENCE': 2,
                          'IDEOLOGICAL-INEQUALITY': 3, 'SEXUAL-VIOLENCE': 4, 'UNKNOWN': 5, '-': 6,
                          "AMBIGUOUS": 7}}

mapIdToLabel = {"task1": {0: 'NO', 1: 'YES', 2: "AMBIGUOUS"},
                "task2": {4: '-', 0: 'JUDGEMENTAL', 1: 'REPORTED', 2: 'DIRECT', 3: 'UNKNOWN', 4: "AMBIGUOUS"},
                "task3": {0: 'OBJECTIFICATION', 1: 'STEREOTYPING-DOMINANCE', 2: 'MISOGYNY-NON-SEXUAL-VIOLENCE',
                          3: 'IDEOLOGICAL-INEQUALITY', 4: 'SEXUAL-VIOLENCE', 5: 'UNKNOWN', 6: '-',
                          7: "AMBIGUOUS"}}


def standard_preprocession(text):
    text = web_re.sub("", text)
    text = user_re.sub("", text)
    text = hashtag_re.sub("", text)
    text = text.lower()

    return text


def no_preprocession(text):
    return text

def unpack(data, task):
    id,text, label = data
    id = [id.iloc[i] for i in range(len(id))]
    sptext = [standard_preprocession(text.iloc[i]) for i in range(len(text))]

    label = [mapLabelToId[task][label.iloc[i]] for i in range(len(label))]

    return {"id": id, "sptext": sptext, "label": label}

## Tweet representations (Feature extraction)

In [None]:
# Obtaining a representation for the train and dev subsets in both tasks
if torch.backends.mps.is_available():  # Mac M? GPU
    device = torch.device("mps")
elif torch.cuda.is_available():  # Nvidia GPU
    device = torch.device("cuda")
else:  # CPU
    device = torch.device("cpu")
print(device)


In [None]:
def get_contextual_embeddings(text, model_name):
    batch_size = 16
    tokenizer = AutoTokenizer.from_pretrained(model_name) #"roberta-base"
    model = AutoModel.from_pretrained(model_name)

    tensor_list=[]
    for i in range(0, len(text), batch_size):
        batch = text[i:i+batch_size]

        input = tokenizer(batch, padding="max_length", max_length = 100, truncation=True, return_tensors="pt")
        model.eval()
        model.to(device)
        input = input.to(device)
        with torch.no_grad():
          outputs = model(**input)
          encoded_layers = outputs[0]
          cls_vector = encoded_layers[:,0,:]

        tensor_list.append(cls_vector)
    cls_vector = torch.cat(tensor_list).cpu()
    return cls_vector


# LSA based on TF-IDF of words (100 singular values)
def LSA_TF_IDF_repre(data, model_name, lang):
    if lang == "english":
        stop_words = "english"
    elif lang == "spanish":
        stop_words = stopwords.words("spanish")
    else:
        stop_words = None

    tfidf_vectorizer = TfidfVectorizer(stop_words = stop_words, binary=False, use_idf=True, preprocessor=None)
    tfidf_matrix = tfidf_vectorizer.fit_transform(data)

    num_features = tfidf_matrix.shape[1]
    n_components = min(100, num_features)

    svd = TruncatedSVD(n_components=n_components)
    svd_matrix = svd.fit_transform(tfidf_matrix)

    return svd_matrix

In [None]:
def get_repre(train, test, method, model_name, task, lang, sample_size = -1):
    train_data1 = unpack(train, task)
    test_data1 = unpack(test, task)

    if sample_size != -1:
      for data in [train_data1, test_data1]:
          for key, valie in data.items():
            data[key] = data[key][:sample_size]

    train_data1["repre"] = method(train_data1["sptext"], model_name, lang)
    test_data1["repre"] = method(test_data1["sptext"], model_name, lang)

    return train_data1, test_data1

In [None]:
train_con_embed, test_con_embed = get_repre(EnTrainTask1, EnDevTask1,
                                      get_contextual_embeddings, "roberta-base",
                                            "task1","english", -1)

In [None]:
train_lsa_repre, test_lsa_repre = get_repre(EnTrainTask1, EnDevTask1,
                                      LSA_TF_IDF_repre, "",
                                      "task1","english", -1)

## Learning Models - subtask 1

In [None]:
def log_reg(x_train, y_train, x_dev, y_dev):
    clf1 = LogisticRegression(
      penalty='l2',
      C=1.0,
      solver='liblinear', #'saga' 'l1'
      max_iter=200
    )
    clf1.fit(x_train, y_train)
    predicted1 = clf1.predict(x_dev)

    f1_positive = f1_score(y_dev, predicted1, pos_label=1)
    print(f"F1-score (Positive Class): {f1_positive}")

    report = classification_report(y_dev,predicted1, digits=4)
    print(report)

def decision_tree_sub1(X_train, y_train, X_dev, y_dev):
    clf = DecisionTreeClassifier(random_state=42)
    clf.fit(X_train, y_train)
    predicted = clf.predict(X_dev)

    f1_positive = f1_score(y_dev, predicted, pos_label=1)
    print(f"F1-score (Positive Class): {f1_positive}")

    report = classification_report(y_dev,predicted, digits=4)
    print(report)

def MLP_sub1(X_train, y_train, X_dev, y_dev):
    clf = MLPClassifier(random_state = 1,
                        hidden_layer_sizes = (256, 128),
                        activation='relu',
                        max_iter = 1500,
                        learning_rate_init = 0.0005,
                        alpha=0.0001,
                        early_stopping=True,
                        solver='lbfgs') # adam

    clf.fit(X_train, y_train)
    predicted = clf.predict(X_dev)

    f1_positive = f1_score(y_dev, predicted, pos_label=1)
    print(f"F1-score (Positive Class): {f1_positive}")

    report = classification_report(y_dev, predicted, digits=4)
    print(report)


In [None]:
def train(train_data, test_data, method):
    method(train_data["repre"], train_data["label"],
           test_data["repre"], test_data["label"])


# Subtask 1 - Results - English

In [None]:
print("Representation: LSA based on TD-IDF with 50 components")
print("Clasiffier: Logistic regression")
train(train_lsa_repre, test_lsa_repre, log_reg)

In [None]:
print("Representation: Contextual embeddings using RoBERTa")
print("Clasiffier: Logistic regression")
train(train_con_embed, test_con_embed, log_reg)

In [None]:
print("Representation: LSA based on TD-IDF with 50 components")
print("Clasiffier: Decison Tree")
train(train_lsa_repre, test_lsa_repre, decision_tree_sub1)

In [None]:
print("Representation: Contextual embeddings using RoBERTa")
print("Clasiffier: Decison Tree")
train(train_con_embed, test_con_embed, decision_tree_sub1)

In [None]:
print("Representation: LSA based on TD-IDF with 50 components")
print("Clasiffier: MLP")
train(train_lsa_repre, test_lsa_repre, MLP_sub1)

In [None]:
print("Representation: Contextual embeddings using RoBERTa")
print("Clasiffier: MLP")
train(train_con_embed, test_con_embed, MLP_sub1)

# Learning Models - Subtask 2 - English

In [None]:
train_con_embed2, test_con_embed2 = get_repre(EnTrainTask2, EnDevTask2,
                                      get_contextual_embeddings, "roberta-base",
                                            "task2", "english", -1)

In [None]:
train_lsa_repre2, test_lsa_repre2 = get_repre(EnTrainTask2, EnDevTask2,
                                      LSA_TF_IDF_repre, "",
                                      "task2", "english", -1)

In [None]:
def log_reg_sub2(x_train, y_train, x_dev, y_dev):
    clf = LogisticRegression(
      penalty='l2',
      C=1.0,
      solver='liblinear', #'saga' 'l1'
      max_iter=200
    )
    clf.fit(x_train, y_train)
    predicted = clf.predict(x_dev)

    f1_macro = f1_score(y_dev, predicted, average='macro')
    print(f"F1-score (Macro-Averaged): {f1_macro}")

    report = classification_report(y_dev,predicted, digits=4)
    print(report)


def decision_tree_sub2(X_train, y_train, X_dev, y_dev):
    clf = DecisionTreeClassifier()
    clf.fit(X_train, y_train)
    predicted = clf.predict(X_dev)

    f1_macro = f1_score(y_dev, predicted, average='macro')
    print(f"F1-score (Macro-Averaged): {f1_macro}")

    report = classification_report(y_dev, predicted, digits=4)
    print(report)


def MLP_sub2(X_train, y_train, X_dev, y_dev):
    clf = MLPClassifier(random_state = 1,
                        max_iter = 1500,
                        learning_rate_init = 0.0005,
                        early_stopping=True,
                        solver='lbfgs')
    clf.fit(X_train, y_train)
    predicted = clf.predict(X_dev)

    f1_macro = f1_score(y_dev, predicted, average='macro')
    print(f"F1-score (Macro-Averaged): {f1_macro}")

    report = classification_report(y_dev, predicted, digits=4)
    print(report)



# Subtask 2 - Results - English

In [None]:
print("Representation: LSA based on TD-IDF with 50 components")
print("Clasiffier: Logistic regression")
train(train_lsa_repre2, test_lsa_repre2, log_reg_sub2)


In [None]:
print("Representation: Contextual embeddings using RoBERTa")
print("Clasiffier: Logistic regression")
train(train_con_embed2, test_con_embed2, log_reg_sub2)

In [None]:
print("Representation: LSA based on TD-IDF with 50 components")
print("Clasiffier: Decison Tree")
train(train_lsa_repre2, test_lsa_repre2, decision_tree_sub2)

In [None]:
print("Representation: Contextual embeddings using RoBERTa")
print("Clasiffier: Decison Tree")
train(train_con_embed2, test_con_embed2, decision_tree_sub2)

In [None]:
print("Representation: LSA based on TD-IDF with 50 components")
print("Clasiffier: MLP")
train(train_lsa_repre2, test_lsa_repre2, MLP_sub2)

In [None]:
print("Representation: Contextual embeddings using RoBERTa")
print("Clasiffier: MLP")
train(train_con_embed2, test_con_embed2, MLP_sub2)

# SPANISH

## Tweet representations (Feature extraction)

In [None]:
train_con_embed, test_con_embed = get_repre(SpTrainTask1, SpDevTask1,
                                      get_contextual_embeddings, "PlanTL-GOB-ES/roberta-base-bne",
                                            "task1", "spanish", -1)

In [None]:
train_lsa_repre, test_lsa_repre = get_repre(SpTrainTask1, SpDevTask1,
                                      LSA_TF_IDF_repre, "",
                                      "task1", "spanish", -1)

## Subtask 1 - Spanish

In [None]:
print("Representation: LSA based on TD-IDF with 50 components")
print("Clasiffier: Logistic regression")
train(train_lsa_repre, test_lsa_repre, log_reg)

In [None]:
print("Representation: Contextual embeddings using RoBERTa")
print("Clasiffier: Logistic regression")
train(train_con_embed, test_con_embed, log_reg) #0

In [None]:
print("Representation: LSA based on TD-IDF with 50 components")
print("Clasiffier: Decison Tree")
train(train_lsa_repre, test_lsa_repre, decision_tree_sub1) #51

In [None]:
print("Representation: Contextual embeddings using RoBERTa")
print("Clasiffier: Decison Tree")
train(train_con_embed, test_con_embed, decision_tree_sub1) #42

In [None]:
print("Representation: LSA based on TD-IDF with 50 components")
print("Clasiffier: MLP")
train(train_lsa_repre, test_lsa_repre, MLP_sub1) #50 50

In [None]:
print("Representation: Contextual embeddings using RoBERTa")
print("Clasiffier: MLP")
train(train_con_embed, test_con_embed, MLP_sub1) #38 74


# Subtask 2 - Spanish

In [None]:
train_con_embed2_sp, test_con_embed2_sp = get_repre(SpTrainTask2, SpDevTask2,
                                      get_contextual_embeddings, "PlanTL-GOB-ES/roberta-base-bne",
                                            "task2", "spanish", -1)

In [None]:
train_lsa_repre2_sp, test_lsa_repre2_sp = get_repre(SpTrainTask2, SpDevTask2,
                                      LSA_TF_IDF_repre, "",
                                      "task2", "spanish", -1)

In [None]:
print("Representation: LSA based on TD-IDF with 50 components")
print("Clasiffier: Logistic regression")
train(train_lsa_repre2_sp, test_lsa_repre2_sp, log_reg_sub2)

In [None]:
print("Representation: Contextual embeddings using RoBERTa")
print("Clasiffier: Logistic regression")
train(train_con_embed2_sp, test_con_embed2_sp, log_reg_sub2)

In [None]:
print("Representation: LSA based on TD-IDF with 50 components")
print("Clasiffier: Decison Tree")
train(train_lsa_repre2_sp, test_lsa_repre2_sp, decision_tree_sub2)

In [None]:
print("Representation: Contextual embeddings using RoBERTa")
print("Clasiffier: Decison Tree")
train(train_con_embed2_sp, test_con_embed2_sp, decision_tree_sub2)

In [None]:
print("Representation: LSA based on TD-IDF with 50 components")
print("Clasiffier: MLP")
train(train_lsa_repre2_sp, test_lsa_repre2_sp, MLP_sub2)

In [None]:
print("Representation: Contextual embeddings using RoBERTa")
print("Clasiffier: MLP")
train(train_con_embed2_sp, test_con_embed2_sp, MLP_sub2)  #54

**Results**
* Subtask 1 English
  * Logistic regresion is the best one, almost tied with MLP, which is second. DT is much worse
  * Contextual embeddings perform much better that LSA based TF-IDF, which is understandable, because they took much longer to compute

* Subtask 2 English
  * this subtask is much harder, with significantly lower F1 scores
  * generally, MLP is the best one, followed with DT, but LSA based works the best with DT and contextual embeddings work the best with MLP

* Subtask 1 Spanish
  * results from subtask 1 english also apply here
  * LR with contextual embeddings got the highest F1 score, I managed to achieve - 0.81

* Subtask 2 Spanish
  * MLP is again best, but the differences with LR are smaller
