# Final Project

This final project can be collaborative. The maximum members of a group is 3. You can also work by yourself. Please respect the academic integrity. **Remember: if you get caught on cheating, you get F.**

## A Introduction to the competition

<img src="news-sexisme-EN.jpg" alt="drawing" width="380"/>

Sexism is a growing problem online. It can inflict harm on women who are targeted, make online spaces inaccessible and unwelcoming, and perpetuate social asymmetries and injustices. Automated tools are now widely deployed to find, and assess sexist content at scale but most only give classifications for generic, high-level categories, with no further explanation. Flagging what is sexist content and also explaining why it is sexist improves interpretability, trust and understanding of the decisions that automated tools use, empowering both users and moderators.

This project is based on SemEval 2023 - Task 10 - Explainable Detection of Online Sexism (EDOS). [Here](https://codalab.lisn.upsaclay.fr/competitions/7124#learn_the_details-overview) you can find a detailed introduction to this task.

You only need to complete **TASK A - Binary Sexism Detection: a two-class (or binary) classification where systems have to predict whether a post is sexist or not sexist**. To cut down training time, we only use a subset of the original dataset (5k out of 20k). The dataset can be found in the same folder. 

Different from our previous homework, this competition gives you great flexibility (and very few hints). You can freely determine every component of your workflow, including but not limited to:
-  **Preprocessing the input text**: You may decide how to clean or transform the text. For example, removing emojis or URLs, lowercasing, removing stopwords, applying stemming or lemmatization, correcting spelling, or performing tokenization and sentence segmentation.
-  **Feature extraction and encoding**: You can choose any method to convert text into numerical representations, such as TF-IDF, Bag-of-Words, N-grams, Word2Vec, GloVe, FastText, contextual embeddings (e.g., BERT, RoBERTa, or other transformer-based models), Part-of-Speech (POS) tagging, dependency-based features, sentiment or emotion features, readability metrics, or even embeddings or features generated by large language models (LLMs).
-  **Data augmentation and enrichment**: You may expand or balance your dataset by incorporating other related corpora or using techniques like synonym replacement, random deletion/insertion, or LLM-assisted augmentation (e.g., generating paraphrased or synthetic examples to improve model robustness).
-  **Model selection**: You are free to experiment with different models — from traditional machine learning algorithms (e.g., Logistic Regression, SVM, Random Forest, XGBoost) to deep learning architectures (e.g., CNNs, RNNs, Transformers), or even hybrid/ensemble approaches that combine multiple models or leverage LLM-generated predictions or reasoning.

## Requirements
-  **Input**: the text for each instance.
-  **Output**: the binary label for each instance.
-  **Feature engineering**: use at least 2 different methods to extract features and encode text into numerical values. You may explore both traditional and AI-assisted techniques. Data augmentation is optional.
-  **Model selection**: implement with at least 3 different models and compare their performance.
-  **Evaluation**: create a dataframe with rows indicating feature+model and columns indicating Precision (P), Recall (R) and F1-score (using weighted average). Your results should have at least 6 rows (2 feature engineering methods x 3 models). Report best performance with (1) your feature engineering method, and (2) the model you choose. Here is an example illustrating how the experimental results table should be presented.

| Feature + Model | Sexist (P) | Sexist (R) | Sexist (F1) | Non-Sexist (P) | Non-Sexist (R) | Non-Sexist (F1) | Weighted (P) | Weighted (R) | Weighted (F1) |
|-----------------|:----------:|:----------:|:------------:|:---------------:|:---------------:|:----------------:|:-------------:|:--------------:|:---------------:|
| TF-IDF + Logistic Regression | ... | ... | ... | ... | ... | ... | ... | ... | ... |

- **Format of the report**: add explainations for each step (you can add markdown cells). At the end of the report, write a summary for each sections: 
    - Data Preprocessing
    - Feature Engineering
    - Model Selection and Architecture
    - Training and Validation
    - Evaluation and Results
    - Use of Generative AI (if you use)

## Rules 
Violations will result in 0 points in the grade: 
-   `Rule 1 - No test set leakage`: You must not use any instance from the test set during training, feature engineering, or model selection.
-   `Rule 2 - Responsible AI use`: You may use generative AI, but you must clearly document how it was used. If you have used genAI, include a section titled “Use of Generative AI” describing:
    -   What parts of the project you used AI for
    -   What was implemented manually vs. with AI assistance

## Grading

The performance should be only evaluated on the test set (a total of 1086 instances). Please split original dataset into train set and test set. The test set should NEVER be used in the training process. The evaluation metric is a combination of precision, recall, and f1-score (use `classification_report` in sklearn). 

The total points are 10.0. Each team will compete with other teams in the class on their best performance. Points will be deducted if not following the requirements above. 

If ALL the requirements are met:
- Top 25\% teams: 10.0 points.
- Top 25\% - 50\% teams: 8.5 points.
- Top 50\% - 75\% teams: 7.0 points.
- Top 75\% - 100\% teams: 6.0 points.

If your best performance reaches **0.82** or above (weighted F1-score) and follows all the requirements and rules, you will also get full points (10.0 points). 

## Submission
Similar as homework, submit both a PDF and .ipynb version of the report including: 
- code and experimental results with details explained
- combined results table, report and best performance
- a summary at the end of the report (please follow the format above)

Missing any part of the above requirements will result in point deductions.

The due date is **Dec 11, Thursday by 11:59pm**.

## Experimental Results

(A table detailed model performance on the test set with at least 6 rows. Report the best performance.)


## Project Summary
### 1. Data Preprocessing


### 2. Feature Engineering
 

### 3. Model Selection and Architecture


### 4. Training and Validation


### 5. Evaluation and Results


### 6. Use of Generative AI (if you use)

### Required Libraries/Imports


In [34]:
# Download and install the necessary libraries
# Uncomment below if needed to do so
# %pip install pandas numpy re nltk demoji sklearn matplotlib transformers datasets torch scikit-learn
# %pip install datasets
# %pip install transformers
%pip install torch
# Install all required libraries for the project
import pandas as pd
import numpy as np
import re
import nltk
import demoji
import matplotlib.pyplot as plt
import torch
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, f1_score
from datasets import Dataset
from sklearn.feature_selection import SelectKBest, chi2, SelectFromModel
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC as SVC
from sklearn    .ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2, SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from scipy.sparse import hstack
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')



Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\slueb\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\slueb\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\slueb\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\slueb\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\slueb\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True


### 1. Data Preprocessing



In [35]:

# Read whole lines as raw strings (avoid using '\n' as a separator in read_csv)
with open("edos_labelled_data.csv", "r", encoding="utf-8") as f:
    lines = f.read().splitlines()
raw = pd.DataFrame({"raw": lines})

def parse(line):
    line = line["raw"]
    text_part, label, split = line.rsplit(",", 2)
    id_, text = text_part.split(",", 1)
    return pd.Series([id_, text, label, split])
# get rid of the first line (header)
raw = raw[1:]
df = raw.apply(parse, axis=1)
df.columns = ["id", "text", "label", "split"]

# split data set into train and test sets
train_df = df[df["split"] == " train"]
test_df = df[df["split"] == " test"]
# get rid of the split column and id column
train_df = train_df.drop(columns=["split"])
test_df = test_df.drop(columns=["split"])

# process text data
def preprocess_text(text):
    stop_words = set(stopwords.words("english"))
    lemmatizer = WordNetLemmatizer()
    # Convert to lowercase
    text = text.lower()
    # remove links
    text = re.sub(r'http\S+|www.\S+', '', text, flags=re.MULTILINE)
    # remove mentions
    text = re.sub(r'@\S+', '', text)
    # remove hashtags
    text = re.sub(r'#\S+', '', text)
    # remove emojis
    text = demoji.replace(text, repl="")
    # remove numbers
    text = re.sub(r'\d+', '', text)
    # remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    # Tokenize: try NLTK tokenizer, but gracefully fall back to simple split if the punkt resource is missing
    try:
        tokens = word_tokenize(text)
    except LookupError:
        # fallback when punkt (or punkt_tab) is not available
        tokens = text.split()

    # Remove stopwords and lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    # Join back to string
    return " ".join(tokens)
train_df["text"] = train_df["text"].apply(preprocess_text)
test_df["text"] = test_df["text"].apply(preprocess_text)


### Feature Engineering


In [36]:
# TF-IDF Vectorizer - this turns the text into a TF-IDF, common way to vectorize text,
# its mentioned in the spec
# what this does is it makes a weight chart of each word, like since 'the' appears all the time,
# it won't matter as much. And something like the word "bitch" would be weighed high since it appears less
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1,2))

X_train_tfidf = tfidf.fit_transform(train_df["text"])
X_test_tfidf = tfidf.transform(test_df["text"])

# Load GloVe embeddings (100d),
# this uses a database to determine how similar certain words are semantically, 
# so a total vector for a sentence will average out to make out a meaning.
# i.e. 'girl' and 'woman' are similar, so they have similar vectors in this glove thing
glove = {}
with open("glove.6B.100d.txt", "r", encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype="float32")
        glove[word] = vector

def sentence_to_vec(sentence, embeddings=glove, dim=100):
    words = sentence.split()
    vectors = [embeddings[w] for w in words if w in embeddings]
    if len(vectors) == 0:
        return np.zeros(dim)
    return np.mean(vectors, axis=0)

X_train_glove = np.vstack(train_df["text"].apply(sentence_to_vec)) # type: ignore
X_test_glove = np.vstack(test_df["text"].apply(sentence_to_vec)) # type: ignore

### Models + Training with Training Data Set
#### Each Model will train using each of the feature engineerings

### Model 1:

In [41]:
# -----------------------------
# 1. TF-IDF Feature Workflow
# -----------------------------
# Assume X_train_tfidf, X_test_tfidf, train_df["label"], test_df["label"] already exist

k = 1000  # top features for TF-IDF

# Feature selection using chi2 (non-negative)
selector_tfidf = SelectKBest(chi2, k=k)
X_train_selected = selector_tfidf.fit_transform(X_train_tfidf, train_df["label"])
X_test_selected  = selector_tfidf.transform(X_test_tfidf)

# GridSearchCV to find best C
C_values = np.logspace(-4, 4, 50)  # 50 values from 0.0001 to 10000
param_grid = {'C': C_values}

lr = LogisticRegression(max_iter=1000, random_state=42)
grid_tfidf = GridSearchCV(lr, param_grid, cv=5, scoring='f1_weighted')
grid_tfidf.fit(X_train_selected, train_df["label"])

print("TF-IDF Best C:", grid_tfidf.best_params_['C'])
print("TF-IDF Best CV F1:", grid_tfidf.best_score_)

# Train final Logistic Regression
best_lr_tfidf = LogisticRegression(C=grid_tfidf.best_params_['C'], max_iter=1000, random_state=42)
best_lr_tfidf.fit(X_train_selected, train_df["label"])

# Evaluate
y_pred_tfidf = best_lr_tfidf.predict(X_test_selected)
print("TF-IDF Logistic Regression Results:")
print(classification_report(test_df["label"], y_pred_tfidf, digits=4))


# -----------------------------
# 2. GloVe Feature Workflow
# -----------------------------
# Feature selection using L1 Logistic Regression
lr_l1 = LogisticRegression(penalty='l1', solver='liblinear', C=1.0, max_iter=1000, random_state=42)
lr_l1.fit(X_train_glove, train_df["label"])

# Uses SelectFromModel instead of SelectKBest since negative values are allowed in embeddings
selector_glove = SelectFromModel(lr_l1, prefit=True, max_features=1000)
X_train_selected_glove = selector_glove.transform(X_train_glove)
X_test_selected_glove  = selector_glove.transform(X_test_glove)

# GridSearchCV to find best C for GloVe
grid_glove = GridSearchCV(lr, param_grid, cv=5, scoring='f1_weighted')
grid_glove.fit(X_train_selected_glove, train_df["label"])

print("GloVe Best C:", grid_glove.best_params_['C'])
print("GloVe Best CV F1:", grid_glove.best_score_)

# Train final Logistic Regression
best_lr_glove = LogisticRegression(C=grid_glove.best_params_['C'], max_iter=1000, random_state=42)
best_lr_glove.fit(X_train_selected_glove, train_df["label"])

# Evaluate
y_pred_glove = best_lr_glove.predict(X_test_selected_glove)
print("GloVe Logistic Regression Results:")
print(classification_report(test_df["label"], y_pred_glove, digits=4))


TF-IDF Best C: 35.564803062231285
TF-IDF Best CV F1: 0.8166562966603014
TF-IDF Logistic Regression Results:
              precision    recall  f1-score   support

  not sexist     0.8260    0.9024    0.8625       789
  sexist         0.6562    0.4949    0.5643       297

    accuracy                         0.7910      1086
   macro avg     0.7411    0.6987    0.7134      1086
weighted avg     0.7796    0.7910    0.7810      1086

GloVe Best C: 1526.4179671752302
GloVe Best CV F1: 0.6657087723692219
GloVe Logistic Regression Results:
              precision    recall  f1-score   support

  not sexist     0.7545    0.9037    0.8224       789
  sexist         0.4610    0.2189    0.2968       297

    accuracy                         0.7164      1086
   macro avg     0.6077    0.5613    0.5596      1086
weighted avg     0.6742    0.7164    0.6786      1086



### Model 2:

In [None]:

from sklearn.metrics import classification_report

# --- TF-IDF Features ---
# Assume X_train_tfidf, X_test_tfidf, train_df["label"], test_df["label"] exist

# Create a simple linear SVM
svm = SVC(random_state=42)

# Train on TF-IDF features
svm.fit(X_train_tfidf, train_df["label"])

# Predict on test set
y_pred = svm.predict(X_test_tfidf)

# Evaluate
print("TF-IDF SVM Results:")
print(classification_report(test_df["label"], y_pred, digits=4))


# --- GloVe Features ---
# Assume X_train_glove, X_test_glove exist

# Create SVM
svm_glove = SVC(random_state=42)

# Train on GloVe features
svm_glove.fit(X_train_glove, train_df["label"])

# Predict on test set
y_pred_glove = svm_glove.predict(X_test_glove)

# Evaluate
print("GloVe SVM Results:")
print(classification_report(test_df["label"], y_pred_glove, digits=4))


TF-IDF SVM Results:
              precision    recall  f1-score   support

  not sexist     0.8193    0.8733    0.8454       789
  sexist         0.5918    0.4882    0.5351       297

    accuracy                         0.7680      1086
   macro avg     0.7055    0.6807    0.6902      1086
weighted avg     0.7571    0.7680    0.7605      1086

GloVe SVM Results:
              precision    recall  f1-score   support

  not sexist     0.7510    0.9214    0.8275       789
  sexist         0.4746    0.1886    0.2699       297

    accuracy                         0.7210      1086
   macro avg     0.6128    0.5550    0.5487      1086
weighted avg     0.6754    0.7210    0.6750      1086



### Model 3:

Loading GloVe...
Loaded 400000 vectors.
Training...
Epoch 1 Loss: 0.3142
Epoch 2 Loss: 0.4607
Epoch 3 Loss: 0.8530
Epoch 4 Loss: 0.4531
Epoch 5 Loss: 0.1476

Sentence: all women are bad at math
Predicted label: 0 (1 = sexist, 0 = not sexist)


### Evaluation (Use models on Test set)