<a href="https://colab.research.google.com/github/Markechy/Data-Science-Final-Project/blob/master/final_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Final Project

This final project can be collaborative. The maximum members of a group is 3. You can also work by yourself. Please respect the academic integrity. **Remember: if you get caught on cheating, you get F.**

## A Introduction to the competition

<img src="https://github.com/Markechy/Data-Science-Final-Project/blob/master/news-sexisme-EN.jpg?raw=1" alt="drawing" width="380"/>

Sexism is a growing problem online. It can inflict harm on women who are targeted, make online spaces inaccessible and unwelcoming, and perpetuate social asymmetries and injustices. Automated tools are now widely deployed to find, and assess sexist content at scale but most only give classifications for generic, high-level categories, with no further explanation. Flagging what is sexist content and also explaining why it is sexist improves interpretability, trust and understanding of the decisions that automated tools use, empowering both users and moderators.

This project is based on SemEval 2023 - Task 10 - Explainable Detection of Online Sexism (EDOS). [Here](https://codalab.lisn.upsaclay.fr/competitions/7124#learn_the_details-overview) you can find a detailed introduction to this task.

You only need to complete **TASK A - Binary Sexism Detection: a two-class (or binary) classification where systems have to predict whether a post is sexist or not sexist**. To cut down training time, we only use a subset of the original dataset (5k out of 20k). The dataset can be found in the same folder.

Different from our previous homework, this competition gives you great flexibility (and very few hints). You can freely determine every component of your workflow, including but not limited to:
-  **Preprocessing the input text**: You may decide how to clean or transform the text. For example, removing emojis or URLs, lowercasing, removing stopwords, applying stemming or lemmatization, correcting spelling, or performing tokenization and sentence segmentation.
-  **Feature extraction and encoding**: You can choose any method to convert text into numerical representations, such as TF-IDF, Bag-of-Words, N-grams, Word2Vec, GloVe, FastText, contextual embeddings (e.g., BERT, RoBERTa, or other transformer-based models), Part-of-Speech (POS) tagging, dependency-based features, sentiment or emotion features, readability metrics, or even embeddings or features generated by large language models (LLMs).
-  **Data augmentation and enrichment**: You may expand or balance your dataset by incorporating other related corpora or using techniques like synonym replacement, random deletion/insertion, or LLM-assisted augmentation (e.g., generating paraphrased or synthetic examples to improve model robustness).
-  **Model selection**: You are free to experiment with different models ‚Äî from traditional machine learning algorithms (e.g., Logistic Regression, SVM, Random Forest, XGBoost) to deep learning architectures (e.g., CNNs, RNNs, Transformers), or even hybrid/ensemble approaches that combine multiple models or leverage LLM-generated predictions or reasoning.

## Requirements
-  **Input**: the text for each instance.
-  **Output**: the binary label for each instance.
-  **Feature engineering**: use at least 2 different methods to extract features and encode text into numerical values. You may explore both traditional and AI-assisted techniques. Data augmentation is optional.
-  **Model selection**: implement with at least 3 different models and compare their performance.
-  **Evaluation**: create a dataframe with rows indicating feature+model and columns indicating Precision (P), Recall (R) and F1-score (using weighted average). Your results should have at least 6 rows (2 feature engineering methods x 3 models). Report best performance with (1) your feature engineering method, and (2) the model you choose. Here is an example illustrating how the experimental results table should be presented.

| Feature + Model | Sexist (P) | Sexist (R) | Sexist (F1) | Non-Sexist (P) | Non-Sexist (R) | Non-Sexist (F1) | Weighted (P) | Weighted (R) | Weighted (F1) |
|-----------------|:----------:|:----------:|:------------:|:---------------:|:---------------:|:----------------:|:-------------:|:--------------:|:---------------:|
| TF-IDF + Logistic Regression | ... | ... | ... | ... | ... | ... | ... | ... | ... |

- **Format of the report**: add explainations for each step (you can add markdown cells). At the end of the report, write a summary for each sections:
    - Data Preprocessing
    - Feature Engineering
    - Model Selection and Architecture
    - Training and Validation
    - Evaluation and Results
    - Use of Generative AI (if you use)

## Rules
Violations will result in 0 points in the grade:
-   `Rule 1 - No test set leakage`: You must not use any instance from the test set during training, feature engineering, or model selection.
-   `Rule 2 - Responsible AI use`: You may use generative AI, but you must clearly document how it was used. If you have used genAI, include a section titled ‚ÄúUse of Generative AI‚Äù describing:
    -   What parts of the project you used AI for
    -   What was implemented manually vs. with AI assistance

## Grading

The performance should be only evaluated on the test set (a total of 1086 instances). Please split original dataset into train set and test set. The test set should NEVER be used in the training process. The evaluation metric is a combination of precision, recall, and f1-score (use `classification_report` in sklearn).

The total points are 10.0. Each team will compete with other teams in the class on their best performance. Points will be deducted if not following the requirements above.

If ALL the requirements are met:
- Top 25\% teams: 10.0 points.
- Top 25\% - 50\% teams: 8.5 points.
- Top 50\% - 75\% teams: 7.0 points.
- Top 75\% - 100\% teams: 6.0 points.

If your best performance reaches **0.82** or above (weighted F1-score) and follows all the requirements and rules, you will also get full points (10.0 points).

## Submission
Similar as homework, submit both a PDF and .ipynb version of the report including:
- code and experimental results with details explained
- combined results table, report and best performance
- a summary at the end of the report (please follow the format above)

Missing any part of the above requirements will result in point deductions.

The due date is **Dec 11, Thursday by 11:59pm**.

In [1]:
#Rena Wang

In [2]:
#Marco Antonio Gonzalez Fernandez

## Experimental Results

(A table detailed model performance on the test set with at least 6 rows. Report the best performance.)


## Project Summary
### 1. Data Preprocessing


### 2. Feature Engineering


### 3. Model Selection and Architecture


### 4. Training and Validation


### 5. Evaluation and Results


### 6. Use of Generative AI (if you use)

# Necessary Libraries

In [25]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from collections import defaultdict
import re

# 1. Data Preprocessing

We first start by loading the data and clean up the text.

In [4]:
#Read de csv file and create a data frame
edos_df = pd.read_csv("edos_labelled_data.csv")
edos_df

Unnamed: 0,rewire_id,text,label,split
0,sexism2022_english-9609,"In Nigeria, if you rape a woman, the men rape ...",not sexist,train
1,sexism2022_english-16993,"Then, she's a keeper. üòâ",not sexist,train
2,sexism2022_english-13149,This is like the Metallica video where the poo...,not sexist,train
3,sexism2022_english-13021,woman?,not sexist,train
4,sexism2022_english-966,I bet she wished she had a gun,not sexist,train
...,...,...,...,...
5274,sexism2022_english-4599,Only if you make it clear you're not looking f...,not sexist,train
5275,sexism2022_english-1196,It was like a big sisterhood all stemming from...,sexist,test
5276,sexism2022_english-9772,It goes like this: I'm on the dance floor and ...,not sexist,test
5277,sexism2022_english-14511,It could be like for the ladies' corner of you...,sexist,test


In [5]:
#Function that help us to clean text

def clean_text(line):
    if pd.isna(line):
        return line

    #Lower case the text
    line = line.lower()

    #Remove URLs
    line = re.sub(r'http\S+|www.\S+', '', line)

    #Remove @mentions
    line = re.sub(r'@\w+', '', line)

    #Remove [user] and [url]
    line = re.sub(r'\[user\]', '', line)
    line = re.sub(r'\[url\]', '', line)

    #Remove multiple spaces
    line = re.sub(r'\s+', ' ', line).strip()

    return line

In [6]:
#Clean text column
edos_df['text'] = edos_df['text'].apply(clean_text)
edos_df


Unnamed: 0,rewire_id,text,label,split
0,sexism2022_english-9609,"in nigeria, if you rape a woman, the men rape ...",not sexist,train
1,sexism2022_english-16993,"then, she's a keeper. üòâ",not sexist,train
2,sexism2022_english-13149,this is like the metallica video where the poo...,not sexist,train
3,sexism2022_english-13021,woman?,not sexist,train
4,sexism2022_english-966,i bet she wished she had a gun,not sexist,train
...,...,...,...,...
5274,sexism2022_english-4599,only if you make it clear you're not looking f...,not sexist,train
5275,sexism2022_english-1196,it was like a big sisterhood all stemming from...,sexist,test
5276,sexism2022_english-9772,it goes like this: i'm on the dance floor and ...,not sexist,test
5277,sexism2022_english-14511,it could be like for the ladies' corner of you...,sexist,test


The EDOS dataset already provides a predefined split for training and evaluation. Instead of randomly dividing the data ourselves, we follow the dataset‚Äôs built-in labels to ensure consistency with the competition setup.

In [7]:
#Split original dataset into test set and train set
test_df  = edos_df[edos_df['split'] == 'test'].copy()
train_df = edos_df[edos_df['split'] == 'train'].copy()


In [8]:
#Verify test set
test_df

Unnamed: 0,rewire_id,text,label,split
21,sexism2022_english-845,fuck the niggers and the jews. both have a his...,not sexist,test
27,sexism2022_english-6629,well then good because someone has to knock he...,sexist,test
35,sexism2022_english-17573,#usa #texas #islam #muslims #islamization #sha...,not sexist,test
38,sexism2022_english-10268,"yes, normal women want to be dominated. social...",sexist,test
44,sexism2022_english-10735,she didn't have to be a bitch about it. she li...,sexist,test
...,...,...,...,...
5268,sexism2022_english-19421,so did you break it off with her then.,not sexist,test
5270,sexism2022_english-15150,in early middle school i was physically bullie...,sexist,test
5275,sexism2022_english-1196,it was like a big sisterhood all stemming from...,sexist,test
5276,sexism2022_english-9772,it goes like this: i'm on the dance floor and ...,not sexist,test


In [9]:
#Verify train set
train_df

Unnamed: 0,rewire_id,text,label,split
0,sexism2022_english-9609,"in nigeria, if you rape a woman, the men rape ...",not sexist,train
1,sexism2022_english-16993,"then, she's a keeper. üòâ",not sexist,train
2,sexism2022_english-13149,this is like the metallica video where the poo...,not sexist,train
3,sexism2022_english-13021,woman?,not sexist,train
4,sexism2022_english-966,i bet she wished she had a gun,not sexist,train
...,...,...,...,...
5271,sexism2022_english-19863,supporting toxic men and glorifying toxic male...,sexist,train
5272,sexism2022_english-18722,find a girl with common beliefs. i have. they ...,not sexist,train
5273,sexism2022_english-2564,not to mention that she's an outright commie w...,not sexist,train
5274,sexism2022_english-4599,only if you make it clear you're not looking f...,not sexist,train


# 2.Feature Engineering

## Method 1: TF-IDF word n-grams (1‚Äì3)

Our first feature method aims to build a classical machine learning representation of the text, so we chose the word n-gram feature. This approach captures local liguistic patterns, such as keywords, short phrases, and recurring expressions that could help distinguish sexist from non-sexist language.

To create our word n-gram feature method, we first wrote a helper function to generate uni-grams, bi-grams, and tri-grams after removing English stopwords to reduce noise. We applied this function to every text in the training set and counted how often each n-gram appeared in the sexist and non-sexist classes, giving us an initial understanding of which phrases characterize each category. Then, we used a TF-IDF vectorizer with ngram_range=(1,3) and min_df=2 to convert the text into numerical features. The vectorizer was fit only on the training data to avoid data leakage, and then applied to the test set using the learned vocabulary.

In [10]:
# Feature extraction: N-grams
from collections import defaultdict
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Function to generate n-grams from text, filtering stopwords
def generate_ngrams(text, n=1):
    # Split text and remove stopwords
    words = [w for w in text.split() if w not in ENGLISH_STOP_WORDS]
    # Generate n-grams
    ngrams_list = zip(*[words[i:] for i in range(n)])
    return [' '.join(ng) for ng in ngrams_list]

# Dictionaries to store counts for sexist and non-sexist classes
sexist_ngrams = defaultdict(int)
nonsexist_ngrams = defaultdict(int)

# Loop through training data and count n-grams
for text, label in zip(train_df['text'], train_df['label']):
    for n in [1, 2, 3]:
        for ng in generate_ngrams(text, n):
            if label == 'sexist':
                sexist_ngrams[ng] += 1
            else:
                nonsexist_ngrams[ng] += 1


#TF-IDF with n-grams 1 to 3
ngram_vectorizer = TfidfVectorizer(
    ngram_range=(1,3),
    stop_words='english',
    min_df=2
)

# Fit ONLY on training set (very important for Rule 1)
X_train_ngrams = ngram_vectorizer.fit_transform(train_df['text'])

# Transform test set using the train vocabulary
X_test_ngrams = ngram_vectorizer.transform(test_df['text'])

# Labels
y_train = train_df['label']
y_test = test_df['label']

print("N-gram feature matrix created!")
print("Train shape:", X_train_ngrams.shape)
print("Test shape:", X_test_ngrams.shape)

N-gram feature matrix created!
Train shape: (4193, 6472)
Test shape: (1086, 6472)


In [11]:
#Verify if the method was applied correctly
from collections import Counter

print("Top 10 n-grams for sexist class:")
print(Counter(sexist_ngrams).most_common(10))

print("\nTop 10 n-grams for non-sexist class:")
print(Counter(nonsexist_ngrams).most_common(10))


Top 10 n-grams for sexist class:
[('women', 386), ('like', 172), ('men', 145), ('just', 141), ('woman', 92), ('bitch', 87), ("don't", 86), ('want', 82), ("it's", 79), ('female', 71)]

Top 10 n-grams for non-sexist class:
[('just', 352), ('women', 322), ('like', 314), ("don't", 198), ("it's", 167), ('woman', 145), ('girl', 141), ('want', 135), ('people', 133), ("i'm", 128)]


Nithyashree. (2021, September 13). What are N-grams and how to implement them in Python? Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/09/what-are-n-grams-and-how-to-implement-them-in-python

## Method 2: TF-IDF character n-grams (3‚Äì5)

For our second feature method, we created a character-level TF-IDF representation to capture subword patterns such as prefixes, suffixes, repeated characters, and partial word fragments that often appear in informal or misspelled online text. Using TfidfVectorizer with analyzer="char" and an n-gram range of 3‚Äì5 characters, we extracted overlapping character sequences from each text and computed their TF-IDF weights, ignoring extremely rare patterns with min_df=3. As with our word n-grams, the vectorizer was fit only on the training set to prevent data leakage and then applied to the test set using the learned character-level vocabulary.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

char_vectorizer = TfidfVectorizer(
    analyzer="char",
    ngram_range=(3, 5),
    min_df=3  # ignore super-rare patterns
)

X_train_char = char_vectorizer.fit_transform(train_df["text"])
X_test_char  = char_vectorizer.transform(test_df["text"])

y_train_char = train_df["label"]
y_test_char  = test_df["label"]

print("Char TF-IDF feature matrix created!")
print("Train shape:", X_train_char.shape)
print("Test shape:", X_test_char.shape)

Char TF-IDF feature matrix created!
Train shape: (4193, 53684)
Test shape: (1086, 53684)


In [13]:
import numpy as np

feature_names = char_vectorizer.get_feature_names_out()

# Boolean masks for each class
sex_mask = (y_train == "sexist")
nsex_mask = (y_train == "not sexist")

# If y_train is a Series, make sure these are numpy arrays
if hasattr(sex_mask, "values"):
    sex_mask = sex_mask.values
    nsex_mask = nsex_mask.values

# Average TF-IDF per n-gram, per class
avg_sex = X_train_char[sex_mask].mean(axis=0).A1
avg_non = X_train_char[nsex_mask].mean(axis=0).A1

# Class-specific scores (difference)
sex_score = avg_sex - avg_non      # positive => more associated with sexist
nonsex_score = avg_non - avg_sex   # positive => more associated with non-sexist

# Top 20 sexist-associated ngrams
top20_sex_idx = np.argsort(sex_score)[-20:]
top20_sex_ngrams = [(feature_names[i], sex_score[i], avg_sex[i], avg_non[i])
                    for i in reversed(top20_sex_idx)]

# Top 20 non-sexist-associated ngrams
top20_non_idx = np.argsort(nonsex_score)[-20:]
top20_non_ngrams = [(feature_names[i], nonsex_score[i], avg_sex[i], avg_non[i])
                    for i in reversed(top20_non_idx)]

print("Top 20 sexist-associated ngrams:")
for ng, score, a_s, a_n in top20_sex_ngrams:
    print(f"{ng!r}  diff={score:.4f}  avg_sex={a_s:.4f}  avg_non={a_n:.4f}")

print("\nTop 20 non-sexist-associated ngrams:")
for ng, score, a_s, a_n in top20_non_ngrams:
    print(f"{ng!r}  diff={score:.4f}  avg_sex={a_s:.4f}  avg_non={a_n:.4f}")

Top 20 sexist-associated ngrams:
'men'  diff=0.0077  avg_sex=0.0144  avg_non=0.0067
'men '  diff=0.0075  avg_sex=0.0119  avg_non=0.0044
'wom'  diff=0.0063  avg_sex=0.0123  avg_non=0.0060
' wom'  diff=0.0061  avg_sex=0.0118  avg_non=0.0058
'en '  diff=0.0060  avg_sex=0.0140  avg_non=0.0080
'women'  diff=0.0060  avg_sex=0.0106  avg_non=0.0046
'wome'  diff=0.0059  avg_sex=0.0106  avg_non=0.0046
'omen'  diff=0.0059  avg_sex=0.0105  avg_non=0.0046
'omen '  diff=0.0059  avg_sex=0.0093  avg_non=0.0034
'bitch'  diff=0.0056  avg_sex=0.0057  avg_non=0.0001
'bitc'  diff=0.0056  avg_sex=0.0057  avg_non=0.0001
' bitc'  diff=0.0055  avg_sex=0.0056  avg_non=0.0001
' wome'  diff=0.0055  avg_sex=0.0100  avg_non=0.0045
'bit'  diff=0.0055  avg_sex=0.0060  avg_non=0.0005
' bit'  diff=0.0054  avg_sex=0.0058  avg_non=0.0004
'itch'  diff=0.0054  avg_sex=0.0058  avg_non=0.0004
'itc'  diff=0.0053  avg_sex=0.0058  avg_non=0.0005
' wo'  diff=0.0050  avg_sex=0.0135  avg_non=0.0085
'tch'  diff=0.0050  avg_sex=0.00

# 3. Model Selection and Architecture

## Model 1 - Logistic Regression

For our first model, we trained a Logistic Regression classifier using both methods. Logistic Regression is a strong and interpretable baseline for text classification, and it performs well in high-dimensional sparse feature spaces like TF-IDF. We used the liblinear solver with a higher iteration limit (max_iter=5000) to ensure convergence, applied class_weight="balanced" to address class imbalance, and set C=0.5 to introduce moderate regularization that prevents overfitting. After fitting the model on the training n-gram matrix, we generated predictions on the test set and evaluated performance using accuracy and a detailed classification report.

## Feature Method 1

In [14]:
# MODEL 1 ‚Äî Logistic Regression - (Method 1)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Create and train the model
log_model = log_model = LogisticRegression(
    max_iter=5000,
    solver="liblinear",
    class_weight="balanced",
    C=0.5
)
log_model.fit(X_train_ngrams, y_train)

# Predictions
y_pred = log_model.predict(X_test_ngrams)

# Accuracy
acc = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {acc:.4f}")

# Classification report
print("\nClassification Report:")
A=classification_report(y_test, y_pred)
print(classification_report(y_test, y_pred))


Test Accuracy: 0.7643

Classification Report:
              precision    recall  f1-score   support

  not sexist       0.87      0.80      0.83       789
      sexist       0.56      0.67      0.61       297

    accuracy                           0.76      1086
   macro avg       0.71      0.73      0.72      1086
weighted avg       0.78      0.76      0.77      1086



## Feature Method 2

In [15]:
# MODEL 1 ‚Äî Logistic Regression - (Method 2)

# Create and train the model
log_model = log_model = LogisticRegression(
    max_iter=5000,
    solver="liblinear",
    class_weight="balanced",
    C=0.5
)

log_model.fit(X_train_char, y_train)

# Predictions
y_pred = log_model.predict(X_test_char)

# Accuracy
acc = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {acc:.4f}")

# Classification report
print("\nClassification Report:")
B=classification_report(y_test, y_pred)
print(classification_report(y_test, y_pred))

Test Accuracy: 0.7781

Classification Report:
              precision    recall  f1-score   support

  not sexist       0.86      0.83      0.84       789
      sexist       0.59      0.64      0.61       297

    accuracy                           0.78      1086
   macro avg       0.72      0.73      0.73      1086
weighted avg       0.78      0.78      0.78      1086



## Combinated Methods

To improve the performance, we combined both feature representation by horizontally stacking the word n-grams and the character n-grams. This hybrid approach allows the model to capture both higher-level semantic phrases and fine-grained subword patterns. We then perform the Logistic Regression on the combined model, unfortunately the result is still not reaching our target of 0.82 weighted F1. We then tried implementing two more models to improve the performance.

In [16]:
from scipy.sparse import hstack

# Combine WORD n-grams + CHAR n-grams
X_train_combined = hstack([X_train_ngrams, X_train_char])
X_test_combined  = hstack([X_test_ngrams, X_test_char])

print("Combined feature matrix created!")
print("Train shape:", X_train_combined.shape)
print("Test shape:", X_test_combined.shape)


Combined feature matrix created!
Train shape: (4193, 60156)
Test shape: (1086, 60156)


In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

log_model_combined = LogisticRegression(
    max_iter=5000,
    solver="liblinear",
    class_weight="balanced",
    C=0.5
)

log_model_combined.fit(X_train_combined, y_train)

y_pred_combined = log_model_combined.predict(X_test_combined)

print(f"Test Accuracy (Combined): {accuracy_score(y_test, y_pred_combined):.4f}")
print("\nClassification Report (Combined Logistic Regression):")
C=classification_report(y_test, y_pred_combined)
print(classification_report(y_test, y_pred_combined))


Test Accuracy (Combined): 0.7956

Classification Report (Combined Logistic Regression):
              precision    recall  f1-score   support

  not sexist       0.88      0.84      0.86       789
      sexist       0.61      0.69      0.65       297

    accuracy                           0.80      1086
   macro avg       0.74      0.76      0.75      1086
weighted avg       0.80      0.80      0.80      1086



## Model 2 - Linear SVM

For our second classical model, we trained a Linear Support Vector Machine using the same feature methods. SVMs are particularly effective for high-dimensional sparse text data because they maximize the margin between classes, often outperforming Logistic Regression when decision boundaries are tight or overlap. However, we didn't see an improvement on the weighted F1 score. We moved on the next model.

## Feature Method 1

In [18]:
# MODEL 2 ‚Äî Linear SVM - (Method 1)
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score

# Create and train the SVM model
svm_model = LinearSVC()
svm_model.fit(X_train_ngrams, y_train)

# Predictions
y_pred_svm = svm_model.predict(X_test_ngrams)

# Accuracy
acc_svm = accuracy_score(y_test, y_pred_svm)
print(f"SVM Test Accuracy: {acc_svm:.4f}")

# Classification report
print("\nClassification Report (SVM):")
D=classification_report(y_test, y_pred_svm)
print(classification_report(y_test, y_pred_svm))


SVM Test Accuracy: 0.7744

Classification Report (SVM):
              precision    recall  f1-score   support

  not sexist       0.83      0.87      0.85       789
      sexist       0.60      0.52      0.56       297

    accuracy                           0.77      1086
   macro avg       0.71      0.69      0.70      1086
weighted avg       0.77      0.77      0.77      1086



## Feature Method 2

In [19]:
# MODEL 2 ‚Äî Linear SVM (Method 2)
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score

# Create and train the SVM model
svm_model_char = LinearSVC()
svm_model_char.fit(X_train_char, y_train)

# Predictions
y_pred_svm_char = svm_model_char.predict(X_test_char)

# Accuracy
acc_svm_char = accuracy_score(y_test, y_pred_svm_char)
print(f"SVM Test Accuracy (Char TF-IDF): {acc_svm_char:.4f}")

# Classification report
print("\nClassification Report (SVM - Char TF-IDF):")
E=classification_report(y_test, y_pred_svm_char)
print(classification_report(y_test, y_pred_svm_char))


SVM Test Accuracy (Char TF-IDF): 0.8020

Classification Report (SVM - Char TF-IDF):
              precision    recall  f1-score   support

  not sexist       0.82      0.92      0.87       789
      sexist       0.70      0.48      0.57       297

    accuracy                           0.80      1086
   macro avg       0.76      0.70      0.72      1086
weighted avg       0.79      0.80      0.79      1086



## Combinated Methods

In [20]:
from sklearn.svm import LinearSVC

svm_combined = LinearSVC(class_weight="balanced")
svm_combined.fit(X_train_combined, y_train)

y_pred_svm_comb = svm_combined.predict(X_test_combined)

print(f"SVM Test Accuracy (Combined): {accuracy_score(y_test, y_pred_svm_comb):.4f}")
print("\nClassification Report (Combined SVM):")
F=classification_report(y_test, y_pred_svm_comb)
print(classification_report(y_test, y_pred_svm_comb))


SVM Test Accuracy (Combined): 0.7781

Classification Report (Combined SVM):
              precision    recall  f1-score   support

  not sexist       0.85      0.85      0.85       789
      sexist       0.59      0.60      0.60       297

    accuracy                           0.78      1086
   macro avg       0.72      0.72      0.72      1086
weighted avg       0.78      0.78      0.78      1086



## Model 3 - XGBOOST

For our last classical model, we experimented with XGBoost, which is a powerful graident-boosted tree algorithm. Even though XGBoost is traditionally used for dense, low-dimensional tabular data, we adapted it to our high-dimensional sparse TF-IDF matrix by tuning parameters such as n_estimators=400, learning_rate=0.07, and mex_depth=6. We also used the hist tree method for faster training on large feature spaces. We ended up getting the highest weighted f1 score of 0.81, still not what we wanted. So the the next step we tried a deep learning method.

In [None]:
from sklearn.preprocessing import LabelEncoder


# Create a LabelEncoder instance to convert string labels into numeric values
le = LabelEncoder()


y_train_enc = le.fit_transform(y_train)
y_test_enc = le.transform(y_test)

print("Label encoding completed:", le.classes_)


Label encoding completed: ['not sexist' 'sexist']


## Feature Method 1

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report


# Initialize an XGBoost classifier with tuned hyperparameters - (Method 1)
xgb_model_1 = XGBClassifier(
    n_estimators=400,
    learning_rate=0.07,
    max_depth=6,
    subsample=0.9,
    colsample_bytree=0.9,
    eval_metric='logloss',
    tree_method='hist'
)

# Train the model using the TF-IDF n-gram features
xgb_model_1.fit(X_train_ngrams, y_train_enc)

# Predict the label of each test sample
y_pred_xgb_1 = xgb_model_1.predict(X_test_ngrams)

# Classification report
print("\n===== XGBOOST (METHOD 1 ‚Äî WORD TF-IDF) =====")
print("Accuracy:", accuracy_score(y_test_enc, y_pred_xgb_1))
print("\nClassification Report:")
G=classification_report(y_test_enc, y_pred_xgb_1)
print(classification_report(y_test_enc, y_pred_xgb_1))



===== XGBOOST (METHOD 1 ‚Äî WORD TF-IDF) =====
Accuracy: 0.8130755064456722

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.93      0.88       789
           1       0.72      0.52      0.60       297

    accuracy                           0.81      1086
   macro avg       0.78      0.72      0.74      1086
weighted avg       0.80      0.81      0.80      1086



## Feature Method 2

In [None]:
# Initialize an XGBoost classifier with tuned hyperparameters - (Method 2)
xgb_model_2 = XGBClassifier(
    n_estimators=850,
    learning_rate=0.02,
    max_depth=6,
    subsample=0.9,
    colsample_bytree=0.9,
    eval_metric='logloss',
    tree_method='hist'
)

# Train the model using the character-level TF-IDF features
xgb_model_2.fit(X_train_char, y_train_enc)

# Predict the label of each test sample
y_pred_xgb_2 = xgb_model_2.predict(X_test_char)

# Classification report
print("\n===== XGBOOST (METHOD 2 ‚Äî CHAR TF-IDF) =====")
print("Accuracy:", accuracy_score(y_test_enc, y_pred_xgb_2))
print("\nClassification Report:")
H=classification_report(y_test_enc, y_pred_xgb_2)
print(classification_report(y_test_enc, y_pred_xgb_2))



===== XGBOOST (METHOD 2 ‚Äî CHAR TF-IDF) =====
Accuracy: 0.8195211786372008

Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.94      0.88       789
           1       0.76      0.50      0.60       297

    accuracy                           0.82      1086
   macro avg       0.79      0.72      0.74      1086
weighted avg       0.81      0.82      0.81      1086



## Combinated Methods

In [None]:
from scipy.sparse import hstack
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report

# Combine WORD n-grams + CHAR n-grams
X_train_combined = hstack([X_train_ngrams, X_train_char])
X_test_combined = hstack([X_test_ngrams, X_test_char])

print("Combined feature matrix created!")
print("Train shape:", X_train_combined.shape)
print("Test shape:", X_test_combined.shape)

# Initialize an XGBoost classifier with tuned hyperparameters - (Combined method)
xgb_model_3 = XGBClassifier(
    n_estimators=850,
    learning_rate=0.02,
    max_depth=7,
    subsample=0.85,
    colsample_bytree=0.85,
    eval_metric='logloss',
    tree_method='hist',
    n_jobs=-1
)

# Train XGBoost using the combined features
xgb_model_3.fit(X_train_combined, y_train_enc)

# Predict the label of each test sample
y_pred_xgb_3 = xgb_model_3.predict(X_test_combined)

# Classification report
print("\n===== XGBOOST (COMBINED MODEL) =====")
print("Accuracy:", accuracy_score(y_test_enc, y_pred_xgb_3))
print("\nClassification Report:")
I=classification_report(y_test_enc, y_pred_xgb_3)
print(classification_report(y_test_enc, y_pred_xgb_3))


Combined feature matrix created!
Train shape: (4193, 60156)
Test shape: (1086, 60156)

===== XGBOOST (COMBINED MODEL) =====
Accuracy: 0.8222836095764272

Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.94      0.89       789
           1       0.77      0.51      0.61       297

    accuracy                           0.82      1086
   macro avg       0.80      0.72      0.75      1086
weighted avg       0.82      0.82      0.81      1086

