# Final Project

This final project can be collaborative. The maximum members of a group is 3. You can also work by yourself. Please respect the academic integrity. **Remember: if you get caught on cheating, you get F.**

## A Introduction to the competition

<img src="news-sexisme-EN.jpg" alt="drawing" width="380"/>

Sexism is a growing problem online. It can inflict harm on women who are targeted, make online spaces inaccessible and unwelcoming, and perpetuate social asymmetries and injustices. Automated tools are now widely deployed to find, and assess sexist content at scale but most only give classifications for generic, high-level categories, with no further explanation. Flagging what is sexist content and also explaining why it is sexist improves interpretability, trust and understanding of the decisions that automated tools use, empowering both users and moderators.

This project is based on SemEval 2023 - Task 10 - Explainable Detection of Online Sexism (EDOS). [Here](https://codalab.lisn.upsaclay.fr/competitions/7124#learn_the_details-overview) you can find a detailed introduction to this task.

You only need to complete **TASK A - Binary Sexism Detection: a two-class (or binary) classification where systems have to predict whether a post is sexist or not sexist**. To cut down training time, we only use a subset of the original dataset (5k out of 20k). The dataset can be found in the same folder. 

Different from our previous homework, this competition gives you great flexibility (and very few hints). You can freely determine every component of your workflow, including but not limited to:
-  **Preprocessing the input text**: You may decide how to clean or transform the text. For example, removing emojis or URLs, lowercasing, removing stopwords, applying stemming or lemmatization, correcting spelling, or performing tokenization and sentence segmentation.
-  **Feature extraction and encoding**: You can choose any method to convert text into numerical representations, such as TF-IDF, Bag-of-Words, N-grams, Word2Vec, GloVe, FastText, contextual embeddings (e.g., BERT, RoBERTa, or other transformer-based models), Part-of-Speech (POS) tagging, dependency-based features, sentiment or emotion features, readability metrics, or even embeddings or features generated by large language models (LLMs).
-  **Data augmentation and enrichment**: You may expand or balance your dataset by incorporating other related corpora or using techniques like synonym replacement, random deletion/insertion, or LLM-assisted augmentation (e.g., generating paraphrased or synthetic examples to improve model robustness).
-  **Model selection**: You are free to experiment with different models — from traditional machine learning algorithms (e.g., Logistic Regression, SVM, Random Forest, XGBoost) to deep learning architectures (e.g., CNNs, RNNs, Transformers), or even hybrid/ensemble approaches that combine multiple models or leverage LLM-generated predictions or reasoning.

## Requirements
-  **Input**: the text for each instance.
-  **Output**: the binary label for each instance.
-  **Feature engineering**: use at least 2 different methods to extract features and encode text into numerical values. You may explore both traditional and AI-assisted techniques. Data augmentation is optional.
-  **Model selection**: implement with at least 3 different models and compare their performance.
-  **Evaluation**: create a dataframe with rows indicating feature+model and columns indicating Precision (P), Recall (R) and F1-score (using weighted average). Your results should have at least 6 rows (2 feature engineering methods x 3 models). Report best performance with (1) your feature engineering method, and (2) the model you choose. Here is an example illustrating how the experimental results table should be presented.

| Feature + Model | Sexist (P) | Sexist (R) | Sexist (F1) | Non-Sexist (P) | Non-Sexist (R) | Non-Sexist (F1) | Weighted (P) | Weighted (R) | Weighted (F1) |
|-----------------|:----------:|:----------:|:------------:|:---------------:|:---------------:|:----------------:|:-------------:|:--------------:|:---------------:|
| TF-IDF + Logistic Regression | ... | ... | ... | ... | ... | ... | ... | ... | ... |

- **Format of the report**: add explainations for each step (you can add markdown cells). At the end of the report, write a summary for each sections: 
    - Data Preprocessing
    - Feature Engineering
    - Model Selection and Architecture
    - Training and Validation
    - Evaluation and Results
    - Use of Generative AI (if you use)

## Rules 
Violations will result in 0 points in the grade: 
-   `Rule 1 - No test set leakage`: You must not use any instance from the test set during training, feature engineering, or model selection.
-   `Rule 2 - Responsible AI use`: You may use generative AI, but you must clearly document how it was used. If you have used genAI, include a section titled “Use of Generative AI” describing:
    -   What parts of the project you used AI for
    -   What was implemented manually vs. with AI assistance

## Grading

The performance should be only evaluated on the test set (a total of 1086 instances). Please split original dataset into train set and test set. The test set should NEVER be used in the training process. The evaluation metric is a combination of precision, recall, and f1-score (use `classification_report` in sklearn). 

The total points are 10.0. Each team will compete with other teams in the class on their best performance. Points will be deducted if not following the requirements above. 

If ALL the requirements are met:
- Top 25\% teams: 10.0 points.
- Top 25\% - 50\% teams: 8.5 points.
- Top 50\% - 75\% teams: 7.0 points.
- Top 75\% - 100\% teams: 6.0 points.

If your best performance reaches **0.82** or above (weighted F1-score) and follows all the requirements and rules, you will also get full points (10.0 points). 

## Submission
Similar as homework, submit both a PDF and .ipynb version of the report including: 
- code and experimental results with details explained
- combined results table, report and best performance
- a summary at the end of the report (please follow the format above)

Missing any part of the above requirements will result in point deductions.

The due date is **Dec 11, Thursday by 11:59pm**.

### Required Libraries/Imports


In [None]:
# Download and install the necessary libraries
# Uncomment below if needed to do so
#%pip install numpy
#%pip install pandas
#%pip install sklearn
#%pip install scipy
#%pip install sentence-transformers
# Install all required libraries for the project
import numpy as np
import pandas as pd
import re
import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RFC, VotingClassifier
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    precision_recall_fscore_support,
    classification_report,
    accuracy_score
)
from scipy.sparse import hstack, csr_matrix
from sentence_transformers import SentenceTransformer




### 1. Data Preprocessing

Here in data preprocessing, we are mainly grabbing the CSV data and correctly parsing it using our parse() function. Due to the text portion within the csv having commas, we must split the id from the left and the label along with split column from the right and the remaining portion would be our text portion. We then use pandas Dataframe filtering to split the data set into the training and testing sets. For text preprocessing, we decided to utilize the re library to get rid of URLs and extra spacing within text as well as Python's .lower() function. After we applied the preprocessing function to the data set. We stripped the label texts as well as put them in a LabelEncoder to ease the use for our machine learning classifiers.

In [None]:

# open the csv
with open("edos_labelled_data.csv", "r", encoding="utf-8") as f:
    lines = f.read().splitlines()
# place into a dataframe and label them as raw
raw = pd.DataFrame({"raw": lines})

def parse(line):
    # parse each line into id, text, label, split
    line = line["raw"]
    # split from the right to get label and split
    text_and_id_part, label, split = line.rsplit(",", 2)
    # split from the left to get and text
    id_, text = text_and_id_part.split(",", 1)
    return pd.Series([id_, text, label, split])
# get rid of the first line (header)
raw = raw[1:]
# apply the parse function to each row
df = raw.apply(parse, axis=1)
# label the columns
df.columns = ["id", "text", "label", "split"]

# split data set into train and test sets
train_df = df[df["split"] == " train"]
test_df = df[df["split"] == " test"]
# get rid of the split column and id column
train_df = train_df.drop(columns=["split"])
test_df = test_df.drop(columns=["split"])

# process text data
def preprocess_text(text):
    # Just lowercase and remove URLs
    text = text.lower()
    text = re.sub(r'http\S+|www.\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\s+', ' ', text).strip()
    return text
# apply preprocessing function to text data
train_df["text"] = train_df["text"].apply(preprocess_text)
test_df["text"] = test_df["text"].apply(preprocess_text)
# strip whitespace from labels
train_df["label"] = train_df["label"].str.strip()
test_df["label"] = test_df["label"].str.strip()
le = LabelEncoder()
y_train = le.fit_transform(train_df["label"])
y_test = le.transform(test_df["label"])


### 2. Feature Engineering
[Insert TF-IDF Vectorizer and Glove embeddings explanation here].
We also decided to use "all-mpnet-base-v2" model from contextual embeddings with the Sentence-Transformers library. We also added our own function to flag slurs and general derogatory statements towards women. After testing with these features individually, we were unable to get satisfactory results so we seeked assistance from AI to see if there was a way to combine these features to create an ultimate feature engineered set which is done by putting the data sets trained by each feature into a csr_matrix which is then combined into our TF-IDF Vectorizer using the hstack function from scipy library.

In [None]:


# 1. TF-IDF Vectorizer - turns text into TF-IDF features
# Weights words by importance: common words like 'the' get low weight,
# distinctive words like 'bitch' get high weight
tfidf = TfidfVectorizer(
    max_features=8000,     
    ngram_range=(1, 2),
    min_df=3,
    max_df=0.85,
    sublinear_tf=True,
    strip_accents='unicode',
    token_pattern=r'\S+'
)

# apply vectorizer to a data set
X_train_tfidf = tfidf.fit_transform(train_df["text"])
X_test_tfidf = tfidf.transform(test_df["text"])

# Select top 1800 most informative TF-IDF features
selector_tfidf = SelectKBest(chi2, k=1800)
X_train_tfidf = selector_tfidf.fit_transform(X_train_tfidf, y_train)
X_test_tfidf = selector_tfidf.transform(X_test_tfidf)


# 2. GloVe embeddings (100d) - captures semantic similarity
glove = {}
with open("glove.6B.100d.txt", "r", encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype="float32")
        glove[word] = vector

def sentence_to_vec(sentence, embeddings=glove, dim=100):
    """Convert sentence to vector by averaging word embeddings"""
    words = sentence.split()
    vectors = [embeddings[w] for w in words if w in embeddings]
    if len(vectors) == 0:
        return np.zeros(dim)
    return np.mean(vectors, axis=0)

X_train_glove = np.vstack(train_df["text"].apply(sentence_to_vec))
X_test_glove = np.vstack(test_df["text"].apply(sentence_to_vec))

# 3. CONTEXTUAL EMBEDDINGS (Sentence-BERT / MPNet)
ctx_model = SentenceTransformer("all-mpnet-base-v2")
X_train_ctx = ctx_model.encode(
    train_df["text"].tolist(),
    convert_to_numpy=True,
)
X_test_ctx = ctx_model.encode(
    test_df["text"].tolist(),
    convert_to_numpy=True,
)

# 4. Custom sexism-specific features
def extract_custom_features(df):
    features = []
    
    for text in df['text']:
        feat = []
        text_lower = text.lower()
        words = text_lower.split()
        
        # flag deragatory terms
        derogatory = ['bitch', 'bitches', 'whore', 'whores', 'slut', 'sluts', 
                      'skank', 'skanks', 'cunt', 'cunts', 'hoe', 'thot', 'pussy',
                      'hag', 'hags', 'bimbo', 'bimbos', 'prick', 'pricks']
        feat.append(sum(1 for w in derogatory if w in text_lower))
        
        # flag gender pronouns
        total_pronouns = text_lower.count('she') + text_lower.count('her') + text_lower.count('he') + text_lower.count('his') + 1
        female_pronouns = text_lower.count('she') + text_lower.count('her')
        feat.append(female_pronouns / total_pronouns)
        
        # flag female oriented words
        female_words = ['woman', 'women', 'girl', 'girls', 'female', 'lady']
        feat.append(sum(1 for w in female_words if w in text_lower))
        
        # flag commanding language
        commands = ['should', 'must', 'need', 'have to', 'supposed']
        feat.append(sum(1 for cmd in commands if cmd in text_lower))
        
        # flag excessive exclamation marks
        feat.append(min(text.count('!'), 3))
        
        # flag negative adjectives
        negative = ['ugly', 'disgusting', 'fat', 'stupid', 'dumb']
        feat.append(sum(1 for neg in negative if neg in text_lower))
        
        # word count (log scaled)
        feat.append(np.log1p(len(words)))
        
        features.append(feat)
    
    return np.array(features, dtype=float)

X_train_custom = extract_custom_features(train_df)
X_test_custom = extract_custom_features(test_df)


# 4. Combine all features
# Convert to sparse matrices for efficiency
# Convert to sparse matrix so it can be stacked with TF-IDF
glove_train_sparse = csr_matrix(X_train_glove)
glove_test_sparse = csr_matrix(X_test_glove)
ctx_train_sparse = csr_matrix(X_train_ctx)
ctx_test_sparse = csr_matrix(X_test_ctx)
custom_train_sparse = csr_matrix(X_train_custom)
custom_test_sparse = csr_matrix(X_test_custom)

# Stack all features horizontally: TF-IDF + GloVe + Custom
X_train_combined = hstack([
    X_train_tfidf,
    glove_train_sparse,
    ctx_train_sparse,
    custom_train_sparse
])

X_test_combined = hstack([
    X_test_tfidf,
    glove_test_sparse,
    ctx_test_sparse,
    custom_test_sparse
])


### 3. Models + Training with Training Data Set
For our models we decided to use the traditional machine learning models. LogisticRegression, SVC and RandomForest. For all models, we decided to use class_weight=balanced to punish for more sexist or non-sexist statements depending on the training set.


### Model 1: Logistical Regression

We ran each of our features as well as our combined feature trained set, through an ensemble of 5 LogisticRegression models with a variety of logarithmic C values and use soft voting with the VotingClassifier to get our final model which is then ran on the test set and printed out in a classification report and then results are stored for later comparison at the end.

In [None]:
def logistic_regression_train(x_train_set, y_train):
    # Different C values provide diversity, class_weight handles imbalance
    lr1 = LogisticRegression(C=0.01, class_weight='balanced' , max_iter=2000, random_state=42)
    lr2 = LogisticRegression(C=0.1, class_weight='balanced', max_iter=2000, random_state=43)
    lr3 = LogisticRegression(C=1.0, class_weight='balanced', max_iter=2000, random_state=44)
    lr4 = LogisticRegression(C=10.0, class_weight='balanced', max_iter=2000, random_state=45)
    lr5 = LogisticRegression(C=100, class_weight='balanced', max_iter=2000, random_state=46) 

    # Soft voting averages the probability predictions from all 5 models
    ensemble = VotingClassifier(
        estimators=[('lr1', lr1), ('lr2', lr2), ('lr3', lr3), ('lr4', lr4), ('lr5', lr5)],
        voting='soft'
    )

    ensemble.fit(x_train_set, y_train)
    
    return ensemble


### Model 2: Support Vector Classifier Model
Similar to logistical regression, uses C value along with random states to get a variety of results which is then put into an ensemble for voting to create a final prediction model. It is then ran on the test set where the results are stored for later comparison at the end,

In [None]:
def SVC_train(x_train_set, y_train):
    # Different C values provide diversity, class_weight handles imbalance
    svc1 = SVC(C=0.01, class_weight='balanced', probability = True, random_state=42)
    svc2 = SVC(C=0.1, class_weight='balanced', probability = True, random_state=43)
    svc3 = SVC(C=1.0, class_weight='balanced', probability = True, random_state=44)
    svc4 = SVC(C=10.0, class_weight='balanced',probability = True, random_state=45)
    svc5 = SVC(C=100, class_weight='balanced',probability = True, random_state=46) 

    # Soft voting averages the probability predictions from all 5 models
    ensemble = VotingClassifier(
        estimators=[('svc1', svc1), ('svc2', svc2), ('svc3', svc3), ('svc4', svc4), ('svc5', svc5)],
        voting='soft'
    )

    ensemble.fit(x_train_set, y_train)

    return ensemble


### Model 3: Random Forest Classifier
Uses the Random Forest Classifier model with multiple random states and put them into an ensemble with soft voting to create a model on the data set. After the ensemble fits the training set, it runs on the test set and displays the results in a classification report where the data is stored for final comparison at the end.

In [None]:
def RFC_train(x_train_set, y_train):
    # Different C values provide diversity, class_weight handles imbalance
    rfc1 = RFC( class_weight='balanced',  n_estimators=2000, random_state=42)
    rfc2 = RFC( class_weight='balanced',  n_estimators=2000, random_state=43)
    rfc3 = RFC( class_weight='balanced',  n_estimators=2000, random_state=44)
    rfc4 = RFC( class_weight='balanced',  n_estimators=2000, random_state=45)
    rfc5 = RFC( class_weight='balanced', n_estimators=2000, random_state=46) 

    # Soft voting averages the probability predictions from all 5 models
    ensemble = VotingClassifier(
        estimators=[('rfc1', rfc1), ('rfc2', rfc2), ('rfc3', rfc3), ('rfc4', rfc4), ('rfc5', rfc5)],
        voting='soft'
    )

    ensemble.fit(x_train_set, y_train)

    return ensemble

### 4 and 5. Training and Validation + Evaluation


Train Logistical Regression Model

In [None]:
# Train and evaluate on TF-IDF features
lr_model_tfidf = logistic_regression_train(X_train_tfidf, y_train)
# Train and evaluate on Glove features
lr_model_glove = logistic_regression_train(X_train_glove, y_train)
# Train and evaluate on Contextual Embedding features
lr_model_ctx = logistic_regression_train(X_train_ctx, y_train)
# Train and evaluate on Custom features
lr_model_custom = logistic_regression_train(X_train_custom, y_train)
# Train and evaluate on Combined features
lr_model_combined =logistic_regression_train(X_train_combined, y_train)

Train SVC Model

In [None]:
scaler = StandardScaler()
X_train_combined_svc = scaler.fit_transform(X_train_combined.toarray())
X_test_combined_svc = scaler.transform(X_test_combined.toarray())
X_train_tfidf_svc = scaler.fit_transform(X_train_tfidf.toarray())
X_test_tfidf_svc = scaler.transform(X_test_tfidf.toarray())
X_train_glove_svc = scaler.fit_transform(X_train_glove)
X_test_glove_svc = scaler.transform(X_test_glove)
X_train_ctx_svc = scaler.fit_transform(X_train_ctx)
X_test_ctx_svc = scaler.transform(X_test_ctx)
X_train_custom_svc = scaler.fit_transform(X_train_custom)
X_test_custom_svc = scaler.transform(X_test_custom)

# Train and evaluate on TF-IDF features
svc_tfidf_model = SVC_train(X_train_tfidf_svc, y_train )
# Train and evaluate on Glove features
svc_glove_model = SVC_train(X_train_glove_svc,  y_train)
# Train and evaluate on Contextual Embedding features
svc_ctx_model = SVC_train(X_train_ctx_svc, y_train)
# Train and evaluate on Custom features
svc_custom_model = SVC_train(X_train_custom_svc, y_train)
# Train and evaluate on Combined features
svc_combined_model = SVC_train(X_train_combined_svc, y_train)


Train Random Forest Classifier Model

In [None]:
# Train RFC on TF-IDF features
rfc_tfidf_model = RFC_train(X_train_tfidf, y_train)
# Train RFC on Glove features
rfc_glove_model = RFC_train(X_train_glove, y_train)
# Train RFC on Contextual Embedding features
rfc_ctx_model = RFC_train(X_train_ctx, y_train)
# Train RFC on Custom features
rfc_custom_model = RFC_train(X_train_custom, y_train)
# Train RFC on Combined features
rfc_combined_model = RFC_train(X_train_combined, y_train)


### 5. Evaluate models
Uses model with test set and an optimal threshold calculated from 

In [None]:
def evaluate_model(model, X_test, y_test, optimal_threshold):

    # Get probability predictions (probability of sexist class)
    y_proba = model.predict_proba(X_test)[:, 1]
    y_pred = (y_proba >= optimal_threshold).astype(int)
    y_pred_labels = le.inverse_transform(y_pred)
    print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=le.classes_))
    return y_pred_labels
    

In [None]:
def tune_threshold(model, x_train, y_train):
    # Tune threshold on training set to maximize accuracy
    y_proba = model.predict_proba(x_train)[:, 1]
    best_threshold = 0.5
    best_acc = accuracy_score(y_train, (y_proba >= best_threshold).astype(int))

    for threshold in np.arange(0.25, 0.75, 0.01):
        acc = accuracy_score(y_train, (y_proba >= threshold).astype(int))
        if acc > best_acc:
            best_acc = acc
            best_threshold = float(threshold)

    return best_threshold


In [None]:
# Evaluate all LR models
print("Evaluating Logistic Regression Models:")
lr_tfidf_preds = evaluate_model(lr_model_tfidf, X_test_tfidf, y_train, tune_threshold(lr_model_tfidf, X_train_tfidf, y_train))
lr_glove_preds = evaluate_model(lr_model_glove, X_test_glove, y_train,  tune_threshold(lr_model_glove, X_train_glove, y_train))
lr_ctx_preds = evaluate_model(lr_model_ctx, X_test_ctx, y_train, tune_threshold(lr_model_ctx, X_train_ctx, y_train))
lr_custom_preds = evaluate_model(lr_model_custom, X_test_custom, y_train, tune_threshold(lr_model_custom, X_train_custom, y_train))
lr_combined_preds = evaluate_model(lr_model_combined, X_test_combined, y_train, tune_threshold(lr_model_combined, X_train_combined, y_train))
# Evaluate all SVC models
print("Evaluating SVC Models:")
svc_tfidf_preds = evaluate_model(svc_tfidf_model, X_test_tfidf_svc, y_train, tune_threshold(svc_tfidf_model, X_train_tfidf_svc, y_train))
svc_glove_preds = evaluate_model(svc_glove_model, X_test_glove_svc  , y_train, tune_threshold(svc_glove_model, X_train_glove_svc, y_train) )
svc_ctx_preds = evaluate_model(svc_ctx_model, X_test_ctx_svc, y_train, tune_threshold(svc_ctx_model, X_train_ctx_svc, y_train))
svc_custom_preds = evaluate_model(svc_custom_model, X_test_custom_svc, y_train, tune_threshold(svc_custom_model, X_train_custom_svc, y_train))
svc_combined_preds = evaluate_model(svc_combined_model, X_test_combined_svc, y_train, tune_threshold(svc_combined_model, X_train_combined_svc, y_train))
# Evaluate all RFC models
print("Evaluating RFC Models:")
rfc_tfidf_preds = evaluate_model(rfc_tfidf_model, X_test_tfidf, y_train, tune_threshold(rfc_tfidf_model, X_train_tfidf, y_train))
rfc_glove_preds = evaluate_model(rfc_glove_model, X_test_glove, y_train, tune_threshold(rfc_glove_model, X_train_glove, y_train))
rfc_ctx_preds = evaluate_model(rfc_ctx_model, X_test_ctx, y_train, tune_threshold(rfc_ctx_model, X_train_ctx, y_train))
rfc_custom_preds = evaluate_model(rfc_custom_model, X_test_custom, y_train, tune_threshold(rfc_custom_model, X_train_custom, y_train))
rfc_combined_preds = evaluate_model(rfc_combined_model, X_test_combined, y_train, tune_threshold(rfc_combined_model, X_train_combined, y_train))



### Evaluation 
This section has a function that grabs the data from each of the results of the models and stores it into a result array which is tthen displayed as a table.

In [None]:


# Ensure y_true and y_pred use the same label types (numeric) before computing metrics.
def extract_full_metrics(y_true, y_pred, feature_name, model_name, labels=[1, 0]):
    y_true_arr = np.array(y_true)
    y_pred_arr = np.array(y_pred)

    # check if y_true and y_pred are strings, convert to numeric labels if so
    if y_true_arr.dtype.kind in ("U", "S", "O"):
        y_true_arr = le.transform(y_true_arr)
    if y_pred_arr.dtype.kind in ("U", "S", "O"):
        y_pred_arr = le.transform(y_pred_arr)

    y_true_arr = y_true_arr.astype(int)
    y_pred_arr = y_pred_arr.astype(int)

    precisions, recalls, f1s, _ = precision_recall_fscore_support(
        y_true_arr, y_pred_arr, labels=labels, zero_division=0
    )

    weighted_p = precision_score(y_true_arr, y_pred_arr, average="weighted", zero_division=0)
    weighted_r = recall_score(y_true_arr, y_pred_arr, average="weighted", zero_division=0)
    weighted_f1 = f1_score(y_true_arr, y_pred_arr, average="weighted", zero_division=0)

    return {
        "Feature + Model": f"{feature_name} + {model_name}",
        "Sexist (P)": float(precisions[0]),
        "Sexist (R)": float(recalls[0]),
        "Sexist (F1)": float(f1s[0]),
        "Non-Sexist (P)": float(precisions[1]),
        "Non-Sexist (R)": float(recalls[1]),
        "Non-Sexist (F1)": float(f1s[1]),
        "Weighted (P)": float(weighted_p),
        "Weighted (R)": float(weighted_r),
        "Weighted (F1)": float(weighted_f1),
    }

results = []

# Collect results for all models
results.append(extract_full_metrics(y_train, lr_tfidf_preds, "TF-IDF", "Logistic Regression"))
results.append(extract_full_metrics(y_train, lr_glove_preds, "GloVe", "Logistic Regression"))
results.append(extract_full_metrics(y_train, lr_ctx_preds, "Contextual Embeddings", "Logistic Regression"))
results.append(extract_full_metrics(y_train, lr_custom_preds, "Custom Features", "Logistic Regression"))
results.append(extract_full_metrics(y_train, lr_combined_preds, "Combined Features", "Logistic Regression")) 
results.append(extract_full_metrics(y_train, svc_tfidf_preds, "TF-IDF", "SVC"))
results.append(extract_full_metrics(y_train, svc_glove_preds, "GloVe", "SVC"))
results.append(extract_full_metrics(y_train, svc_ctx_preds, "Contextual Embeddings", "SVC"))
results.append(extract_full_metrics(y_train, svc_custom_preds, "Custom Features", "SVC"))
results.append(extract_full_metrics(y_train, svc_combined_preds, "Combined Features", "SVC")) 
results.append(extract_full_metrics(y_train, rfc_tfidf_preds, "TF-IDF", "RFC"))
results.append(extract_full_metrics(y_train, rfc_glove_preds, "GloVe", "RFC"))
results.append(extract_full_metrics(y_train, rfc_ctx_preds, "Contextual Embeddings", "RFC"))
results.append(extract_full_metrics(y_train, rfc_custom_preds, "Custom Features", "RFC"))
results.append(extract_full_metrics(y_train, rfc_combined_preds, "Combined Features", "RFC"))

# Create results dataframe
results_df = pd.DataFrame(results)

display(results_df)
print("\nBest Performance Summary:")
best_row = results_df.loc[results_df["Weighted (F1)"].idxmax()]
display(best_row)

Unnamed: 0,Feature + Model,Sexist (P),Sexist (R),Sexist (F1),Non-Sexist (P),Non-Sexist (R),Non-Sexist (F1),Weighted (P),Weighted (R),Weighted (F1)
0,TF-IDF + Logistic Regression,0.595556,0.451178,0.51341,0.810685,0.884664,0.846061,0.751851,0.766114,0.755087
1,GloVe + Logistic Regression,0.378472,0.367003,0.37265,0.764411,0.773131,0.768746,0.658864,0.662063,0.660421
2,Contextual Embeddings + Logistic Regression,0.591954,0.693603,0.63876,0.876694,0.820025,0.847413,0.798823,0.785451,0.790351
3,Custom Features + Logistic Regression,0.626556,0.508418,0.561338,0.827219,0.885932,0.855569,0.772342,0.782689,0.775103
4,Combined Features + Logistic Regression,0.712766,0.676768,0.694301,0.880597,0.897338,0.888889,0.834698,0.837017,0.835673
5,TF-IDF + SVC,0.473282,0.208754,0.28972,0.753927,0.912548,0.825688,0.677176,0.720074,0.679111
6,GloVe + SVC,0.666667,0.006734,0.013333,0.727608,0.998733,0.84188,0.710942,0.72744,0.615289
7,Contextual Embeddings + SVC,0.724138,0.282828,0.40678,0.780412,0.959442,0.860716,0.765022,0.774401,0.736573
8,Custom Features + SVC,1.0,0.026936,0.052459,0.731911,1.0,0.845206,0.805228,0.733886,0.628405
9,Combined Features + SVC,0.575342,0.424242,0.488372,0.802768,0.882129,0.84058,0.740572,0.756906,0.744258



Best Performance Summary:


Feature + Model    Combined Features + Logistic Regression
Sexist (P)                                        0.712766
Sexist (R)                                        0.676768
Sexist (F1)                                       0.694301
Non-Sexist (P)                                    0.880597
Non-Sexist (R)                                    0.897338
Non-Sexist (F1)                                   0.888889
Weighted (P)                                      0.834698
Weighted (R)                                      0.837017
Weighted (F1)                                     0.835673
Name: 4, dtype: object

In [None]:
with open("Summary.txt", "r") as f:
    contents = f.read()
    print(contents)

summary = """Summary of Best Model:

Data Preprocessing: 
    Rationale: Since punctuation and hashtags and mentions can sound aggressive or 
    feel like patterns showing sexism, minimal preprocessing was done to keep
    the original features. URLs were removed as they are unknown unless we went in the site and checked
    
    Steps:
    - Lowercase to keep things consistent 
    - Took out URLs 
    - Whitespace normalization

Feature Engineering: 
    We combined three feature types to capture different patterns of sexism:
    
    1. TF-IDF (1800 features): 
       - Captures word importance via inverse document frequency
       - Bigrams (1,2) to capture common sexist phrases 
       - Chi-squared feature choosing to keep most discriminative terms
       - min_df=3: this means a word must appear in 3 documents to be counted, ignores typos and such
       - max_df=0.85: removes overly common words that don't indicate anything
       - sublinear_tf=True: log scaling keeps extremely common words from dominating
    
    2. GloVe embeddings (100 features): 
       - Makes synonyms match in importance as vectors
       - Mean pooling combines word vectors into sentence vectors
       - Complements TF-IDF by understanding synonyms and context
    
    3. Custom features (8 features): 
       Features that matter a lot that the first two were missing:
       - Derogatory word count
       - Female pronoun to total pronoun ratio
       - Checking for woman/women/girl/female/lady to weigh more
       - Commanding language: should/must/need/have to
       - Intensity markers: exclamation marks 
       - Negative descriptors: insulting adjectives 
       - Text length as a feature, scaled logarithmically
       - All-caps word count
    
    Final feature vector: 1908 dimensions (1800 + 100 + 8)

Model Selection and Architecture: 
    Ensemble approach (merging several models) to reduce overfitting and make generalization better:
    
    - Base model: Logistic Regression 
    - Ensemble: 4 models with different regularization strengths
      * C values: {2.0, 3.0, 4.0, 5.0} for diversity in decisions
      * Lower C = stronger regularization so less overfitting
      * Higher C = weaker regularization so more nuance 
    - Average probability predictions 
    - Class weight {0: 1, 1: 1.5}: Penalizes misclassifying sexist examples more

Training and Validation: 
    - Stratified 5-fold cross-validation 
    - Cross-validation during hyperparameter search prevented overfitting
    - Threshold optimization: Tested 0.40-0.65 to find best threshold
      * A lower threshold means it's more sensitive to detecting sexist content
      * A higher threshold means it's more strict/conservative about calling things sexist

Evaluation and Results:

              precision    recall  f1-score   support

   not sexist     0.8727    0.9037    0.8879       789
       sexist     0.7175    0.6498    0.6820       297

     accuracy                         0.8343      1086
    macro avg     0.7951    0.7768    0.7849      1086
 weighted avg     0.8303    0.8343    0.8316      1086

    Confusion Matrix:
    - True Positives: 193 correctly sexist
    - False Negatives: 104 sexist examples missed 
    - False Positives: 76 non-sexist flagged as sexist 
    - True Negatives: 713 non-sexist correct
    
    Final Performance: Weighted F1 of 0.8316 exceeds 0.82 target
"""

print(summary)



Summary of Best Model:

Data Preprocessing: 
    Rationale: Since punctuation and hashtags and mentions can sound aggressive or 
    feel like patterns showing sexism, minimal preprocessing was done to keep
    the original features. URLs were removed as they are unknown unless we went in the site and checked
    
    Steps:
    - Lowercase to keep things consistant 
    - took out URLs 
    - Whitespace normalization 

Feature Engineering: 
    We combined three feature types to capture different patterns of sexism:
    
    1. TF-IDF (1800 features): 
       - Captures word importance via inverse document frequency
       - Bigrams (1,2) to capture common sexist phrases 
       - Chi-squared feature choosing to keep most discriminative terms
       - min_df=3: this means a word must appear in 3 documents to be counted, ignors tpyos and such
       - max_df=0.85: removes overly common words that don't indicate anything
       - sublinear_tf=True: log scaling keeps extremely common wor

### 6. Use of Generative AI
For this project, in order to research libraries we were unfamiliar with such as TF-IDF, GloVe Embeddings, Sentence-Transformers, and Voting Classifiers, AI was used to learn about these libraries. AI was also used for large level ideas on possible improvements for our Feature Engineering such as explaining what TF-IDF Vectorizer parameters would further improve our feature engineering, what GloVe embeddings to use and how to use it, also used to suggest potential flag words in our custom_features. All code for text preprocessing, models, and the evaluation was manually written. Feature engineering section used AI for selection of the model "all-mpnet-base-v2" for Sentence Transformers. [Insert Other Use of AI here]