

**Group HOMEWORK**. This final project can be collaborative. The maximum members of a group is 2. You can also work by yourself. Please respect the academic integrity. **Remember: if you get caught on cheating, you get F.**

## A Introduction to the competition

<img src="news-sexisme-EN.jpg" alt="drawing" width="380"/>

Sexism is a growing problem online. It can inflict harm on women who are targeted, make online spaces inaccessible and unwelcoming, and perpetuate social asymmetries and injustices. Automated tools are now widely deployed to find, and assess sexist content at scale but most only give classifications for generic, high-level categories, with no further explanation. Flagging what is sexist content and also explaining why it is sexist improves interpretability, trust and understanding of the decisions that automated tools use, empowering both users and moderators.

This project is based on SemEval 2023 - Task 10 - Explainable Detection of Online Sexism (EDOS). [Here](https://codalab.lisn.upsaclay.fr/competitions/7124#learn_the_details-overview) you can find a detailed introduction to this task.

You only need to complete **TASK A - Binary Sexism Detection: a two-class (or binary) classification where systems have to predict whether a post is sexist or not sexist**. To cut down training time, we only use a subset of the original dataset (5k out of 20k). The dataset can be found in the same folder. 

Different from our previous homework, this competition gives you great flexibility (and very few hints), you can determine: 
-  how to preprocess the input text (e.g., remove emoji, remove stopwords, text lemmatization and stemming, etc.);
-  which method to use to encode text features (e.g., TF-IDF, N-grams, Word2vec, GloVe, Part-of-Speech (POS), etc.);
-  which model to use.

## Requirements
-  **Input**: the text for each instance.
-  **Output**: the binary label for each instance.
-  **Feature engineering**: use at least 2 different methods to extract features and encode text into numerical values.
-  **Model selection**: implement with at least 3 different models and compare their performance.
-  **Evaluation**: create a dataframe with rows indicating feature+model and columns indicating Precision, Accuracy and F1-score (using weighted average). Your results should have at least 6 rows (2 feature engineering methods x 3 models). Report best performance with (1) your feature engineering method, and (2) the model you choose. 
- **Format**: add explainations for each step (you can add markdown cells). At the end of the report, write a summary and answer the following questions: 
    - What preprocessing steps do you follow?
    - How do you select the features from the inputs? 
    - Which model you use and what is the structure of your model?
    - How do you train your model?
    - What is the performance of your best model?
    - What other models or feature engineering methods would you like to implement in the future?
- **Two Rules**, violations will result in 0 points in the grade: 
    - Not allowed to use test set in the training: You CANNOT use any of the instances from test set in the training process. 
    - Not allowed to use code from generative AI (e.g., ChatGPT). 

## Evaluation

The performance should be only evaluated on the test set (a total of 1086 instances). Please split original dataset into train set and test set. The test set should NEVER be used in the training process. The evaluation metric is a combination of precision, recall, and f1-score (use `classification_report` in sklearn). 

The total points are 10.0. Each team will compete with other teams in the class on their best performance. Points will be deducted if not following the requirements above.

If ALL the requirements are met:
- Top 25\% teams: 10.0 points.
- Top 25\% - 50\% teams: 8.5 points.
- Top 50\% - 75\% teams: 7.0 points.
- Top 75\% - 100\% teams: 6.0 points.

## Submission
Similar as homework, submit both a PDF and .ipynb version of the report. 

The report should include: (a)code, (b)outputs, (c)explainations for each step, and (d)summary (you can add markdown cells). 

The due date is **December 8, Friday by 11:59pm.

In [211]:
# Import everything we need
import re
import string
import numpy as np
import pandas as pd

import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import RandomizedSearchCV

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/elliothagyard/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/elliothagyard/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [212]:
# Making the test and train dataframes, and spilling into X(text) and Y(label)
comments_df = pd.read_csv('edos_labelled_data.csv') 
le = preprocessing.LabelEncoder()
comments_df["label"] = le.fit_transform(comments_df["label"])
comments_train_df = comments_df[comments_df["split"] == "train"]
comments_test_df = comments_df[comments_df["split"] == "test"]
X_train = comments_train_df["text"]
Y_train = comments_train_df["label"]
X_test = comments_test_df["text"]
Y_test = comments_test_df["label"]

In [213]:
def clean(comments: list[str]) -> list[str]:
    # Removes unicode characters and punctuation and makes everything lowercase
    # Returns a list of "cleaned" strings
    wln = WordNetLemmatizer()
    comments_clean = [comment.encode("ascii", "ignore").decode() for comment in comments]
    comments_clean = list(map(lambda x : x.lower(), comments_clean))
    comments_clean = [re.sub(r'(#\w+|\[user\]|\[url\])', '', comment) for comment in comments_clean]
    translator = str.maketrans('', '', string.punctuation)
    comments_clean = [comment.translate(translator) for comment in comments_clean]

    return comments_clean

In [214]:
def get_wordnet_pos(treebank_tag):
    # Convert pos_tag output to a part of speech recognized by the lemmatizer
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

def clean_and_lemmatize(x):
    clean_x = clean(x)
    # In clean2 we want to use the lemmatizer to simplify the data.
    lemmatizer = WordNetLemmatizer()
    out = []
    for sentence in clean_x:
        tokens = nltk.pos_tag(nltk.word_tokenize(sentence))
        tagged = list(map(
                lambda x : (x[0], get_wordnet_pos(x[1])),
                tokens
        ))
        # Remove all words that are not recognized parts of speech
        word_and_pos = list(filter(
            lambda x : x[1] != '', 
            tagged
        ))
        # Map the lemmatizer over each word in the sentence
        lemmatized_words = list(
            map(
                lambda x : lemmatizer.lemmatize(x[0], x[1]),
                word_and_pos
            )
        )
        # Join lemmatized words back into a single string
        lemmatized_sentence = " ".join(lemmatized_words)
        out.append(lemmatized_sentence)
    return out



### Encoding 1
The first encoding method was basically just cleaning the data and converting sentences into the number of times each word ocurred in that sentence using the CountVectorizer.

In [215]:
# Fit the s on the X_train data
vectorizer = CountVectorizer()
out = clean_and_lemmatize(X_train)
vectorizer.fit(out)

def word_frequency(x : pd.DataFrame) -> pd.DataFrame:
    out = vectorizer.transform(clean_and_lemmatize(x))
    df = pd.DataFrame(out.toarray(), columns=vectorizer.get_feature_names_out())
    return df


train_freq = word_frequency(X_train)


### Encoding 2
The second encoding method treated the training comments labeled 'sexist' as a body of text on which we fit the TF_IDF vectorizer.

Conceptually, the idea was that the TF_IDF would identify words that were semantically important in sexist comments and would encode this in the features of the data.

In [216]:
TF_IDF_VEC = TfidfVectorizer()
# Fit the TF_IDF vectorizer on the data labeled sexist.
TF_IDF_VEC.fit(clean_and_lemmatize(X_train[Y_train==1]))

from sklearn.feature_extraction.text import TfidfVectorizer
def TFIDF(x):
    cleaned = clean_and_lemmatize(x)
    out = TF_IDF_VEC.transform(cleaned)
    df = pd.DataFrame(out.toarray(), columns=TF_IDF_VEC.get_feature_names_out())
    return df
train_tf_idf = TFIDF(X_train)


## Models


### Model 1: Word Filter

In [217]:
# The first model checks if a comment contains a word in the bad_words list
# If it does, the comments is decided to be sexist, if not the comment is decided to be non-sexist
def model_1(sentence : str, bad_words):
    for word in sentence.split():
        if word in bad_words:
            return 1  
    return 0


# Find the best cutoff for bad_words list
def optimize_model_1(word_frequency : pd.DataFrame):
    freq_with_labels = word_frequency.copy()
    freq_with_labels['_label'] = Y_train.tolist()
    # sexist_comment_percent = the percentage of comments that are sexist (as a decimal)
    sexist_comment_percent = len(Y_train.loc[Y_train== 1]) / len(Y_train)
    # common_freq = all words that appear at least 5 times
    common_freq = freq_with_labels.sum().loc[freq_with_labels.sum() >= 5]
    # sexist = the number of times a word appears in a sexist comment given for all words in common_freq
    sexist = word_frequency[freq_with_labels["_label"] == 1].sum().loc[freq_with_labels.sum() >= 5]
    # ratio = # of sexist comments / # of comments for all words in common_freq
    ratio  = (sexist/common_freq).sort_values()
    f1_scores = {}
    for i in range(-30, 30):
        # All words with a sexism ratio that exceeds the non-sexist comment percentage by at least i / 100 
        bad_words = ratio.loc[ratio > 1 - sexist_comment_percent + i / 100].index

        predict = []
        for comment in clean_and_lemmatize(X_train):
            predict.append(model_1(comment, bad_words))
        f1 = f1_score(Y_train, predict, average = "weighted") 
        f1_scores[i] = f1
    print(max(f1_scores.items(), key=lambda x:x[1]))
    f1_scores = {}
    for i in range(-20, 20):
        # All words with a sexism ratio that exceeds the non-sexist comment percentage by at least i / 100 
        bad_words = ratio.loc[ratio > 1 - sexist_comment_percent + i / 100].index

        predict = []
        for comment in clean_and_lemmatize(X_test):
            predict.append(model_1(comment, bad_words))
        f1 = f1_score(Y_test, predict, average = "weighted") 
        f1_scores[i] = f1
    print(max(f1_scores.items(), key=lambda x:x[1]))
    print(f1_scores[-8])



In [218]:
# Optimize our model for encoding 1
optimize_model_1(train_freq)

(-9, 0.7854088695014045)
(-7, 0.7928391893717034)
0.7904344225729465


In [219]:
# Optimize our model for encoding 2
optimize_model_1(train_tf_idf)

(-22, 0.7574617505587201)
(-16, 0.7812396269580991)
0.7582529740909114


### Model 2

### Model 3: Random Forest Classifier

In [231]:
def create_random_forest_classifier(training_features):
    # Copy the dataframe, not sure if necessary but wanting avoid state issues
    data = training_features.copy()
    labels = Y_train
    rand_forest = RandomForestClassifier(random_state=0)
    rand_forest.fit(data, labels)
    return rand_forest


The baseline model performance is pretty good

In [232]:
# The baseline random_forest_models are pretty good
word_frequency_rand_forest = create_random_forest_classifier(train_freq)
tf_idf_rand_forest = create_random_forest_classifier(train_tf_idf)


In [233]:
f1_score(
    Y_test, 
    word_frequency_rand_forest.predict(word_frequency(X_test)), 
    average = "weighted"
)

0.8005719221181321

In [234]:
f1_score(
    Y_test, 
    tf_idf_rand_forest.predict(TFIDF(X_test)), 
    average = "weighted"
)

0.7911743716353477

But there is some tuning we can still do to improve performance. Note, that the 

In [257]:
# Defining model parameters we want to optimzie.
# It's good to optimize on all of them, but it can dramatically increase runtime
# Takes about 3 minutes to run with just these
# Got this idea from https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
# Minimum number of samples required to split a node
split_samples = [2, 5]
# Minimum number of samples required at each leaf node
min_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]# Create the random grid
random_grid = {
    'min_samples_split': split_samples,
    'min_samples_leaf': min_leaf,
    'bootstrap': bootstrap
}
# Number of trees in random forest
def find_best_random_forest(x_train):
    rf = RandomForestClassifier(random_state=0)
    rf_random = RandomizedSearchCV(
        estimator = rf, 
        param_distributions = random_grid, 
        n_iter = 100, 
        cv = 3, 
        verbose=2, 
        random_state=0, 
        n_jobs = -1
    )# Fit the random search model
    rf_random.fit(x_train, Y_train)
    return rf_random.best_estimator_

In [243]:
word_frequency_rand_forest_improved = find_best_random_forest(train_freq)

Fitting 3 folds for each of 12 candidates, totalling 36 fits




In [256]:
tf_idf_rand_forest_improved = find_best_random_forest(train_tf_idf)



Fitting 3 folds for each of 12 candidates, totalling 36 fits


In [245]:
# Check the model performance
f1_score(
    Y_test, 
    word_frequency_rand_forest_improved.predict(word_frequency(X_test)), 
    average = "weighted"
)

0.8070724261772241

In [None]:
word_frequency_rand_forest_improved.get_params()

In [258]:
f1_score(Y_test, tf_idf_rand_forest_improved.predict(TFIDF(X_test)), average = "weighted")

0.7946202825334021

In [259]:
tf_idf_rand_forest_improved.get_params()

{'bootstrap': False,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 2,
 'min_samples_split': 5,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 0,
 'verbose': 0,
 'warm_start': False}

## Summary

1. What preprocessing steps do you follow?
   
   Your answer:
   
2. How do you select the features from the inputs?
   
   Your answer:
   
3. Which model you use and what is the structure of your model?
   
   Your answer:
   
4. How do you train your model?
   
   Your answer:
   
5. What is the performance of your best model?
   
   Your answer:
   
6. What other models or feature engineering methods would you like to implement in the future?
   
   Your answer:
   