

**Group HOMEWORK**. This final project can be collaborative. The maximum members of a group is 2. You can also work by yourself. Please respect the academic integrity. **Remember: if you get caught on cheating, you get F.**

## A Introduction to the competition

<img src="news-sexisme-EN.jpg" alt="drawing" width="380"/>

Sexism is a growing problem online. It can inflict harm on women who are targeted, make online spaces inaccessible and unwelcoming, and perpetuate social asymmetries and injustices. Automated tools are now widely deployed to find, and assess sexist content at scale but most only give classifications for generic, high-level categories, with no further explanation. Flagging what is sexist content and also explaining why it is sexist improves interpretability, trust and understanding of the decisions that automated tools use, empowering both users and moderators.

This project is based on SemEval 2023 - Task 10 - Explainable Detection of Online Sexism (EDOS). [Here](https://codalab.lisn.upsaclay.fr/competitions/7124#learn_the_details-overview) you can find a detailed introduction to this task.

You only need to complete **TASK A - Binary Sexism Detection: a two-class (or binary) classification where systems have to predict whether a post is sexist or not sexist**. To cut down training time, we only use a subset of the original dataset (5k out of 20k). The dataset can be found in the same folder. 

Different from our previous homework, this competition gives you great flexibility (and very few hints), you can determine: 
-  how to preprocess the input text (e.g., remove emoji, remove stopwords, text lemmatization and stemming, etc.);
-  which method to use to encode text features (e.g., TF-IDF, N-grams, Word2vec, GloVe, Part-of-Speech (POS), etc.);
-  which model to use.

## Requirements
-  **Input**: the text for each instance.
-  **Output**: the binary label for each instance.
-  **Feature engineering**: use at least 2 different methods to extract features and encode text into numerical values.
-  **Model selection**: implement with at least 3 different models and compare their performance.
-  **Evaluation**: create a dataframe with rows indicating feature+model and columns indicating Precision, Accuracy and F1-score (using weighted average). Your results should have at least 6 rows (2 feature engineering methods x 3 models). Report best performance with (1) your feature engineering method, and (2) the model you choose. 
- **Format**: add explainations for each step (you can add markdown cells). At the end of the report, write a summary and answer the following questions: 
    - What preprocessing steps do you follow?
    - How do you select the features from the inputs? 
    - Which model you use and what is the structure of your model?
    - How do you train your model?
    - What is the performance of your best model?
    - What other models or feature engineering methods would you like to implement in the future?
- **Two Rules**, violations will result in 0 points in the grade: 
    - Not allowed to use test set in the training: You CANNOT use any of the instances from test set in the training process. 
    - Not allowed to use code from generative AI (e.g., ChatGPT). 

## Evaluation

The performance should be only evaluated on the test set (a total of 1086 instances). Please split original dataset into train set and test set. The test set should NEVER be used in the training process. The evaluation metric is a combination of precision, recall, and f1-score (use `classification_report` in sklearn). 

The total points are 10.0. Each team will compete with other teams in the class on their best performance. Points will be deducted if not following the requirements above.

If ALL the requirements are met:
- Top 25\% teams: 10.0 points.
- Top 25\% - 50\% teams: 8.5 points.
- Top 50\% - 75\% teams: 7.0 points.
- Top 75\% - 100\% teams: 6.0 points.

## Submission
Similar as homework, submit both a PDF and .ipynb version of the report. 

The report should include: (a)code, (b)outputs, (c)explainations for each step, and (d)summary (you can add markdown cells). 

The due date is **December 8, Friday by 11:59pm.

In [1]:
# Import everything we need
import pandas as pd
import re
import nltk
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from nltk.stem import WordNetLemmatizer
from sklearn.metrics import f1_score 
from sklearn.metrics import classification_report

In [2]:
# Making the test and train dataframes, and spilling into X(text) and Y(label)
comments_df = pd.read_csv('edos_labelled_data.csv') 
le = preprocessing.LabelEncoder()
comments_df["label"] = le.fit_transform(comments_df["label"])
comments_train_df = comments_df[comments_df["split"] == "train"]
comments_test_df = comments_df[comments_df["split"] == "test"]
X_train = comments_train_df["text"]
Y_train = comments_train_df["label"]
X_test = comments_test_df["text"]
Y_test = comments_test_df["label"]

In [3]:
def clean(comments: list[str]) -> list[str]:
    # Removes unicode characters and punctuation and makes everything lowercase
    # Returns a list of "cleaned" strings
    wln = WordNetLemmatizer()
    comments_clean = [comment.encode("ascii", "ignore").decode() for comment in comments]
    comments_clean = list(map(lambda x : x.lower(), comments_clean))
    comments_clean = [re.sub(r'(#\w+|\[user\]|\[url\])', '', comment) for comment in comments_clean]
    translator = str.maketrans('', '', string.punctuation)
    comments_clean = [comment.translate(translator) for comment in comments_clean]

    return comments_clean

In [4]:
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

def clean2EvenCleaner(x):
    clean_x = clean(x)
    lemmatizer = WordNetLemmatizer()
    out = []
    for sentence in clean_x:
        tokens = pos_tag(word_tokenize(sentence))
        tagged = list(map(
                lambda x : (x[0], get_wordnet_pos(x[1])),
                tokens
        ))
        
        word_and_pos = list(filter(
            lambda x : x[1] != '', 
            tagged
        ))

        out.append(" ".join(list(map(lambda x : lemmatizer.lemmatize(x[0], x[1]), word_and_pos))))
    return out


In [5]:
def toWordFreqDF(x, y):
    clean_x = clean(x)
    vectorizer = CountVectorizer()
    vec = vectorizer.fit_transform(clean_x)
    frequency_df = pd.DataFrame(vec.toarray(), columns=vectorizer.get_feature_names_out())
    frequency_df['_label'] = y.tolist()
    frequency_df['_label'].tail()
    return frequency_df

train_freq = toWordFreqDF(X_train, Y_train)
test_freq = toWordFreqDF(X_test, Y_test)

In [6]:
# sexist_comment_percent = the percentage of comments that are sexist (as a decimal)
sexist_comment_percent = len(Y_train.loc[Y_train== 1]) / len(Y_train)
# common_freq = all words that appear at least 5 times
common_freq = train_freq.sum().loc[train_freq.sum() >= 5]
# sexist = the number of times a word appears in a sexist comment given for all words in common_freq
sexist = train_freq[train_freq["_label"] == 1].sum().loc[train_freq.sum() >= 5]
# ratio = # of sexist comments / # of comments for all words in common_freq
ratio  = (sexist/common_freq).sort_values()

In [13]:
# The first model checks if a comment contains a word in the bad_words list
# If it does, the comments is decided to be sexist, if not the comment is decided to be non-sexist
def model_1(sentence : str, bad_words):
    for word in sentence.split():
        if word in bad_words:
            return 1  
    return 0


# Find the best cutoff for bad_words list
def optimize_model_1():
    f1_scores = {}
    for i in range(-20, 20):
        # All words with a sexism ratio that exceeds the non-sexist comment percentage by at least i / 100 
        bad_words = ratio.loc[ratio > 1 - sexist_comment_percent + i / 100].index

        predict = []
        for comment in clean(X_train):
            predict.append(model_1(comment, bad_words))
        f1 = f1_score(Y_train, predict, average = "weighted") 
        f1_scores[i] = f1
    print(max(f1_scores.items(), key=lambda x:x[1]))
    f1_scores = {}
    for i in range(-20, 20):
        # All words with a sexism ratio that exceeds the non-sexist comment percentage by at least i / 100 
        bad_words = ratio.loc[ratio > 1 - sexist_comment_percent + i / 100].index

        predict = []
        for comment in clean(X_test):
            predict.append(model_1(comment, bad_words))
        f1 = f1_score(Y_test, predict, average = "weighted") 
        f1_scores[i] = f1
    print(max(f1_scores.items(), key=lambda x:x[1]))
    print(f1_scores[-8])
          
optimize_model_1()

(-8, 0.7872235271864106)
(-3, 0.7889672885458582)
0.7685862242158681


## Summary

1. What preprocessing steps do you follow?
   
   Your answer:
   
2. How do you select the features from the inputs?
   
   Your answer:
   
3. Which model you use and what is the structure of your model?
   
   Your answer:
   
4. How do you train your model?
   
   Your answer:
   
5. What is the performance of your best model?
   
   Your answer:
   
6. What other models or feature engineering methods would you like to implement in the future?
   
   Your answer:
   