## Introduction to the Sexism Detection Dataset

### Data Format

-   **rewire_id:**  A unique identifier for each data point.
-   **text:**  The actual text content.
-   **label_sexist:**  A binary label indicating whether the text is sexist or not.
-   **label_category:**  A categorical label indicating the type of sexism or other category the text belongs to (if applicable).
-   **label_vector:**  A numerical vector representation of the labels (if applicable).
-   **split:**  A column indicating the split of the data into training, development, or test sets.

### Label Information

-   **label_sexist:**
    -   **not sexist:**  The text does not contain any sexist content.
    -   **sexist:**  The text contains sexist content.
-   **label_category:**
    -   This column may contain various categories of sexism or other types of content. The specific categories and their meanings will depend on the context of the dataset.
-   **label_vector:**
    -   This column may contain a numerical vector representation of the labels. The specific format and interpretation of this vector will depend on the task and the model used.

### Data Split

-   **split:**
    -   **dev:**  Development set.
    -   **train:**  Training set.
    -   **test:**  Test set.

### Potential Applications

-   Training machine learning models to identify and classify sexist text.
-   Developing tools and systems for detecting and mitigating sexism in online content.
-   Conducting research on the prevalence and patterns of sexism in language.
-   Studying the impact of sexist language on individuals and society.

### Limitations and Considerations

-   The dataset may contain biases or limitations inherent in the data collection process or the labeling methodology.
-   The specific categories of sexism or other types of content in the  **label_category**  column may vary depending on the context and purpose of the dataset.
-   The dataset may require additional preprocessing and feature engineering to be suitable for specific NLP tasks.


## Header

```
rewire_id	text	label_sexist	label_category	label_vector	split
```

## Import Libraries

In [160]:
from sklearn.model_selection import train_test_split
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import pandas as pd
import nltk
import string
import re

## Pre rocessing

In [161]:
data = pd.read_csv("./edos_labelled_aggregated.csv")

In [162]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   rewire_id       20000 non-null  object
 1   text            20000 non-null  object
 2   label_sexist    20000 non-null  object
 3   label_category  20000 non-null  object
 4   label_vector    20000 non-null  object
 5   split           20000 non-null  object
dtypes: object(6)
memory usage: 937.6+ KB


In [163]:
data.head()

Unnamed: 0,rewire_id,text,label_sexist,label_category,label_vector,split
0,sexism2022_english-9609,"In Nigeria, if you rape a woman, the men rape ...",not sexist,none,none,dev
1,sexism2022_english-16993,"Then, she's a keeper. 😉",not sexist,none,none,train
2,sexism2022_english-13149,This is like the Metallica video where the poo...,not sexist,none,none,train
3,sexism2022_english-13021,woman?,not sexist,none,none,train
4,sexism2022_english-966,I bet she wished she had a gun,not sexist,none,none,dev


Drop unnecessary columns

In [164]:
data = data[['split', 'text', 'label_sexist']]

Remove punctuation and special characters:

In [165]:
def remove_punctuation(text):
  translator = str.maketrans('', '', string.punctuation)
  return text.translate(translator)

data["text"] = data["text"].apply(lambda x: remove_punctuation(x))

In [166]:
def remove_special_characters(text):
  pattern = r'[^a-zA-Z0-9\s]'
  return re.sub(pattern, '', text)

data["text"] = data["text"].apply(lambda x: remove_special_characters(x))

Convert text to lowercase:

In [167]:
def to_lowercase(text):
  return text.lower()

data["text"] = data["text"].apply(lambda x: to_lowercase(x))

Remove stop words:  

In [168]:
def remove_stop_words(text):
  stop_words = set(stopwords.words('english'))
  return ' '.join([word for word in text.split() if word not in stop_words])

data["text"] = data["text"].apply(lambda x: remove_stop_words(x))

Stemming:

In [None]:
def stemming(text):
  stemmer = PorterStemmer()
  return ' '.join([stemmer.stem(word) for word in text.split()])


data["text"] = data["text"].apply(lambda x: stemming(x))

Lemmatization

In [None]:
def lemmatization(text):
  lemmatizer = WordNetLemmatizer()
  return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

data["text"] = data["text"].apply(lambda x: lemmatization(x))

## Spliting the data

In [None]:
train_mask = data['split'] == 'train'
test_mask = data['split'] == 'test'
val_mask = data['split'] == 'dev'

In [None]:
train_data = data[train_mask]
test_data = data[test_mask]
val_data = data[val_mask]

### Training data information

In [None]:
train_data.head(4)

Unnamed: 0,split,text,label_sexist
1,train,she keeper,not sexist
2,train,like metallica video poor mutil bastard say pl...,not sexist
3,train,woman,not sexist
5,train,unlicens day care worker reportedli tell cop w...,not sexist


In [None]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14000 entries, 1 to 19998
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   split         14000 non-null  object
 1   text          14000 non-null  object
 2   label_sexist  14000 non-null  object
dtypes: object(3)
memory usage: 437.5+ KB


In [None]:
train_data = train_data.drop(columns=["split"])

### Testing data information

In [None]:
test_data.head(4)

Unnamed: 0,split,text,label_sexist
21,test,fuck nigger jew histori eat white peopl fuck s...,not sexist
27,test,well good someon knock as back line act like w...,sexist
35,test,usa texa islam muslim islam sharialaw sharia t...,not sexist
38,test,ye normal woman want domin social scientist ca...,sexist


In [None]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4000 entries, 21 to 19999
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   split         4000 non-null   object
 1   text          4000 non-null   object
 2   label_sexist  4000 non-null   object
dtypes: object(3)
memory usage: 125.0+ KB


In [None]:
test_data = test_data.drop(columns=["split"])

### Validating data information

In [None]:
val_data.head(4)

Unnamed: 0,split,text,label_sexist
0,dev,nigeria rape woman men rape back nsfw nigeria ...,not sexist
4,dev,bet wish gun,not sexist
9,dev,agre time know well enough say cant love woman...,sexist
15,dev,democrat minnesota leftist muzzi shithol dumbe...,sexist


In [None]:
val_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2000 entries, 0 to 19974
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   split         2000 non-null   object
 1   text          2000 non-null   object
 2   label_sexist  2000 non-null   object
dtypes: object(3)
memory usage: 62.5+ KB


In [None]:
val_data = val_data.drop(columns=["split"])