## Introduction to the Sexism Detection Dataset



### Data Format

-   **rewire_id:**  A unique identifier for each data point.
-   **text:**  The actual text content.
-   **label_sexist:**  A binary label indicating whether the text is sexist or not.
-   **label_category:**  A categorical label indicating the type of sexism or other category the text belongs to (if applicable).
-   **label_vector:**  A numerical vector representation of the labels (if applicable).
-   **split:**  A column indicating the split of the data into training, development, or test sets.

### Label Information

-   **label_sexist:**
    -   **not sexist:**  The text does not contain any sexist content.
    -   **sexist:**  The text contains sexist content.
-   **label_category:**
    -   This column may contain various categories of sexism or other types of content. The specific categories and their meanings will depend on the context of the dataset.
-   **label_vector:**
    -   This column may contain a numerical vector representation of the labels. The specific format and interpretation of this vector will depend on the task and the model used.

### Data Split

-   **split:**
    -   **dev:**  Development set.
    -   **train:**  Training set.
    -   **test:**  Test set.

### Potential Applications

-   Training machine learning models to identify and classify sexist text.
-   Developing tools and systems for detecting and mitigating sexism in online content.
-   Conducting research on the prevalence and patterns of sexism in language.
-   Studying the impact of sexist language on individuals and society.

### Limitations and Considerations

-   The dataset may contain biases or limitations inherent in the data collection process or the labeling methodology.
-   The specific categories of sexism or other types of content in the  **label_category**  column may vary depending on the context and purpose of the dataset.
-   The dataset may require additional preprocessing and feature engineering to be suitable for specific NLP tasks.


## Header

```
rewire_id	text	label_sexist	label_category	label_vector	split
```

## Import Libraries

In [136]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
import matplotlib.pyplot as plt
from tensorflow.keras.utils import to_categorical
from sklearn.calibration import LabelEncoder
from keras.layers import Embedding, LSTM, Dense
from keras.utils import pad_sequences
from nltk.stem import WordNetLemmatizer
from keras.models import Sequential
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import tensorflow as tf
import pandas as pd
import numpy as np
import nltk
import string
import re

## Pre rocessing

In [103]:
data = pd.read_csv("./edos_labelled_aggregated.csv")

In [104]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   rewire_id       20000 non-null  object
 1   text            20000 non-null  object
 2   label_sexist    20000 non-null  object
 3   label_category  20000 non-null  object
 4   label_vector    20000 non-null  object
 5   split           20000 non-null  object
dtypes: object(6)
memory usage: 937.6+ KB


In [105]:
data.head()

Unnamed: 0,rewire_id,text,label_sexist,label_category,label_vector,split
0,sexism2022_english-9609,"In Nigeria, if you rape a woman, the men rape ...",not sexist,none,none,dev
1,sexism2022_english-16993,"Then, she's a keeper. 😉",not sexist,none,none,train
2,sexism2022_english-13149,This is like the Metallica video where the poo...,not sexist,none,none,train
3,sexism2022_english-13021,woman?,not sexist,none,none,train
4,sexism2022_english-966,I bet she wished she had a gun,not sexist,none,none,dev


Drop unnecessary columns

In [106]:
data = data[['split', 'text', 'label_sexist']]

Remove punctuation and special characters:

In [107]:
def remove_punctuation(text):
  translator = str.maketrans('', '', string.punctuation)
  return text.translate(translator)

data["text"] = data["text"].apply(lambda x: remove_punctuation(x))

In [108]:
def remove_special_characters(text):
  pattern = r'[^a-zA-Z0-9\s]'
  return re.sub(pattern, '', text)

data["text"] = data["text"].apply(lambda x: remove_special_characters(x))

Convert text to lowercase:

In [109]:
def to_lowercase(text):
  return text.lower()

data["text"] = data["text"].apply(lambda x: to_lowercase(x))

Remove stop words:  

In [110]:
def remove_stop_words(text):
  stop_words = set(stopwords.words('english'))
  return ' '.join([word for word in text.split() if word not in stop_words])

data["text"] = data["text"].apply(lambda x: remove_stop_words(x))

Stemming:

In [111]:
def stemming(text):
  stemmer = PorterStemmer()
  return ' '.join([stemmer.stem(word) for word in text.split()])

data["text"] = data["text"].apply(lambda x: stemming(x))

Lemmatization

In [112]:
def lemmatization(text):
  lemmatizer = WordNetLemmatizer()
  return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

data["text"] = data["text"].apply(lambda x: lemmatization(x))

In [113]:
label_encoder = LabelEncoder()

In [114]:
data['label_sexist'] = label_encoder.fit_transform(data['label_sexist'])

In [115]:
num_classes = len(set(data['label_sexist']))

In [116]:
data['label_sexist'] = to_categorical(data['label_sexist'], num_classes=num_classes)

## Spliting the data

In [117]:
train_mask = data['split'] == 'train'
test_mask = data['split'] == 'test'
val_mask = data['split'] == 'dev'

In [118]:
train_data = data[train_mask]
test_data = data[test_mask]
val_data = data[val_mask]

### Training data information

In [119]:
train_data.head(4)

Unnamed: 0,split,text,label_sexist
1,train,she keeper,1.0
2,train,like metallica video poor mutil bastard say pl...,1.0
3,train,woman,1.0
5,train,unlicens day care worker reportedli tell cop w...,1.0


In [120]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14000 entries, 1 to 19998
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   split         14000 non-null  object 
 1   text          14000 non-null  object 
 2   label_sexist  14000 non-null  float32
dtypes: float32(1), object(2)
memory usage: 382.8+ KB


In [121]:
X_train = train_data.drop(['label_sexist'] , axis=1)
y_train = train_data.label_sexist

### Testing data information

In [122]:
test_data.head(4)

Unnamed: 0,split,text,label_sexist
21,test,fuck nigger jew histori eat white peopl fuck s...,1.0
27,test,well good someon knock as back line act like w...,0.0
35,test,usa texa islam muslim islam sharialaw sharia t...,1.0
38,test,ye normal woman want domin social scientist ca...,0.0


In [123]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4000 entries, 21 to 19999
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   split         4000 non-null   object 
 1   text          4000 non-null   object 
 2   label_sexist  4000 non-null   float32
dtypes: float32(1), object(2)
memory usage: 109.4+ KB


In [124]:
X_test = test_data.drop(['label_sexist'] , axis=1)
y_test = test_data.label_sexist

### Validating data information

In [125]:
val_data.head(4)

Unnamed: 0,split,text,label_sexist
0,dev,nigeria rape woman men rape back nsfw nigeria ...,1.0
4,dev,bet wish gun,1.0
9,dev,agre time know well enough say cant love woman...,0.0
15,dev,democrat minnesota leftist muzzi shithol dumbe...,0.0


In [126]:
val_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2000 entries, 0 to 19974
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   split         2000 non-null   object 
 1   text          2000 non-null   object 
 2   label_sexist  2000 non-null   float32
dtypes: float32(1), object(2)
memory usage: 54.7+ KB


In [127]:
X_val = val_data.drop(["label_sexist"] ,axis=1)
y_val = val_data.label_sexist

## Tokenize  data

In [128]:
tokenizer = Tokenizer()

training data

In [129]:
texts_train = X_train['text'].tolist()
tokenizer.fit_on_texts(texts_train)
X_train_sequences = tokenizer.texts_to_sequences(texts_train)

Validation data

In [130]:
text_val = X_val['text'].tolist()
tokenizer.fit_on_texts(text_val)
X_val_sequences = tokenizer.texts_to_sequences(text_val)

Testing data

In [133]:
text_test = X_test["text"].tolist()
tokenizer.fit_on_texts(text_test)
X_test_sequences = tokenizer.texts_to_sequences(text_test)

Pad sequences to ensure they have the same length

In [134]:
max_sequence_length = 100
X_train_padded = pad_sequences(X_train_sequences, maxlen=max_sequence_length)
X_val_padded = pad_sequences(X_val_sequences, maxlen=max_sequence_length)
X_test_padded = pad_sequences(X_test_sequences, maxlen=max_sequence_length)

## Fiting model

Define the LSTM model architecture

In [95]:
model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1 ,input_length=max_sequence_length , output_dim=50))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])


In [96]:
model.fit(X_train_padded, y_train, epochs=10, batch_size=64 , validation_data=(X_val_padded, y_val))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Testing 

In [135]:
test_loss, test_accuracy = model.evaluate(X_test_padded, y_test)

