<a href="https://colab.research.google.com/github/PadmajaVB/Sexist-Statement-Detection/blob/main/Final_Preprocess_reddit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing to select subtle misogyny from Reddit comments dataset

The reddit misogynistic comments dataset is a complex dataset with levels of intricacies. First, a binary distinction is made between
**Misogynistic content** and **Non-misogynistic**
**content**, which are mutually exclusive. Second, subtypes of Misogynistic and Nonmisogynistic content are elaborated. For Misogynistic content, four categories are defined: (i) Misogynistic Pejoratives,
(ii) descriptions of Misogynistic Treatment,
(iii) acts of Misogynistic Derogation and (iv) Gendered
Personal attacks against women. For Nonmisogynistic
content three categories are defined:
(i) Counter speech against misogyny, (ii) Nonmisogynistic
personal attacks and (iii) None of the
categories. Third, additional flags were included for
some of the second level categories. Within both
Misogynistic and Non-misogynistic content, the
second level categories are not mutually exclusive,
thereby allowing for multiple labels per entry. For
instance, a Misogynistic entry could be assigned
labels for both a Pejorative and Treatment.

On reading the paper, we highlighted several implicit and benign forms of sexist categories which have the potential of being labeled as not sexist by a classifier trained on extreme forms of hate speech. Thanks to the levels assigned to it, we were able to filter the extreme ones out resulting in 3787 samples with 267 labeled as sexist and 3520 non sexist labels


**For more information on the dataset, visit https://www.aclweb.org/anthology/2021.eacl-main.114/**

In [None]:
import pandas as pd

In [None]:
df_2= pd.read_csv("final_labels.csv", usecols= ['body', 'level_1', 'level_2', 'level_3', 'strength', 'highlight']) #extract only the columns which indicated something about the text

**The levels give the intricacies of each category while strength highlights whether the text is explicit or implicit**

In [None]:
df_2.head() 

Unnamed: 0,body,level_1,level_2,level_3,strength,highlight
0,Do you have the skin of a 80 year old grandma?...,Nonmisogynistic,None_of_the_categories,,,
1,This is taking a grain of truth and extrapolat...,Nonmisogynistic,None_of_the_categories,,,
2,Honestly my favorite thing about this is that ...,Nonmisogynistic,None_of_the_categories,,,
3,Source? Doesnt sound right to me idk,Nonmisogynistic,None_of_the_categories,,,
4,"Damn, I saw a movie in which the old woman bat...",Misogynistic,Derogation,Moral_inferiority,Nature of the abuse is Implicit,old woman bathed in the blood if virgins


**Getting counts of each category in that column**

In [None]:
df_2.level_1.value_counts()

Nonmisogynistic    5868
Misogynistic        699
Name: level_1, dtype: int64

In [None]:
df_2.level_2.value_counts()

None_of_the_categories             5815
Derogation                          285
Misogynistic_pejorative             276
Treatment                           103
Nonmisogynistic_personal_attack      43
Misogynistic_personal_attack         35
Counter_speech                       10
Name: level_2, dtype: int64

In [None]:
df_2.level_3.value_counts()

Moral_inferiority                               148
Other                                            79
Gender_of_recipient_is_Female                    64
Disrespectful_actions_Seduction_and_conquest     43
Intellectual_inferiority                         32
Sexual_or_physical_limitations                   26
Disrespectful_actions_Controlling                17
Disrespectful_actions_Manipulation               16
Threatening_Physical_violence                    13
Gender_of_recipient_is_Male                      10
Disrespectful_actions_Other                       9
Gender_of_recipient_is_Unknown                    4
Threatening_Sexual_violence                       3
Threatening_Privacy                               2
Name: level_3, dtype: int64

In [None]:
df_2.strength.value_counts()

Nature of the abuse is Implicit    267
Nature of the abuse is Explicit    121
Name: strength, dtype: int64

**Filtering the dataset to retain only benevolent misogyny**

Deleting rows: 
1. Downsampling the None label in Non misogynistic data to get slighlty more balanced dataset for non misogyny.
2. Misogynistic Prejoratives.
3. Remove threatening labels from Misogynistic Treatment.
4. Removing explicit from Misogynistic derogation.
5. Removing personal attacks from both major labels.


**Downsampling the None label in Non misogynistic data to get slighlty more balanced dataset for non misogyny.**

In [None]:
df_downsamp=df_2[~(df_2.iloc[:]['level_2'].str.contains("None_of_the_categories"))]
df_replace= df_2[(df_2.iloc[:]['level_2'].str.contains("None_of_the_categories"))]
df_replace= df_replace.iloc[2000:][:]
df_downsamp= pd.concat([df_downsamp, df_replace])

In [None]:
nan=df_downsamp.index[df_downsamp['body'].isna()]
print(nan)

Int64Index([2777, 2976, 4493, 4747, 4794, 5433, 5943, 5975], dtype='int64')


In [None]:
print(df_downsamp.info())
df_downsamp.dropna(subset=['body'], inplace=True)
print(df_downsamp.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4567 entries, 4 to 6541
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   body       4559 non-null   object
 1   level_1    4567 non-null   object
 2   level_2    4567 non-null   object
 3   level_3    466 non-null    object
 4   strength   388 non-null    object
 5   highlight  751 non-null    object
dtypes: object(6)
memory usage: 249.8+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4559 entries, 4 to 6541
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   body       4559 non-null   object
 1   level_1    4559 non-null   object
 2   level_2    4559 non-null   object
 3   level_3    466 non-null    object
 4   strength   388 non-null    object
 5   highlight  751 non-null    object
dtypes: object(6)
memory usage: 249.3+ KB
None


**Filter out Misogynistic Prejoratives.**

In [None]:
df_downsamp= df_downsamp[~(df_downsamp.iloc[:]['level_2'].str.contains("Misogynistic_pejorative"))]

In [None]:
df_3=df_downsamp.reset_index().drop(['index'], axis=1)

In [None]:
df_3.fillna(value= 'nan', inplace= True)

**Removing explicit/Threatening/Personal Attack Statements.**

In [None]:
df_3= df_3[~(df_3.iloc[:]['strength'].str.contains("Nature of the abuse is Explicit"))]
df_3= df_3[~(df_3.iloc[:]['level_3'].str.contains("Threatening_Sexual_violence"))]
df_3= df_3[~(df_3.iloc[:]['level_3'].str.contains("Threatening_Physical_violence"))]
df_3= df_3[~(df_3.iloc[:]['level_3'].str.contains("Threatening_Privacy"))]
df_3= df_3[~(df_3.iloc[:]['level_2'].str.contains("Nonmisogynistic_personal_attack"))]
df_3= df_3[~(df_3.iloc[:]['level_2'].str.contains("Misogynistic_personal_attack"))]

In [None]:
df_3=df_3.reset_index().drop(['index'], axis=1)

**Drop columns that we dont need**

In [None]:
df_3=df_3.drop(['level_2', 'level_3', 'strength', 'highlight'], axis=1)

**Encode Nonmisogynistic as 0 and Misogynistic as 1**

In [None]:
ind_1= df_3.index[df_3['level_1'].str.contains("Misogynistic")]
ind_2= df_3.index[df_3['level_1'].str.contains("Nonmisogynistic")]

In [None]:
df_3['Sexism'] = 1

In [None]:
df_3.head()

Unnamed: 0,body,level_1,Sexism
0,"Damn, I saw a movie in which the old woman bat...",Misogynistic,1
1,I would not say that by women getting rights i...,Nonmisogynistic,1
2,"I'm sorry, does women having rights mean that...",Nonmisogynistic,1
3,> The problem is that they removed the urinals...,Misogynistic,1
4,But using the urinals in front of girls that a...,Misogynistic,1


In [None]:
df_3.loc[ind_2,'Sexism']=0

In [None]:
df_3.drop(['level_1'], axis=1, inplace=True)

**Saving the preprocessed data**

In [None]:
df_3.head()

Unnamed: 0,body,Sexism
0,"Damn, I saw a movie in which the old woman bat...",1
1,I would not say that by women getting rights i...,0
2,"I'm sorry, does women having rights mean that...",0
3,> The problem is that they removed the urinals...,1
4,But using the urinals in front of girls that a...,1


In [None]:
df_3.to_csv('Reddit (1).csv')