# Basic pre-processing of textual dataset
# For EDA and language model fine-tuning

## This notebook demonstrates basic pre-processing techniques using the `nltk` and `spaCy` libraries

The task is done to prepare datasets of hate speech for use in the Omdena Aswan Local Chapter, 'Detecting Hateful and Offensive Language using NLP', in which I am cooperating and co-leading the pre-processing and EDA tasks with Vishu Kalier.

Here, we will be using the hatecheck-data (https://github.com/paul-rottger/hatecheck-data) dataset.

## Guidelines

I shall complete the task by strictly following the guidelines outlined for the project. These are as follows.

### File structure

The output of the pre-processing has to be a `.cvs` file. The file has to be organised in 5 columns as follows: 
| corpus_name | raw_sentence | label | clean_sentence_training | clean_sentence_EDA | 

### Labels

The labels will be:

- '2' for RISKY sentences (eg., 'hateful' or 'abusive');
- '1' for POTENTIALLY RISKY sentences (e.g., 'offensive)';
- '0' for NON RISKY sentences.

### Pre-processing

The pre-processing for the 'clean_sentence_training' column involves the following steps:

1- delete all rows containing null values;

2- stopwords removal;

3- removal of digits and words containing digits, punctuation and special characters, extra spaces, links;

4- lemmatization (NO stemming at it can yield undesirable results).

The case of the sentences shoudn't be changed as the distinction between lowercase and uppercase can be meaningful in this context. The pre-processing for the 'clean_sentence_EDA' involves all the steps above, plus the deletion of all emojis (if present).

In [1]:
# import necessary libraries
# don't forget to install any library that you don't have in your workspace yet
# make sure that the environment you're installing your libraries corresponds to the one you're using to compile your code

import numpy as np
import pandas as pd
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from textblob import TextBlob

In [2]:
# import the spaCy standard model for English
# use the command 'python -m spacy download en_core_web_lg' if the model isn't already in your machine
# note that downloading the model can tale some time
# refer to the spaCy documentation if you're working on languages other than English

nlp = spacy.load("en_core_web_lg")

In [3]:
# retreive the chosen dataset
# display 10 random rows of the chosen dataset
# display dataset shape
# beware: this notebook and the dataset you want to work on HAVE TO BE in the same directory

data = pd.read_csv("all_annotations.csv")
data.sample(10)

Unnamed: 0.1,Unnamed: 0,functionality,case_id,templ_id,test_case,label_gold,label_1,label_2,label_3,label_4,label_5,label_6,label_7,label_8,label_9,label_10,count_label_h,count_label_nh,label_annot_maj
1475,1475,ref_subs_clause_h,1476,376,"If immigrants like you refuse to fit in, they ...",hateful,,,hateful,hateful,hateful,hateful,hateful,,,,5,0,hateful
2013,2013,phrase_question_h,2014,453,What if we just killed all Muslims?,hateful,,hateful,,hateful,,hateful,,hateful,,hateful,5,0,hateful
2778,2778,counter_ref_nh,2779,558,Do you not realise how hateful you are being w...,non-hateful,,,non-hateful,non-hateful,non-hateful,non-hateful,non-hateful,,,,0,5,non-hateful
1306,1306,profanity_nh,1307,297,"Motherfucker, how is that even possible?",non-hateful,,,,,non-hateful,non-hateful,non-hateful,non-hateful,non-hateful,,0,5,non-hateful
1397,1397,ref_subs_clause_h,1398,365,I don't have anything against Muslims but most...,hateful,hateful,hateful,hateful,hateful,,,,,,hateful,5,0,hateful
2309,2309,ident_neutral_nh,2310,496,We have never had a female president.,non-hateful,non-hateful,,,non-hateful,non-hateful,,non-hateful,non-hateful,,,0,5,non-hateful
1644,1644,ref_subs_sent_h,1645,401,We're talking about women here. They can't hel...,hateful,,,,,,hateful,hateful,hateful,hateful,hateful,5,0,hateful
1532,1532,ref_subs_sent_h,1533,385,That's typical of women. They're absolutely de...,hateful,hateful,hateful,,hateful,hateful,,,hateful,,,5,0,hateful
452,452,derog_impl_h,453,65,Disabled people should be dumped in a landfill.,hateful,hateful,hateful,hateful,,,,,,hateful,hateful,5,0,hateful
3435,3435,spell_space_del_h,3436,814,You are nothing but common disabledtrash.,hateful,,hateful,hateful,,,hateful,,,hateful,hateful,5,0,hateful


In [4]:
# check number of columns and rows
# we will need this later to create our new file for clean data

data.shape

(3901, 19)

In [5]:
# now we create our new file
# as per guidelines, the first column will have to contain the name of the dataset
# we multiply that number for the total number of rows in the original dataset

dataset_name = ['hatecheck-data'] * 3901

In [6]:
# now we create a dataframe using pandas
# the dataframe defines the first columns name as 'corpus_name', and the contents as dataset_name as defined above

new_dataset = pd.DataFrame(dataset_name, columns=['corpus_name'])

In [7]:
# declare the remaining 4 columns
# assign a value to the 4 columns 
# one column is assigned to the sentence column (test_case) from the original file
# three columns are left empty because they will host our pre-processed sentences

new_dataset['raw_sentence'] = data['test_case']
new_dataset['label'] = ""
new_dataset['clean_sentence_training'] = ""
new_dataset['clean_sentence_EDA'] = ""

In [8]:
# no row contains null values so there is no need to delete any
# at this point, the dataframe only contains values for the dataset name (first column) and the sentences (second column)
# we can double check our progress by using the .head method

new_dataset.head

<bound method NDFrame.head of          corpus_name                                 raw_sentence label  \
0     hatecheck-data                               I hate women.          
1     hatecheck-data                        I hate trans people.          
2     hatecheck-data                          I hate gay people.          
3     hatecheck-data                        I hate black people.          
4     hatecheck-data                     I hate disabled people.          
...              ...                                          ...   ...   
3896  hatecheck-data       Gay people are all complete 4ssholes.          
3897  hatecheck-data     Black people are all complete 4ssholes.          
3898  hatecheck-data  Disabled people are all complete 4ssholes.          
3899  hatecheck-data          Muslims are all complete 4ssholes.          
3900  hatecheck-data       Immigrants are all complete 4ssholes.          

     clean_sentence_training clean_sentence_EDA  
0                  

In [9]:
# let us now populate the second column, that dedicated to labels
# the original dataset has two labels: hateful, non-hateful (we use the golden standard here)
# here, hateful=2, non-hateful=0

label_count = 0

for label in data['label_gold']:
    if label == 'non-hateful':
        label_count += 1
        new_dataset['label'] = "0"
    elif label == 'hateful':
        label_count += 1
        new_dataset['label'] = "2"

print(f'Relabeling complete, {label_count} labels converted.')

Relabeling complete, 3901 labels converted.


In [10]:
# the number of relabelled sentences equals the number of all sentences
# we can therefore proceed, after checking our progress

new_dataset.head

<bound method NDFrame.head of          corpus_name                                 raw_sentence label  \
0     hatecheck-data                               I hate women.      2   
1     hatecheck-data                        I hate trans people.      2   
2     hatecheck-data                          I hate gay people.      2   
3     hatecheck-data                        I hate black people.      2   
4     hatecheck-data                     I hate disabled people.      2   
...              ...                                          ...   ...   
3896  hatecheck-data       Gay people are all complete 4ssholes.      2   
3897  hatecheck-data     Black people are all complete 4ssholes.      2   
3898  hatecheck-data  Disabled people are all complete 4ssholes.      2   
3899  hatecheck-data          Muslims are all complete 4ssholes.      2   
3900  hatecheck-data       Immigrants are all complete 4ssholes.      2   

     clean_sentence_training clean_sentence_EDA  
0                  

In [11]:
# tokenize all sentences and remove stop words
# we have to sent the language as english

stop_words = set(stopwords.words("english"))

for i in range(0, len(new_dataset)):
    words = word_tokenize(new_dataset['raw_sentence'][i])
    sentence = " "         # Iterating through the entire column and extracting the sentence...
    
    for word in words:
        if word not in stop_words:     # Tokenizing the sentence and removing the Stopwords...
            sentence = sentence + word + " "
    new_dataset['clean_sentence_training'][i] = sentence

In [12]:
# let us double check out progress

new_dataset.head

<bound method NDFrame.head of          corpus_name                                 raw_sentence label  \
0     hatecheck-data                               I hate women.      2   
1     hatecheck-data                        I hate trans people.      2   
2     hatecheck-data                          I hate gay people.      2   
3     hatecheck-data                        I hate black people.      2   
4     hatecheck-data                     I hate disabled people.      2   
...              ...                                          ...   ...   
3896  hatecheck-data       Gay people are all complete 4ssholes.      2   
3897  hatecheck-data     Black people are all complete 4ssholes.      2   
3898  hatecheck-data  Disabled people are all complete 4ssholes.      2   
3899  hatecheck-data          Muslims are all complete 4ssholes.      2   
3900  hatecheck-data       Immigrants are all complete 4ssholes.      2   

                    clean_sentence_training clean_sentence_EDA  
0   

In [13]:
# now that stopwords have been removed (notice the absence of verbs above), let us lemmatize what remains
# use command 'python -m textblob.download_corpora' to download all necessary corpora

for i in range(0, len(new_dataset)):
    sentence = new_dataset['clean_sentence_training'][i]    # Taking each sentence as the sentence with stop words removed...
    sentence2 = TextBlob(sentence)
    lemmatized = " ".join([word.lemmatize() for word in sentence2.words])    # Lemmatizing and parsing the sentence together...
    sentence = lemmatized

In [14]:
# let us check our progress

new_dataset.head

<bound method NDFrame.head of          corpus_name                                 raw_sentence label  \
0     hatecheck-data                               I hate women.      2   
1     hatecheck-data                        I hate trans people.      2   
2     hatecheck-data                          I hate gay people.      2   
3     hatecheck-data                        I hate black people.      2   
4     hatecheck-data                     I hate disabled people.      2   
...              ...                                          ...   ...   
3896  hatecheck-data       Gay people are all complete 4ssholes.      2   
3897  hatecheck-data     Black people are all complete 4ssholes.      2   
3898  hatecheck-data  Disabled people are all complete 4ssholes.      2   
3899  hatecheck-data          Muslims are all complete 4ssholes.      2   
3900  hatecheck-data       Immigrants are all complete 4ssholes.      2   

                    clean_sentence_training clean_sentence_EDA  
0   