# Basic pre-processing of textual dataset
# For EDA and language model fine-tuning

## This notebook demonstrates basic pre-processing techniques using the `nltk` and `spaCy` libraries

The task is done to prepare datasets of hate speech for use in the Omdena Aswan Local Chapter, 'Detecting Hateful and Offensive Language using NLP', in which I am cooperating and co-leading the pre-processing and EDA tasks with Vishu Kalier.

Here, we will be using the hatecheck-data (https://github.com/paul-rottger/hatecheck-data) dataset.

## Guidelines

I shall complete the task by strictly following the guidelines outlined for the project. These are as follows.

### File structure

The output of the pre-processing has to be a `.cvs` file. The file has to be organised in 5 columns as follows: 
| corpus_name | raw_sentence | label | clean_sentence_training | clean_sentence_EDA | 

### Labels

The labels will be:

- '2' for RISKY sentences (eg., 'hateful' or 'abusive');
- '1' for POTENTIALLY RISKY sentences (e.g., 'offensive)';
- '0' for NON RISKY sentences.

### Pre-processing

The pre-processing for the 'clean_sentence_training' column involves the following steps:

1- delete all rows containing null values;

2- stopwords removal;

3- removal of digits and words containing digits, punctuation and special characters, extra spaces, links;

4- lemmatization (NO stemming at it can yield undesirable results).

The case of the sentences shoudn't be changed as the distinction between lowercase and uppercase can be meaningful in this context. The pre-processing for the 'clean_sentence_EDA' involves all the steps above, plus the deletion of all emojis (if present).

In [57]:
# import necessary libraries
# don't forget to install any library that you don't have in your workspace yet
# make sure that the environment you're installing your libraries corresponds to the one you're using to compile your code

import numpy as np
import pandas as pd
import re
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from textblob import TextBlob
import string

In [58]:
# import the spaCy standard model for English
# use the command 'python -m spacy download en_core_web_lg' if the model isn't already in your machine
# note that downloading the model can tale some time
# refer to the spaCy documentation if you're working on languages other than English

nlp = spacy.load("en_core_web_lg")

In [59]:
# retreive the chosen dataset
# display 10 random rows of the chosen dataset
# display dataset shape
# beware: this notebook and the dataset you want to work on HAVE TO BE in the same directory

data = pd.read_csv("all_annotations.csv")
data.sample(10)

Unnamed: 0.1,Unnamed: 0,functionality,case_id,templ_id,test_case,label_gold,label_1,label_2,label_3,label_4,label_5,label_6,label_7,label_8,label_9,label_10,count_label_h,count_label_nh,label_annot_maj
1915,1915,negate_neg_nh,1916,439,We should not let Muslims suffer.,non-hateful,,,non-hateful,,non-hateful,non-hateful,non-hateful,,non-hateful,,0,5,non-hateful
1735,1735,negate_pos_h,1736,414,There is no space on this planet for any woman...,hateful,hateful,non-hateful,hateful,hateful,,,,,,hateful,4,1,hateful
751,751,threat_norm_h,752,108,If only gay people like you all had their thro...,hateful,hateful,,hateful,hateful,,,hateful,,,hateful,5,0,hateful
465,465,derog_impl_h,466,67,I care more about the dirt under my shoe than ...,hateful,non-hateful,hateful,hateful,hateful,,,,,,hateful,4,1,hateful
1273,1273,profanity_nh,1274,264,No fucking way.,non-hateful,non-hateful,,,non-hateful,non-hateful,,non-hateful,non-hateful,,,0,5,non-hateful
575,575,threat_dir_h,576,83,I really am just going to murder that trans pe...,hateful,,,hateful,hateful,hateful,hateful,hateful,,,,5,0,hateful
3306,3306,spell_char_del_h,3307,796,Trns people belong in a zoo.,hateful,hateful,,hateful,,hateful,,hateful,,hateful,,5,0,hateful
3030,3030,target_indiv_nh,3031,695,You can all go fuck yourselves.,non-hateful,,,,,,non-hateful,non-hateful,non-hateful,non-hateful,non-hateful,0,5,non-hateful
1742,1742,negate_pos_h,1743,415,No woman can ever be a valuable member of soci...,hateful,hateful,,hateful,hateful,,,hateful,,,hateful,5,0,hateful
644,644,threat_dir_h,645,93,We'll torture these women without mercy.,hateful,hateful,,hateful,hateful,,,hateful,,,hateful,5,0,hateful


In [60]:
# check number of columns and rows
# we will need this later to create our new file for clean data

data.shape

(3901, 19)

In [61]:
# now we create our new file
# as per guidelines, the first column will have to contain the name of the dataset
# we multiply that number for the total number of rows in the original dataset

dataset_name = ['hatecheck-data'] * 3901

In [62]:
# now we create a dataframe using pandas
# the dataframe defines the first columns name as 'corpus_name', and the contents as dataset_name as defined above

new_dataset = pd.DataFrame(dataset_name, columns=['corpus_name'])

In [63]:
# declare the remaining 4 columns
# assign a value to the 4 columns 
# one column is assigned to the sentence column (test_case) from the original file
# three columns are left empty because they will host our pre-processed sentences

new_dataset['raw_sentence'] = data['test_case']
new_dataset['label'] = ""
new_dataset['clean_sentence_training'] = ""
new_dataset['clean_sentence_EDA'] = ""

In [64]:
# no row contains null values so there is no need to delete any
# at this point, the dataframe only contains values for the dataset name (first column) and the sentences (second column)
# we can double check our progress by using the .head method

new_dataset.head

<bound method NDFrame.head of          corpus_name                                 raw_sentence label  \
0     hatecheck-data                               I hate women.          
1     hatecheck-data                        I hate trans people.          
2     hatecheck-data                          I hate gay people.          
3     hatecheck-data                        I hate black people.          
4     hatecheck-data                     I hate disabled people.          
...              ...                                          ...   ...   
3896  hatecheck-data       Gay people are all complete 4ssholes.          
3897  hatecheck-data     Black people are all complete 4ssholes.          
3898  hatecheck-data  Disabled people are all complete 4ssholes.          
3899  hatecheck-data          Muslims are all complete 4ssholes.          
3900  hatecheck-data       Immigrants are all complete 4ssholes.          

     clean_sentence_training clean_sentence_EDA  
0                  

In [65]:
# let us now populate the second column, that dedicated to labels
# the original dataset has two labels: hateful, non-hateful (we use the golden standard here)
# here, hateful=2, non-hateful=0

label_count = 0

for label in data['label_gold']:
    if label == 'non-hateful':
        label_count += 1
        new_dataset['label'] = "0"
    elif label == 'hateful':
        label_count += 1
        new_dataset['label'] = "2"

print(f'Relabeling complete, {label_count} labels converted.')

Relabeling complete, 3901 labels converted.


In [66]:
# the number of relabelled sentences equals the number of all sentences
# we can therefore proceed, after checking our progress

new_dataset.head

<bound method NDFrame.head of          corpus_name                                 raw_sentence label  \
0     hatecheck-data                               I hate women.      2   
1     hatecheck-data                        I hate trans people.      2   
2     hatecheck-data                          I hate gay people.      2   
3     hatecheck-data                        I hate black people.      2   
4     hatecheck-data                     I hate disabled people.      2   
...              ...                                          ...   ...   
3896  hatecheck-data       Gay people are all complete 4ssholes.      2   
3897  hatecheck-data     Black people are all complete 4ssholes.      2   
3898  hatecheck-data  Disabled people are all complete 4ssholes.      2   
3899  hatecheck-data          Muslims are all complete 4ssholes.      2   
3900  hatecheck-data       Immigrants are all complete 4ssholes.      2   

     clean_sentence_training clean_sentence_EDA  
0                  

In [67]:
# tokenize all sentences and remove stop words
# we have to sent the language as english

stop_words = set(stopwords.words("english"))

for i in range(0, len(new_dataset)):
    words = word_tokenize(new_dataset['raw_sentence'][i])
    sentence = " "         # Iterating through the entire column and extracting the sentence...
    
    for word in words:
        if word not in stop_words:     # Tokenizing the sentence and removing the Stopwords...
            sentence = sentence + word + " "
    new_dataset['clean_sentence_training'][i] = sentence

In [68]:
# let us double check out progress

new_dataset.head

<bound method NDFrame.head of          corpus_name                                 raw_sentence label  \
0     hatecheck-data                               I hate women.      2   
1     hatecheck-data                        I hate trans people.      2   
2     hatecheck-data                          I hate gay people.      2   
3     hatecheck-data                        I hate black people.      2   
4     hatecheck-data                     I hate disabled people.      2   
...              ...                                          ...   ...   
3896  hatecheck-data       Gay people are all complete 4ssholes.      2   
3897  hatecheck-data     Black people are all complete 4ssholes.      2   
3898  hatecheck-data  Disabled people are all complete 4ssholes.      2   
3899  hatecheck-data          Muslims are all complete 4ssholes.      2   
3900  hatecheck-data       Immigrants are all complete 4ssholes.      2   

                    clean_sentence_training clean_sentence_EDA  
0   

In [69]:
# now that stopwords have been removed (notice the absence of verbs above), let us lemmatize what remains
# use command 'python -m textblob.download_corpora' to download all necessary corpora

for i in range(0, len(new_dataset)):
    sentence = new_dataset['clean_sentence_training'][i]    # Taking each sentence as the sentence with stop words removed...
    sentence2 = TextBlob(sentence)
    lemmatized = " ".join([word.lemmatize() for word in sentence2.words])    # Lemmatizing and parsing the sentence together...
    sentence = lemmatized

In [70]:
# let us check our progress

new_dataset.head

<bound method NDFrame.head of          corpus_name                                 raw_sentence label  \
0     hatecheck-data                               I hate women.      2   
1     hatecheck-data                        I hate trans people.      2   
2     hatecheck-data                          I hate gay people.      2   
3     hatecheck-data                        I hate black people.      2   
4     hatecheck-data                     I hate disabled people.      2   
...              ...                                          ...   ...   
3896  hatecheck-data       Gay people are all complete 4ssholes.      2   
3897  hatecheck-data     Black people are all complete 4ssholes.      2   
3898  hatecheck-data  Disabled people are all complete 4ssholes.      2   
3899  hatecheck-data          Muslims are all complete 4ssholes.      2   
3900  hatecheck-data       Immigrants are all complete 4ssholes.      2   

                    clean_sentence_training clean_sentence_EDA  
0   

In [77]:
# there are no emojis so no need to remove them
# let us define a function to get rid of all symbols, links and punctuation

def clean_text(sentence):
    sentence = re.sub('\[.*?\]', '', sentence)
    sentence = re.sub('[ð]', '', sentence)
    sentence = re.sub('\w*[\d@]\w*', '', sentence)
    sentence = re.sub('[A-Za-z]+\d+|\d+[A-Za-z]+', '', sentence).strip()
    sentence = re.sub('https?://\S+|www\.\S+', '', sentence)
    sentence = re.sub('<.*?>+', '', sentence)
    sentence = re.sub('^RT[\s]+', '', sentence)
    sentence = re.sub('[%s]' % re.escape(string.punctuation), '', sentence)
    sentence = re.sub('\n', '', sentence)
    sentence = re.sub('\w*\d\w*', '', sentence)
    sentence = re.sub('\±\ð\$\%\&\@+\"\'\,', '', sentence)
    return sentence

for i, sentence in enumerate(new_dataset['clean_sentence_training']):
    clean = clean_text(sentence)
    new_dataset['clean_sentence_training'][i] = clean
    print(new_dataset['clean_sentence_training'][i])

I hate women
I hate trans people
I hate gay people
I hate black people
I hate disabled people
I hate Muslims
I hate immigrants
I really ca nt stand women like
I really ca nt stand trans people like
I really ca nt stand gay people like
I really ca nt stand black people like
I really ca nt stand disabled people like
I really ca nt stand Muslims like
I really ca nt stand immigrants like
I despise women
I despise trans people
I despise gay people
I despise black people
I despise disabled people
I despise Muslims
I despise immigrants
I detest women
I detest trans people
I detest gay people
I detest black people
I detest disabled people
I detest Muslims
I detest immigrants
I absolutely loathe women
I absolutely loathe trans people
I absolutely loathe gay people
I absolutely loathe black people
I absolutely loathe disabled people
I absolutely loathe Muslims
I absolutely loathe immigrants
Disgusted women around
Disgusted trans people around
Disgusted gay people around
Disgusted black people ar

In [78]:
# let us check our progress

new_dataset.head

<bound method NDFrame.head of          corpus_name                                 raw_sentence label  \
0     hatecheck-data                               I hate women.      2   
1     hatecheck-data                        I hate trans people.      2   
2     hatecheck-data                          I hate gay people.      2   
3     hatecheck-data                        I hate black people.      2   
4     hatecheck-data                     I hate disabled people.      2   
...              ...                                          ...   ...   
3896  hatecheck-data       Gay people are all complete 4ssholes.      2   
3897  hatecheck-data     Black people are all complete 4ssholes.      2   
3898  hatecheck-data  Disabled people are all complete 4ssholes.      2   
3899  hatecheck-data          Muslims are all complete 4ssholes.      2   
3900  hatecheck-data       Immigrants are all complete 4ssholes.      2   

       clean_sentence_training clean_sentence_EDA  
0                

In [79]:
# there are no emoji, therefore the two clean columns look exactly the same

new_dataset['clean_sentence_EDA'] = new_dataset['clean_sentence_training']

In [80]:
# let us double check that our work is done correctly, and finished

new_dataset.head

<bound method NDFrame.head of          corpus_name                                 raw_sentence label  \
0     hatecheck-data                               I hate women.      2   
1     hatecheck-data                        I hate trans people.      2   
2     hatecheck-data                          I hate gay people.      2   
3     hatecheck-data                        I hate black people.      2   
4     hatecheck-data                     I hate disabled people.      2   
...              ...                                          ...   ...   
3896  hatecheck-data       Gay people are all complete 4ssholes.      2   
3897  hatecheck-data     Black people are all complete 4ssholes.      2   
3898  hatecheck-data  Disabled people are all complete 4ssholes.      2   
3899  hatecheck-data          Muslims are all complete 4ssholes.      2   
3900  hatecheck-data       Immigrants are all complete 4ssholes.      2   

       clean_sentence_training        clean_sentence_EDA  
0         

In [None]:
# moves our dataframe to a .csv file 
 
new_dataset.to_csv("hatecheck-data-clean.csv", index=False)