# Basic pre-processing of textual dataset
# For EDA and language model fine-tuning

## This notebook demonstrates basic pre-processing techniques using the `nltk` and `spaCy` libraries

The task is done to prepare datasets of hate speech for use in the Omdena Aswan Local Chapter, 'Detecting Hateful and Offensive Language using NLP', in which I am cooperating and co-leading the pre-processing and EDA tasks with Vishu Kalier.

Here, we will be using the hatecheck-data (https://github.com/paul-rottger/hatecheck-data) dataset.

## Guidelines

I shall complete the task by strictly following the guidelines outlined for the project. These are as follows.

### File structure

The output of the pre-processing has to be a `.cvs` file. The file has to be organised in 5 columns as follows: 
| corpus_name | raw_sentence | label | clean_sentence_training | clean_sentence_EDA | 

### Labels

The labels will be:

- '2' for RISKY sentences (eg., 'hateful' or 'abusive');
- '1' for POTENTIALLY RISKY sentences (e.g., 'offensive)';
- '0' for NON RISKY sentences.

### Pre-processing

The pre-processing for the 'clean_sentence_training' column involves the following steps:

1- delete all rows containing null values;

2- stopwords removal;

3- removal of digits and words containing digits, punctuation and special characters, extra spaces, links;

4- lemmatization (NO stemming at it can yield undesirable results).

The case of the sentences shoudn't be changed as the distinction between lowercase and uppercase can be meaningful in this context. The pre-processing for the 'clean_sentence_EDA' involves all the steps above, plus the deletion of all emojis (if present).

In [29]:
# import necessary libraries
# don't forget to install any library that you don't have in your workspace yet
# make sure that the environment you're installing your libraries corresponds to the one you're using to compile your code
# note that the downloads form nltk might take a while

import numpy as np
import pandas as pd
import spacy
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/caterinabonan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [30]:
# import the spaCy standard model for English
# use the command 'python -m spacy download en_core_web_lg' if the model isn't already in your machine
# note that downloading the model can tale some time
# refer to the spaCy documentation if you're working on languages other than English

nlp = spacy.load("en_core_web_lg")

In [31]:
# retreive the chosen dataset
# display 10 random rows of the chosen dataset
# display dataset shape
# beware: this notebook and the dataset you want to work on HAVE TO BE in the same directory

data = pd.read_csv("all_annotations.csv")
data.sample(10)

Unnamed: 0.1,Unnamed: 0,functionality,case_id,templ_id,test_case,label_gold,label_1,label_2,label_3,label_4,label_5,label_6,label_7,label_8,label_9,label_10,count_label_h,count_label_nh,label_annot_maj
2257,2257,ident_neutral_nh,2258,488,Are there any disabled people on your course?,non-hateful,non-hateful,,,non-hateful,non-hateful,,non-hateful,non-hateful,,,0,5,non-hateful
2522,2522,ident_pos_nh,2523,526,I'll cherish any black person like you.,non-hateful,,,,,,non-hateful,non-hateful,non-hateful,non-hateful,non-hateful,0,5,non-hateful
1327,1327,profanity_nh,1328,318,He gave me some bullshit excuse.,non-hateful,non-hateful,,non-hateful,non-hateful,,,non-hateful,,,non-hateful,0,5,non-hateful
3431,3431,spell_space_del_h,3432,814,You are nothing but common femaletrash.,hateful,,,,,,hateful,hateful,hateful,hateful,hateful,5,0,hateful
1202,1202,profanity_h,1203,252,Typical of a trans person to be such a total s...,hateful,,,,,,hateful,hateful,hateful,hateful,hateful,5,0,hateful
384,384,derog_dehum_h,385,55,Immigrants are just a waste of space.,hateful,,hateful,hateful,hateful,,hateful,,,,hateful,5,0,hateful
3236,3236,spell_char_swap_h,3237,786,"If the idea comes from a trans person, it must...",hateful,hateful,hateful,hateful,hateful,,,,,,hateful,5,0,hateful
988,988,slur_h,989,129,Shemales just can't help themselves.,hateful,hateful,,hateful,hateful,hateful,,hateful,,,,5,0,hateful
1296,1296,profanity_nh,1297,287,The theatre usher just told me to shut the fuc...,non-hateful,non-hateful,non-hateful,,non-hateful,non-hateful,,,non-hateful,,,0,5,non-hateful
3742,3742,spell_leet_h,3743,849,W0men are the absolute worst.,hateful,,,,,,hateful,hateful,hateful,non-hateful,hateful,4,1,hateful


In [32]:
# check number of columns and rows
# we will need this later to create our new file for clean data

data.shape

(3901, 19)

In [33]:
# now we create our new file
# as per guidelines, the first column will have to contain the name of the dataset
# we multiply that number for the total number of rows in the original dataset

dataset_name = ['hatecheck-data'] * 3901

In [34]:
# now we create a dataframe using pandas
# the dataframe defines the first columns name as 'corpus_name', and the contents as dataset_name as defined above

new_dataset = pd.DataFrame(dataset_name, columns=['corpus_name'])

In [35]:
# declare the remaining 4 columns
# assign a value to the 4 columns 
# one column is assigned to the sentence column (test_case) from the original file
# three columns are left empty because they will host our pre-processed sentences

new_dataset['raw_sentence'] = data['test_case']
new_dataset['label'] = ""
new_dataset['clean_sentence_training'] = ""
new_dataset['clean_sentence_EDA'] = ""

In [36]:
# no row contains null values so there is no need to delete any
# at this point, the dataframe only contains values for the dataset name (first colums)
# we can double check our progress by using the .head method

new_dataset.head

<bound method NDFrame.head of          corpus_name                                 raw_sentence label  \
0     hatecheck-data                               I hate women.          
1     hatecheck-data                        I hate trans people.          
2     hatecheck-data                          I hate gay people.          
3     hatecheck-data                        I hate black people.          
4     hatecheck-data                     I hate disabled people.          
...              ...                                          ...   ...   
3896  hatecheck-data       Gay people are all complete 4ssholes.          
3897  hatecheck-data     Black people are all complete 4ssholes.          
3898  hatecheck-data  Disabled people are all complete 4ssholes.          
3899  hatecheck-data          Muslims are all complete 4ssholes.          
3900  hatecheck-data       Immigrants are all complete 4ssholes.          

     clean_sentence_training clean_sentence_EDA  
0                  

In [None]:
# let us now populate the second column, that dedicated to labels
# the original dataset has two labels: hateful, non-hateful
# here, hateful=2, non-hateful=0