# Basic pre-processing of textual dataset
# For EDA and language model fine-tuning

## This notebook demonstrates basic pre-processing techniques using the `nltk` and `spaCy` libraries

The task is done to prepare datasets of hate speech for use in the Omdena Aswan Local Chapter, 'Detecting Hateful and Offensive Language using NLP', in which I am cooperating and co-leading the pre-processing and EDA tasks with Vishu Kalier.

Here, we will be using the hatecheck-data (https://github.com/paul-rottger/hatecheck-data) dataset.

## Guidelines

I shall complete the task by strictly following the guidelines outlined for the project. These are as follows.

### File structure

The output of the pre-processing has to be a `.cvs` file. The file has to be organised in 5 columns as follows: 
| corpus_name | raw_sentence | label | clean_sentence_training | clean_sentence_EDA | 

### Labels

The labels will be:

- '2' for RISKY sentences (eg., 'hateful' or 'abusive');
- '1' for POTENTIALLY RISKY sentences (e.g., 'offensive)';
- '0' for NON RISKY sentences.

### Pre-processing

The pre-processing for the 'clean_sentence_training' column involves the following steps:

1- delete all rows containing null values;

2- stopwords removal;

3- removal of digits and words containing digits, punctuation and special characters, extra spaces, links;

4- lemmatization (NO stemming at it can yield undesirable results).

The case of the sentences shoudn't be changed as the distinction between lowercase and uppercase can be meaningful in this context. The pre-processing for the 'clean_sentence_EDA' involves all the steps above, plus the deletion of all emojis (if present).

In [7]:
# import necessary libraries
# don't forget to install any library that you don't have in your workspace yet
# make sure that the environment you're installing your libraries corresponds to the one you're using to compile your code
# note that the downloads form nltk might take a while

import numpy as np
import pandas as pd
import spacy
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/caterinabonan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [8]:
# import the spaCy standard model for English
# use the command 'python -m spacy download en_core_web_lg' if the model isn't already in your machine
# note that downloading the model can tale some time
# refer to the spaCy documentation if you're working on languages other than English

nlp = spacy.load("en_core_web_lg")

In [11]:
# retreive the chosen dataset
# display 10 random rows of the chosen dataset
# beware: this notebook and the dataset you want to work on HAVE TO BE in the same directory

data = pd.read_csv("all_annotations.csv")
data.sample(10)

Unnamed: 0.1,Unnamed: 0,functionality,case_id,templ_id,test_case,label_gold,label_1,label_2,label_3,label_4,label_5,label_6,label_7,label_8,label_9,label_10,count_label_h,count_label_nh,label_annot_maj
292,292,derog_dehum_h,293,42,Muslims are worthless dirt.,hateful,hateful,,hateful,hateful,,,hateful,,,hateful,5,0,hateful
90,90,derog_neg_emote_h,91,13,You immigrants make me sick.,hateful,,hateful,,,hateful,hateful,,hateful,hateful,,5,0,hateful
610,610,threat_dir_h,611,88,I will hurt you and any other trans people lik...,hateful,hateful,hateful,,,hateful,,,hateful,hateful,,5,0,hateful
224,224,derog_neg_attrib_h,225,33,Only women are that violent and cruel.,hateful,hateful,hateful,,hateful,hateful,,,hateful,,,5,0,hateful
1720,1720,negate_pos_h,1721,411,Immigrant lives should not be important to any...,hateful,hateful,,,hateful,,,hateful,hateful,,hateful,5,0,hateful
917,917,slur_h,918,125,They are just a bunch of faggots.,hateful,hateful,,hateful,,,,hateful,,hateful,hateful,5,0,hateful
1627,1627,ref_subs_sent_h,1628,398,Your lives aren't worth shit to me. You're jus...,hateful,hateful,hateful,hateful,hateful,,,,,,hateful,5,0,hateful
2810,2810,counter_ref_nh,2811,562,There have to be consequences for saying you w...,non-hateful,non-hateful,non-hateful,,non-hateful,non-hateful,,,non-hateful,,,0,5,non-hateful
827,827,threat_norm_h,828,119,Someone should snap the necks of all those tra...,hateful,hateful,hateful,,hateful,,,,hateful,,hateful,5,0,hateful
277,277,derog_neg_attrib_h,278,40,Disabled people really are brainless without e...,hateful,,hateful,,,hateful,hateful,,hateful,hateful,,5,0,hateful
