# Basic pre-processing of textual dataset
# For EDA and language model fine-tuning

## This notebook demonstrates basic pre-processing techniques using the `nltk` and `spaCy` libraries

The task is done to prepare datasets of hate speech for use in the Omdena Aswan Local Chapter, 'Detecting Hateful and Offensive Language using NLP', in which I am cooperating and co-leading the pre-processing and EDA tasks with Vishu Kalier.

Here, we will be using the hatecheck-data (https://github.com/paul-rottger/hatecheck-data) dataset.

## Guidelines

I shall complete the task by strictly following the guidelines outlined for the project. These are as follows.

### File structure

The output of the pre-processing has to be a `.cvs` file. The file has to be organised in 5 columns as follows: 
| corpus_name | raw_sentence | label | clean_sentence_training | clean_sentence_EDA | 

### Labels

The labels will be:

- '2' for RISKY sentences (eg., 'hateful' or 'abusive');
- '1' for POTENTIALLY RISKY sentences (e.g., 'offensive)';
- '0' for NON RISKY sentences.

### Pre-processing

The pre-processing for the 'clean_sentence_training' column involves the following steps:

1- delete all rows containing null values;

2- stopwords removal;

3- removal of digits and words containing digits, punctuation and special characters, extra spaces, links;

4- lemmatization (NO stemming at it can yield undesirable results).

The case of the sentences shoudn't be changed as the distinction between lowercase and uppercase can be meaningful in this context. The pre-processing for the 'clean_sentence_EDA' involves all the steps above, plus the deletion of all emojis (if present).

In [23]:
# import necessary libraries
# don't forget to install any library that you don't have in your workspace yet
# make sure that the environment you're installing your libraries corresponds to the one you're using to compile your code
# note that the downloads form nltk might take a while

import numpy as np
import pandas as pd
import spacy
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/caterinabonan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [24]:
# import the spaCy standard model for English
# use the command 'python -m spacy download en_core_web_lg' if the model isn't already in your machine
# note that downloading the model can tale some time
# refer to the spaCy documentation if you're working on languages other than English

nlp = spacy.load("en_core_web_lg")

In [25]:
# retreive the chosen dataset
# display 10 random rows of the chosen dataset
# display dataset shape
# beware: this notebook and the dataset you want to work on HAVE TO BE in the same directory

data = pd.read_csv("all_annotations.csv")
data.sample(10)

Unnamed: 0.1,Unnamed: 0,functionality,case_id,templ_id,test_case,label_gold,label_1,label_2,label_3,label_4,label_5,label_6,label_7,label_8,label_9,label_10,count_label_h,count_label_nh,label_annot_maj
3127,3127,spell_char_swap_h,3128,770,Disabled people with any power are a thraet to...,hateful,,hateful,,hateful,,hateful,,hateful,,hateful,5,0,hateful
120,120,derog_neg_emote_h,121,18,You and all the other trans people in this cou...,hateful,hateful,,,hateful,,,hateful,hateful,,hateful,5,0,hateful
895,895,slur_h,896,124,Just another common slut.,hateful,,,,,,hateful,hateful,hateful,hateful,hateful,5,0,hateful
2170,2170,phrase_opinion_h,2171,476,"Personally, I think we should just hang all th...",hateful,hateful,,hateful,,hateful,,hateful,,hateful,,5,0,hateful
3246,3246,spell_char_del_h,3247,787,I have so much hatrd for you disabled people.,hateful,,,,,,hateful,hateful,hateful,hateful,hateful,5,0,hateful
1279,1279,profanity_nh,1280,270,Why the fuck can I not just skip this exam?,non-hateful,non-hateful,non-hateful,non-hateful,non-hateful,,,,,,non-hateful,0,5,non-hateful
2863,2863,counter_ref_nh,2864,567,You have to stop calling people muzzies.,non-hateful,non-hateful,non-hateful,,,non-hateful,,,non-hateful,non-hateful,,0,5,non-hateful
2592,2592,counter_quote_nh,2593,536,"I will not tolerate posts like ""black people a...",non-hateful,non-hateful,non-hateful,,non-hateful,non-hateful,,,non-hateful,,,0,5,non-hateful
441,441,derog_impl_h,442,64,That monkey is more intelligent than all of yo...,hateful,hateful,,hateful,,hateful,,hateful,,hateful,,5,0,hateful
1773,1773,negate_pos_h,1774,419,I would never lift a finger to save a black pe...,hateful,hateful,hateful,,,hateful,,,hateful,hateful,,5,0,hateful


In [26]:
# check number of columns and rows
# we will need this later to create our new file for clean data

data.shape

(3901, 19)

In [27]:
# now we create our new file
# as per guidelines, the first column will have to contain the name of the dataset
# we multiply that number for the total number of rows in the original dataset

dataset_name = ['hatecheck-data'] * 3901

In [28]:
# now we create a dataframe using pandas
# the dataframe defines the first columns name as 'corpus_name', and the contents as dataset_name as defined above

new_dataset = pd.DataFrame(dataset_name, columns=['corpus_name'])

TypeError: DataFrame.__init__() got an unexpected keyword argument 'column'