# Preprocessing data

### Functions:

Beside the function for each step, there are 2 pipline functions:

* dataset_pipline - Running the pipeline for a whole dataset (useful for processing text before training/testing)
* text_pipeline - Running the pipeline for a single text document (useful for processing text before prediction)

### These are all the functions required to preprocess the data
The data pipeline consists of the following steps:
1. Removing rows with null from the original dataset
2. Transforming the text to lowercase
3. Removing spaces, newlines and tabs
4. Tokenizing the text to an array of words
5. Removing stopwords from the text (using nltk english stopwords)
6. Lemmatizing the text (using WordNet Lemmatizer)
7. Saving the dataset to a new file by a given name
    * The saving is only for dataset_pipeline function
8. Resetting the tokens and returning the processed text to a string



In [16]:
import pandas as pd
import numpy as np
import nltk
import re
from os import path

BASE_DATASET_PATH = "..\\datasets"
TRAIN_DATASET_PATH = path.join(BASE_DATASET_PATH,"train.csv")
TEST_DATASET_PATH = path.join(BASE_DATASET_PATH,"test.csv")
DATA_COLUMNS = ["comment_text"]
LABEL_COLUMNS = ["toxic",'severe_toxic','obscene','threat','insult','identity_hate']


In [17]:
train_dataset = pd.read_csv(TRAIN_DATASET_PATH)
test_dataset = pd.read_csv(TEST_DATASET_PATH)

In [18]:
def remove_nulls(dataset):
    print("Dropping nulls...")
    print("Dataset size before - "+ str(len(dataset)))
    dataset = dataset.dropna()
    print("Dataset size after - "+ str(len(dataset)))
    return dataset

In [19]:
x_train = train_dataset[DATA_COLUMNS]
y_train = train_dataset[LABEL_COLUMNS]
x_train.head()

Unnamed: 0,comment_text
0,Explanation\nWhy the edits made under my usern...
1,D'aww! He matches this background colour I'm s...
2,"Hey man, I'm really not trying to edit war. It..."
3,"""\nMore\nI can't make any real suggestions on ..."
4,"You, sir, are my hero. Any chance you remember..."


In [20]:
def transform_lowecase(text):
    return text.lower()

In [21]:
def remove_spaces(text):
    text = ' '.join(text.split())
    return re.sub(r'\s{2,}'," ",text)

In [22]:
from nltk.corpus import stopwords
import string

def remove_punctuations(text):
    # return every character in the text if it is not punctuation
    # To rebuild the sentece, we will join the characters in the list without any seperator
    text = "".join([char for char in text if char not in string.punctuation])
    return re.sub(r'\s{2,}'," ",text)

In [23]:
def tokenize_text(text):
    tokenized = text.split(' ')
    return [word for word in tokenized if word != ""]

In [24]:

def remove_stopwords(tokenized_text):
    stopwords = nltk.corpus.stopwords.words('english')
    return [word for word in tokenized_text if word not in stopwords]

In [25]:
wn = nltk.WordNetLemmatizer()

def lemmatize_text(tokenized_text):
    return [wn.lemmatize(word) for word in tokenized_text]

In [26]:
def reset_tokens(tokenize_text):
    return " ".join(tokenize_text)

In [30]:
def save_processed_data(filename,before,after):
    before[DATA_COLUMNS] = after
    before = before.reset_index(drop=True)
    before.to_csv(path.join(BASE_DATASET_PATH,filename),index=False)

In [31]:
def dataset_pipeline(dataset,filename):
    dataset = remove_nulls(dataset)
    xs = dataset[DATA_COLUMNS]
    print("Transforming to lowercase...")
    xs = xs.applymap(lambda x: transform_lowecase(x))
    print("Removing spaces...")
    xs = xs.applymap(lambda x: remove_spaces(x))
    print("Removing punctuations...")
    xs = xs.applymap(lambda x: remove_punctuations(x))
    print("Tokenizing text...")
    xs = xs.applymap(lambda x: tokenize_text(x))
    print("Removing stopwords...")
    xs = xs.applymap(lambda x: remove_stopwords(x))
    print("Lemmatizing text...")
    xs = xs.applymap(lambda x: lemmatize_text(x))
    print("Resetting tokens...")
    xs = xs.applymap(lambda x: reset_tokens(x))
    print("Saving to new file "+filename)
    save_processed_data(filename,dataset,xs)
    return dataset

dataset_pipeline(train_dataset,"train_processed.csv")
dataset_pipeline(test_dataset,"test_processed.csv")
    


Dropping nulls...
Dataset size before - 159571
Dataset size after - 159571
Transforming to lowercase...
Removing spaces...
Removing punctuations...
Tokenizing text...
Removing stopwords...
Lemmatizing text...
Resetting tokens...
Saving to new file train_processed.csv
Dropping nulls...
Dataset size before - 153164
Dataset size after - 153164
Transforming to lowercase...
Removing spaces...
Removing punctuations...
Tokenizing text...
Removing stopwords...
Lemmatizing text...
Resetting tokens...
Saving to new file test_processed.csv


Unnamed: 0,id,comment_text
0,00001cee341fdb12,yo bitch ja rule succesful youll ever whats ha...
1,0000247867823ef7,rfc title fine imo
2,00013b17ad220c46,source zawe ashton lapland —
3,00017563c3f7919a,look back source information updated correct f...
4,00017695ad8997eb,dont anonymously edit article
...,...,...
153159,fffcd0960ee309b5,totally agree stuff nothing toolongcrap
153160,fffd7a9a6eb32c16,throw field home plate get faster throwing cut...
153161,fffda9e8d6fafa9e,okinotorishima category see change agree corre...
153162,fffe8f1340a79fc2,one founding nation eu germany law return quit...


In [29]:
def text_pipeline(text):
    text = transform_lowecase(text)
    text = remove_spaces(text)
    text = remove_punctuations(text)
    text = tokenize_text(text)
    text = remove_stopwords(text)
    text = lemmatize_text(text)
    text = reset_tokens(text)
    return text
text_pipeline("Hey,   \n I am going to show you my   awesome text!")

'hey going show awesome text'