# Data preprocessing

## Setup

In [1]:
# import the usual suspects / basics
import pandas as pd
import numpy as np
import re
import pickle
import os

# tqdm
from tqdm import tqdm
tqdm.pandas()

# spaCy
import spacy
# download stuff for spaCy, must be run only once
#!python -m spacy download en_core_web_sm

# display all dataframe columns (default is 20)
pd.options.display.max_columns = None

# show all data in columns so that full comment is visible
pd.options.display.max_colwidth = None

## Load data

Source: https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification

In [2]:
# load only comments and toxicity scores from CSV
df = pd.read_csv('data/kaggle_original/all_data.csv',
                 usecols=['comment_text', 'toxicity'])

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999516 entries, 0 to 1999515
Data columns (total 2 columns):
 #   Column        Dtype  
---  ------        -----  
 0   comment_text  object 
 1   toxicity      float64
dtypes: float64(1), object(1)
memory usage: 30.5+ MB


## Optional: Create data sample to speed up things for testing & experimenting

TODO: A 'stratified' approach would be nice here, so that the sample has the same degree of imbalance as the full dataset.

In [4]:
sample_size = None

# uncomment to create sample of desired size
sample_size = 10_000

if sample_size != None:
    df = df.sample(sample_size, random_state=42)
    print(f'Using data sample: {df.shape[0]} rows.')

else:
    print(f'Using full data: {df.shape[0]} rows.')

Using data sample: 10000 rows.


## Create binary target column

In [5]:
df['toxic'] = (df['toxicity'] >= 0.5).astype('int')

# drop toxicity scores as they are no longer needed and shouldn't
# be passed into models
df.drop('toxicity', axis='columns', inplace=True)

## Check data for imbalance

In [6]:
# determine number of toxic and non-toxic comments
nontoxic_count = sum(df.toxic == 0)
toxic_count = sum(df.toxic == 1)

# calculate percentages
nontoxic_perc = round((nontoxic_count / len(df)) * 100, 1)
toxic_perc = round((toxic_count / len(df)) * 100, 1)

print(f'Nontoxic comments: {nontoxic_count} ({nontoxic_perc} %)')
print(f'Toxic comments: {toxic_count} ({toxic_perc} %)')

Nontoxic comments: 9159 (91.6 %)
Toxic comments: 841 (8.4 %)


Data is strongly imbalanced → apply resampling to training data at a later stage

## Handle missing data

In [7]:
# check for NaN's
df.isna().sum()

comment_text    0
toxic           0
dtype: int64

In [8]:
# drop rows including NaN's
df.dropna(inplace=True)

In [9]:
df.isna().sum()

comment_text    0
toxic           0
dtype: int64

## Create corpus variable

In [10]:
corp = df['comment_text']

## Data cleaning

### Show data size before cleaning

In [11]:
# count 'words' (rough regex method)
num_words_before = corp.str.count(r'\S+').sum()

print(f'Number of words in corpus before cleaning: {num_words_before:,}')

Number of words in corpus before cleaning: 509,339


### Remove anchor HTML tags (\<a\>)

TODO: Do this with an HTML parser like Beautiful Soup.

In [12]:
regex = r'<a .*?>|</a>' # *? for non-greedy repetition

# count matches
print(corp.str.count(regex, flags=re.I).sum())

# show some rows containing the pattern
corp[corp.str.contains(regex, na=False, case=False)].head()

0


Series([], Name: comment_text, dtype: object)

In [13]:
# replace pattern
corp = corp.str.replace(regex, '', regex=True, case=False)

# count matches again, should be 0
print(corp.str.count(regex, flags=re.I).sum())

0


### Remove URLs

In [14]:
regex = r'https?://\S+'
print(corp.str.count(regex, flags=re.I).sum())
corp[corp.str.contains(regex, na=False, case=False)].head()

306


1030589                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           "Rather than accept the theory that the purpose of government is to protect man’s natural rights, the Progressives put forward the notion that government’s primary purpose is to ensure fairness and economic equality."\n\nhttp://www.americanthinker.com/articles/2017/07/americas_long_march_toward_a_secular_socialist_democracy.html\n\nTeam Hobbes?  Or team Locke?  Hmmm.  I choose Locke.  What say you all?


In [15]:
corp = corp.str.replace(regex, '', regex=True, case=False)
print(corp.str.count(regex, flags=re.I).sum())

0


### Remove whitespace except for spaces

\r actually causes an error when loading the saved csv file with read_csv() (just C engine, Python engine works).  
\u2028 = Unicode line seperator.

In [16]:
regex = r'[\t\n\r\f\v\u2028]'
print(corp.str.count(regex, flags=re.I).sum())
corp[corp.str.contains(regex, na=False, case=False)].head()

11133


1610665                                                                                                                 Naturally you can feel it in your urine. \nThat's one of the common expressions in the Germanic languages and the hallmark of those who, after the fact, say they knew that all along.   ;)
1295718                                                                                                                                                                                               "The shortage of priests is not a shortage of vocations but a shortness of sight."\n\nExactly! Very well put!
867320                                                                                                                                                                                                             Trudeau Derangement Syndrome - every bit as bad as Harper Derangement Syndrom.\n\nGet some help.
374915                                                                      

In [17]:
corp = corp.str.replace(regex, ' ', regex=True, case=False)
print(corp.str.count(regex, flags=re.I).sum())

0


### Remove numbers

In [18]:
regex = r'\d+'
print(corp.str.count(regex, flags=re.I).sum())
corp[corp.str.contains(regex, na=False, case=False)].head()

5284


789650                                                             Kiz for some reason has been anti-Siemian from way back. Maybe it's because his ceiling isn't high enough as he is only 6'3" instead of 6'7".  Woody, on the other hand, just wrote a great column about Touchdown Trevor. It is worth reading and easy to find. That touch pass to Sanders at the back of the end zone was a thing of beauty.  I guess Kiz missed that play.
374915                                                                                                                                                                                                                                                                                                                                            Base price is $35K. With "options" $42K.  Wonder what the buyer doesn't get at the base price?
632685                                                                       Vote them all out. I will never again vote for anyone who

In [19]:
corp = corp.str.replace(regex, ' ', regex=True, case=False)
print(corp.str.count(regex, flags=re.I).sum())

0


### Manually "unmask" morst frequent swearwords, insults etc. (e.g. f*ck, cr@p)

Also correct some (on-purpose) misspellings that reflect pronunciation, e.g. "huuuge", "stooopid".

TODO: Implement autocorrection.

In [20]:
# search patterns used to create list of replacements (see next
# code cell)

regex = r'\S*\*\S+'
#regex = r'\S*@\S+'
#regex = r'\S*#\S+'
#regex = r'\S*a{3,}\S*'
#regex = r'\S*e{3,}\S*'
#regex = r'\S*i{3,}\S*'
#regex = r'\S*o{3,}\S*'
#regex = r'\S*u{3,}\S*'

print(corp.str.count(regex, flags=re.I).sum())
all_matches = corp.str.findall(regex, flags=re.I).value_counts()
all_matches[all_matches > 5]

76


comment_text
[]    9942
Name: count, dtype: int64

In [21]:
match_list = '(?i)f*ck, (?i)sh*t, (?i)s**t, (?i)f***, (?i)p***y, (?i)b*tch, (?i)f**k, (?i)p*ssy, (?i)p****, (?i)s***, (?i)a**, (?i)h*ll, (?i)h***, (?i)sh*t, (?i)pu**y, (?i)sh**, (?i)cr*p, (?i)@ss, (?i)cr@p, (?i)b@lls, (?i)f@ck, (?i)waaay, (?i)waaaay, (?i)riiiight, (?i)soo+, (?i)stooooopid, (?i)huu+ge, (?i)yuu+ge, (?i)suu+re'\
    .replace('*', r'\*').split(', ')
replace_list = 'fuck, shit, shit, fuck, pussy, bitch, fuck, pussy, pussy, shit, ass, hell, hell, shit, pussy, shit, crap, ass, crap, balls, fuck, way, way, right, so, stupid, huge, huge, sure'\
    .split(', ')

corp.replace(match_list, replace_list, regex=True, inplace=True)

### Remove multiple spaces

In [22]:
regex = r' {2,}'
print(corp.str.count(regex, flags=re.I).sum())
corp[corp.str.contains(regex, na=False, case=False)].head()

18209


1610665                                                                                                                                       Naturally you can feel it in your urine.  That's one of the common expressions in the Germanic languages and the hallmark of those who, after the fact, say they knew that all along.   ;)
495126                                                                                                                                                                                                                                                                           Yum!  What's not to love: water+maple syrup - together.
431396     Catou I will wager that mutual fund sales will be just as strong this RRSP season as last year.  The fact is that your typical mutual fund owner isn't going to suddenly discover DIY investing in individual stocks and salesperson at the bank isn't going to suddenly abandon the sales pitch for lower cost ETFs or GICs.
1295718      

In [23]:
corp = corp.str.replace(regex, ' ', regex=True, case=False)
print(corp.str.count(regex, flags=re.I).sum())

0


### Show data size after cleaning

In [24]:
num_words_after = corp.str.count(r'\S+', flags=re.I).sum()

print(f'Number of words in corpus after cleaning: {num_words_after:,} (before: {num_words_before:,})')

Number of words in corpus after cleaning: 508,166 (before: 509,339)


## Preprocess data with spaCy (based on Eric's pipeline)

See: https://realpython.com/natural-language-processing-spacy-python/

TODO: Check if this can also be done with NLTK. Faster?

In [25]:
# load English language model
nlp = spacy.load('en_core_web_sm')

### Tokenize, remove punctuation, make lower case, lemmatize, remove stop words

In [26]:
def preprocess(s):
    doc = nlp(s)
    
    tokens = [token.text.lower()
              for token in doc
              if not token.is_punct]
    
    tokens_lemma = [token.lemma_.lower()
              for token in doc
              if not token.is_punct]
    
    tokens_lemma_stop = [token.lemma_.lower()
              for token in doc
              if not token.is_punct and not token.is_stop]
    
    # convert lists to space-separated strings and return as Series
    return pd.Series([' '.join(tokens),
                      ' '.join(tokens_lemma),
                      ' '.join(tokens_lemma_stop)],
                      index=['clean_pp',
                             'clean_pp_lemma',
                             'clean_pp_lemma_stop'])

In [27]:
corp_pp = corp.progress_apply(preprocess)
corp_pp.head()

100%|██████████| 10000/10000 [03:12<00:00, 52.03it/s]


Unnamed: 0,clean_pp,clean_pp_lemma,clean_pp_lemma_stop
1610665,naturally you can feel it in your urine that 's one of the common expressions in the germanic languages and the hallmark of those who after the fact say they knew that all along,naturally you can feel it in your urine that be one of the common expression in the germanic language and the hallmark of those who after the fact say they know that all along,naturally feel urine common expression germanic language hallmark fact know
495126,yum what 's not to love water+maple syrup together,yum what be not to love water+maple syrup together,yum love water+maple syrup
431396,catou i will wager that mutual fund sales will be just as strong this rrsp season as last year the fact is that your typical mutual fund owner is n't going to suddenly discover diy investing in individual stocks and salesperson at the bank is n't going to suddenly abandon the sales pitch for lower cost etfs or gics,catou i will wager that mutual fund sale will be just as strong this rrsp season as last year the fact be that your typical mutual fund owner be not go to suddenly discover diy invest in individual stock and salesperson at the bank be not go to suddenly abandon the sale pitch for low cost etf or gic,catou wager mutual fund sale strong rrsp season year fact typical mutual fund owner go suddenly discover diy invest individual stock salesperson bank go suddenly abandon sale pitch low cost etf gic
1295718,the shortage of priests is not a shortage of vocations but a shortness of sight exactly very well put,the shortage of priest be not a shortage of vocation but a shortness of sight exactly very well put,shortage priest shortage vocation shortness sight exactly
859224,i do nt disagree with that it takes money to deal drugs street level drug sales people they nit really dealers are from many parts of the federal population,i do not disagree with that it take money to deal drug street level drug sale people they nit really dealer be from many part of the federal population,not disagree take money deal drug street level drug sale people nit dealer part federal population


## Create new df with raw + cleaned + preprocessed comments + target

In [28]:
df_new = pd.concat([df['comment_text'],
                    corp,
                    corp_pp['clean_pp'],
                    corp_pp['clean_pp_lemma'],
                    corp_pp['clean_pp_lemma_stop'],
                    df['toxic']], axis=1)

# column names
df_new.columns = ['raw',
                  'clean',
                  'clean_pp',
                  'clean_pp_lemma',
                  'clean_pp_lemma_stop',
                  'toxic']

df_new.head()

Unnamed: 0,raw,clean,clean_pp,clean_pp_lemma,clean_pp_lemma_stop,toxic
1610665,"Naturally you can feel it in your urine. \nThat's one of the common expressions in the Germanic languages and the hallmark of those who, after the fact, say they knew that all along. ;)","Naturally you can feel it in your urine. That's one of the common expressions in the Germanic languages and the hallmark of those who, after the fact, say they knew that all along. ;)",naturally you can feel it in your urine that 's one of the common expressions in the germanic languages and the hallmark of those who after the fact say they knew that all along,naturally you can feel it in your urine that be one of the common expression in the germanic language and the hallmark of those who after the fact say they know that all along,naturally feel urine common expression germanic language hallmark fact know,0
495126,Yum! What's not to love: water+maple syrup - together.,Yum! What's not to love: water+maple syrup - together.,yum what 's not to love water+maple syrup together,yum what be not to love water+maple syrup together,yum love water+maple syrup,0
431396,Catou I will wager that mutual fund sales will be just as strong this RRSP season as last year. The fact is that your typical mutual fund owner isn't going to suddenly discover DIY investing in individual stocks and salesperson at the bank isn't going to suddenly abandon the sales pitch for lower cost ETFs or GICs.,Catou I will wager that mutual fund sales will be just as strong this RRSP season as last year. The fact is that your typical mutual fund owner isn't going to suddenly discover DIY investing in individual stocks and salesperson at the bank isn't going to suddenly abandon the sales pitch for lower cost ETFs or GICs.,catou i will wager that mutual fund sales will be just as strong this rrsp season as last year the fact is that your typical mutual fund owner is n't going to suddenly discover diy investing in individual stocks and salesperson at the bank is n't going to suddenly abandon the sales pitch for lower cost etfs or gics,catou i will wager that mutual fund sale will be just as strong this rrsp season as last year the fact be that your typical mutual fund owner be not go to suddenly discover diy invest in individual stock and salesperson at the bank be not go to suddenly abandon the sale pitch for low cost etf or gic,catou wager mutual fund sale strong rrsp season year fact typical mutual fund owner go suddenly discover diy invest individual stock salesperson bank go suddenly abandon sale pitch low cost etf gic,0
1295718,"""The shortage of priests is not a shortage of vocations but a shortness of sight.""\n\nExactly! Very well put!","""The shortage of priests is not a shortage of vocations but a shortness of sight."" Exactly! Very well put!",the shortage of priests is not a shortage of vocations but a shortness of sight exactly very well put,the shortage of priest be not a shortage of vocation but a shortness of sight exactly very well put,shortage priest shortage vocation shortness sight exactly,0
859224,"I dont disagree with that. It takes money to deal drugs. Street level drug sales people, they nit really dealers, are from many parts of the federal population.","I dont disagree with that. It takes money to deal drugs. Street level drug sales people, they nit really dealers, are from many parts of the federal population.",i do nt disagree with that it takes money to deal drugs street level drug sales people they nit really dealers are from many parts of the federal population,i do not disagree with that it take money to deal drug street level drug sale people they nit really dealer be from many part of the federal population,not disagree take money deal drug street level drug sale people nit dealer part federal population,0


## Drop rows with NaN's

Has to be done again because spaCy preprocessing can lead to empty strings.

In [29]:
# convert empty strings to NaN
df_new.replace('', np.NaN, inplace=True)

In [30]:
df_new.isna().sum()
rows_before = df_new.shape[0]
print("Rows before dropping:", rows_before)
df_new.dropna(inplace=True)
df_new.reset_index(drop=True, inplace=True)
rows_after = df_new.shape[0]
print('Rows after dropping:', rows_after)
print('Rows dropped:', rows_before - rows_after)

Rows before dropping: 10000
Rows after dropping: 9964
Rows dropped: 36


In [31]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9964 entries, 0 to 9963
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   raw                  9964 non-null   object
 1   clean                9964 non-null   object
 2   clean_pp             9964 non-null   object
 3   clean_pp_lemma       9964 non-null   object
 4   clean_pp_lemma_stop  9964 non-null   object
 5   toxic                9964 non-null   int32 
dtypes: int32(1), object(5)
memory usage: 428.3+ KB


## Save CSV file

In [32]:
df_new.to_csv('data/lexiguard_data.csv', index=False)

In [33]:
df_check = pd.read_csv('data/lexiguard_data.csv')
df_check.head()

Unnamed: 0,raw,clean,clean_pp,clean_pp_lemma,clean_pp_lemma_stop,toxic
0,"Naturally you can feel it in your urine. \nThat's one of the common expressions in the Germanic languages and the hallmark of those who, after the fact, say they knew that all along. ;)","Naturally you can feel it in your urine. That's one of the common expressions in the Germanic languages and the hallmark of those who, after the fact, say they knew that all along. ;)",naturally you can feel it in your urine that 's one of the common expressions in the germanic languages and the hallmark of those who after the fact say they knew that all along,naturally you can feel it in your urine that be one of the common expression in the germanic language and the hallmark of those who after the fact say they know that all along,naturally feel urine common expression germanic language hallmark fact know,0
1,Yum! What's not to love: water+maple syrup - together.,Yum! What's not to love: water+maple syrup - together.,yum what 's not to love water+maple syrup together,yum what be not to love water+maple syrup together,yum love water+maple syrup,0
2,Catou I will wager that mutual fund sales will be just as strong this RRSP season as last year. The fact is that your typical mutual fund owner isn't going to suddenly discover DIY investing in individual stocks and salesperson at the bank isn't going to suddenly abandon the sales pitch for lower cost ETFs or GICs.,Catou I will wager that mutual fund sales will be just as strong this RRSP season as last year. The fact is that your typical mutual fund owner isn't going to suddenly discover DIY investing in individual stocks and salesperson at the bank isn't going to suddenly abandon the sales pitch for lower cost ETFs or GICs.,catou i will wager that mutual fund sales will be just as strong this rrsp season as last year the fact is that your typical mutual fund owner is n't going to suddenly discover diy investing in individual stocks and salesperson at the bank is n't going to suddenly abandon the sales pitch for lower cost etfs or gics,catou i will wager that mutual fund sale will be just as strong this rrsp season as last year the fact be that your typical mutual fund owner be not go to suddenly discover diy invest in individual stock and salesperson at the bank be not go to suddenly abandon the sale pitch for low cost etf or gic,catou wager mutual fund sale strong rrsp season year fact typical mutual fund owner go suddenly discover diy invest individual stock salesperson bank go suddenly abandon sale pitch low cost etf gic,0
3,"""The shortage of priests is not a shortage of vocations but a shortness of sight.""\n\nExactly! Very well put!","""The shortage of priests is not a shortage of vocations but a shortness of sight."" Exactly! Very well put!",the shortage of priests is not a shortage of vocations but a shortness of sight exactly very well put,the shortage of priest be not a shortage of vocation but a shortness of sight exactly very well put,shortage priest shortage vocation shortness sight exactly,0
4,"I dont disagree with that. It takes money to deal drugs. Street level drug sales people, they nit really dealers, are from many parts of the federal population.","I dont disagree with that. It takes money to deal drugs. Street level drug sales people, they nit really dealers, are from many parts of the federal population.",i do nt disagree with that it takes money to deal drugs street level drug sales people they nit really dealers are from many parts of the federal population,i do not disagree with that it take money to deal drug street level drug sale people they nit really dealer be from many part of the federal population,not disagree take money deal drug street level drug sale people nit dealer part federal population,0


In [34]:
df_check.isna().sum()

raw                    0
clean                  0
clean_pp               0
clean_pp_lemma         0
clean_pp_lemma_stop    0
toxic                  0
dtype: int64