## Text Pre-processing

Preprocessing for NLP is a very task dependent process. The steps we take are dependent on the aims and methods we intend to apply at the modelling stage, as such there may be different paths to take.

As a reminder, the task of our study is to be able to train models that are able to effectively classify text comments as Toxic, while minimising bias. Key to doing this will be to train a model that is able to understand contextual relationships between words, but also perhaps to understand the contextual impact of punctuation at a certain point. To help understand this, consider the impact of exclamation points to your view of the context of a sentence. 

Standard NLP preprocess workflows would generally remove punctuation given that classic methods of word-embedding such as TF-IDF do not gain meaning from these outside of their counts. However the use of pre-trained word-embeddings such as fasttext, GloVE, word2vec, and others, often contain vector representations of punctuation that has been learned by studying their relationships to other words. As a result, we could now skip the removal of punctuation.

### Our Method:

For this project we are following two paths. 

#### Traditional ML
We will be training a series of classic ML classifiers such as SVM, Logistic Regression and Random Forest. We will embed the words using TF-IDF and therefore we will use a traditional NLP workflow.

 1. Remove punctuation, convert to lower case
 2. Tokenization
 3. Removal of stop words
 4. Stemming/Lemmatization
 
We will use tools from NLTK/Spacy to achieve this, with adjustments for quirks in our data. 

#### Neural Networks with GloVe
We will also be training a neural network model which takes advantage of glove word embeddings. As such we will not be carrying out the standard work flow. Instead, we will begin by comparing the number of tokens in our dataset vocabulary to the glove word embeddings. We will remove as many OOV words/symbols as we can. 

We will then attempt to correct issues such as misspellings that are causing words to not match with the word-embeddings. We will leave in punctuation that has a vector representation.




In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
import contractions
import tqdm
from tqdm import tqdm
tqdm.pandas()
import gc
import operator

  from pandas import Panel


In [2]:
# Load in the cleaned train and cleaned+concatenated test datasets. 
train_df = pd.read_csv('data/train_clean.csv')
test_df = pd.read_csv('data/test_clean.csv')

In [3]:
# Filter out the first 
train_df = train_df.iloc[:,1:]
test_df = test_df.iloc[:,1:]

------------------------

### Text Pre-processing:

We will start with a baseline approach of:
* Casing to lower
* Expanding contractions
* Tokenizing - Uses nltk.TweetTokenizer (with reduce_len = true)
* Removing Stop words Uses nltk.English stop words
* Lemmatizing - nltk. wordnet lemmatizer on adverbs, adjectives, verbs and nouns

This is a standard workflow for pre-processing text. We will experiment with results if we remove some of these

Ultimately this is a difficult task given these are online comments. We will likely see numerous misspellings that cause the tools we use to miss certain issues. 

In [None]:
# to lower case.
df['cleaned_comment'] = df['comment_text'].apply(lambda x: x.lower())

In [None]:
# use contractions library to expand contractions
# Contraction package code can be found here https://github.com/kootenpv/contractions, contractions list can be found below
# https://github.com/kootenpv/contractions/blob/master/contractions/__init__.py
# The package is very useful for expanding contractions and is good at doing so for slang 

# We could also implement this using a dictionary built from 
# https://en.wikipedia.org/wiki/Wikipedia%3aList_of_English_contractions

def contraction_expand(text):
    return contractions.fix(text)
#teststr = "hi i'm the coolest cat yall, y'all, there's isn't"

df['cleaned_comment'] = df['cleaned_comment'].apply(lambda x: contraction_expand(x))

In [None]:
# Quick check of some comments
print(df['comment_text'].loc[1])
print(df['cleaned_comment'].loc[1])
print('')
print(df['comment_text'].loc[204000])
print('')
print(df['cleaned_comment'].loc[204000])

print('')
print(df['comment_text'].loc[565934])
print('')
print(df['cleaned_comment'].loc[565934])

So far we can see that we have successfuly converted the strings to lower case and appear to have expanded out contractions, certainly we have captured common cases such as "I'm". We will now progress to the next stage.

In [None]:
# TOKENIZATION

# We will use the TweetTokenizer from nltk.
# Our data has been taken from online comments, which likely share very similar traits as a Tweet, such as hashtags and
# additional punctuation for emphasis. This tokenizer should more effectively deal with them.
from nltk.tokenize import TweetTokenizer

tokenizer = TweetTokenizer(reduce_len=True) # Set reducelen to true to somewhat reduce lengths of words like 'waaayyy'

#apply tokenizer using lambda function 
df['cleaned_comment'] = df['cleaned_comment'].apply(lambda x: tokenizer.tokenize(x))

In [None]:
# remove stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# remove stop words via lambda function and list comprehension
df['cleaned_comment'] = df['cleaned_comment'].apply(lambda x: [item for item in x if item not in stop_words])

In [None]:
# Clean Punctuation
# We are taking advantage of the translate functionality of the string function. 

# first we make a translation table - what we are doing here is actually making an empty table. The third argument is for
# charachters which will be mapped to None if found. As such by putting in string.punctuation we are creating a translate
# table which will set punctuation to none 
punc_table = str.maketrans('', '', string.punctuation)
df['cleaned_comment'] = df['cleaned_comment'].apply(lambda x: [item.translate(punc_table) for item in x])


In [None]:
# quick check
print(df.loc[1,['comment_text', 'cleaned_comment']])

In [None]:
%%time
# lemmatize words using the WordNetLemmatizer
from nltk.stem import WordNetLemmatizer


# Lemmatizer works on part of speech words, so we need to run this over the various pos,
lemmatizer = WordNetLemmatizer()
def lemmatize_text_noun(text):
    return [lemmatizer.lemmatize(w, pos='n') for w in text]

def lemmatize_text_verb(text):
    return [lemmatizer.lemmatize(w, pos='v') for w in text]
def lemmatize_text_adj(text):
    return [lemmatizer.lemmatize(w, pos='a') for w in text]
def lemmatize_text_adv(text):
    return [lemmatizer.lemmatize(w, pos='r') for w in text]

df['cleaned_comment'] = df['cleaned_comment'].apply(lemmatize_text_noun)
df['cleaned_comment'] = df['cleaned_comment'].apply(lemmatize_text_verb)
df['cleaned_comment'] = df['cleaned_comment'].apply(lemmatize_text_adj)
df['cleaned_comment'] = df['cleaned_comment'].apply(lemmatize_text_adv)

This is now the end of our standard pipeline, we have put all the above steps together in a function below which we can apply to the text. THis will also allow us to ensure we treat the test set in the exact same way as we have done for the training set.


#### PRE-PROCESS PIPELINE FUNC

In [None]:
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize.treebank import TreebankWordDetokenizer
import contractions
import string

def text_cleaner(df, col_name, clean_col_name):
   

    # Lemmatize functions to be called lated
    # Lemmatize nouns
    def lemmatize_text_noun(text):
        return [lemmatizer.lemmatize(w, pos='n') for w in text]
    
    # Lemmatize verbs
    def lemmatize_text_verb(text):
        return [lemmatizer.lemmatize(w, pos='v') for w in text]
    # Lemmatize adjectives
    def lemmatize_text_adj(text):
        return [lemmatizer.lemmatize(w, pos='a') for w in text]

    # Lemmatize adverbs
    def lemmatize_text_adv(text):
        return [lemmatizer.lemmatize(w, pos='r') for w in text]
    
    # Expand contraction method
    def contraction_expand(text):
        return contractions.fix(text)
    
    # To lower case.
    df[clean_col_name] = df[col_name].apply(lambda x: x.lower())
    
    # Expand contractions
    df[clean_col_name] = df[clean_col_name].apply(lambda x: contraction_expand(x))
    
    #Tokenize:
    tokenizer = TweetTokenizer(reduce_len=True)
    df[clean_col_name] = df[clean_col_name].apply(lambda x: tokenizer.tokenize(x))
   
    
    #Remove Stop words
    stop_words = stopwords.words('english')
    df[clean_col_name] = df[clean_col_name].apply(lambda x: [item for item in x if item not in stop_words])
    
    #Delete punctuation
    punc_table = str.maketrans('', '', string.punctuation)
    df[clean_col_name] = df[clean_col_name].apply(lambda x: [item.translate(punc_table) for item in x])
    
    # LEMMATIZATION
    lemmatizer = WordNetLemmatizer()
    
    df[clean_col_name] = df[clean_col_name].apply(lemmatize_text_noun)
    df[clean_col_name] = df[clean_col_name].apply(lemmatize_text_verb)
    df[clean_col_name] = df[clean_col_name].apply(lemmatize_text_adj)
    df[clean_col_name] = df[clean_col_name].apply(lemmatize_text_adv)
    
    
    return None

def detokenizer(df, col_name):
    detokenizer = TreebankWordDetokenizer()
    df[col_name+'_detokenize'] = df[col_name].apply(lambda x: detokenizer.detokenize(x))
    
    return None
    

In [None]:
%%time
# Run the cleaner func on train data
text_cleaner(train_df, 'comment_text', 'comment_text_clean')

In [None]:
%%time
# run text cleaner on testdf
text_cleaner(test_df, 'comment_text', 'comment_text_clean')

In [None]:
# detokenize
detokenizer(train_df,'comment_text_clean')
detokenizer(test_df,'comment_text_clean')

-----

## Preprocessing for Glove word embeddings

For our Neural Network models we will be using Glove pre-trained word-embeddings. Key to the success of the model will therefore be processing our text to ensure we have as high a vocabulary coverage as possible. This means we will have to correct as many miss-spellings and contractions as we can. We will not use the cleaning pipeline defined above that we use for the classic ML models as the Glove word embeddings contain vector representation for many of the words that are removed by that pipeline. Instead we will tailor our approach to maximise the percentage of our vocabulary fromt he dataset that we can match to a word-embedding. 

We will use the Glove Common Crawl 840B 300dim vectors. We believe this will achieve superior accuracy over using the smaller 6B set which was trained on Wikipedia only. Remember, our comments come from various online sources so will likely use language differently than Wikipedia which is more formalized.

In [2]:
# Load in the cleaned train and cleaned+concatenated test datasets. 
train_df = pd.read_csv('data/train_clean.csv')
test_df = pd.read_csv('data/test_clean.csv')

In [3]:
# Filter out the first col
train_df = train_df.iloc[:,1:]
test_df = test_df.iloc[:,1:]

In [4]:
## Define functions used to read embeddings in and build vocab

from tqdm import tqdm
tqdm.pandas()
import gc
import operator


GLOVE_FILE = 'embeds/glove.840B.300d.txt'


def load_embed(file):
    
    def get_coefs(word,*arr): 
        return word, np.asarray(arr, dtype='float16')[:1]
    
        
    embedding_index = dict(get_coefs(*o.strip().split(" ")) for o in open(file))
        
    return embedding_index

def build_vocab(texts, verbose =  True):
    sentences = texts.apply(lambda x: x.split()).values
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in tqdm(sentences, disable = (not verbose)):
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab


In [5]:
import operator 
# this function checks our vocab against whats in the embedding matrix
def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in vocab:
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print(f'Found embeddings for {len(a) / len(vocab):.2%} of vocab')
    print(f'Found embeddings for  {k / (k + i):.2%} of all text')
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x


In [10]:
%%time
glove = load_embed(GLOVE_FILE)

CPU times: user 2min 16s, sys: 5.32 s, total: 2min 22s
Wall time: 2min 25s


In [None]:
%%time 
# Lets check the OOV for our pre-processed df and for df_train
vocab = build_vocab(train_df['comment_text'])
glove_oov = check_coverage(train_df, 'comment_text')

As we can see, we were only able to find 15.82% of the words in our vocab within the embedding. Let us investigate the glove_oov dictionary to see what kind of words we are missing

In [29]:
glove_oov #output hidden 

[('Yes,', 19056),
 ('that,', 18289),
 ('(and', 16524),
 ('is,', 16305),
 ('"The', 16292),
 ('Trump.', 15692),
 ('However,', 13721),
 ('So,', 13048),
 ('them,', 12684),
 ('people,', 12450),
 ('Well,', 12328),
 ('it?', 12276),
 ('No,', 12099),
 ('time,', 12042),
 ('"I', 11621),
 ('course,', 11289),
 ('years,', 11245),
 ('Trump,', 10657),
 ('me,', 10600),
 ('said,', 10543),
 ('now,', 10281),
 ('way,', 10224),
 ('all,', 9671),
 ('this,', 9626),
 ('fact,', 9427),
 ('know,', 8904),
 ('again,', 8904),
 ('you?', 8787),
 ('(or', 8756),
 ('here,', 8750),
 ('"the', 8747),
 ('yes,', 8678),
 ('not,', 8668),
 ('right?', 8385),
 ('up,', 8375),
 ('say,', 8273),
 ('Also,', 8200),
 ('Oh,', 8122),
 ('that?', 8051),
 ('well,', 7834),
 ('right,', 7649),
 ('out,', 7497),
 ('Canada,', 7374),
 ('so,', 7150),
 ('him,', 6864),
 ('year,', 6735),
 ('Yeah,', 6714),
 ('and,', 6645),
 ('But,', 6620),
 ('there,', 6582),
 ('100%', 6567),
 ('money,', 6541),
 ('country,', 6482),
 ('And,', 6475),
 ('do,', 6437),
 ('examp

The first thing to notice is that a number of the most common missing vocab words are contractions or words with a possessive apostrophie. Naturally there are also a number of miss-spellings. 

In addition, there appear to be a number of incorrect grammar examples, emojis, and also names. Interestingly with the names, we do not appear to be missing the name itself, i.e Trump or Obama, but rather missing the possessive case, e.g: Trump's or Obama's. We could therefore see if we can lemmatize/stem these. 

#### Dealing with Emojis and other symbols
Let us first deal with emojis and symbols. Emojis can contain useful information and we may have some contained in our text given this was taken from a common crawl. We will try and preserve emojis we have glove embeddings for and delete ones which we don't. We will do the same for other symbols

In [12]:
# First get all the characters from our vocabulary
# we can use our build_vocab method from before 
clean_vocab = build_vocab(train_df['comment_text'], verbose=False)

In [13]:
# use list comprehension to pull out characters
# Instead of generating a large list, we append each character into a long string which allows us to view them easier, 
# We have added two spaces per character to spread these out.
# The list comprehension simply takes a char for each char in our dict, if the length is 1 (i.e a char)
clean_vocab_chars = ''.join([char for char in clean_vocab if len(char) == 1])

In [14]:
# We will now make a filter to remove all regular letters and numbers

# We can classify symbols as anything which are not latin letters. 
# The string package contains a list of common ascii characters and digits 
# https://docs.python.org/3/library/string.html

# string.ascii_letters includes lowercase and upper case. We define a filter below 
non_symbols = string.digits + string.ascii_letters

# We could also include common latin-based languages charahcters (western european).
# This list was taken from Latin-1 charset table at: https://cs.stanford.edu/people/miles/iso8859.html#ISO
# There are likely more that we could exclude
latin_based_char = 'ÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'

# add this to filter
non_symbols = non_symbols + latin_based_char

In [15]:
# Create a new string of symbols which has been filtered
clean_vocab_symbols = ''
clean_vocab_symbols = ''.join([char for char in clean_vocab_chars if not char in non_symbols])

In [16]:
clean_vocab_symbols

'.—-,(:&@!;$=–🐵>"?😑*…#\ue014~/)•≠+\uf04a\'🐶“§%·τα☺\uf0e0😜[]”|💖→❤_😂😄😀―{ᴀ★\x96}😎😱🚌🌟😊`👍💩💯⛽🚄😖♡\xad😈►◄💔½^\u200b🙄✬\ufeff\x7f£±<😉🙂😃😐💙😳☆¾😘😞😓😍‘🙏😡🎶🌺😁🤔😔🙃😏😕💚😢😟\\😭😆🔗¬🚽😴🤗♫👮ι🐟💎♥\x95⛲🍰ā›💛😠👎❧◞\x13▂▃▄▅▆▇😵😨’☹➡⬅😅«»🆕👅👥👶💕👲👤👄🔛🙈⇒\uf0b7⅓😋ġ⏺💰😲∙😇🙆👐†💫💤🍇\ue613🏡¶🌞🎲🤓🍕©💋🙀💀🎄🏆😒™🗑İ💃¢👿😧༼つÀ💜￼🐝✔🎅🍺●🎵🌎😬🎉🤣😣Сви✘😝🤢╪▶☭✭🌸ℐ\uf04c😺⏏🍔🐮🌹®🍀🔥\u200f👻🙁🤑☎🖒✨✰❆☙ˌ🎓🚪⚲⚭⚆⬭⬯○‣∎💥✈🎼⅛¼🚀☠Яя𝗮📺🐋💓с\x10😤👺😯🚶←𝙖🙌🤘ͦ☼⋆💞\uf0a7🚢👏⊂🤡🚂⅔\u200c‒Ｉ\u200dд🐒\x9f\u202a⏩ﷻ😮🦄😩🚗✊🐳😥🤤▸⦁⛷🤧σ⇌¿𝒶🦊☐☑🐽😫🦆😷♾⚠🎁̶✓‐🤠잘🔫\u202c☒▪↑😪𝖆𝕴∼☝💡🤥\uf005🦁🤐𝒂😦♪😌🌏👽🍽🐾✌😰🐄−🔨😛¸€о🍭🛳👣🐸☻🐀👊🎈👀🍻🍩μ😗🏂👳🍗💪❄👌🍒⠀🐑⏰◦⛓♬💊🤙😶𝑰∵∴𝗜🍊⤏🔹🍎🐴🐎𝘢☞☜▲↴↳💘🏀⊘▫⬇Ā🚴☮🖤🥘➕🚫♀✋🐻🤦🎃－ａ🤞📉🔭𝒙\uf10a🤯🐷💳🙇🔼'

Outside of punctuation, emojis make up the majority of symbols in our dataset vocabulary. In addition to this we have a number of characters from different encodings. We can now compare this set with what we have available in the word embeddings.  

In [17]:
# Do the same as above 
# The below code iterates through the 
glove_char = ''.join([char for char in glove if len(char) == 1])

In [18]:
# Now let us filter out the symbols
# Similar list_comprehension as above, but if statement to check c is  not in our filter 
glove_symbols = ''.join([char for char in glove_char if not char in non_symbols])

In [19]:
glove_symbols

',.":)(-!?|;\'$&/[]>%=#*+\\•~@£·_{}©^®`<→°€™›♥←§″′█½…“★”–●►−¢²¬░¡¶↑±¿▾═¦║―¥▓—‹─▒：¼⊕▼▪†■’▀¨▄♫☆¯♦¤▲¸¾⋅‘∞∙）↓、│（»，♪╩╚³・╦╣╔╗▬❤¹≤‡√◄━⇒▶º≥╝♡◊。✈≡☺✔↵≈✓♣☎℃◦└‟～！○◆№♠▌✿▸⁄□❖✦．｜À┃／￥╠↩✭▐☼µ☻┐├«∼┌℉☮฿≦♬✧〉－⌂✖･◕※‖◀‰\x97↺∆œ┘┬╬،⌘š⊂ª＞〈⎙Å？☠⇐▫∗∈≠♀ƒ♔˚℗┗＊┼❀ı＆∩♂‿∑‣➜┛⇓☯⊖☀┳；∇⇑✰◇♯☞´ə↔┏｡◘∂✌♭┣┴┓✨ˈ˜❥┫℠✒ž［∫\x93≧］\x94∀♛\x96∨◎ˑ↻⅓⇩＜≫✩ˆ✪♕؟₤☛╮␊＋┈ɡ％╋▽⇨┻⊗￡।▂✯▇＿➤₂✞＝▷△◙▅✝ﾟ∧␉☭┊╯☾➔∴\x92▃↳＾׳➢╭➡＠⊙☢˝⅛∏ā„①๑∥❝Š☐▆Ÿ╱⋙๏☁⇔▔\x91②➚◡╰٠♢˙۞✘✮☑⋆ℓⓘ❒☣✉ē⌊➠∣❑⅔◢ⓒ\x80〒∕▮⦿✫✚⋯♩☂ˌ❞‗č܂☜ī‾✜╲∘⟩ō＼⟨·⅜✗Ă♚∅ⓔ◣͡‛❦⑨③◠✄❄１∃␣≪｢≅◯☽２İ∎｣⁰❧̅ǡⒶ↘⚓▣˘∪⇢✍ɛ⊥＃⅝⎯↠۩☰◥⊆✽ﬁ⚡↪ở❁☹ł◼☃◤❏Žⓢ⊱α➝̣✡∠｀▴┤Ȃ∝♏ⓐ✎;３④␤＇❣⅞✂✤ⓞ☪✴⌒˛♒＄ɪ✶▻Ⓔ◌◈۲Ʈ❚ʿ❂￦◉╜̃ťν✱╖❉₃ⓡℝ٤↗❶ʡ۰ˇⓣ♻➽۶₁ʃ׀✲Đʤ✬☉▉≒☥⌐♨✕ⓝ⊰❘＂⇧̵➪４▁βđ۱▏⊃ⓛ‚♰́✏⏑Œ̶٩Ⓢー⩾日￠❍≃⋰♋ɿ､̂ǿ❋✳ⓤ╤▕⌣✸℮⁺▨⑤╨Ⓥ♈❃☝Ā５✻⊇≻♘♞◂７✟Łū⌠✠☚✥ŋ❊ƂⒸŮ⌈❅Ⓡ♧Ⓞɑλ۵▭❱Ⓣ∟☕♺∵⍝ⓑɔ✵ŕ✣ℤ年ℕ٭♆Ⓘⅆ∶⚜◞்✹Ǥȡ➥ᴥ↕ɂ̳∷✋į➧∋̿ͧʘ┅⥤⬆ǀμ₄⋱ʔ☄↖⋮۔♌Ⓛ╕♓ـ⁴❯♍▋ă✺⭐６✾♊➣▿Ⓑ♉Ａ⏠◾▹⑥⩽в↦ż╥⍵⌋։➨и∮⇥ⓗⒹ⁻ʊć⎝⌥⌉◔◑ǂ✼♎ℂ♐╪ɨ⊚☒⇤θВⓜ⎠Ｏ◐ǰ⚠╞ﬂş◗⎕ⓨ☟Ｉⓟ♟❈↬ⓓŞ◻♮❙а♤∉؛⁂例ČⓃ־♑╫╓╳⬅☔πɒɹ߂Ō☸ɐʻ┄╧ʌ׃８ʒ⎢ġ❆⋄⚫ħ̏☏➞͂␙Ⓤ◟Ƥąʕ̊Ȥ⚐✙は↙̾ωΔ℘ﾞ✷⑦φ⍺❌⊢▵✅ｗ９ⓖ☨▰ʹŢ╡Ⓜő☤∽╘Ű˹↨ȿ♙⬇♱ś⌡Ω⠀╛❕┉Ⓟ̀Ǩ♖ⓚ┆⑧⎜Śǹ◜⚾⤴✇╟⎛☩➲➟ⓥⒽŘ⏝Ŀ◃０₀╢月↯✆ĶĢ˃⍴Ĥ❇ũ⚽╒Ｃɻɤ̸ʼ♜☓Ｔ➳⇄γ☬⚑✐⁵δȭ⌃◅▢ｓȸ❐ě∊☈ⅇℜ॥σ⎮ȣ▩のτεřＳŀு⊹‵␔☊➸̌☿⇉Ĺ➊⊳╙⁶ⓦ⇣｛̄↝ź⎟ęℳŹ▍❗ℑＭɾſｍŧĦ״Γ΄▞◁⛄⇝Ż⎪ˤ♁ｖ⇠

As we can see, we actually have a number of word embeddings for symbols, and most importantly, we have word embeddings for punctuation marks including exclamation points. Recall that we mentioned earlier about how some punctuation may acutally help convey meaning. As GloVe comes with these we are able to keep the punctuation marks if needed. 

However, it is clear that we are likely going to have to drop a significant number of the emojis as the word-embedding set we have chosen has very few. Perhaps there are other word-embedding sets which have been trained including emojis. 

For the purposes of our neural network model, we will delete the symbols with no word-embeddings. Any word/symbol with no embedding gets automatically assignedd a OOV token which translates to a value of 0. As such we will not gain no information from including these symbols and instead will benefit from the fact that we have to process less data.

In [20]:
# Create new string using same method above of dataset vocab symbols not in glove symbols
drop_symbol = ''.join([char for char in clean_vocab_symbols if char not in glove_symbols])

# create a keep list
keep_list = ''.join([char for char in clean_vocab_symbols if char in glove_symbols])

In [None]:
len(drop_symbol)

In total we will be dropping 277 symbols from our dataset.

In [None]:
len(keep_list)

In [21]:
%%time
# There are two methods of doing this, we can use string.replace and loop over the comments. However it would be faster 
# to use the string.translate() method as we use in our text cleaning pipeline earlier.

symb_table = str.maketrans('', '', drop_symbol)
train_df['comment_text_clean'] = train_df['comment_text'].apply(lambda x: x.translate(symb_table))
    

CPU times: user 7.42 s, sys: 80.3 ms, total: 7.5 s
Wall time: 7.55 s


In [22]:
# lets check what symbols remain
clean_vocab2 = build_vocab(train_df['comment_text_clean'], verbose=False)
clean_vocab_chars2 = ''.join([char for char in clean_vocab2 if len(char) == 1])
clean_vocab_symbols2 = ''
clean_vocab_symbols2 = ''.join([char for char in clean_vocab_chars2 if not char in non_symbols])

In [23]:
drop_symbol2 = ''.join([char for char in clean_vocab_symbols2 if char not in glove_symbols])

In [24]:
drop_symbol2

'🏻🏼ɴᴛ🍾🏾🐕т👆🏽👉༽️𝙠ѕ👹𝙣𝘀𝘁👑💨🌝𝙡щ𝒕𝒇𝒎𝗳𝒔𝘴🐈👈🎨𝒏🎻𝗻👋Д'

The above symbols remain in our vocabulary but not in the glove symbols. Lets apply another translate to remove these as well.

In [25]:
#add these extra symbols to the original list
drop_symbol_final = drop_symbol2+drop_symbol

In [26]:
# create a new translation table and use the drop_symbol_final string
symb_table = str.maketrans('', '', drop_symbol_final)
train_df['comment_text_clean'] = train_df['comment_text_clean'].apply(lambda x: x.translate(symb_table))
    

#### Dealing with possessive apostrophes

One point we noted earlier was that a number of OOV words were cases where we had a possessive apostrophe. We checked some of these and found that in many cases the base word was present in the embedding. If we therefore remove the posessive apostrophe, we should see a large increase in 'in-vocabulary' words. There is an argument that possessive apostrophe's contain information on sentence context, however if they are not in the word embeddings we are using then there is nothing to be lost by removing them.

In [27]:
import re
# We need to deal with cases where the word ends in s so is written "James', Chris'" e.t.c
# We therefore use regex rather than simply replacing on "'s'"
# the below regex uses ? to greedy match zero to one s, this way we will remove cases where the word ends in "'" and no s 
train_df['comment_text_clean'] = train_df['comment_text_clean'].apply(lambda x: re.sub("'s?", " ", x))

After making these changes, let us observe the impact on the in-vocabulary percentage. 

In [28]:
vocab = build_vocab(train_df['comment_text_clean'])
glove_oov = check_coverage(vocab,glove)

100%|██████████| 1804874/1804874 [00:24<00:00, 73744.77it/s]


Found embeddings for 16.93% of vocab
Found embeddings for  90.79% of all text


We have increased our percentage to 16.93%, however this remains low. Our next step will be to tokenize the sentences, we can see that a number of common OOV are ones with a punctuation mark at the end of the word. Tokenization will solve this. 

In [30]:
# Apply tokenizer
from nltk import TweetTokenizer
from nltk.tokenize.treebank import TreebankWordDetokenizer
tokenizer = TweetTokenizer(reduce_len=True)
train_df['comment_text_clean2'] = train_df['comment_text_clean'].apply(lambda x: tokenizer.tokenize(x))

#detokenizer = TreebankWordDetokenizer()
#train_df['comment_text3'] = train_df['comment_text3'].apply(lambda x: detokenizer.detokenize(x))

In [13]:
# same as above but we can pass lists like tokenized text
def build_vocab2(texts):
    #sentences = texts.apply(lambda x: x.split()).values
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in texts:
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

In [33]:
vocab2 = build_vocab2(train_df['comment_text_clean2'])
glove_oov2 = check_coverage(vocab2, glove)

Found embeddings for 52.00% of vocab
Found embeddings for  99.50% of all text


After this step we have achieved much greater accuracy of 52.00%, so over half of our vocabulary has a vector representation. It is important to remember at this stage that we are dealing with online comments so there is likely to be a large amount of slang, misspelling, and other quirks that mean getting a high accuracy will be difficult. Our word embeddings were taken from a common crawl of the internet so some of these may be accounted for, but unless we trained our own word embeddings on a similar set of data it will be hard to get very high accuracies. Let's check if there are any other low hanging fruit we can fix, else we will use this on our baseline model and revisit if we get poor accuracy.

In [36]:
glove_oov2

[('..', 43344),
 ('.\n.', 10379),
 ('. . .', 5586),
 ('tRump', 2503),
 ('alt-right', 1959),
 ('.\n\n.', 1714),
 ('Brexit', 1665),
 ('. .', 1650),
 (');', 1568),
 ('):', 1528),
 ('. ...', 1469),
 ('anti-Trump', 1436),
 ('. \n.', 1192),
 ('Drumpf', 1176),
 ('#MAGA', 1106),
 ('deplorables', 1020),
 (':(', 792),
 ('SB91', 778),
 ('alt-left', 641),
 ('Trumpcare', 567),
 ('...\n\n...', 553),
 ('. . . .', 543),
 ('Trumpism', 535),
 ('ᴅ', 499),
 (':-/', 493),
 ('bigly', 473),
 ('Klastri', 452),
 ('.  \n.', 429),
 ('8:', 419),
 ('(8', 412),
 ('.\n...', 401),
 ('...\n...', 395),
 ('...\n.', 387),
 ('.\n.\n.', 385),
 ('Auwe', 384),
 ('http://bit.ly/2gTbpns', 381),
 ('.\n\n...', 353),
 ('ʜᴇ', 351),
 ('Trumpian', 347),
 ('Trumpsters', 337),
 ('ᴜᴘ', 331),
 ('ʙʏ', 330),
 ('Yᴏᴜ', 330),
 ('.  ...', 323),
 ('Vinis', 321),
 (':/', 305),
 ('Saullie', 298),
 ('. ..', 296),
 ('shibai', 290),
 ('T-rump', 287),
 ('SJWs', 281),
 ('TFWs', 279),
 ('Koncerned', 275),
 ('pro-Trump', 265),
 ('RangerMC', 260),
 ('kl

The complexion of the out of vocabulary dictionary has changed significantly. Now we generally see out random collections of punctuation fill the upper ends of the list. In addition to this, we note many slang/irregular words that are often creations as part of the online comment environment such as 'SJW','Drumpf','Trumpian. The latter two are part of an interesting trend within the dictionary of more recent political developments such as 'Brexit' being OOV. It would be interesting to revist the word embeddings at a later stage to see if we can use/train a word embedding set that is more up to date to account for this.

To summarize, for the data we will feed into the Neural Network we have:
 * Deleted OOV symbols
 * Removed possessive apostrophes
 * Tokenized words
 
We will apply the exact same process to the test set and save down both files.
We have one final set before saving the files, which is to drop the oclumns we won't be using and binarize (using boolean) the ones we will.

In [None]:
def text_preprocess_nn(texts):
    # Drop symbols
    

In [54]:
identity_columns = [
    'male', 'female', 'homosexual_gay_or_lesbian', 'christian', 'jewish',
    'muslim', 'black', 'white', 'psychiatric_or_mental_illness']

for col in identity_columns + ['target']:
    train_df[col] = np.where(train_df[col] >= 0.5, True, False)

In [66]:
train_df['target'] = np.where(train_df['target'] >= 0.5, 1, 0)

In [67]:
train_df.head()

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count,target_class,comment_text_clean,comment_text_clean2
0,59848,0,"This is so cool. It's like, 'would you want yo...",0.0,0.0,0.0,0.0,0.0,0,0,...,0,0,0,0,0.0,0,4,0,"This is so cool. It like, would you want you...","[This, is, so, cool, ., It, like, ,, would, yo..."
1,59849,0,Thank you!! This would make my life a lot less...,0.0,0.0,0.0,0.0,0.0,0,0,...,0,0,0,0,0.0,0,4,0,Thank you!! This would make my life a lot less...,"[Thank, you, !, !, This, would, make, my, life..."
2,59852,0,This is such an urgent design problem; kudos t...,0.0,0.0,0.0,0.0,0.0,0,0,...,0,0,0,0,0.0,0,4,0,This is such an urgent design problem; kudos t...,"[This, is, such, an, urgent, design, problem, ..."
3,59855,0,Is this something I'll be able to install on m...,0.0,0.0,0.0,0.0,0.0,0,0,...,0,0,0,0,0.0,0,4,0,Is this something I ll be able to install on m...,"[Is, this, something, I, ll, be, able, to, ins..."
4,59856,1,haha you guys are a bunch of losers.,0.021277,0.0,0.021277,0.87234,0.0,0,0,...,0,0,1,0,0.0,4,47,1,haha you guys are a bunch of losers.,"[haha, you, guys, are, a, bunch, of, losers, .]"


In [1]:
# For our NN data we are going to drop uneeded columns. THe decision has been made to only keep the tokenized comment col
# Ultimately given the size of the dataset, word embeddings, and the complexity of the model, we want to reduce memory
# requirements whereever possible
keep_cols = ['id','target', 'comment_text_clean2',
             'male', 'female', 'homosexual_gay_or_lesbian', 'christian', 'jewish','muslim', 'black', 
             'white', 'psychiatric_or_mental_illness']

In [72]:
train_df = train_df.loc[:,keep_cols]

In [73]:
# save train_df to file
train_df.to_csv('data/train_for_nn.csv')

In [46]:
# Apply all the above steps to the test dataset

# SYMBOL DROPPING
clean_vocab = build_vocab(test_df['comment_text'], verbose=False)
clean_vocab_chars = ''.join([char for char in clean_vocab if len(char) == 1])
clean_vocab_symbols = ''
clean_vocab_symbols = ''.join([char for char in clean_vocab_chars if not char in non_symbols])
test_df['comment_text_clean'] = test_df['comment_text'].apply(lambda x: x.translate(symb_table))

In [47]:
# POSSESSIVE APOSTROPHE
test_df['comment_text_clean'] = test_df['comment_text_clean'].apply(lambda x: re.sub("'s?", " ", x))

In [50]:
# TOKENIZE
test_df['comment_text_clean2'] = test_df['comment_text_clean'].apply(lambda x: tokenizer.tokenize(x))

In [51]:
vocab_test = build_vocab2(test_df['comment_text_clean2'])
glove_oov_test = check_coverage(vocab2, glove)

Found embeddings for 52.00% of vocab
Found embeddings for  99.50% of all text


In [83]:
# Binarize the identity
for col in identity_columns:
    test_df[col] = np.where(test_df[col] >= 0.5, True, False)
    
# Binarize the target to 1,0
test_df['toxicity'] = np.where(test_df['toxicity'] >= 0.5, 1, 0)

In [77]:
# For our NN data we are going to drop uneeded columns. THe decision has been made to only keep the tokenized comment col
# Ultimately given the size of the dataset, word embeddings, and the complexity of the model, we want to reduce memory
# requirements whereever possible
keep_cols = ['id','toxicity' 'comment_text_clean2',
             'male', 'female', 'homosexual_gay_or_lesbian', 'christian', 'jewish','muslim', 'black', 
             'white', 'psychiatric_or_mental_illness']
test_df = test_df.loc[:,keep_cols]

In [80]:
#rename toxicity to target
test_df.rename({'toxicity': 'target'},axis=1, inplace =True)

In [85]:
# Save to file
test_df.to_csv('data/test_for_nn.csv')