Original notebook: -> https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings


Two rules:

* Don't use standard preprocessing steps like stemming or stopword removal when you have pre-trained embeddings
(You loose valuable information, which would help your NN to figure things out)
* Get your vocabulary as close to the embeddings as possible

I will use the GoogleNews pretrained embeddings.

We start with a neat little trick that enables us to see a progressbar when applying functions to a pandas Dataframe


In [1]:
import pandas as pd
from tqdm import tqdm

tqdm.pandas()


In [2]:
# load the data

train = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv',usecols = ['target'] + ['comment_text'], nrows = 1000000)
test = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv',usecols = ['comment_text'])

print("Train shape : ", train.shape)
print("Test shape : ", test.shape)


Train shape :  (1000000, 2)
Test shape :  (97320, 1)


I will use the following function to track our training vocabulary, which goes through all our text and counts the occurance of the contained words.


In [3]:
def build_vocab(sentences, verbose =  True):
    """
    sentences: list of list of words
    return: dictionary of words and their count
    """
    vocab = {}
    for sentence in tqdm(sentences, disable = (not verbose)):
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

So lets populate the vocabulary and display the first 5 elements and their count. Note that now we can use progress_apply to see progress bar

In [4]:
sentences = train['comment_text'].progress_apply(lambda x: x.split()).values


100%|██████████| 1000000/1000000 [00:11<00:00, 84125.39it/s]


In [5]:
vocab = build_vocab(sentences)


100%|██████████| 1000000/1000000 [00:14<00:00, 68711.83it/s]


In [6]:
print({k: vocab[k] for k in list(vocab)[:5]})

{'This': 68488, 'is': 820186, 'so': 120264, 'cool.': 263, "It's": 45347}


Next we import the embeddings we want to use in our model later. For illustration I use GoogleNews here.


In [7]:
from gensim.models import KeyedVectors

news_path = '../input/quora-insincere-questions-classification/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
embeddings_index = KeyedVectors.load_word2vec_format(news_path, binary=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Next I define a function that checks the intersection between our vocabulary and the embeddings. It will output a list of out of vocabulary (oov) words that we can use to improve our preprocessing


In [8]:
import operator

def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0 #counts number of words(tokens) we can represent with embedings
    i = 0 #counts number of words(tokens) we can't represent with embedings
    for word in tqdm(vocab):
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:
            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x


In [9]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 1134573/1134573 [00:03<00:00, 335000.95it/s]

Found embeddings for 13.21% of vocab
Found embeddings for  76.76% of all text





Ouch only 13.21% of our vocabulary will have embeddings, making ~23% of our data more or less useless. So lets have a look and start improving. For this we can easily have a look at the top oov words.



In [10]:
oov[:10]

[('to', 1480652),
 ('and', 1195130),
 ('of', 1155960),
 ('a', 1062940),
 ('-', 100234),
 ('.', 41336),
 (',', 17071),
 ('it,', 16371),
 ('that.', 15419),
 ('--', 15124)]

On first place there is "to". Why? Simply because "to" was removed when the GoogleNews Embeddings were trained. We will fix this later, for now we take care about the splitting of punctuation as this also seems to be a Problem. But what do we do with the punctuation then - Do we want to delete or consider as a token? I would say: It depends. If the token has an embedding, keep it, if it doesn't we don't need it anymore. So lets check:


In [11]:
'?' in embeddings_index

False

In [12]:
'&' in embeddings_index

True

Interesting. While "&" is in the Google News Embeddings, "?" is not. So we basically define a function that splits off "&" and removes other punctuation.


In [13]:
def clean_text(x):
    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
    return x


In [14]:
train["comment_text"] = train["comment_text"].progress_apply(lambda x: clean_text(x))
sentences = train["comment_text"].apply(lambda x: x.split())
vocab = build_vocab(sentences)

100%|██████████| 1000000/1000000 [00:16<00:00, 62420.00it/s]
100%|██████████| 1000000/1000000 [00:13<00:00, 76110.06it/s]


In [15]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 410450/410450 [00:01<00:00, 333179.17it/s]


Found embeddings for 41.61% of vocab
Found embeddings for  89.25% of all text


Nice! We were able to increase our embeddings ratio from 13.21% to 41.61% by just handling punctiation. Ok lets check on those oov words.


In [16]:
oov[:10]

[('to', 1495671),
 ('and', 1218100),
 ('of', 1164992),
 ('a', 1071602),
 ('10', 15404),
 ('20', 12068),
 ('100', 10495),
 ('2016', 9909),
 ('50', 9740),
 ('30', 9022)]

Hmm seems like numbers also are a problem. Lets check the top 10 embeddings to get a clue.

In [17]:
for i in range(10):
    print(embeddings_index.index2entity[i])

</s>
in
for
that
is
on
##
The
with
said


hmm why is "##" in there? Simply because as a reprocessing all numbers bigger than 9 have been replaced by hashs. I.e. 15 becomes ## while 123 becomes ### or 15.80€ becomes ##.##€. So lets mimic this preprocessing step to further improve our embeddings coverage

In [18]:
import re

def clean_numbers(x):

    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x


In [19]:
train["comment_text"] = train["comment_text"].progress_apply(lambda x: clean_numbers(x))
sentences = train["comment_text"].progress_apply(lambda x: x.split())
vocab = build_vocab(sentences)

100%|██████████| 1000000/1000000 [00:30<00:00, 33159.27it/s]
100%|██████████| 1000000/1000000 [00:06<00:00, 145481.50it/s]
100%|██████████| 1000000/1000000 [00:12<00:00, 77360.59it/s]


In [20]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 393973/393973 [00:01<00:00, 331467.20it/s]


Found embeddings for 43.61% of vocab
Found embeddings for  89.96% of all text


Nice! Another ~2.5% increase. Now as much as with handling the puntuation, but every bit helps. Lets check the oov words

In [21]:
oov[:20]

[('to', 1495671),
 ('and', 1218100),
 ('of', 1164992),
 ('a', 1071602),
 ('–', 3637),
 ('—', 3525),
 ('wwwyoutubecom', 2716),
 ('judgement', 1844),
 ('behaviour', 1756),
 ('favour', 1699),
 ('tRump', 1681),
 ('labour', 1636),
 ('doesnt', 1580),
 ('didnt', 1568),
 ('enwikipediaorg', 1525),
 ('Brexit', 1325),
 ('defence', 1092),
 ('centre', 1047),
 ('isnt', 1022),
 ('wwwadncom', 882)]

Ok now we take care of common misspellings when using american/ british vocab and replacing a few "modern" words with "social media" for this task I use a multi regex script I found some time ago on stackoverflow. Additionally we will simply remove the words "a","to","and" and "of" since those have obviously been downsampled when training the GoogleNews Embeddings.

In [22]:
def _get_mispell(mispell_dict):
    mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))
    return mispell_dict, mispell_re

mispell_dict = {'colour':'color',
                'centre':'center',
                'didnt':'did not',
                'doesnt':'does not',
                'isnt':'is not',
                'shouldnt':'should not',
                'favourite':'favorite',
                'travelling':'traveling',
                'counselling':'counseling',
                'theatre':'theater',
                'cancelled':'canceled',
                'labour':'labor',
                'organisation':'organization',
                'wwii':'world war 2',
                'citicise':'criticize',
                'instagram': 'social medium',
                'whatsapp': 'social medium',
                'snapchat': 'social medium'
                }

mispellings, mispellings_re = _get_mispell(mispell_dict)

def replace_typical_misspell(text):
    def replace(match):
        return mispellings[match.group(0)]

    return mispellings_re.sub(replace, text)

In [23]:
train["comment_text"] = train["comment_text"].progress_apply(lambda x: replace_typical_misspell(x))
sentences = train["comment_text"].progress_apply(lambda x: x.split())
to_remove = ['a','to','of','and']
sentences = [[word for word in sentence if not word in to_remove] for sentence in tqdm(sentences)]
vocab = build_vocab(sentences)

100%|██████████| 1000000/1000000 [00:12<00:00, 81622.80it/s]
100%|██████████| 1000000/1000000 [00:08<00:00, 115198.45it/s]
100%|██████████| 1000000/1000000 [00:10<00:00, 93088.75it/s]
100%|██████████| 1000000/1000000 [00:12<00:00, 81185.14it/s]


In [24]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 393918/393918 [00:01<00:00, 321267.17it/s]


Found embeddings for 43.62% of vocab
Found embeddings for  99.13% of all text


We see that we improved on the amount of embeddings found for all our text from 89% to 99%. Lets check the oov words again

In [25]:
oov[:20]

[('–', 3637),
 ('—', 3525),
 ('wwwyoutubecom', 2716),
 ('judgement', 1844),
 ('behaviour', 1756),
 ('favour', 1699),
 ('tRump', 1681),
 ('enwikipediaorg', 1525),
 ('Brexit', 1325),
 ('defence', 1092),
 ('wwwadncom', 882),
 ('…', 789),
 ('article#####', 687),
 ('wwwnytimescom', 683),
 ('wwwtheglobeandmailcom', 679),
 ('wwwwashingtonpostcom', 678),
 ('neighbours', 629),
 ('hominem', 604),
 ('deplorables', 589),
 ('Drumpf', 576)]