### How to preprocess for glove: exploratory data analysis

Original notebook: -> https://www.kaggle.com/christofhenkel/how-to-preprocessing-for-glove-part1-eda

In the following will be presented 3 tricks that not only speed up the preprocessing a bit, but also improve a models accuracy.

The 3 main contributions of this kernel are the following:

* loading embedding from pickles
* aimed preprocessing for GloVe and fasttext vectors (the main content of this notebook)
* fixing some unknown words

What will not be covered are word-specific preprocessing steps like handling contractions, or mispellings .

In [1]:
import pandas as pd
from tqdm import tqdm

tqdm.pandas()

Loading a pickled version extremly improves timing!

In [2]:
CRAWL_EMBEDDING_PATH = '../input/pickled-crawl300d2m-for-kernel-competitions/crawl-300d-2M.pkl'
GLOVE_EMBEDDING_PATH = '../input/pickled-glove840b300d-for-10sec-loading/glove.840B.300d.pkl'

Of course we also need to adjust the load_embeddings function, to now handle the pickled dict.

In [3]:
def load_embeddings(path):
    with open(path,'rb') as f:
        emb_arr = pickle.load(f)
    return emb_arr

def build_matrix(word_index, path):
    embedding_index = load_embeddings(path)
    embedding_matrix = np.zeros((len(word_index) + 1, 300))
    unknown_words = []
    
    for word, i in word_index.items():
        try:
            embedding_matrix[i] = embedding_index[word]
        except KeyError:
            unknown_words.append(word)
    return embedding_matrix, unknown_words




### Data loading

In [4]:
train = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv')
test = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv')

### Preprocessing

Important remarks:

Don't naively use standard preprocessing steps like stemming, lowercasing or stopword removal when you have pre-trained embeddings
The reason is simple: You loose valuable information, which would help your NN to figure things out.

**Get your vocabulary as close to the embeddings as possible**

I will focus in this notebook, how to achieve that.

Getting your vocabulary close to the pretrained embeddings means, that you should aim for your preprocessing to result in tokens that are mostly covered by word vectors. That leads to two conclusions:

Setting up the preprocessing is some eda and research work

If a word vector for a token (see remark below for what I mean with token) is available strongly depends on the preprocessing used by the people who trained the embeddings. Unfortunatly most are quite intransparent about this point. (e.g. did they use lower casing, removing contractions, replacement of words, etc. So you need to research their github repositories and/or read the related papers. Did you now the Google pretrained word vectors replace numbers with "##" or the guys training glove twitter embeddings did text = re.sub("<3", '<HEART>', text) That all leads to the second conclusion:

**Each pretrained embedding needs its own preprocessing**

If people used different preprocessing for training their embeddings you would also need to do the same,

Especially point to can be quite challenging, if you want to concatenate embeddings as in this kernel. Imagine Embedding A preprocesses "don't" to a single token["dont"] and Embedding B to two tokens["do","n't"]. You are basically not able to do both. So you need to find a compromise.

*(most of the times token and word is the same, but sometimes e.g. "?", "n't" are not words, so I use the term token instead)

Lets start with two function I mainly use for the EDA. The first one goes through a given vocabulary and tries to find word vectors in your embedding matrix. build_vocab builds a ordered dictionary of words and their frequency in your text corpus.



In [5]:
import operator

def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:
            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

def build_vocab(sentences, verbose =  True):
    """
    sentences: list of list of words
    return: dictionary of words and their count
    """
    vocab = {}
    for sentence in tqdm(sentences, disable = (not verbose)):
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

Lets load the two embeddings and time the loading process


In [6]:
import time
import pickle

tic = time.time()
glove_embeddings = load_embeddings(GLOVE_EMBEDDING_PATH)
print(f'loaded {len(glove_embeddings)} word vectors in {time.time()-tic}s')

loaded 2196008 word vectors in 9.510367155075073s


10s compared to 2min with the standard way ;) So lets build our vocab and check the embeddings coverage without any preprocessing.

In [7]:
vocab = build_vocab(list(train['comment_text'].apply(lambda x:x.split())))
oov = check_coverage(vocab,glove_embeddings)
oov[:10]

100%|██████████| 1804874/1804874 [00:29<00:00, 62127.07it/s]
100%|██████████| 1670966/1670966 [00:03<00:00, 525961.64it/s]


Found embeddings for 15.82% of vocab
Found embeddings for  89.63% of all text


[("isn't", 39964),
 ("That's", 37640),
 ("won't", 29397),
 ("he's", 24353),
 ("Trump's", 23453),
 ("aren't", 20528),
 ("wouldn't", 19544),
 ('Yes,', 19043),
 ('that,', 18283),
 ("wasn't", 18153)]

Seems like <mark> ' </mark> and other punctuation directly on or in a word is an issue. We could simply delete punctuation to fix that words, but there are better methods. Lets explore the embeddings, in particular symbols a bit. For that we first need to define "what is a symbol" in contrast to a regular letter. I nowadays use the following list for "regular" letters. And symbols are all characters not in that list.



In [8]:
import string
latin_similar = "’'‘ÆÐƎƏƐƔĲŊŒẞÞǷȜæðǝəɛɣĳŋœĸſßþƿȝĄƁÇĐƊĘĦĮƘŁØƠŞȘŢȚŦŲƯY̨Ƴąɓçđɗęħįƙłøơşșţțŧųưy̨ƴÁÀÂÄǍĂĀÃÅǺĄÆǼǢƁĆĊĈČÇĎḌĐƊÐÉÈĖÊËĚĔĒĘẸƎƏƐĠĜǦĞĢƔáàâäǎăāãåǻąæǽǣɓćċĉčçďḍđɗðéèėêëěĕēęẹǝəɛġĝǧğģɣĤḤĦIÍÌİÎÏǏĬĪĨĮỊĲĴĶƘĹĻŁĽĿʼNŃN̈ŇÑŅŊÓÒÔÖǑŎŌÕŐỌØǾƠŒĥḥħıíìiîïǐĭīĩįịĳĵķƙĸĺļłľŀŉńn̈ňñņŋóòôöǒŏōõőọøǿơœŔŘŖŚŜŠŞȘṢẞŤŢṬŦÞÚÙÛÜǓŬŪŨŰŮŲỤƯẂẀŴẄǷÝỲŶŸȲỸƳŹŻŽẒŕřŗſśŝšşșṣßťţṭŧþúùûüǔŭūũűůųụưẃẁŵẅƿýỳŷÿȳỹƴźżžẓ"
white_list = string.ascii_letters + string.digits + latin_similar + ' '
white_list += "'"

In [9]:
glove_chars = ''.join([c for c in tqdm(glove_embeddings) if len(c) == 1])
glove_symbols = ''.join([c for c in glove_chars if not c in white_list])
glove_symbols

100%|██████████| 2196008/2196008 [00:00<00:00, 2381035.82it/s]


',.":)(-!?|;$&/[]>%=#*+\\•~@£·_{}©^®`<→°€™›♥←×§″′█½…“★”–●►−¢²¬░¡¶↑±¿▾═¦║―¥▓—‹─▒：¼⊕▼▪†■▀¨▄♫☆¯♦¤▲¸¾⋅∞∙）↓、│（»，♪╩╚³・╦╣╔╗▬❤¹≤‡√◄━⇒▶º≥╝♡◊。✈≡☺✔↵≈✓♣☎℃◦└‟～！○◆№♠▌✿▸⁄□❖✦．÷｜┃／￥╠↩✭▐☼µ☻┐├«∼┌℉☮฿≦♬✧〉－⌂✖･◕※‖◀‰\x97↺∆┘┬╬،⌘⊂ª＞〈⎙Å？☠⇐▫∗∈≠♀ƒ♔˚℗┗＊┼❀＆∩♂‿∑‣➜┛⇓☯⊖☀┳；∇⇑✰◇♯☞´↔┏｡◘∂✌♭┣┴┓✨ˈ˜❥┫℠✒［∫\x93≧］\x94∀♛\x96∨◎ˑ↻⅓⇩＜≫✩ˆ✪♕؟₤☛╮␊＋┈ɡ％╋▽⇨┻⊗￡।▂✯▇＿➤₂✞＝▷△◙▅✝ﾟ∧␉☭┊╯☾➔∴\x92▃↳＾׳➢╭➡＠⊙☢˝⅛∏„①๑∥❝☐▆╱⋙๏☁⇔▔\x91②➚◡╰٠♢˙۞✘✮☑⋆ℓⓘ❒☣✉⌊➠∣❑⅔◢ⓒ\x80〒∕▮⦿✫✚⋯♩☂ˌ❞‗܂☜‾✜╲∘⟩＼⟨·⅜✗♚∅ⓔ◣͡‛❦⑨③◠✄❄１∃␣≪｢≅◯☽２∎｣⁰❧̅ǡⒶ↘⚓▣˘∪⇢✍⊥＃⅝⎯↠۩☰◥⊆✽ﬁ⚡↪ở❁☹◼☃◤❏ⓢ⊱α➝̣✡∠｀▴┤Ȃ∝♏ⓐ✎;３④␤＇❣⅞✂✤ⓞ☪✴⌒˛♒＄ɪ✶▻Ⓔ◌◈۲Ʈ❚ʿ❂￦◉╜̃ν✱╖❉₃ⓡℝ٤↗❶ʡ۰ˇⓣ♻➽۶₁ʃ׀✲ʤ✬☉▉≒☥⌐♨✕ⓝ⊰❘＂⇧̵➪４▁β۱▏⊃ⓛ‚♰́✏⏑̶٩Ⓢー⩾日￠❍≃⋰♋ɿ､̂❋✳ⓤ╤▕⌣✸℮⁺▨⑤╨Ⓥ♈❃☝５✻⊇≻♘♞◂７✟⌠✠☚✥❊ƂⒸ⌈❅Ⓡ♧Ⓞɑλ۵▭❱Ⓣ∟☕♺∵⍝ⓑɔ✵✣ℤ年ℕ٭♆Ⓘⅆ∶⚜◞்✹Ǥȡ➥ᴥ↕ɂ̳∷✋➧∋̿ͧʘ┅⥤⬆ǀμ₄⋱ʔ☄↖⋮۔♌Ⓛ╕♓ـ⁴❯♍▋✺⭐６✾♊➣▿Ⓑ♉Ａ⏠◾▹⑥⩽в↦╥⍵⌋։➨и∮⇥ⓗⒹ⁻ʊ⎝⌥⌉◔◑ǂ✼♎ℂ♐╪ɨ⊚☒⇤θВⓜ⎠Ｏ◐ǰ⚠╞ﬂ◗⎕ⓨ☟Ｉⓟ♟❈↬ⓓ◻♮❙а♤∉؛⁂例Ⓝ־♑╫╓╳⬅☔πɒɹ߂☸ɐʻ┄╧ʌ׃８ʒ⎢❆⋄⚫̏☏➞͂␙Ⓤ◟Ƥʕ̊Ȥ⚐✙は↙̾ωΔ℘ﾞ✷⑦φ⍺❌⊢▵✅ｗ９ⓖ☨▰ʹ╡Ⓜ☤∽╘˹↨ȿ♙⬇♱⌡Ω⠀╛❕┉Ⓟ̀Ǩ♖ⓚ┆⑧⎜ǹ◜⚾⤴✇╟⎛☩➲➟ⓥⒽ⏝◃０₀╢月↯✆˃⍴❇⚽╒Ｃɻɤ̸♜☓Ｔ➳⇄γ☬⚑✐⁵δȭ⌃◅▢ｓȸ❐∊☈ⅇℜ॥σ⎮ȣ▩のτεＳு⊹‵␔☊➸̌☿⇉➊⊳╙⁶ⓦ⇣｛̄↝⎟ℳ▍❗ℑＭɾｍ״Γ΄▞◁⛄⇝⎪ˤ♁ｖ⇠☇✊位ℒạி｝๐⭕➘Ｂ❺ɸˡ⁀⑩ｃ⅕Ƽ۳☙❛₆ƪ❓⟲Ʒ⇀≲Ｐ❷١ⓕ⎥Ｄс\u06ddǥͤ₋̱̎♝≳▙Ｒʹ➭ℰ܀ʺȫⒼ⇛ˉ▊❸号⇗̷

So lets have closer look on what we just did. We printed all symbols that we have an embedding vector for. Intrestingly its not only a vast amount of punctuation but also emojis and other symbols. Especially when doing sentiment analysis emojis and other symbols carrying sentiments should not be deleted! What we can delete are symbols we have no embeddings for. So lets check the characters in our texts and find those to delete:

In [10]:
jigsaw_chars = build_vocab(list(train["comment_text"]))
jigsaw_symbols = ''.join([c for c in jigsaw_chars if not c in white_list])
jigsaw_symbols

100%|██████████| 1804874/1804874 [00:57<00:00, 31174.53it/s]


'.,?!-;*"…:\n—()%#$&_/@＼・ω+🍕=”“[]^–>\r🐵\\°<😑~\xa0\ue014•≠\t™\uf818\uf04a\xadˈʊɒ😢🐶∞§{}·τα❤️☺ɡ\uf0e0😜😎👊\u200b\u200e😁|عدويهصقأناخلىبمغر😍💖¢→̶`💵❥━┣┫Е┗Ｏ►★👎😀😂\u202a\u202c🔥😄©―🏻💥ᴍʏʀɪᴇɴᴅᴏᴀᴋʜᴜʟᴛᴄᴘʙғᴊᴡɢ✔®\x96\x92●😋👏שלוםבי😱‼£\x81♥エンジ故障➤´\u2009🚌ᴵ͞🌟😊😳😧🙀😐😕\u200f👍😮😃😘¹☕≈÷אעכח♡◐║▬💩′ɔː💯⛽€🚄🏼ஜ۩۞†😖ᴠ🚲‐μ✒➥😟😈═☆ˌ💪🙏🎯◄🌹😇💔½ʻ😡\x7f👌ἐπὶδηλήσειὲκἀίῃἴρξνʃ🙄✬ＳＵＰＥＲＨＩＴ😠\ufeff☻±\u2028😉😤⛺♍🙂µ\u3000تحكسة👮💙فزط😏º🍾🎉¾😞\u2008🏾😅😭👻😥😔😓🏽🎆✓◾🍻🍽🎶🌺🤔😪\x08‑؟🐰🐇🐱🙆．😨⬅🙃💕𝘊𝘦𝘳𝘢𝘵𝘰𝘤𝘺𝘴𝘪𝘧𝘮𝘣💗💚地獄谷℅»ВулканПвоАН🐾🐕❣😆ה⋅🔗¿¬🚽歌舞伎🙈😴🏿🤗🇺🇸♫мυтѕＣＭ⤵🏆🎃β😩█▓▒░\u200a🌠🐟💫💰💎⇒эпрд\x95🖐🙅⛲🍰⭐🤐👆›🙌\u2002💛🙁👀🙊🙉¡₂₃\u2004❧▰ˢᵒʳʸ▔ᴼᴷᴺʷᵗʰᵉᵘ◞▀\x13🚬▂▃▄▅▆▇↙🤓\ue602😵άοόςέγὸ̄תמדףנרךצט😒͝″☹➡«🆕👅👥👄🔄🔤👉👤👶👲🔛🎓φ\uf0b7⅓„✋：\uf04c\x9f\x10成都¥😣⏺̲̅😌🤑́🌏😯ех😲∙‛Ἰᾶὁ💞🚓◇🔔📚✏🏀👐\u202d💤🍇\ue613小土豆🏡▷❔❓⁉❗\u202f👠¶》कर्मा🇹🇼🌸蔡英文🌞˚🎲レクサス😛˙外国人关系）Ссиб💋💀🎄💜🤢َِʿьыгя✨不是。ɑ\x80\x9c\x9d🗑\u2005💃📣👿༼つ◕༽😰ḷЗз▱ц￼🤣卖！温哥华议会下降％你失去所有的钱加拿大坏税骗子🐝¯ツ🎅\x85🍺آإشء−ﬂﬁ🎵🌎͟ἔ油别克🤡🤥😬🤧й\u2003₁²🚀🤴ʌʲш¼⁴⁄₄⌠чИОРФДЯМю♭ж✘😝🖑ὐύύ特殊作戦群╪щ💨圆明园ק▶ℐ☭✭🏈😺♪🌍⏏ệ🍔🐮🍁☔🍆🍑🌮🌯☠🤦\u200d♂𝓒𝓲𝓿𝓵안영하세요ЖљКћ🍀😫🤤ῦ我出生在了可以说普通话汉语好极🎼🕺☃🍸🥂🗽🎇🎊🆘☎🤠👩✈🖒✌✰❆☙🚪天一家⚲\u2006⚭⚆⬭⬯⏖○‣⚓新年∎ℒ▪▙☏⅛✀╌🇫🇷🇩🇪🇮🇬🇧😷🇨🇦ХШ🌐\x1f杀鸡给猴看ʁ𝗪𝗵𝗲𝗻𝘆𝗼

Basically we can delete all symbols we have no embeddings for:

In [11]:
symbols_to_delete = ''.join([c for c in jigsaw_symbols if not c in glove_symbols])
symbols_to_delete

'\n🍕\r🐵😑\xa0\ue014\t\uf818\uf04a\xad😢🐶️\uf0e0😜😎👊\u200b\u200e😁عدويهصقأناخلىبمغر😍💖💵Е👎😀😂\u202a\u202c🔥😄🏻💥ᴍʏʀᴇɴᴅᴏᴀᴋʜᴜʟᴛᴄᴘʙғᴊᴡɢ😋👏שלוםבי😱‼\x81エンジ故障\u2009🚌ᴵ͞🌟😊😳😧🙀😐😕\u200f👍😮😃😘אעכח💩💯⛽🚄🏼ஜ😖ᴠ🚲‐😟😈💪🙏🎯🌹😇💔😡\x7f👌ἐὶήιὲκἀίῃἴξ🙄Ｈ😠\ufeff\u2028😉😤⛺🙂\u3000تحكسة👮💙فزط😏🍾🎉😞\u2008🏾😅😭👻😥😔😓🏽🎆🍻🍽🎶🌺🤔😪\x08‑🐰🐇🐱🙆😨🙃💕𝘊𝘦𝘳𝘢𝘵𝘰𝘤𝘺𝘴𝘪𝘧𝘮𝘣💗💚地獄谷улкнПоАН🐾🐕😆ה🔗🚽歌舞伎🙈😴🏿🤗🇺🇸мυтѕ⤵🏆🎃😩\u200a🌠🐟💫💰💎эпрд\x95🖐🙅⛲🍰🤐👆🙌\u2002💛🙁👀🙊🙉\u2004ˢᵒʳʸᴼᴷᴺʷᵗʰᵉᵘ\x13🚬🤓\ue602😵άοόςέὸתמדףנרךצט😒͝🆕👅👥👄🔄🔤👉👤👶👲🔛🎓\uf0b7\uf04c\x9f\x10成都😣⏺😌🤑🌏😯ех😲Ἰᾶὁ💞🚓🔔📚🏀👐\u202d💤🍇\ue613小土豆🏡❔⁉\u202f👠》कर्मा🇹🇼🌸蔡英文🌞🎲レクサス😛外国人关系Сб💋💀🎄💜🤢َِьыгя不是\x9c\x9d🗑\u2005💃📣👿༼つ༽😰ḷЗз▱ц￼🤣卖温哥华议会下降你失去所有的钱加拿大坏税骗子🐝ツ🎅\x85🍺آإشء🎵🌎͟ἔ油别克🤡🤥😬🤧й\u2003🚀🤴ʲшчИОРФДЯМюж😝🖑ὐύύ特殊作戦群щ💨圆明园קℐ🏈😺🌍⏏ệ🍔🐮🍁🍆🍑🌮🌯🤦\u200d𝓒𝓲𝓿𝓵안영하세요ЖљКћ🍀😫🤤ῦ我出生在了可以说普通话汉语好极🎼🕺🍸🥂🗽🎇🎊🆘🤠👩🖒🚪天一家⚲\u2006⚭⚆⬭⬯⏖新✀╌🇫🇷🇩🇪🇮🇬🇧😷🇨🇦ХШ🌐\x1f杀鸡给猴看ʁ𝗪𝗵𝗲𝗻𝘆𝗼𝘂𝗿𝗮𝗹𝗶𝘇𝗯𝘁𝗰𝘀𝘅𝗽𝘄𝗱📺ϖ\u2000үսᴦᎥһͺ\u2007հ\u2001ɩｙｅ൦ｌƽｈ𝐓𝐡𝐞𝐫𝐮𝐝𝐚𝐃𝐜𝐩𝐭𝐢𝐨𝐧Ƅᴨןᑯ໐ΤᏧ௦Іᴑ܁𝐬𝐰𝐲𝐛𝐦𝐯𝐑𝐙𝐣𝐇𝐂𝐘𝟎ԜТᗞ౦〔Ꭻ𝐳𝐔𝐱𝟔𝟓𝐅🐋ﬃ💘💓ё𝘥𝘯𝘶💐🌋🌄🌅𝙬𝙖𝙨𝙤𝙣𝙡𝙮𝙘𝙠𝙚𝙙𝙜𝙧𝙥𝙩𝙪𝙗𝙞𝙝𝙛👺🐷ℋ𝐀𝐥𝐪🚶𝙢Ἱ🤘ͦ💸ج패티Ｗ𝙇ᵻ👂👃ɜ🎫\uf0a7БУі🚢🚂ગુજરાતીῆ🏃𝓬𝓻𝓴𝓮𝓽𝓼☘﴾̯﴿₽\ue807𝑻𝒆𝒍𝒕𝒉𝒓𝒖𝒂𝒏𝒅𝒔𝒎𝒗𝒊👽😙\u200cЛ‒🎾👹⎌🏒⛸公寓养宠物吗🏄🐀🚑🤷操美𝒑𝒚𝒐𝑴🤙🐒欢迎来到阿拉斯ספ𝙫🐈𝒌𝙊

The symbols we want to keep we need to isolate from our words. So lets setup a list of those to isolate.

In [12]:
symbols_to_isolate = ''.join([c for c in jigsaw_symbols if c in glove_symbols])
symbols_to_isolate

'.,?!-;*"…:—()%#$&_/@＼・ω+=”“[]^–>\\°<~•≠™ˈʊɒ∞§{}·τα❤☺ɡ|¢→̶`❥━┣┫┗Ｏ►★©―ɪ✔®\x96\x92●£♥➤´¹☕≈÷♡◐║▬′ɔː€۩۞†μ✒➥═☆ˌ◄½ʻπδηλσερνʃ✬ＳＵＰＥＲＩＴ☻±♍µº¾✓◾؟．⬅℅»Вав❣⋅¿¬♫ＣＭβ█▓▒░⇒⭐›¡₂₃❧▰▔◞▀▂▃▄▅▆▇↙γ̄″☹➡«φ⅓„✋：¥̲̅́∙‛◇✏▷❓❗¶˚˙）сиʿ✨。ɑ\x80◕！％¯−ﬂﬁ₁²ʌ¼⁴⁄₄⌠♭✘╪▶☭✭♪☔☠♂☃☎✈✌✰❆☙○‣⚓年∎ℒ▪▙☏⅛ｃａｓǀ℮¸ｗ‚∼‖ℳ❄←☼⋆ʒ⊂、⅔¨͡๏⚾⚽Φ×θ￦？（℃⏩☮⚠月✊❌⭕▸■⇌☐☑⚡☄ǫ╭∩╮，例＞ʕɐ̣Δ₀✞┈╱╲▏▕┃╰▊▋╯┳┊≥☒↑☝ɹ✅☛♩☞ＡＪＢ◔◡↓♀⬆̱ℏ\x91⠀ˤ╚↺⇤∏✾◦♬³の｜／∵∴√Ω¤☜▲↳▫‿⬇✧ｏｖｍ－２０８＇‰≤∕ˆ⚜☁'

Next comes the next trick. Instead of using an inefficient loop of <mark>replace</mark> we use <mark>translate</mark>. Syntax is a bit weird, but the improvement in speed is worth the worse readablity.

In [13]:
isolate_dict = {ord(c):f' {c} ' for c in symbols_to_isolate}
remove_dict = {ord(c):f'' for c in symbols_to_delete}

def handle_punctuation(x):
    x = x.translate(remove_dict)
    x = x.translate(isolate_dict)
    return x


So lets apply that function to our text and reasses the coverage

In [14]:
train['comment_text'] = train['comment_text'].progress_apply(lambda x:handle_punctuation(x))
test['comment_text'] = test['comment_text'].progress_apply(lambda x:handle_punctuation(x))

100%|██████████| 1804874/1804874 [01:06<00:00, 27339.29it/s]
100%|██████████| 97320/97320 [00:03<00:00, 25986.94it/s]


In [15]:
vocab = build_vocab(list(train['comment_text'].apply(lambda x:x.split())))
oov = check_coverage(vocab,glove_embeddings)
oov[:10]


100%|██████████| 1804874/1804874 [00:28<00:00, 64334.77it/s]
100%|██████████| 542927/542927 [00:00<00:00, 564305.83it/s]


Found embeddings for 47.09% of vocab
Found embeddings for  98.68% of all text


[("isn't", 41947),
 ("That's", 38119),
 ("won't", 30974),
 ("he's", 25010),
 ("Trump's", 24059),
 ("aren't", 21489),
 ("wouldn't", 20066),
 ("wasn't", 18932),
 ("they're", 17834),
 ("there's", 15511)]

In [16]:
from nltk.tokenize.treebank import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()


In [17]:
def handle_contractions(x):
    x = tokenizer.tokenize(x)
    x = ' '.join(x)
    return x

In [18]:
train['comment_text'] = train['comment_text'].progress_apply(lambda x:handle_contractions(x))
test['comment_text'] = test['comment_text'].progress_apply(lambda x:handle_contractions(x))

100%|██████████| 1804874/1804874 [07:25<00:00, 4052.61it/s]
100%|██████████| 97320/97320 [00:24<00:00, 4039.91it/s]


In [19]:
vocab = build_vocab(list(train['comment_text'].apply(lambda x:x.split())),verbose=False)
oov = check_coverage(vocab,glove_embeddings)
oov[:10]


100%|██████████| 492249/492249 [00:00<00:00, 561895.86it/s]


Found embeddings for 52.32% of vocab
Found embeddings for  99.58% of all text


[('tRump', 2521),
 ("gov't", 2237),
 ('Brexit', 1729),
 ('theglobeandmail', 1350),
 ("'the", 1300),
 ('Drumpf', 1183),
 ('deplorables', 988),
 ("'The", 843),
 ('SB91', 776),
 ('theguardian', 734)]

Now the oov words look "normal", apart from those still carrying the ' token in the beginning of the word. Will need to fix those "per hand"

In [20]:
def fix_quote(x):
    x = [x_[1:] if x_.startswith("'") else x_ for x_ in x]
    x = ' '.join(x)
    return x

In [21]:
train['comment_text'] = train['comment_text'].progress_apply(lambda x:fix_quote(x.split()))
test['comment_text'] = test['comment_text'].progress_apply(lambda x:fix_quote(x.split()))


100%|██████████| 1804874/1804874 [00:30<00:00, 58502.43it/s]
100%|██████████| 97320/97320 [00:01<00:00, 58367.73it/s]


In [22]:
train['comment_text'].head()

0    This is so cool . It s like , would you want y...
1    Thank you ! ! This would make my life a lot le...
2    This is such an urgent design problem ; kudos ...
3    Is this something I ll be able to install on m...
4                haha you guys are a bunch of losers .
Name: comment_text, dtype: object

In [23]:
vocab = build_vocab(list(train['comment_text'].apply(lambda x:x.split())),verbose=False)
oov = check_coverage(vocab,glove_embeddings)
oov[:50]

100%|██████████| 473685/473685 [00:00<00:00, 572680.39it/s]


Found embeddings for 54.41% of vocab
Found embeddings for  99.66% of all text


[('tRump', 2522),
 ("gov't", 2237),
 ('Brexit', 1732),
 ('theglobeandmail', 1350),
 ('Drumpf', 1183),
 ('deplorables', 1022),
 ('SB91', 779),
 ('theguardian', 734),
 ("Gov't", 715),
 ('Trumpcare', 566),
 ('Trumpism', 543),
 ('bigly', 473),
 ('Klastri', 449),
 ("y'all", 396),
 ('Auwe', 386),
 ('2gTbpnsWATCH', 353),
 ('Trumpian', 350),
 ('Trumpsters', 340),
 ('Vinis', 321),
 ('Saullie', 298),
 ('shibai', 293),
 ('Koncerned', 287),
 ('SJWs', 281),
 ('TFWs', 276),
 ('RangerMC', 271),
 ('civilbeat', 269),
 ('klastri', 251),
 ('BCLibs', 248),
 ('Trudope', 242),
 ('garycrum', 242),
 ('Daesh', 241),
 ("Qur'an", 240),
 ('wiliki', 230),
 ('gofundme', 225),
 ('OBAMAcare', 222),
 ('cashapp24', 221),
 ('Donkel', 220),
 ('Finicum', 220),
 ('Trumpkins', 219),
 ('Cheetolini', 215),
 ('brotherIn', 214),
 ('11e7', 211),
 ('Beyak', 210),
 ('Trudeaus', 210),
 ('dailycaller', 207),
 ('Layla4', 205),
 ('Tridentinus', 203),
 ('Ontariowe', 202),
 ('washingtontimes', 200),
 ('Zupta', 196)]

Looks good, although they are some possible misspellings like : tRump, Qur'an,Brexit etc. that we could find embeddings for...