# Preface

In this notebook I build upon the the work of 
* https://www.kaggle.com/christofhenkel/how-to-preprocessing-for-glove-part1-eda 
* https://www.kaggle.com/christofhenkel/how-to-preprocessing-for-glove-part2-usage
* https://www.kaggle.com/bminixhofer/speed-up-your-rnn-with-sequence-bucketing 

I will show how one can handle OOV words which are still present after the text preprocessing.
I will not cover correct misspellings or similar methods, however have a look at the [third palce solution](https://www.kaggle.com/wowfattie/3rd-place) of Quora, how one can do this effectively (the idea goes back to [CPMP](https://www.kaggle.com/cpmpml/spell-checker-using-word2vec) ).

Here, I want to use BERT to predict words which are OOV.
As BERT was trained to predict masked words (in the input sequence, a word is replaced by the token "[MASK]"), one can use masked language model out of the box to replace unknown words by words which are found inside BERT's vocabulary.
Let me give you an example:

In [None]:
!pip install pytorch_pretrained_bert

In [None]:
import pickle
import numpy as np
import torch
from pytorch_pretrained_bert import BertTokenizer,BertForMaskedLM

with torch.no_grad():
    model = BertForMaskedLM.from_pretrained('bert-base-cased')
    model.eval()
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

In [None]:
# We should ignore the warning above, as the tokenizer is loaded correctly.
bert_tokenizer.vocab['Hello']

In [None]:
import string
mask_idx = bert_tokenizer.vocab['[MASK]']
def predict_OOV(text, words_to_replace):
    for puncation in string.punctuation:
        text = text.replace(puncation, f' {puncation} ')

    for word in words_to_replace:
        text = text.replace(word, '[MASK]')
    
    tokenize_input = ["[CLS]"] + bert_tokenizer.tokenize(text) + ["[SEP]"]
    ids = np.array(bert_tokenizer.convert_tokens_to_ids(tokenize_input))
    mask_ids = np.argwhere(ids==mask_idx)
    input_ids = torch.tensor([ids], device='cpu')
    next_word_preds =model(input_ids,
                           token_type_ids=None,
                           attention_mask=None,
                           masked_lm_labels=None).data.cpu().numpy()

    masked_lm_model = np.array([bert_tokenizer.ids_to_tokens[token] for token in np.argmax(next_word_preds[0], axis=-1)])
    return masked_lm_model[mask_ids]

In [None]:
text = """Quidditch , formerly known as Kwidditch and Cuaditch , is a wizarding sport played on broomsticks. It is the most popular game and most well-known game among wizards and witches, and, according to Rubeus Hagrid, the equivalent to Muggles' passion for football (Soccer). The object of the game is to score more points than your opponents. Each goal is worth ten points and catching the Golden Snitch is worth one-hundred and fifty points. The game ends when the Snitch is caught or an agreement is reached between the captains of both teams. Some games can go on for many days if the Snitch is not caught (the record, according to Quidditch Through the Ages, is six months, although no one caught the Snitch.) """
predict_OOV(text, ['Quidditch', 'Kwidditch', 'Cuaditch'])

This looks promising!

We will apply this idea to replace OOV words in the training set by BERT's predictions.
For this to be efficient, we need to rewrite predict_OOV, as this function currently only works with a batch size of 1 :)

Before doing this, I apply the methods found in [Dieter's](https://www.kaggle.com/christofhenkel/how-to-preprocessing-for-glove-part1-eda ) kernel to preprocess the text effectively.

In [None]:
import numpy as np
import pandas as pd
import os
import time

import gc
import random
from tqdm._tqdm_notebook import tqdm_notebook as tqdm
from keras.preprocessing import text, sequence
import torch
from torch import nn
from torch.utils import data
from torch.nn import functional as F

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [None]:
tqdm.pandas()

In [None]:
# disable progress bars when submitting
def is_interactive():
   return 'SHLVL' not in os.environ

if not is_interactive():
    def nop(it, *a, **k):
        return it

    tqdm = nop

    fastprogress.fastprogress.NO_BAR = True
    master_bar, progress_bar = force_console_behavior()
    fastai.basic_train.master_bar, fastai.basic_train.progress_bar = master_bar, progress_bar

In [None]:
def seed_everything(seed=123):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
seed_everything()

Here, compared to most other public kernels I replace the pretrained embedding files with their pickle corresponds. Loading a pickled version extremly improves timing ;)

In [None]:
CRAWL_EMBEDDING_PATH = '../input/pickled-crawl300d2m-for-kernel-competitions/crawl-300d-2M.pkl'
GLOVE_EMBEDDING_PATH = '../input/pickled-glove840b300d-for-10sec-loading/glove.840B.300d.pkl'

Of course we also need to adjust the load_embeddings function, to now handle the pickled dict.

In [None]:
def get_coefs(word, *arr):
    return word, np.asarray(arr, dtype='float32')


def load_embeddings(path):
    with open(path,'rb') as f:
        emb_arr = pickle.load(f)
    # Only load words to save memory
    return set(emb_arr.keys())

The next function is really important. Although we put a lot of effort in making the preprocessing right there are stil some out of vocabulary words we could easily fix. One example I implement here is to try a "lower/upper case version of a" word if an embedding is not found, which sometimes gives us an embedding. Sorry for the bad coding style in the loop

Let's discuss the function, which is most popular in most public kernels.

In principle this functions just deletes some special characters. Which is not optimal and I will explain why in a bit. What is additionally inefficient is that later the keras tokenizer with its default parameters is used which has its own with the above function redundant behavior.

In [None]:
train = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv')
test = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv')

## Preprocessing

See part1 for an explanation how I came to the list of symbols and contraction function. I copied them from that kernel.

In [None]:
symbols_to_isolate = '.,?!-;*"…:—()%#$&_/@＼・ω+=”“[]^–>\\°<~•≠™ˈʊɒ∞§{}·τα❤☺ɡ|¢→̶`❥━┣┫┗Ｏ►★©―ɪ✔®\x96\x92●£♥➤´¹☕≈÷♡◐║▬′ɔː€۩۞†μ✒➥═☆ˌ◄½ʻπδηλσερνʃ✬ＳＵＰＥＲＩＴ☻±♍µº¾✓◾؟．⬅℅»Вав❣⋅¿¬♫ＣＭβ█▓▒░⇒⭐›¡₂₃❧▰▔◞▀▂▃▄▅▆▇↙γ̄″☹➡«φ⅓„✋：¥̲̅́∙‛◇✏▷❓❗¶˚˙）сиʿ✨。ɑ\x80◕！％¯−ﬂﬁ₁²ʌ¼⁴⁄₄⌠♭✘╪▶☭✭♪☔☠♂☃☎✈✌✰❆☙○‣⚓年∎ℒ▪▙☏⅛ｃａｓǀ℮¸ｗ‚∼‖ℳ❄←☼⋆ʒ⊂、⅔¨͡๏⚾⚽Φ×θ￦？（℃⏩☮⚠月✊❌⭕▸■⇌☐☑⚡☄ǫ╭∩╮，例＞ʕɐ̣Δ₀✞┈╱╲▏▕┃╰▊▋╯┳┊≥☒↑☝ɹ✅☛♩☞ＡＪＢ◔◡↓♀⬆̱ℏ\x91⠀ˤ╚↺⇤∏✾◦♬³の｜／∵∴√Ω¤☜▲↳▫‿⬇✧ｏｖｍ－２０８＇‰≤∕ˆ⚜☁'
symbols_to_delete = '\n🍕\r🐵😑\xa0\ue014\t\uf818\uf04a\xad😢🐶️\uf0e0😜😎👊\u200b\u200e😁عدويهصقأناخلىبمغر😍💖💵Е👎😀😂\u202a\u202c🔥😄🏻💥ᴍʏʀᴇɴᴅᴏᴀᴋʜᴜʟᴛᴄᴘʙғᴊᴡɢ😋👏שלוםבי😱‼\x81エンジ故障\u2009🚌ᴵ͞🌟😊😳😧🙀😐😕\u200f👍😮😃😘אעכח💩💯⛽🚄🏼ஜ😖ᴠ🚲‐😟😈💪🙏🎯🌹😇💔😡\x7f👌ἐὶήιὲκἀίῃἴξ🙄Ｈ😠\ufeff\u2028😉😤⛺🙂\u3000تحكسة👮💙فزط😏🍾🎉😞\u2008🏾😅😭👻😥😔😓🏽🎆🍻🍽🎶🌺🤔😪\x08‑🐰🐇🐱🙆😨🙃💕𝘊𝘦𝘳𝘢𝘵𝘰𝘤𝘺𝘴𝘪𝘧𝘮𝘣💗💚地獄谷улкнПоАН🐾🐕😆ה🔗🚽歌舞伎🙈😴🏿🤗🇺🇸мυтѕ⤵🏆🎃😩\u200a🌠🐟💫💰💎эпрд\x95🖐🙅⛲🍰🤐👆🙌\u2002💛🙁👀🙊🙉\u2004ˢᵒʳʸᴼᴷᴺʷᵗʰᵉᵘ\x13🚬🤓\ue602😵άοόςέὸתמדףנרךצט😒͝🆕👅👥👄🔄🔤👉👤👶👲🔛🎓\uf0b7\uf04c\x9f\x10成都😣⏺😌🤑🌏😯ех😲Ἰᾶὁ💞🚓🔔📚🏀👐\u202d💤🍇\ue613小土豆🏡❔⁉\u202f👠》कर्मा🇹🇼🌸蔡英文🌞🎲レクサス😛外国人关系Сб💋💀🎄💜🤢َِьыгя不是\x9c\x9d🗑\u2005💃📣👿༼つ༽😰ḷЗз▱ц￼🤣卖温哥华议会下降你失去所有的钱加拿大坏税骗子🐝ツ🎅\x85🍺آإشء🎵🌎͟ἔ油别克🤡🤥😬🤧й\u2003🚀🤴ʲшчИОРФДЯМюж😝🖑ὐύύ特殊作戦群щ💨圆明园קℐ🏈😺🌍⏏ệ🍔🐮🍁🍆🍑🌮🌯🤦\u200d𝓒𝓲𝓿𝓵안영하세요ЖљКћ🍀😫🤤ῦ我出生在了可以说普通话汉语好极🎼🕺🍸🥂🗽🎇🎊🆘🤠👩🖒🚪天一家⚲\u2006⚭⚆⬭⬯⏖新✀╌🇫🇷🇩🇪🇮🇬🇧😷🇨🇦ХШ🌐\x1f杀鸡给猴看ʁ𝗪𝗵𝗲𝗻𝘆𝗼𝘂𝗿𝗮𝗹𝗶𝘇𝗯𝘁𝗰𝘀𝘅𝗽𝘄𝗱📺ϖ\u2000үսᴦᎥһͺ\u2007հ\u2001ɩｙｅ൦ｌƽｈ𝐓𝐡𝐞𝐫𝐮𝐝𝐚𝐃𝐜𝐩𝐭𝐢𝐨𝐧Ƅᴨןᑯ໐ΤᏧ௦Іᴑ܁𝐬𝐰𝐲𝐛𝐦𝐯𝐑𝐙𝐣𝐇𝐂𝐘𝟎ԜТᗞ౦〔Ꭻ𝐳𝐔𝐱𝟔𝟓𝐅🐋ﬃ💘💓ё𝘥𝘯𝘶💐🌋🌄🌅𝙬𝙖𝙨𝙤𝙣𝙡𝙮𝙘𝙠𝙚𝙙𝙜𝙧𝙥𝙩𝙪𝙗𝙞𝙝𝙛👺🐷ℋ𝐀𝐥𝐪🚶𝙢Ἱ🤘ͦ💸ج패티Ｗ𝙇ᵻ👂👃ɜ🎫\uf0a7БУі🚢🚂ગુજરાતીῆ🏃𝓬𝓻𝓴𝓮𝓽𝓼☘﴾̯﴿₽\ue807𝑻𝒆𝒍𝒕𝒉𝒓𝒖𝒂𝒏𝒅𝒔𝒎𝒗𝒊👽😙\u200cЛ‒🎾👹⎌🏒⛸公寓养宠物吗🏄🐀🚑🤷操美𝒑𝒚𝒐𝑴🤙🐒欢迎来到阿拉斯ספ𝙫🐈𝒌𝙊𝙭𝙆𝙋𝙍𝘼𝙅ﷻ🦄巨收赢得白鬼愤怒要买额ẽ🚗🐳𝟏𝐟𝟖𝟑𝟕𝒄𝟗𝐠𝙄𝙃👇锟斤拷𝗢𝟳𝟱𝟬⦁マルハニチロ株式社⛷한국어ㄸㅓ니͜ʖ𝘿𝙔₵𝒩ℯ𝒾𝓁𝒶𝓉𝓇𝓊𝓃𝓈𝓅ℴ𝒻𝒽𝓀𝓌𝒸𝓎𝙏ζ𝙟𝘃𝗺𝟮𝟭𝟯𝟲👋🦊多伦🐽🎻🎹⛓🏹🍷🦆为和中友谊祝贺与其想象对法如直接问用自己猜本传教士没积唯认识基督徒曾经让相信耶稣复活死怪他但当们聊些政治题时候战胜因圣把全堂结婚孩恐惧且栗谓这样还♾🎸🤕🤒⛑🎁批判检讨🏝🦁🙋😶쥐스탱트뤼도석유가격인상이경제황을렵게만들지않록잘관리해야합다캐나에서대마초와화약금의품런성분갈때는반드시허된사용🔫👁凸ὰ💲🗯𝙈Ἄ𝒇𝒈𝒘𝒃𝑬𝑶𝕾𝖙𝖗𝖆𝖎𝖌𝖍𝖕𝖊𝖔𝖑𝖉𝖓𝖐𝖜𝖞𝖚𝖇𝕿𝖘𝖄𝖛𝖒𝖋𝖂𝕴𝖟𝖈𝕸👑🚿💡知彼百\uf005𝙀𝒛𝑲𝑳𝑾𝒋𝟒😦𝙒𝘾𝘽🏐𝘩𝘨ὼṑ𝑱𝑹𝑫𝑵𝑪🇰🇵👾ᓇᒧᔭᐃᐧᐦᑳᐨᓃᓂᑲᐸᑭᑎᓀᐣ🐄🎈🔨🐎🤞🐸💟🎰🌝🛳点击查版🍭𝑥𝑦𝑧ＮＧ👣\uf020っ🏉ф💭🎥Ξ🐴👨🤳🦍\x0b🍩𝑯𝒒😗𝟐🏂👳🍗🕉🐲چی𝑮𝗕𝗴🍒ꜥⲣⲏ🐑⏰鉄リ事件ї💊「」\uf203\uf09a\uf222\ue608\uf202\uf099\uf469\ue607\uf410\ue600燻製シ虚偽屁理屈Г𝑩𝑰𝒀𝑺🌤𝗳𝗜𝗙𝗦𝗧🍊ὺἈἡχῖΛ⤏🇳𝒙ψՁմեռայինրւդձ冬至ὀ𝒁🔹🤚🍎𝑷🐂💅𝘬𝘱𝘸𝘷𝘐𝘭𝘓𝘖𝘹𝘲𝘫کΒώ💢ΜΟΝΑΕ🇱♲𝝈↴💒⊘Ȼ🚴🖕🖤🥘📍👈➕🚫🎨🌑🐻𝐎𝐍𝐊𝑭🤖🎎😼🕷ｇｒｎｔｉｄｕｆｂｋ𝟰🇴🇭🇻🇲𝗞𝗭𝗘𝗤👼📉🍟🍦🌈🔭《🐊🐍\uf10aლڡ🐦\U0001f92f\U0001f92a🐡💳ἱ🙇𝗸𝗟𝗠𝗷🥜さようなら🔼'

In [None]:
from nltk.tokenize.treebank import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()


isolate_dict = {ord(c):f' {c} ' for c in symbols_to_isolate}
remove_dict = {ord(c):f'' for c in symbols_to_delete}


def handle_punctuation(x):
    x = x.translate(remove_dict)
    x = x.translate(isolate_dict)
    return x

def handle_contractions(x):
    x = tokenizer.tokenize(x)
    return x

def fix_quote(x):
    x = [x_[1:] if x_.startswith("'") else x_ for x_ in x]
    x = ' '.join(x)
    return x

def preprocess(x):
    x = handle_punctuation(x)
    x = handle_contractions(x)
    x = fix_quote(x)
    return x

So lets apply that preprocess function to our text

In [None]:
train['comment_text'].head()

In [None]:
train['comment_text'] = train['comment_text'].progress_apply(lambda x:preprocess(x))
test['comment_text'] = test['comment_text'].progress_apply(lambda x:preprocess(x))

In [None]:
import operator 

def check_coverage(vocab, embedding_words):
    num_unique_words_found = 0
    num_words_found = 0
    num_words_not_found = 0
    oov = {}

    for word in tqdm(vocab):
        if word in embedding_words:
            num_unique_words_found += 1
            num_words_found += vocab[word]
        else:
            oov[word] = vocab[word]
            num_words_not_found += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(num_unique_words_found / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(num_words_found / (num_words_found + num_words_not_found)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

def build_vocab(sentences, verbose =  True):
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in tqdm(sentences, disable = (not verbose)):
        for word in sentence.split():
            vocab[word] = vocab.get(word, 0) + 1
            
    return vocab

In [None]:
glove_words = load_embeddings(GLOVE_EMBEDDING_PATH)
vocab = build_vocab(train['comment_text'])
oov = check_coverage(vocab, glove_words)

In [None]:
oov[:20]

In [None]:
oov_words = set([word for word, count in oov])

In [None]:
def contains_oov(text):
    for word in text.split():
        if word in oov_words:
            return True
    return False

In [None]:
train['contains_oov'] = train['comment_text'].apply(contains_oov)
test['contains_oov'] = train['comment_text'].apply(contains_oov)

In [None]:
len(train[train['contains_oov']])

Let us now implement the idea above, using code snppets from the [Toxic BERT plain vanila](https://www.kaggle.com/yuval6967/toxic-bert-plain-vanila) kernel.

The workflow is the following:
* Replace OOV words by the [MASK] token and convert the text to BERT input ids.
* Use the masked language model to argmax predict a distribution of possible candidate words for each time step of the input. 
(Side Remark: For words which are not masked, the prediction of the respective time step is (almost always)  the input id.).
* Obtain the argmax prediction for those postitions which were marked. I then create a dictionary, where the keys are the ids of the texts and the keys are the argmax predictions for all the masked inputs of the respective text.
* For each masked text, replace the masked words by the predicted words.

In [None]:
# We will have a look at this specific example how BERT replaces OOV tokens. In this case,
# the following words are not found within the glove embedding: 'Ellmyer', 'Bretzing', 'backcountryhabitat', '980875531985426'
train['comment_text'].iloc[117]

In [None]:
# Converting the lines to BERT format
# Thanks to https://www.kaggle.com/httpwwwfszyc/bert-in-keras-taming
def convert_lines(texts, max_seq_length, tokenizer, words_to_replace):
    for i, text in enumerate(texts):
        for word in words_to_replace:
            text = text.replace(word, '[MASK]')
        texts[i] = text
        
    max_seq_length -=2
    all_tokens = []
    longer = 0
    for text in tqdm(texts):
        tokens_a = tokenizer.tokenize(text)
        if len(tokens_a)>max_seq_length:
            tokens_a = tokens_a[:max_seq_length]
            longer += 1
        one_token = tokenizer.convert_tokens_to_ids(["[CLS]"] + tokens_a + ["[SEP]"]) +[0] * (max_seq_length - len(tokens_a))
        all_tokens.append(one_token)
    return np.array(all_tokens)

In [None]:
def get_location2prediction(x_batch, text_preds, offset):
    """
    location2prediction[line] contains the indices of the predicted
    masked words. 
    """
    idxs = np.argwhere(x_batch == mask_idx)
    locations = idxs[:, 0] + offset
    preds = np.argmax(text_preds[x_batch == mask_idx], axis=-1)
    location2prediction = {}
    for location, pred in zip(locations, preds):
        location2prediction[location] = location2prediction.get(location, []) + [pred]
    return location2prediction

In [None]:
model = model.to('cuda')

In [None]:
def predict_OOV(texts,
                max_seq_length,
                words_to_replace,
                batch_size=32):
    
    ids = convert_lines(texts, max_seq_length, bert_tokenizer, oov_words)
    data_loader = torch.utils.data.DataLoader(torch.tensor(ids), batch_size=batch_size, shuffle=False)
    
    idx2prediction = {}
    for i, x_batch  in tqdm(enumerate(data_loader)):
        text_preds = model(x_batch.to('cuda'),
                           token_type_ids=None,
                           attention_mask=None,
                           masked_lm_labels=None).data.cpu().numpy()
        batch_preds = get_location2prediction(x_batch.numpy(), text_preds, offset = i * batch_size)
        idx2prediction.update(batch_preds)
    return idx2prediction

In [None]:
idx2prediction = predict_OOV(train['comment_text'].values[:200],
                    max_seq_length=220,
                    words_to_replace=oov_words,
                    batch_size=32
                   )

In [None]:
print([bert_tokenizer.ids_to_tokens[idx] for idx in idx2prediction[117]])
print(train['comment_text'].iloc[117])

In [None]:
def replace_MASK(text, predicted_words):
    if len(predicted_words) == 0:
        return text
    text = np.array(text.split())
    locations = np.argwhere(text == '[MASK]').flatten()[: len(predicted_words)]
    text[locations] = np.array(predicted_words)
    return ' '.join(text)

In [None]:
def replace_MASK_in_dataframe(row, idx2prediction):
    text = row['comment_text']
    idx = row['idx']
    predcited_OOV_tokens = idx2prediction.get(idx, [])

    predcited_OOV_words = [bert_tokenizer.ids_to_tokens[idx] for idx in predcited_OOV_tokens]
    corrected_text = replace_MASK(text, predcited_OOV_words)
    return corrected_text

In [None]:
train['idx'] = list(range(len(train)))
train['comment_texts'] =  train.apply(lambda x: replace_MASK_in_dataframe(x, idx2prediction), axis=1).values

In [None]:
train['comment_texts'].iloc[117]

In [None]:
idx2prediction = predict_OOV(test['comment_text'].values,
                    max_seq_length=220,
                    words_to_replace=oov_words,
                    batch_size=32
                   )

In [None]:
print([bert_tokenizer.ids_to_tokens[idx] for idx in idx2prediction[184]])
print(test['comment_text'].iloc[184])

In [None]:
test['idx'] = list(range(len(test)))
test['comment_text'] = test.progress_apply(lambda x: replace_MASK_in_dataframe(x, idx2prediction), axis=1).values

In [None]:
print(test['comment_text'].iloc[184])

In [None]:
train.to_csv('train.csv', index=False)
test.to_csv('test.csv', index=False)