# Text classification for insincere Quora questions
(inspired by: https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings)

Step 1 - pre-processing; the point here is to not use standard pre-processing steps but instead make sure that there is as much overlap between the word embeddings and your vocabulary.

In [0]:
###imports
import pandas as pd
import re


In [2]:
###mount drive
from google.colab import drive
import os
drive.mount('/content/gdrive')

###change directory
os.chdir('/content/gdrive/My Drive/Colab Notebooks/quora')




Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [3]:
###Data set explore
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train.iloc[0:10]

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0
5,00004f9a462a357c33be,"Is Gaza slowly becoming Auschwitz, Dachau or T...",0
6,00005059a06ee19e11ad,Why does Quora automatically ban conservative ...,0
7,0000559f875832745e2e,Is it crazy if I wash or wipe my groceries off...,0
8,00005bd3426b2d0c8305,"Is there such a thing as dressing moderately, ...",0
9,00006e6928c5df60eacb,Is it just me or have you ever been in this ph...,0


The below function builds the training vocabulary dictionary, going through all the sentences and counts the occurances of the contained words.

In [0]:
###build vocab dictionary function
def build_vocab(sentences, verbose =  True):
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in sentences:
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

In [6]:
###build vocab
#split sentences into separate words
sentences = train["question_text"].apply(lambda x: x.split()).values
#run vocab function
vocab = build_vocab(sentences)
#print first 5 elements of dictionary
print({k: vocab[k] for k in list(vocab)[:5]})

{'How': 261930, 'did': 33489, 'Quebec': 97, 'nationalists': 91, 'see': 9003}


In [7]:
###import google news embeddings
from gensim.models import KeyedVectors
#change directory
os.chdir('/content/gdrive/My Drive/Colab Notebooks/album_reviews')
news_path = 'GoogleNews-vectors-negative300.bin.gz'
#
embeddings_index = KeyedVectors.load_word2vec_format(news_path, binary=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
###a function to check the intersection between bocab and embeddings
import operator 

def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in vocab:
        #try to assign word from embedding to new dict with index value
        #add number of found words to k
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        #otherwise add word count value to oov dict word key
        #add number of unfound words to i
        except:
            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

We can see from the below results that 78.5% of our vocabulary is covered by word embeddings

In [9]:
###run vocab function
oov = check_coverage(vocab,embeddings_index)

Found embeddings for  78.75% of all text


To understand which words we print the top 10 out of vocabulary words, which show that stop words and punctuation are to blame.

In [10]:
oov[:10]

[('to', 403183),
 ('a', 402682),
 ('of', 330825),
 ('and', 251973),
 ('India?', 16384),
 ('it?', 12900),
 ('do?', 8753),
 ('life?', 7753),
 ('you?', 6295),
 ('me?', 6202)]

## Pre-processing steps to remove words not covered by embeddings
1.   Remove punctuation not in embeddings
2.   Change numbers of 2 or more digits to hashes
3.   Replace common misspellings, and american/british conflicts
4.   Remove most common stop words



In [0]:
###PP step 1 function
def clean_text(x):

    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
    return x

In [0]:
###PP step 2 function
def clean_numbers(x):

    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

In [0]:
###PP step 3 function
def _get_mispell(mispell_dict):
    mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))
    return mispell_dict, mispell_re


mispell_dict = {'colour':'color',
                'centre':'center',
                'didnt':'did not',
                'doesnt':'does not',
                'isnt':'is not',
                'shouldnt':'should not',
                'favourite':'favorite',
                'travelling':'traveling',
                'counselling':'counseling',
                'theatre':'theater',
                'cancelled':'canceled',
                'labour':'labor',
                'organisation':'organization',
                'wwii':'world war 2',
                'citicise':'criticize',
                'instagram': 'social medium',
                'whatsapp': 'social medium',
                'snapchat': 'social medium'

                }
mispellings, mispellings_re = _get_mispell(mispell_dict)


In [0]:
###PP step 3 function
def replace_typical_misspell(text):
    def replace(match):
        return mispellings[match.group(0)]

    return mispellings_re.sub(replace, text)

In [0]:
train["question_text"] = train["question_text"].apply(lambda x: clean_text(x))
sentences = train["question_text"].apply(lambda x: x.split())
vocab = build_vocab(sentences)

In [14]:
oov = check_coverage(vocab,embeddings_index)

Found embeddings for  89.99% of all text


The results above show that PP step 1 significantly improved vocabulary coverage to 89.99%

In [0]:
train["question_text"] = train["question_text"].apply(lambda x: clean_numbers(x))
sentences = train["question_text"].apply(lambda x: x.split())
vocab = build_vocab(sentences)

In [16]:
oov = check_coverage(vocab,embeddings_index)


Found embeddings for  90.75% of all text


The results above show that PP step 2 improved vocabulary coverage to 90.75%

In [0]:
train["question_text"] = train["question_text"].apply(lambda x: replace_typical_misspell(x))
sentences = train["question_text"].apply(lambda x: x.split())
vocab = build_vocab(sentences)

In [20]:
oov = check_coverage(vocab,embeddings_index)

Found embeddings for  90.81% of all text


The results above show that PP step 3 improved vocabulary coverage to 90.81%

In [0]:
to_remove = ['a','to','of','and']
sentences = [[word for word in sentence if not word in to_remove] for sentence in sentences]
vocab = build_vocab(sentences)

In [22]:
oov = check_coverage(vocab,embeddings_index)

Found embeddings for  98.96% of all text


The results above show that PP step 4 improved vocabulary coverage to 98.96%. This is the final step of our pre-processing, from the word below we can see that there is no easy way to improve coverage.

In [23]:
oov[:10]

[('bitcoin', 987),
 ('Quorans', 858),
 ('cryptocurrency', 822),
 ('Snapchat', 807),
 ('btech', 632),
 ('Brexit', 493),
 ('cryptocurrencies', 481),
 ('blockchain', 474),
 ('behaviour', 468),
 ('upvotes', 432)]

In [0]:
import pickle
os.chdir('/content/gdrive/My Drive/Colab Notebooks/quora')
train["question_text"] = sentences

with open('sentences_pp', 'wb') as fp:
    pickle.dump(train, fp)