## Quora Question Classification Challenge - Kaggle Competition.

In this notebook, I have worked around a method to classify whether the question is sincere or insincere.

Let us first import all the packages I have used ahead in one go so to keep the code clean.

In [None]:
import numpy as np
import pandas as pd
from tqdm import tqdm
from collections import OrderedDict
import re
from sklearn.model_selection import train_test_split as tts
from keras.models import Sequential
from keras.layers import Dense, CuDNNLSTM, Embedding, Bidirectional

Initialize the progress bar and get the training data.

In [None]:
tqdm.pandas()
train_data = pd.read_csv("train.csv")

Now, make a list from the pandas column. Using this manner, it is quite easy to process the sentences for using ahead to create a model.

In [None]:
sentence_list_train = train_data['question_text'].tolist()

Dividing a code to various function makes our code clean. So let's do the same now.

In [None]:
def handle_punctuations(sentence):
    '''
    To handle the punctuations, we replace '&' with 'and' and for others, we just remove it. The first
    loop is deprecated and redundant and is never run because condition is never met. It is just there
    because previous version of the code was using it.
    '''
    sentence = str(sentence)
    for punct in "/-'":
        sentence = sentence.replace(punct, ' ')
    for punct in '&':
        sentence = sentence.replace(punct, ' and ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        sentence = sentence.replace(punct, '')
    return sentence

In [None]:
def handle_contractions(sentence):
    '''
    We need to handle contractions. To do that, a manually created dictionary is used. Whenever a contraction
    is encountered, it searches for it in keys of the dictionary, and replaces it with its value in the 
    sentence. Need to do so is to ensure most of the data actually has an embedding present so that Keras
    model can map word to vector.
    '''
    contraction_dict = {"ain't": "am not",
                        "aren't": "are not",
                        "can't": "cannot",
                        "can't've": "cannot have",
                        "'cause": "because",
                        "could've": "could have",
                        "couldn't": "could not",
                        "couldn't've": "could not have",
                        "didn't": "did not",
                        "doesn't": "does not",
                        "Don't": "do not",
                        "don't": "do not",
                        "hadn't": "had not",
                        "hadn't've": "had not have",
                        "hasn't": "has not",
                        "haven't": "have not",
                        "he'd": "he would",
                        "he'd've": "he would have",
                        "he'll": "he will",
                        "he'll've": "he will have",
                        "he's": "he is",
                        "how'd": "how did",
                        "how'd'y": "how do you",
                        "how'll": "how will",
                        "how's": "how is",
                        "I'd": "I would",
                        "I'd've": "I would have",
                        "I'll": "I will",
                        "I'll've": "I will have",
                        "i'm": "I am",
                        "I'm": "I am",
                        "I've": "I have",
                        "isn't": "is not",
                        "it'd": "it had",
                        "it'd've": "it would have",
                        "it'll": "it will",
                        "it'll've": "it will have",
                        "it's": "it is",
                        "let's": "let us",
                        "ma'am": "madam",
                        "mayn't": "may not",
                        "might've": "might have",
                        "mightn't": "might not",
                        "mightn't've": "might not have", 
                        "must've": "must have",
                        "mustn't": "must not",
                        "mustn't've": "must not have",
                        "needn't": "need not",
                        "needn't've": "need not have",
                        "o'clock": "of the clock",
                        "oughtn't": "ought not",
                        "oughtn't've": "ought not have",
                        "shan't": "shall not",
                        "sha'n't": "shall not",
                        "shan't've": "shall not have",
                        "she'd": "she would",
                        "she'd've": "she would have",
                        "she'll": "she will",
                        "she'll've": "she will have",
                        "she's": "she is",
                        "should've": "should have",
                        "shouldn't": "should not",
                        "shouldn't've": "should not have",
                        "so've": "so have",
                        "so's": "so is",
                        "that'd": "that would",
                        "that'd've": "that would have",
                        "that's": "that is",
                        "there'd": "there had",
                        "there'd've": "there would have",
                        "there's": "there is",
                        "they'd": "they would",
                        "they'd've": "they would have",
                        "they'll": "they will",
                        "they'll've": "they will have",
                        "they're": "they are",
                        "they've": "they have",
                        "to've": "to have",
                        "wasn't": "was not",
                        "we'd": "we had",
                        "we'd've": "we would have",
                        "we'll": "we will",
                        "we'll've": "we will have",
                        "we're": "we are",
                        "we've": "we have",
                        "weren't": "were not",
                        "what'll": "what will",
                        "what'll've": "what will have",
                        "what're": "what are",
                        "what's": "what is",
                        "what've": "what have",
                        "when's": "when is",
                        "when've": "when have",
                        "where'd": "where did",
                        "where's": "where is",
                        "where've": "where have",
                        "who'll": "who will",
                        "who'll've": "who will have",
                        "who's": "who is",
                        "who've": "who have",
                        "why's": "why is",
                        "why've": "why have",
                        "will've": "will have",
                        "won't": "will not",
                        "won't've": "will not have",
                        "would've": "would have",
                        "wouldn't": "would not",
                        "wouldn't've": "would not have",
                        "y'all": "you all",
                        "y'alls": "you alls",
                        "y'all'd": "you all would",
                        "y'all'd've": "you all would have",
                        "y'all're": "you all are",
                        "y'all've": "you all have",
                        "you'd": "you had",
                        "you'd've": "you would have",
                        "you'll": "you you will",
                        "you'll've": "you you will have",
                        "you're": "you are",
                        "you've": "you have"
                       }
    updated_sentence = ""
    words = sentence.split()
    for word in words:
        try:
            updated_sentence += contraction_dict[word]
        except KeyError:
            updated_sentence += word
        updated_sentence += " "
    return updated_sentence

In [None]:
def handle_digits(sentence):
    '''
    To handle digits, an approach is used which can be seen with example:
    2/1/19 --> date
    573568 --> six digit number
    3,67,123 --> amount
    12:40 --> time
    3.14 --> decimal number
    
    It does exactly what is shown in example. A NLP researcher in his research paper mentioned that using
    this approach does no harm to the accuracy of the model and thus used the same approach here.
    Although when ran the code with the method, it was seen that accuracy actually increased by 1% on the 
    test file with Word2Vec embedding.
    '''
    def to_string(digit):
        if x == 1:
            return "one"
        elif x == 2:
            return "two"
        elif x == 3:
            return "three"
        elif x == 4:
            return "four"
        elif x == 5:
            return "five"
        elif x == 6:
            return "six"
        elif x == 7:
            return "seven"
        elif x == 8:
            return "eight"
        elif x == 9:
            return "nine"
        else:
            return "large"
        
    pattern = re.compile('.*[0-9].*')
    words = sentence.split()
    updated_line = ""
    for word in words:
        matched = pattern.match(word)
        if matched:
            if "," in word:
                updated_line += "amount "
            elif "/" in word:
                updated_line += "date "
            elif ":" in word:
                updated_line += "time "
            elif "-" in word:
                updated_line += "date "
            elif "." in word:
                updated_line += "decimal number "
            else:
                x = len(word)
                x = to_string(x)
                x += " digit number "
                updated_line += x
        else:
            word += " "
            updated_line += word
    return updated_line

In [None]:
def handle_spelling_errors(sentence):
    '''
    We need to correct some common mispellings as well. This ensures most of the vocabulary can map itself 
    to the embedding. The frequency of each mispelling was found using another method later in the code.
    '''
    spell_correction_dict = {"qoura": "quora",
                            "qouran": "quoran",
                            "quoracom": "quora website",
                            "wwwyoutubecom": "youtube website",
                            "freelancercom": "freelancer website",
                            "demonitisation": "demonetization",
                            "demonetisation": "demonetization",
                            "bookingcom": "booking website",
                            "upwork": "freelancing platform",
                            "trumpcare": "trump care",
                            "brexit": "britain exit from europe",
                            "iiith": "iiit hyderabad",
                            "cryptocurrencies": "multiple cryptocurrency",
                            "pokémon": "pokemon",
                            "clickbait": "forced click",
                            "naukricom": "indian job portal website",
                            "bhakts": "devotees",
                            "…": "",
                             "etc…": "etc",
                             "π": "pi",
                             "√": "square root",
                             "blockchains": "blockchain",
                             "∞": "infinity"
                            }
    correct_sentence = ""
    words = sentence.split()
    for word in words:
        try:
            x = spell_correction_dict[word.lower()]
            correct_sentence += x
        except KeyError:
            correct_sentence += word
        correct_sentence += " "
    return correct_sentence

In [None]:
def clean_sentence(sentence):
    '''
    This method calls other methods needed to clean sentence. Some methods are commented and not even
    provided with this code because GloVe embedding was last used with it and it has embedding for most of
    the proper nouns but not Word2Vec. It is kept such so as to remind that whenever use this code with 
    Word2Vec, add these methods too to increase accuracy. Handling non-English words is not actually
    possible as there are like tens of other languages used in the train data and can't figure out every
    language's meaning.
    '''
    sentence = handle_contractions(sentence)
    sentence = handle_digits(sentence)
    sentence = sentence.strip()
    sentence = handle_punctuations(sentence)
    #sentence = handle_non_English_words(sentence)
    #sentence = handle_acronyms_and_proper_nouns(sentence)
    sentence = handle_spelling_errors(sentence)
    sentence = sentence.strip()
    return sentence

In [None]:
def create_vocabulary(sentence_list):
    '''
    Here comes interesting part. We create a bag of words. These are all the unique words in the text.
    Keeping a count of all the unique words and the frequency of it helps in finding out how much of our
    dataset is useful. By useful, it is meant that how much of our dataset can be converted into vectors
    provided by GloVe embedding.
    Later, vocabulary is sorted by values in reverse order to find out stop words.
    It returns the vocabulary dictionary and the updated sentence list which is now cleaned.
    '''
    new_sentence_list = []
    vocabulary = {}
    for sentence in tqdm(sentence_list):
        sentence = clean_sentence(sentence)
        new_sentence_list.append(sentence)
        words = sentence.split()
        for word in words:
            try:
                vocabulary[word] += 1
            except KeyError:
                vocabulary[word] = 1
    vocabulary = OrderedDict(sorted(vocabulary.items(), key = lambda x:x[1], reverse = True))
    return vocabulary, new_sentence_list

In [None]:
def check_coverage(vocabulary, embedding):
    '''This method checks how much of our vocabulary actually has an embedding. A simple way was to 
    convert keys to set for both vocabulary and glove and then get the intersection of sets using set
    operations.'''
    words_vocabulary = set(vocabulary.keys())
    words_embedding = set(embedding.keys())
    intersection = words_vocabulary & words_embedding
    print('Found embeddings for {:.2%} of our vocabulary'.format(len(intersection)/len(words_vocabulary)))

In [None]:
glove_embedding = "../input/embeddings/glove.840B.300d/glove.840B.300d.txt"

def loading_glove_embedding(glove_embedding):
    '''
    This method loads the embedding onto the RAM. This feature is costly on RAM but is required as well
    to perform the tasks. It takes around 3-4 minutes to load such an embedding as it weighs around 6GB.
    '''
    def get_coefs(word,*arr): 
        return word, np.asarray(arr, dtype='float32')
    
    embedding = dict(get_coefs(*o.split(" ")) for o in open(glove_embedding, encoding='latin'))
    return embedding

glove = loading_glove_embedding(glove_embedding)

It is time to create a vocabulary and update the sentences.

In [None]:
vocabulary, sentence_list_train = create_vocabulary(sentence_list_train)

Now it is time to generalize the code for any embedding. GloVe does understand some words in upper case like proper nouns but not the other embeddings. This method ahead, and many previous ones, just ensure that if we change the embedding, we do not need to modify the code much which was necessary because of difference in the way words are treated by an embedding.

In [None]:
def to_lower_case(vocabulary):
    '''
    In this method, we are bringing our words in bag of words to lower case. Now if a word is present twice 
    in our bag of words, once in upper case and once in lower case, we add up their frequency and delete
    the upper case word. Else we bring down the word to lower case without altering the frequency.
    Using try-except block helps in the if block. Notice we just access both the word and the lower word
    without even checking that lower word previously existed in the vocabulary or not. This might result
    in an exception for some cases. For those cases and except block is used. If we were to search a key is 
    actually present in our bag of words, it could be a costly affair as one can see how big our bag of
    words actually is. This way, it ensures complexity of code is O(n).
    '''
    updated_vocabulary = {}
    for word in tqdm(vocabulary):
        lower_word = word.lower()
        try:
            if word != lower_word:
                updated_vocabulary[lower_word] = vocabulary[word] + vocabulary[lower_word]
            else:
                updated_vocabulary[lower_word] = vocabulary[word]
        except KeyError:
            updated_vocabulary[lower_word] = vocabulary[word]
    return updated_vocabulary

In [None]:
def update_glove(vocabulary, glove):
    '''
    We need to update the embedding too. If embedding for an upper case word exists, but not of the lower
    case one, it just adds an embedding for a lower case word same as that of upper case one.
    '''
    for word in tqdm(vocabulary):
        lower_word = word.lower()
        if word in glove and lower_word not in glove:
            glove[lower_word] = glove[word]
    return glove

In [None]:
glove = update_glove(vocabulary, glove)

In [None]:
vocabulary = to_lower_case(vocabulary)

In [None]:
check_coverage(vocabulary, glove)

In [None]:
def create_oov_dictionary(vocabulary, glove):
    '''
    Now, we need a dictionary which has the words which are in bag of words but not in embedding. So as to
    manually add them in our embedding by correcting its spelling, or using some synonym if its frequency
    is large enough to impact our model's capacity.
    '''
    oov_dictionary = {}
    for key in tqdm(vocabulary):
        try:
            x = glove[key]
        except KeyError:
            oov_dictionary[key] = vocabulary[key]
    oov_dictionary = OrderedDict(sorted(oov_dictionary.items(), key = lambda x:x[1], reverse = True))
    return oov_dictionary

In [None]:
oov_dict = create_oov_dictionary(vocabulary, glove)

In [None]:
def lower_case_sentence(sentence_list):
    '''
    We bring all of the sentences in the list to lower case now. As our bag of words does not recognize
    any upper case word anymore.
    '''
    sentence_list_new = []
    for sentence in tqdm(sentence_list):
        sentence = sentence.lower()
        sentence_list_new.append(sentence)
    del sentence_list
    return sentence_list_new

In [None]:
sentence_list_train = lower_case_sentence(sentence_list_train)

All preprocessing stuff is done. Add the sentence list back to dataframe to get back on standard ways of creating a model.

In [None]:
train_data['question_text'] = sentence_list_train

And from here on, we start with our standard algorithm of model building.

Extract features and target from the dataframe.

In [None]:
features = train_data.iloc[:, 1:-1].values
target = train_data.iloc[:, -1:].values

We do not have unlimited RAM, do we? So clear off data we no longer need.

In [1]:
del train_data
del sentence_list_train

NameError: name 'train_data' is not defined

In [None]:
def pretrainedEmbedding():
    '''
    This method creates an embedding layer for our deep learning model. In this layer our embedding is
    brought into the format Keras understands and is added to the layer as weights.
    
    We make our bag of words and embedding global because once we are done using it here, we no longer need
    them and can delete from RAM as this function uses almost 5 GB of RAM and deleting other data that 
    already occupies 10 GB of our RAM would help save us some space when running by ensuring system doesn't
    deadlock.
    '''
    
    global vocabulary
    global glove
    
    def wordToIndex(embed):
        tokens = sorted(embed.keys())
        wordIndex = {}
        for idx, tok in enumerate(tokens):
            kerasIdx = idx + 1
            wordIndex[tok] = kerasIdx
        return wordIndex
    
    wordIndex = wordToIndex(glove)
    
    vocabLength = len(wordIndex) + 1
    embDim = next(iter(glove.values())).shape[0]
    
    embeddingMatrix = np.zeros((vocabLength, embDim))
    for word, index in tqdm(wordIndex.items()):
        embeddingMatrix[index, : ] = glove[word]
    
    del vocabulary
    del glove
    
    embeddingLayer = Embedding(vocabLength, embDim, weights = [embeddingMatrix], trainable = False)
    
    return embeddingLayer

Time to build our Sequential model. First layer is always an embedding layer in a NLP model. Next we must use RNN network as the order matters when language processing is done. CuDNNLSTM is used instead of LSTM layer so to increase speed by running our model on GPU instead of RAM. As LSTM is meant to share information with the previous layers, it is kept Bidirectional. The second LSTM layer though doesn't return sequences as the next layer after it is a Dense one which never expects to get any other data other than which the model propagates. Keeping our activation layer as a sigmoid one ensures that our output is either 0 or 1.

LSTM --> Long Short Term Memory

In [None]:
model = Sequential()
model.add(pretrainedEmbedding())
model.add(Bidirectional(CuDNNLSTM(64, return_sequences = True),
                        input_shape=(30, 300)))
model.add(Bidirectional(CuDNNLSTM(64, return_sequences = False)))
model.add(Dense(1, activation="sigmoid"))

Time to compile our model. Just read the documentation of the parameters passed to the function.

In [None]:
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

Fitting a model is different than compiling it. Compile just tells it how to do it. Fit tells it on what data it needs to do it.

In [None]:
model.fit(features, target, epochs = 20, batch_size = 512, verbose = True)

Okay. Our model is built. Next get our test data file and complete the same preprocessing stuff we did for train dataset. Except, now we do not create a bag of words.

In [3]:
test_data = pd.read_csv('test.csv')

NameError: name 'pd' is not defined

In [None]:
test_sentence_list = test_data['question_text'].tolist()

In [None]:
def update_test_sentences(sentences):
    new_sentence_list = []
    for sentence in tqdm(sentences):
        sentence = clean_sentence(sentence)
        new_sentence_list.append(sentence)
    return new_sentence_list

In [None]:
test_sentence_list = updated_test_sentences(test_sentence_list)
test_sentence_list = lower_case_sentence(test_sentence_list)

In [None]:
test_data['question_text'] = test_sentence_list

Get the question IDs and the questions from the dataframe. We need an output CSV file for submitting in the competition.

In [None]:
ques_id = test_data['qid'].tolist() # for safety reasons
questions = test_data.iloc[:, -1:].values

In [None]:
y_predicted = []
for question in tqdm(questions):
    y_predicted.extend(model.predict(question).flatten())

In [None]:
submission_df = pd.DataFrame({"qid": ques_id, "prediction": y_predicted})
submission_df.to_csv("submission.csv", index = False)

Our output file is ready. Submit the file and see where we stand on leaderboard!