# Quora Insincere Questions Classification (Kaggle)

<img src="https://qph.fs.quoracdn.net/main-qimg-416e6107aed22920d238a91f3bae6681" width="250px" alt="Quora Logo">

## Table Of Contents:
1. [Challenge Description](#Challenge-Description)
2. [Data Files Description](#Data-Files-Description)
3. [Import necessary libraries](#Import-necessary-libraries)
4. [File Paths](#File-Paths)
5. [Helper Methods](#Helper-Methods)
6. [Data Wrangling](#Data-Wrangling)
7. [Feature Engineering](#Feature-Engineering)
8. [Data Preprocessing](#Data-Preprocessing)
9. [LSTM](#LSTM)
10. [Evaluation](#Evaluation)

### Challenge Description

In this challenge, we have to train a model which is able to detect if a given question in insincere or not. The model should be able if the question is a statement rather than a question that if answered will provide benefit to Quora's online community. We will implement and compare various model and finally pick the highest performing one and deploy it on a live instance.

### Data Files Description

Value to be predicted: 0 or 1 for each q_id

Data files:
* **train.csv**: Contains the training data
* **test.csv**: Contains the testing data
* **embeddings.zip**: A set of already existing embeddings for this project

### Import necessary libraries

In [1]:
import string
import os
import math

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
%matplotlib inline

In [3]:
from wordcloud import WordCloud
from nltk.util import ngrams
from nltk.tokenize import RegexpTokenizer

In [4]:
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

In [5]:
from keras.layers import Input
from keras import Model
from keras.preprocessing import sequence,text
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense,Dropout,Embedding,LSTM,Conv1D,GlobalMaxPooling1D,Flatten,MaxPooling1D,GRU,SpatialDropout1D,Bidirectional
from keras.callbacks import EarlyStopping
from keras.utils import to_categorical
from keras.losses import categorical_crossentropy
from keras.optimizers import Adam
from keras.callbacks import Callback
import keras.backend as K
from sklearn.model_selection import train_test_split
from tqdm import tqdm

Using TensorFlow backend.


In [6]:
stop_words = stopwords.words('english')
stemmer = SnowballStemmer('english')

In [7]:
# Parameters and definitions
RANDOM_SEED = 0
VAL_SET_SIZE = 0.2

In [8]:
np.random.seed(RANDOM_SEED)

### File Paths

In [9]:
DATA_DIR = "../input/"
TRAIN_SAMPLES = DATA_DIR+"train.csv"
TEST_SAMPLES = DATA_DIR+"test.csv"
EMBD_SAMPLES = DATA_DIR+"embeddings.zip"
SUBMISSION_FILE = DATA_DIR+"submission.csv"

### Helper Methods

In [10]:
def load_data():
    """Loads the training and testing sets into the memory.
    """
    return pd.read_csv(TRAIN_SAMPLES), pd.read_csv(TEST_SAMPLES)

### Data Wrangling

In [11]:
df_train, df_test = load_data()

### Feature Engineering

In [12]:
def build_features(data):
    """"""
    # Number of words on the data set
    data["n_words"] = data["question_text"].apply(lambda x: len(str(x).split()))
    
    # Number of unique words on the data set
    data["uniq_words"] = data["question_text"].apply(lambda x: len(set(str(x).split())))
    
    # Number of characters on data set
    data["n_chars"] = data["question_text"].apply(lambda x: len(str(x)))

    # Number of stopwords on data set
#     data["n_swords"] = data["question_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))

    # Number of punctuations on data set
    data["n_punct"] = data['question_text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )

    # Number of title case words on data set
    data["n_up_words"] = data["question_text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))

    # Number of title case words on data set
    data["n_titles"] = data["question_text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))

    # Average length of the words on data set
    data["m_w_len"] = data["question_text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
    
    return data

In [13]:
df_train = build_features(df_train)

In [14]:
# Record the min, max and average value for the new columns (n_words	uniq_words	n_chars	n_punct	n_up_words	n_titles	m_w_len)
df_train.describe()

Unnamed: 0,target,n_words,uniq_words,n_chars,n_punct,n_up_words,n_titles,m_w_len
count,1306122.0,1306122.0,1306122.0,1306122.0,1306122.0,1306122.0,1306122.0,1306122.0
mean,0.06187018,12.80361,12.13578,70.67884,1.746492,0.450657,2.121108,4.671008
std,0.2409197,7.052437,6.040779,38.78428,1.672051,0.8490158,1.495405,0.8187338
min,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0
25%,0.0,8.0,8.0,45.0,1.0,0.0,1.0,4.111111
50%,0.0,11.0,11.0,60.0,1.0,0.0,2.0,4.6
75%,0.0,15.0,15.0,85.0,2.0,1.0,3.0,5.142857
max,1.0,134.0,96.0,1017.0,411.0,37.0,37.0,57.66667


In [15]:
# Sneak peak into the updated training set
df_train.head()

Unnamed: 0,qid,question_text,target,n_words,uniq_words,n_chars,n_punct,n_up_words,n_titles,m_w_len
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0,13,13,72,1,0,2,4.615385
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0,16,15,81,2,0,1,4.125
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0,10,8,67,2,0,2,5.8
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0,9,9,57,1,0,4,5.444444
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0,15,15,77,1,2,3,4.2


### Data Preprocessing

In [16]:
def preprocess(data):
    # Convert data set to lowercase
    data["question_text"] = data["question_text"].apply(lambda s: s.lower())
    
    # Remove punctuation from the data set
    data["question_text"] = data['question_text'].str.replace('[^\w\s]','')

    # Remove digits from the data set
    data["question_text"] = data["question_text"].str.replace('\d+', '')

    # Remove stop words from question text
    data["question_text"] = data["question_text"].apply(lambda s: " ".join([item for item in s.split() if item not in stop_words]))

    # Stem words
    data["question_text"] = data["question_text"].apply(lambda s: " ".join([stemmer.stem(w) for w in s.split()]))
    
    return data

In [17]:
# Preprocess training set
df_train = preprocess(df_train)

In [18]:
# Update combinatorial features
df_train = build_features(df_train)

# Delete not needed columns
del df_train["n_punct"]
del df_train["n_up_words"]
del df_train["n_titles"]

# Display the resulting DataFrame
df_train.head()

  out=out, **kwargs)


Unnamed: 0,qid,question_text,target,n_words,uniq_words,n_chars,m_w_len
0,00002165364db923c7e6,quebec nationalist see provinc nation,0,5,5,37,6.6
1,000032939017120e6e44,adopt dog would encourag peopl adopt shop,0,7,6,41,5.0
2,0000412ca6e4628ce2cf,veloc affect time veloc affect space geometri,0,7,5,45,5.571429
3,000042bf85aa498cd78e,otto von guerick use magdeburg hemispher,0,6,6,40,5.833333
4,0000455dfa3e01eae3af,convert montra helicon mountain bike chang tyre,0,7,7,47,5.857143


### Prepare validation set

In [19]:
# Split training set into training and validation sets
df_train, df_val = train_test_split(df_train, test_size=VAL_SET_SIZE, random_state = RANDOM_SEED)

### Resources


For LSTM: https://www.kaggle.com/sdelecourt/simple-lstm-that-does-the-job

For loading embeddings: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

### Load word embeddings

In [20]:
# File path of pretrained word embeddings
EMB_FILE_PATH = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'

In [21]:
# Load GloVe Word Embeddings
def load_embeddings(file_path):
    """ Loads word embeddings and returns embeddings index
    """
    embeddings_index = {}
    f = open(file_path)
    for line in tqdm(f):
        values = line.split(" ")
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()
    return embeddings_index

In [22]:
emb_index = load_embeddings(EMB_FILE_PATH)
print('Found %s word vectors.' % len(emb_index))

2196017it [03:26, 10652.80it/s]

Found 2196016 word vectors.





In [23]:
# Extract text and targets from training set
train_questions = df_train['question_text'].values
y_train = df_train['target'].values

# Extract text and targets from validation set
val_questions = df_val['question_text'].values
y_val = df_val['target'].values

# Extract text and targets from test set
test_questions = df_test['question_text'].values

In [24]:
# Number of unique words in our dataset
NUM_UNIQUE_WORDS = 1044897
# Maximum number of words in a question
MAX_WORDS = 125

In [25]:
def get_tokenizer(num_unique_words):
    """ Returns tokenizer
    """
    return Tokenizer(num_words=num_unique_words)

In [26]:
# Convert questions into vectors of integers using Keras Tokenizer
tokenizer = get_tokenizer(NUM_UNIQUE_WORDS)
tokenizer.fit_on_texts(list(train_questions))

In [27]:
# Store tokenizer
import pickle

with open('LSTM_tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [28]:
X_train = tokenizer.texts_to_sequences(train_questions)
X_val = tokenizer.texts_to_sequences(val_questions)
X_test = tokenizer.texts_to_sequences(test_questions)

In [29]:
# Pad sequences so that they are all the same length. Questions shorter than maxlen are padded with zeros.
X_train = sequence.pad_sequences(X_train, maxlen=MAX_WORDS)
X_val = sequence.pad_sequences(X_val, maxlen=MAX_WORDS)
X_test = sequence.pad_sequences(X_test, maxlen=MAX_WORDS)

In [30]:
# Create word index
word_index = tokenizer.word_index

In [31]:
# Dimension of embedding matrix
EMB_DIM = 300

### LSTM

In [36]:
embedding_matrix = np.zeros((len(word_index) + 1, EMB_DIM))
for word, i in word_index.items():
    embedding_vector = emb_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [37]:
embedding_layer = Embedding(len(word_index) + 1,
                            EMB_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_WORDS,
                            trainable=False)

In [38]:
lstm_out = 200 # dimensionality of output space

In [39]:
lstm_out = 200

model = Sequential()
model.add(embedding_layer)
model.add(LSTM(lstm_out, dropout_U = 0.2, dropout_W = 0.2))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

  """


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 125, 300)          47323500  
_________________________________________________________________
lstm_1 (LSTM)                (None, 200)               400800    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 201       
Total params: 47,724,501
Trainable params: 401,001
Non-trainable params: 47,323,500
_________________________________________________________________
None


In [40]:
# # Create model 
# def create_LSTM(embedding_layer):
#     """ Creates LSTM model with embedding layer, LSTM and dense layer
#     """
#     model = Sequential()
#     model.add(embedding_layer)
#     model.add(LSTM(lstm_out, dropout_U = 0.2, dropout_W = 0.2))
#     model.add(Dense(1,activation='sigmoid'))
#     model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
#     return model

In [41]:
# LSTM_model = create_LSTM(emb_layer)

In [42]:
# Fit model to training data
model.fit(X_train, y_train, validation_data=(X_val, y_val),
          epochs=2, batch_size=1024, verbose=1)

Train on 1044897 samples, validate on 261225 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fc2f4314fd0>

In [43]:
# Save model 
model.save('LSTM_1.h5')

In [44]:
# Make predictions for validation set 
y_pred_val = model.predict(X_val, verbose=1)



In [45]:
# Make predictions for training set
y_pred_train = model.predict(X_train, verbose=1)



In [46]:
# Convert probabilities into predictions for validation set
y_te_val = (np.array(y_pred_val) > 0.5).astype(np.int)

In [47]:
# Convert probabilities into predictions for training set
y_te_train = (np.array(y_pred_train) > 0.5).astype(np.int)

### Evaluation

In [48]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report

In [49]:
def produce_metrics(y, y_pred):
    """Produces a report containing the accuracy, f1-score, precision and recall metrics.
    
    Args:
        y: The true classification
        y_pred: The predicted classification
    """
    print("Accuracy: {}, F1 Score: {}, Precision: {}, Recall: {}".format(accuracy_score(y, y_pred),
                                                                     f1_score(y, y_pred, average="macro"),
                                                                     precision_score(y, y_pred, average="macro"),
                                                                     recall_score(y, y_pred, average="macro")))


In [50]:
def produce_classification_report(y, y_pred):
    """Produces a classification report.
    
    Args:
        y: The true classification
        y_pred: The predicted classification
    """
    print(classification_report(y, y_pred))

In [51]:
produce_metrics(y_val, y_te_val)

Accuracy: 0.954717197817973, F1 Score: 0.777098944834635, Precision: 0.8227636210799454, Recall: 0.7442151677611695


In [52]:
produce_metrics(y_train, y_te_train)

Accuracy: 0.9555592560797859, F1 Score: 0.7835296667591048, Precision: 0.82829239606895, Recall: 0.7509097990272653


In [53]:
produce_classification_report(y_val, y_te_val)

              precision    recall  f1-score   support

           0       0.97      0.98      0.98    245149
           1       0.68      0.50      0.58     16076

   micro avg       0.95      0.95      0.95    261225
   macro avg       0.82      0.74      0.78    261225
weighted avg       0.95      0.95      0.95    261225



In [54]:
produce_classification_report(y_train, y_te_train)

              precision    recall  f1-score   support

           0       0.97      0.98      0.98    980163
           1       0.69      0.52      0.59     64734

   micro avg       0.96      0.96      0.96   1044897
   macro avg       0.83      0.75      0.78   1044897
weighted avg       0.95      0.96      0.95   1044897



In [55]:
print("Accuracy: {}, F1 Score: {}".format(accuracy_score(y_val, y_te_val), 
                                          f1_score(y_val, y_te_val, average="macro")))

Accuracy: 0.954717197817973, F1 Score: 0.777098944834635
