# Quora Insincere Questions Classification (Kaggle) - AdaBoost

<img src="https://qph.fs.quoracdn.net/main-qimg-416e6107aed22920d238a91f3bae6681" width="250px" alt="Quora Logo">

## Table Of Contents:
1. [Challenge Description](#Challenge-Description)
2. [Data Files Description](#Data-Files-Description)
3. [Import necessary libraries](#Import-necessary-libraries)
4. [File Paths](#File-Paths)
5. [Helper Methods](#Helper-Methods)
6. [Data Wrangling](#Data-Wrangling)
7. [Data Preprocessing](#Data-Preprocessing)
8. [Vectorization](#Vectorization)
8. [Machine Learning](#Machine-Learning)
9. [Evaluation](#Evaluation)
10. [Submission](#Submission)
11. [Export the model](#Export-the-model)

Model implementation inspired and modified from <a href="https://www.kaggle.com/adritab/sub1-no-deep-learning-vanilla-tfidf-logreg-svm?fbclid=IwAR2ob2La2qtWhWKwDwcc7CxCUHXa10_CiKMqJ4w7FK1i-KltVrhzYJPbuTo"><i>here</i></a> 

### Challenge Description

In this challenge, we have to train a model which is able to detect if a given question in insincere or not. The model should be able if the question is a statement rather than a question that if answered will provide benefit to Quora's online community. We will implement and compare various model and finally pick the highest performing one and deploy it on a live instance.

### Data Files Description

Value to be predicted: *loyalty score* for each *card_id*

Data files:
* **train.csv**: Contains the training data
* **test.csv**: Contains the testing data
* **embeddings.zip**: A set of already existing embeddings for this project

### Import necessary libraries

In [1]:
import string
import os
import math
import pickle

In [2]:
import pandas as pd
import numpy as np
import nltk

In [3]:
from nltk.util import ngrams
from nltk.tokenize import RegexpTokenizer

In [4]:
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

In [5]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

In [6]:
# Parameters and definitions
RANDOM_SEED = 0
VAL_SET_SIZE = 0.2

In [7]:
# Download essential resources
stop_words = stopwords.words('english')
stemmer = SnowballStemmer('english')

In [8]:
np.random.seed(RANDOM_SEED)

### File Paths

In [9]:
DATA_DIR = "../input/"
TRAIN_SAMPLES = DATA_DIR+"train.csv"
TEST_SAMPLES = DATA_DIR+"test.csv"
EMBD_SAMPLES = DATA_DIR+"embeddings.zip"
MODEL_OUT = "model-ada.pkl"

### Helper Methods

In [10]:
def load_data():
    """Loads the training and testing sets into the memory."""
    return pd.read_csv(TRAIN_SAMPLES), pd.read_csv(TEST_SAMPLES)

### Data Wrangling

In [11]:
df_train, df_test = load_data()

In [12]:
# Sneak peak into the updated training set
df_train.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


### Data Preprocessing

In [13]:
def preprocess(data, col):
    """Preprocesses a given DataFrame. 
    
    It applies conversion to lowercase, removes punctuation, removes digits, 
    removes stop words and stems the words.
    
    Args:
        data: A pandas DataFrame.
        col: The name of the column that needs NLP preprocessing.
    
    Returns:
        The resulting data set.
    """
    # Convert data set to lowercase
    data["question_text"] = data["question_text"].apply(lambda s: s.lower())
    
    # Remove punctuation from the data set
    data["question_text"] = data['question_text'].str.replace('[^\w\s]','')

    # Remove digits from the data set
    data["question_text"] = data["question_text"].str.replace('\d+', '')

    # Remove stop words from question text
    data["question_text"] = data["question_text"].apply(lambda s: " ".join([item for item in s.split() if item not in stop_words]))

    # Stem words
    data["question_text"] = data["question_text"].apply(lambda s: " ".join([stemmer.stem(w) for w in s.split()]))
    
    return data

In [14]:
# Preprocess both data sets accordingly
df_train = preprocess(df_train, "question_text")

In [15]:
df_train.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,quebec nationalist see provinc nation,0
1,000032939017120e6e44,adopt dog would encourag peopl adopt shop,0
2,0000412ca6e4628ce2cf,veloc affect time veloc affect space geometri,0
3,000042bf85aa498cd78e,otto von guerick use magdeburg hemispher,0
4,0000455dfa3e01eae3af,convert montra helicon mountain bike chang tyre,0


In [16]:
df_test.head()

Unnamed: 0,qid,question_text
0,0000163e3ea7c7a74cd7,Why do so many women become so rude and arroga...
1,00002bd4fb5d505b9161,When should I apply for RV college of engineer...
2,00007756b4a147d2b0b3,What is it really like to be a nurse practitio...
3,000086e4b7e1c7146103,Who are entrepreneurs?
4,0000c4c3fbe8785a3090,Is education really making good people nowadays?


### Vectorization

In [17]:
def build_TF(dt_train, dt_test):
    """Builds the TF-IDF matrix."""
    max_features = 50000  # More than this would filter in noise also
    tfidf_vectorizer = TfidfVectorizer(ngram_range =(2,4) , max_df=0.90, min_df=5, max_features=max_features)
    X = tfidf_vectorizer.fit_transform(dt_train['question_text'])
    X_test = tfidf_vectorizer.transform(dt_test['question_text'])
    tfidf_feature_names = tfidf_vectorizer.get_feature_names()
    y = dt_train["target"]
    return [train_test_split(X, y, test_size=VAL_SET_SIZE), X_test, tfidf_vectorizer]

In [18]:
tfvect = build_TF(df_train, df_test)
X_train, X_val, y_train, y_val = tfvect[0]
X_test = tfvect[1]
tfidf_vectorizer = tfvect[2]

### Machine Learning

In [19]:
def build_model(X_train, y_train):
    """Builds an AdaBoost model."""
    return AdaBoostClassifier(random_state=RANDOM_SEED).fit(X_train, y_train)

In [20]:
# Build the model
ada_model = build_model(X_train, y_train)

In [21]:
# Produce predictions
y_pred_train = ada_model.predict(X_train)
y_pred_val = ada_model.predict(X_val)
y_pred_test = ada_model.predict(X_test)

### Evaluation

In [22]:
def produce_metrics(y, y_pred):
    """Produces a report containing the accuracy, f1-score, precision and recall metrics.
    
    Args:
        y: The true classification
        y_pred: The predicted classification
    """
    print("Accuracy: {}, F1 Score: {}, Precision: {}, Recall: {}".format(accuracy_score(y, y_pred),
                                                                     f1_score(y, y_pred, average="macro"),
                                                                     precision_score(y, y_pred, average="macro"),
                                                                     recall_score(y, y_pred, average="macro")))

In [23]:
def produce_classification_report(y, y_pred):
    """Produces a classification report.
    
    Args:
        y: The true classification
        y_pred: The predicted classification
    """
    print(classification_report(y, y_pred))

In [24]:
produce_metrics(y_train, y_pred_train)

Accuracy: 0.9398715854289944, F1 Score: 0.5376517664246443, Precision: 0.806112808573314, Recall: 0.5279591022107871


In [25]:
produce_metrics(y_val, y_pred_val)

Accuracy: 0.940285194755479, F1 Score: 0.5390461908198274, Precision: 0.804229504915433, Recall: 0.5286986823703099


In [26]:
produce_classification_report(y_val, y_pred_val)

              precision    recall  f1-score   support

           0       0.94      1.00      0.97    245149
           1       0.67      0.06      0.11     16076

   micro avg       0.94      0.94      0.94    261225
   macro avg       0.80      0.53      0.54    261225
weighted avg       0.92      0.94      0.92    261225



### Export the model

In [27]:
def export_model(vectorizer, model):
    """Exports a model to a pickle file.
    
    Args:
        model: A Scikit-learn model
    """
    with open(MODEL_OUT, "wb") as f:
        pickle.dump((vectorizer, model), f)

In [28]:
export_model(tfidf_vectorizer, ada_model)