In [71]:
import os
import re
import string
import pandas as pd
import matplotlib.pyplot as plt

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
df_train = pd.read_csv("../data/train.csv", header=0)

In [3]:
df_train.columns

Index(['qid', 'question_text', 'target'], dtype='object')

### Column meanings

- **qid** a unique key for each questions
- **question_text** the question that was asked
- **target** the label where 1 is an insincere question

In [12]:
df_train.head(5)

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


### Balance of label

Only 6.2% of the questions are classed as incsincere, meaning that we have a relatively imbalanced data set. We will need to take steps to account for this in our model development. To deal with imbalanced classes are we need to consider the following aspects: -

- Use the correct evaluation metrics (i.e. not accuracy)
- Resampling
    - Over-sampling
    - Under-sampling
- Use K-Fold cross validation correctly with resampling
- Ensemble different resampled datasets
- Resample with different ratios
- Cluster the abundant class
- Choose specialised models such as xgboost

See https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html

In [18]:
df_train[df_train["target"] == 1].shape[0] / df_train.shape[0] * 100

6.187017751787352

### Text data 'cleaning'

In [48]:
# To lower case
df_train['temp_question'] = df_train.question_text.apply(lambda x: x.lower())

In [60]:
# strip punctuation
df_train['temp_question'] = df_train.temp_question.apply(lambda x: re.sub('['+string.punctuation+']', '', x))

In [None]:
# strip stop words
def stop_words(text):
    text = [word for word in text.split() if word not in stopwords.words('english')]
    
    return " ".join(text)

df_train['temp_question'] = df_train.temp_question.apply(lambda x: stop_words(x))

In [73]:
df_train.temp_question[0]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\andre\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True