# Introduction

Quora Insincere Questions Classification

![logo](images/quora.jpg)


## Contexte

Quora is a popular website where anyone can ask and/or answer a question. There are more than 100 millions unique visitors per month.

Like any other forum, Quora is facing a problem: toxic questions and comments.

As you can imagine, Quora teams cannot check all of the Q&A by hand. So they decided to ask the data science community to help them to perform automatically insincere questions classification.

## Data

This challenge was launched on Kaggle : https://www.kaggle.com/c/quora-insincere-questions-classification

Read the overall information on Kaggle. Quora provided a dataset of questions with a label, and the features are the following:

- `qid`: a unique identifier for each question, an hexadecimal number
- `question_text`: the text of the question
- `target`: either 1 (for insincere question) or 0

🔦 In this competition, the metric used for performance evaluation is the **F-score**.

In [3]:
import pandas as pd
import numpy as np

<h1> EDA

In [4]:
filename = "../../../../../data/train.csv"

In [5]:
df = pd.read_csv(filename)

In [6]:
df.shape

(1306122, 3)

In [7]:
df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [9]:
print("Ratio of toxic question", df[df.target == 1].target.count()/df.shape[0]*100, "%")
print("Ratio of non toxic question", df[df.target == 0].target.count()/df.shape[0]*100, "%")

Ratio of toxic question 6.187017751787352 %
Ratio of non toxic question 93.81298224821265 %


In [10]:
pd.set_option('display.max_colwidth', 1000)
df[df.target==1].head(n=10)

Unnamed: 0,qid,question_text,target
22,0000e91571b60c2fb487,Has the United States become the largest dictatorship in the world?,1
30,00013ceca3f624b09f42,Which babies are more sweeter to their parents? Dark skin babies or light skin babies?,1
110,0004a7fcb2bf73076489,If blacks support school choice and mandatory sentencing for criminals why don't they vote Republican?,1
114,00052793eaa287aff1e1,"I am gay boy and I love my cousin (boy). He is sexy, but I dont know what to do. He is hot, and I want to see his di**. What should I do?",1
115,000537213b01fd77b58a,Which races have the smallest penis?,1
119,00056d45a1ce63856fc6,Why do females find penises ugly?,1
127,0005de07b07a17046e27,How do I marry an American woman for a Green Card? How much do they charge?,1
144,00068875d7c82a5bcf88,"Why do Europeans say they're the superior race, when in fact it took them over 2,000 years until mid 19th century to surpass China's largest economy?",1
156,0006ffd99a6599ff35b3,Did Julius Caesar bring a tyrannosaurus rex on his campaigns to frighten the Celts into submission?,1
167,00075f7061837807c69f,In what manner has Republican backing of 'states rights' been hypocritical and what ways have they actually restricted the ability of states to make their own laws?,1


The dataset is quite big, Let's play with a sample of 10000 lines first!

In [12]:
from sklearn.utils import resample

df_sample = resample(df, n_samples=10000, replace=False, random_state=0)

In [13]:
df_sample.shape

(10000, 3)

Check the proportion of toxic question within our sample

In [14]:
print("Ratio of toxic question", df_sample[df_sample.target == 1].target.count()/df_sample.shape[0]*100, "%")
print("Ratio of non toxic question", df_sample[df_sample.target == 0].target.count()/df_sample.shape[0]*100, "%")

Ratio of toxic question 5.71 %
Ratio of non toxic question 94.28999999999999 %


# Text Preprocessing

In [2]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" }

In [71]:
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
stop_words = stopwords.words("english")

def text_preprocess(text):
    text = text.lower()
    #after checking vocab_out dict we decided to aply those following replace
    text = text.replace("-", " ").replace("/", " ").replace("\\", " ").replace("'", " ")
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t.isalpha()]
    return tokens

In [72]:
df_sample["tokens"] = df_sample["question_text"].apply(lambda x: text_preprocess(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [73]:
df_sample.head()

Unnamed: 0,qid,question_text,target,tokens
879280,ac452e6e3f90075f2caa,Whether advances from customers are to be reinstated in the financial statements?,0,"[whether, advances, from, customers, are, to, be, reinstated, in, the, financial, statements, ?]"
43285,087846c595acf81fd460,How can you get help?,0,"[how, can, you, get, help, ?]"
740986,911f311c129684b0eb08,How does one succeed as a lecturer in medicine in a medical college in India?,0,"[how, does, one, succeed, as, a, lecturer, in, medicine, in, a, medical, college, in, india, ?]"
472594,5c8a3fa6a63e7e1b86c8,What is the purpose behind the Yellow wallpaper?,0,"[what, is, the, purpose, behind, the, yellow, wallpaper, ?]"
453814,58e70e16bd6a778b8e6e,What is the problem by applying under non spp programme for diploma?,0,"[what, is, the, problem, by, applying, under, non, spp, programme, for, diploma, ?]"


<h1> Word Embeddigs

Should we preprocess the text with our classic methods... well not really !
First let's check what is the proportion of our document vocabulary that is taken into account by our embeddings.

In [74]:
import numpy as np

# Function that allows to read a pretrained model and returns words and a dictionary of word embeddings
def read_glove_vecs(glove_file):
    with open(glove_file, 'r') as f:
        words = []
        word_to_vec_map = {}
        bad = 0
        
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.append(curr_word)
            try :
                word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
            except ValueError:
                bad +=1
            
        print(f'There are {bad} bad lines')
    return words, word_to_vec_map

In [24]:
glove_file = "../../../../../pretrained_model/glove/glove.6B.50d.txt"
words, word_to_vec_map = read_glove_vecs(glove_file)

There are 0 bad lines


In [40]:
import operator

def is_in_vocab(tokens_list):
    in_vocab = {}
    out_vocab = {}
    for tokens in tokens_list:
        for w in tokens:
            if w.lower() in words:
                in_vocab[w] = 1
            elif w in out_vocab.keys():
                out_vocab[w] += 1
            else:
                out_vocab[w] = 1
    out_vocab_ordered = sorted(out_vocab.items(), key=operator.itemgetter(1))[::-1]
    return in_vocab, out_vocab_ordered

unknown_words = sorted(unknown_words.items(), key=operator.itemgetter(1))[::-1]

In [75]:
text = np.array(df_sample.tokens)

In [76]:
in_vocab, out_vocab = is_in_vocab(text)

In [77]:
in_vocab_ratio = len(in_vocab.keys())/(len(in_vocab.keys()) + len(out_vocab))
out_vocab_ratio = len(out_vocab)/(len(in_vocab.keys()) + len(out_vocab))

In [78]:
print("proportion of words in word embedding vocab: ", in_vocab_ratio*100, "%")
print("proportion of words not in word embedding vocab: ", out_vocab_ratio*100, "%")

proportion of words in word embedding vocab:  91.67427900300143 %
proportion of words not in word embedding vocab:  8.325720996998564 %


In [79]:
len(out_vocab)

1276

In [80]:
out_vocab

[('quorans', 9),
 ('cryptocurrencies', 7),
 ('kvpy', 7),
 ('x^2', 6),
 ('wbjee', 5),
 ('cryptocurrency', 5),
 ('brexit', 5),
 ('bitsat', 5),
 ('blockchain', 4),
 ('upvotes', 4),
 ('afsb', 3),
 ('infty', 3),
 ('kyc', 3),
 ('b+', 3),
 ('chsl', 3),
 ('async', 3),
 ('shopify', 3),
 ('viteee', 3),
 ('chitkara', 3),
 ('infp', 3),
 ('ethereum', 3),
 ('bnbr', 3),
 ('cot^2', 3),
 ('articleship', 3),
 ('flipkart', 3),
 ('josaa', 3),
 ('afcat', 3),
 ('.net', 3),
 ('offcampus', 2),
 ('igdtuw', 2),
 ('admirial', 2),
 ('tnpsc', 2),
 ('14500', 2),
 ('narcassist', 2),
 ('odsp', 2),
 ('n+1', 2),
 ('rightarrow', 2),
 ('lim_', 2),
 ('iiest', 2),
 ('covarient', 2),
 ('marksheet', 2),
 ('apist', 2),
 ('tution', 2),
 ('homolysis', 2),
 ('nh4oh', 2),
 ('0.1m', 2),
 ('sqrt', 2),
 ('mblog', 2),
 ('aadhaar', 2),
 ('hairfall', 2),
 ('aemon', 2),
 ('katachi', 2),
 ('cibil', 2),
 ('redmi', 2),
 ('oneplus', 2),
 ('litecoin', 2),
 ('aadhar', 2),
 ('g_2', 2),
 ('g_1', 2),
 ('a_n', 2),
 ('vajiram', 2),
 ('²', 2),
 ('s

How to improve this rate:
    * Should we remove punctuation ? 
    * Should we remove numbers ? 
    * Should we remove stopwords ? 
    * Should we Stemmatize / Lemmatize ?

We could also use TextBlob for mispellings

In [141]:
def get_vector_from(tokens):
    word_vect = np.array([word_to_vec_map[t] for t in tokens if t in words])
    try:
        word_vect = word_vect.mean(axis=0).astype("float64")
    except:
        print("Can not convert tokens into vector")
    return word_vect

In [142]:
df_sample["vector"] = df_sample["tokens"].apply(lambda x: get_vector_from(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [46]:
df_sample.head()

Unnamed: 0,qid,question_text,target,tokens
879280,ac452e6e3f90075f2caa,Whether advances from customers are to be reinstated in the financial statements?,0,"[whether, advances, from, customers, are, to, be, reinstated, in, the, financial, statements, ?]"
43285,087846c595acf81fd460,How can you get help?,0,"[how, can, you, get, help, ?]"
740986,911f311c129684b0eb08,How does one succeed as a lecturer in medicine in a medical college in India?,0,"[how, does, one, succeed, as, a, lecturer, in, medicine, in, a, medical, college, in, india, ?]"
472594,5c8a3fa6a63e7e1b86c8,What is the purpose behind the Yellow wallpaper?,0,"[what, is, the, purpose, behind, the, yellow, wallpaper, ?]"
453814,58e70e16bd6a778b8e6e,What is the problem by applying under non spp programme for diploma?,0,"[what, is, the, problem, by, applying, under, non, spp, programme, for, diploma, ?]"


In [144]:
X = df_sample.vector.apply(lambda x : pd.Series(x))
X = X.set_index(df_sample.index)
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49
879280,0.518603,-0.138013,0.128421,0.007646,0.12983,0.051235,-0.319765,0.065653,-0.228245,0.014102,0.079651,0.097121,-0.307854,-0.337448,0.41093,0.300265,-0.159765,-0.316708,-0.15767,-0.415883,0.349012,-0.043168,0.226705,-0.057428,-0.224042,-1.599572,-0.126453,-0.070226,-0.049743,-0.173345,3.187738,0.155372,-0.050522,-0.600988,0.055891,-0.252864,-0.067751,0.035685,-0.006926,-0.297121,-0.118901,-0.020867,0.263501,0.403998,-0.109543,-0.002275,-0.268847,0.359126,-0.039279,-0.157245
43285,0.506896,0.062074,0.487136,-0.383953,0.459732,-0.12036,-0.463464,0.090315,0.023969,0.21075,-0.138,0.749918,-0.157341,0.066331,0.769273,0.699592,0.54502,-0.023262,0.385648,-1.035638,-0.218696,0.29524,0.483744,0.283692,0.63719,-1.90828,-0.604292,-0.135703,0.849316,-1.023038,3.70082,0.84966,-0.675888,-0.418946,-0.068444,0.148059,-0.061538,0.169992,0.486096,-0.57181,-0.148134,0.122221,0.176637,0.583872,0.293376,0.118642,0.153615,-0.006049,-0.15621,0.600302
740986,0.073237,0.328231,-0.394382,-0.21712,0.284743,0.17198,-0.492919,-0.18479,0.099077,-0.107368,0.106792,0.002601,-0.18472,-0.21843,0.22387,0.167539,-0.066288,0.330302,-0.362422,0.159119,0.241175,0.34109,0.005401,0.072503,0.10357,-1.870527,-0.317339,-0.275153,-0.274063,0.009833,3.15022,-0.020586,-0.286908,-0.569024,0.272107,0.074661,0.002756,0.508513,0.406337,-0.064843,-0.238314,0.145366,-0.121155,0.177531,-0.044804,0.305474,-0.050946,0.003374,-0.030235,0.093014
472594,0.202483,0.270304,-0.26049,0.057956,0.579097,0.161558,-0.370708,-0.411702,-0.03207,-0.257589,0.055538,0.03484,-0.299448,-0.002856,0.137555,0.08845,0.061608,-0.03844,-0.234108,-0.573639,-0.20429,-0.138235,0.024123,-0.076034,0.043941,-1.394616,-0.579051,0.51427,0.340709,-0.440973,2.926147,-0.123082,-0.394819,-0.356277,-0.148388,0.112892,-0.015213,0.176257,-0.071229,-0.205176,0.086117,-0.094823,-0.122852,0.097114,0.042062,0.104768,0.069113,-0.339189,0.154179,-0.38243
453814,0.102285,0.161628,-0.262136,-0.174965,0.013831,0.279919,-0.116523,-0.531807,0.06933,-0.043853,0.377705,0.166937,-0.212991,-0.200352,0.403988,0.164092,0.11596,0.271367,0.011845,-0.173805,0.161439,-0.03459,0.160093,-0.065891,-0.016001,-1.284528,-0.154639,-0.124656,-0.176258,0.042331,2.986267,-0.009553,-0.49532,-0.381523,-0.082474,0.02568,0.22865,0.15908,0.055869,-0.039974,-0.099632,-0.177519,0.119006,0.142807,-0.14472,-0.113618,-0.104358,0.320439,0.010098,-0.027357


In [145]:
y = df_sample.target

In [146]:
df_new = pd.concat([X, y], axis=1)

In [147]:
df_new = df_new.dropna()

<h1> ML model

In [148]:
y = df_new.target
X = df_new.drop("target", axis=1)

In [149]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [150]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Train the model
lr = LogisticRegression()
lr.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [151]:
# Estimate the accuracy
y_pred_lr = lr.predict(X_test)
print(classification_report(y_test, y_pred_lr))

              precision    recall  f1-score   support

           0       0.96      0.99      0.97      1901
           1       0.42      0.11      0.18        99

   micro avg       0.95      0.95      0.95      2000
   macro avg       0.69      0.55      0.57      2000
weighted avg       0.93      0.95      0.93      2000

