# Quora Insincere Questions Classification

## Context
Quora is a popular website where anyone can ask and/or answer a question. There are more than 100 millions unique visitors per month.

Like any other forum, Quora is facing a problem: toxic questions and comments.

As you can imagine, Quora teams cannot check all of the Q&A by hand. So they decided to ask the data science community to help them to perform automatically insincere questions classification.

## Data
This challenge was launched on Kaggle : https://www.kaggle.com/c/quora-insincere-questions-classification.

Read the overall information on Kaggle. Quora provided a dataset of questions with a label, and the features are the following:

+ qid: a unique identifier for each question, an hexadecimal number
+ question_text: the text of the question
+ target: either 1 (for insincere question) or 0

In this competition, the metric used for performance evaluation is the F-score.

## EDA

In [3]:
import pandas as pd
import numpy as np
import os

#TODO: Read the training data in CSV file


(1306122, 3)

In [29]:
#TODO: Check data
df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province as a nation in the 1960s?,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you encourage people to adopt and not shop?",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity affect space geometry?,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg hemispheres?,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain bike by just changing the tyres?,0


In [30]:
# TODO: Print the class distribution


Ratio of toxic question 6.187017751787352 %
Ratio of non toxic question 93.81298224821265 %


In [31]:
pd.set_option('display.max_colwidth', 1000)
df[df.target==1].head(n=10)

Unnamed: 0,qid,question_text,target
22,0000e91571b60c2fb487,Has the United States become the largest dictatorship in the world?,1
30,00013ceca3f624b09f42,Which babies are more sweeter to their parents? Dark skin babies or light skin babies?,1
110,0004a7fcb2bf73076489,If blacks support school choice and mandatory sentencing for criminals why don't they vote Republican?,1
114,00052793eaa287aff1e1,"I am gay boy and I love my cousin (boy). He is sexy, but I dont know what to do. He is hot, and I want to see his di**. What should I do?",1
115,000537213b01fd77b58a,Which races have the smallest penis?,1
119,00056d45a1ce63856fc6,Why do females find penises ugly?,1
127,0005de07b07a17046e27,How do I marry an American woman for a Green Card? How much do they charge?,1
144,00068875d7c82a5bcf88,"Why do Europeans say they're the superior race, when in fact it took them over 2,000 years until mid 19th century to surpass China's largest economy?",1
156,0006ffd99a6599ff35b3,Did Julius Caesar bring a tyrannosaurus rex on his campaigns to frighten the Celts into submission?,1
167,00075f7061837807c69f,In what manner has Republican backing of 'states rights' been hypocritical and what ways have they actually restricted the ability of states to make their own laws?,1


The dataset is quite big, Let's play with a sample of 10000 lines.

In [33]:
from sklearn.utils import resample

#TODO: sample 10000 questions


(10000, 3)


Unnamed: 0,qid,question_text,target
879280,ac452e6e3f90075f2caa,Whether advances from customers are to be reinstated in the financial statements?,0
43285,087846c595acf81fd460,How can you get help?,0
740986,911f311c129684b0eb08,How does one succeed as a lecturer in medicine in a medical college in India?,0
472594,5c8a3fa6a63e7e1b86c8,What is the purpose behind the Yellow wallpaper?,0
453814,58e70e16bd6a778b8e6e,What is the problem by applying under non spp programme for diploma?,0


Check the proportion of toxic question within our sample

In [34]:
#TODO


Ratio of toxic question 5.71 %
Ratio of non toxic question 94.28999999999999 %


## Text Preprocessing

In [35]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", 
                       "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not",
                       "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", 
                       "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", 
                       "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", 
                       "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", 
                       "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", 
                       "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have",
                       "it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have",
                       "mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", 
                       "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", 
                       "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", 
                       "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", 
                       "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", 
                       "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would",
                       "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", 
                       "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", 
                       "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", 
                       "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", 
                       "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", 
                       "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", 
                       "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", 
                       "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", 
                       "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", 
                       "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", 
                       "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are",
                       "y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", 
                       "you'll've": "you will have", "you're": "you are", "you've": "you have" }

#TODO: normalize the text by using contraction mapping


In [36]:
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def text_preprocess(text):
    text = text.lower()
    #after checking vocab_out dict we decided to aply those following replace
    text = text.replace("-", " ").replace("/", " ").replace("\\", " ").replace("'", " ")
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t.isalpha()]
    return tokens

#TODO: you can further remove stop words and use lemmtizer
lemmatizer = WordNetLemmatizer()
stop_words = stopwords.words("english")

In [37]:
df_sample["tokens"] = df_sample["question_text"].apply(lambda x: text_preprocess(x))
df_sample.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,qid,question_text,target,tokens
879280,ac452e6e3f90075f2caa,Whether advances from customers are to be reinstated in the financial statements?,0,"[whether, advances, from, customers, are, to, be, reinstated, in, the, financial, statements]"
43285,087846c595acf81fd460,How can you get help?,0,"[how, can, you, get, help]"
740986,911f311c129684b0eb08,How does one succeed as a lecturer in medicine in a medical college in India?,0,"[how, does, one, succeed, as, a, lecturer, in, medicine, in, a, medical, college, in, india]"
472594,5c8a3fa6a63e7e1b86c8,What is the purpose behind the Yellow wallpaper?,0,"[what, is, the, purpose, behind, the, yellow, wallpaper]"
453814,58e70e16bd6a778b8e6e,What is the problem by applying under non spp programme for diploma?,0,"[what, is, the, problem, by, applying, under, non, spp, programme, for, diploma]"


## Word Embeddings

Should we preprocess the text with our classic methods... well not really ! First let's check what is the proportion of our document vocabulary that is taken into account by our embeddings.

In [5]:
import numpy as np

# Function that allows to read a pretrained model and returns words and a dictionary of word embeddings
def read_glove_vecs(glove_file):
    with open(glove_file, 'r') as f:
        words = []
        word_to_vec_map = {}
        bad = 0
        
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.append(curr_word)
            try :
                word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
            except ValueError:
                bad +=1
            
        print(f'There are {bad} bad lines')
    return words, word_to_vec_map

In [6]:
#TODO: Load Glove embedding
glove_file = "...glove.6B.50d.txt"
words, word_to_vec_map = read_glove_vecs(glove_file)

There are 0 bad lines


In [40]:
import operator

#TODO: check if any token is not in the Glove embedding 

def is_in_vocab(tokens_list):
    in_vocab = {}
    out_vocab = {}
    ...
    out_vocab_ordered = sorted(out_vocab.items(), key=operator.itemgetter(1))[::-1]
    return in_vocab, out_vocab_ordered

In [41]:
text = np.array(df_sample.tokens)
in_vocab, out_vocab = is_in_vocab(text)

in_vocab_ratio = len(in_vocab.keys())/(len(in_vocab.keys()) + len(out_vocab))
out_vocab_ratio = len(out_vocab)/(len(in_vocab.keys()) + len(out_vocab))
                                                             
print("proportion of words in word embedding vocab: ", in_vocab_ratio*100, "%")
print("proportion of words not in word embedding vocab: ", out_vocab_ratio*100, "%")

proportion of words in word embedding vocab:  93.68450082735798 %
proportion of words not in word embedding vocab:  6.3154991726420295 %


In [42]:
len(out_vocab)

916

In [43]:
out_vocab

[('quorans', 9),
 ('cryptocurrencies', 7),
 ('kvpy', 7),
 ('wbjee', 5),
 ('cryptocurrency', 5),
 ('brexit', 5),
 ('bitsat', 5),
 ('blockchain', 4),
 ('upvotes', 4),
 ('afsb', 3),
 ('infty', 3),
 ('kyc', 3),
 ('chsl', 3),
 ('async', 3),
 ('shopify', 3),
 ('viteee', 3),
 ('chitkara', 3),
 ('infp', 3),
 ('ethereum', 3),
 ('bnbr', 3),
 ('articleship', 3),
 ('flipkart', 3),
 ('josaa', 3),
 ('afcat', 3),
 ('offcampus', 2),
 ('igdtuw', 2),
 ('admirial', 2),
 ('tnpsc', 2),
 ('narcassist', 2),
 ('odsp', 2),
 ('rightarrow', 2),
 ('iiest', 2),
 ('covarient', 2),
 ('marksheet', 2),
 ('apist', 2),
 ('tution', 2),
 ('homolysis', 2),
 ('sqrt', 2),
 ('mblog', 2),
 ('aadhaar', 2),
 ('hairfall', 2),
 ('aemon', 2),
 ('katachi', 2),
 ('cibil', 2),
 ('redmi', 2),
 ('oneplus', 2),
 ('litecoin', 2),
 ('aadhar', 2),
 ('vajiram', 2),
 ('starflower', 2),
 ('udemy', 2),
 ('upvote', 2),
 ('ntse', 2),
 ('hotstar', 2),
 ('kaggle', 2),
 ('mht', 2),
 ('paytm', 2),
 ('masterbate', 2),
 ('demonetisation', 2),
 ('intres

(OPTIONAL)
How to improve this rate:
* Should we remove punctuation ? 
* Should we remove numbers ? 
* Should we remove stopwords ? 
* Should we Stemmatize / Lemmatize ?

We could also use TextBlob for mispellings

In [44]:
#Compute the embedding for each question text from word embeddings
def get_vector_from(tokens):
    word_vect = np.array(...)
    try:
        word_vect = ....astype("float64")
    except:
        print("Can not convert tokens into vector")
    return word_vect

In [45]:
df_sample["vector"] = df_sample["tokens"].apply(lambda x: get_vector_from(x))
df_sample.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,qid,question_text,target,tokens,vector
879280,ac452e6e3f90075f2caa,Whether advances from customers are to be reinstated in the financial statements?,0,"[whether, advances, from, customers, are, to, be, reinstated, in, the, financial, statements]","[0.5186033333333332, -0.13801260833333331, 0.12842116666666664, 0.007646316666666669, 0.1298301666666667, 0.05123508333333334, -0.3197645, 0.06565250000000002, -0.2282449358333333, 0.014101916666666672, 0.07965062499999999, 0.09712083333333332, -0.30785416666666665, -0.3374477, 0.41093004166666663, 0.30026533333333333, -0.15976524999999997, -0.3167084166666667, -0.15766999999999998, -0.4158830833333333, 0.3490121666666666, -0.043168416666666674, 0.22670500000000002, -0.057427666666666655, -0.22404183333333336, -1.5995716666666666, -0.1264535, -0.07022583333333333, -0.049743249999999996, -0.17334549999999996, 3.1877383333333333, 0.15537208333333333, -0.05052241666666666, -0.6009883333333333, 0.0558911775, -0.25286434166666666, -0.06775129166666666, 0.03568535000000003, -0.006925999999999988, -0.2971213333333333, -0.11890124999999997, -0.020867016666666672, 0.2635008333333333, 0.40399775, -0.10954252499999999, -0.0022747916666666645, -0.2688467583333334, 0.35912608333333335, -0.03927..."
43285,087846c595acf81fd460,How can you get help?,0,"[how, can, you, get, help]","[0.50689562, 0.06207399999999998, 0.487136, -0.383953, 0.459732, -0.12035970199999999, -0.4634638, 0.09031544, 0.023969200000000003, 0.21075, -0.13800000000000004, 0.749918, -0.15734112000000003, 0.0663314, 0.769273, 0.699592, 0.5450200000000001, -0.023261999999999994, 0.385648, -1.035638, -0.21869600000000006, 0.29524, 0.483744, 0.283692, 0.63719, -1.90828, -0.6042919999999999, -0.13570267600000002, 0.849316, -1.0230380000000001, 3.70082, 0.8496599999999999, -0.6758879999999999, -0.41894600000000004, -0.06844380000000001, 0.1480588, -0.061538, 0.1699916, 0.48609600000000003, -0.5718099999999999, -0.148134, 0.12222060000000001, 0.17663679999999998, 0.583872, 0.29337579999999996, 0.11864199999999996, 0.1536154, -0.006048999999999999, -0.1562098, 0.6003020000000001]"
740986,911f311c129684b0eb08,How does one succeed as a lecturer in medicine in a medical college in India?,0,"[how, does, one, succeed, as, a, lecturer, in, medicine, in, a, medical, college, in, india]","[0.07323666666666664, 0.3282313333333333, -0.3943822, -0.21711979999999997, 0.2847428866666667, 0.17198027266666666, -0.4929190666666667, -0.18478980000000003, 0.09907726666666665, -0.10736793333333333, 0.10679233333333332, 0.0026013333333332897, -0.18472040000000006, -0.2184302, 0.22387013333333336, 0.16753906666666668, -0.06628766000000001, 0.33030198, -0.3624220666666666, 0.15911866666666666, 0.24117533333333332, 0.34109039999999996, 0.005400866666666667, 0.07250313333333333, 0.10356999999999998, -1.870526666666667, -0.31733919999999993, -0.275153092, -0.2740633333333334, 0.009833419999999992, 3.1502200000000005, -0.020586466666666674, -0.2869076666666667, -0.5690243999999999, 0.27210666666666666, 0.074661, 0.00275588666666667, 0.5085130666666668, 0.4063372, -0.0648432, -0.23831393333333334, 0.14536638666666668, -0.12115526666666664, 0.1775306666666667, -0.04480393333333333, 0.3054738, -0.05094600000000002, 0.0033737333333333287, -0.030235133333333334, 0.09301366666666669]"
472594,5c8a3fa6a63e7e1b86c8,What is the purpose behind the Yellow wallpaper?,0,"[what, is, the, purpose, behind, the, yellow, wallpaper]","[0.20248250000000004, 0.270303875, -0.26049, 0.05795637499999999, 0.5790974999999999, 0.16155825000000001, -0.37070826250000005, -0.41170225000000005, -0.0320700575, -0.257588875, 0.05553784, 0.03484000000000001, -0.29944774999999996, -0.002855874999999994, 0.137555125, 0.08844950000000001, 0.06160750000000002, -0.03844000000000003, -0.23410750000000002, -0.57363875, -0.20429025, -0.138234625, 0.024123250000000006, -0.07603412500000001, 0.04394124999999997, -1.3946162500000003, -0.5790506250000002, 0.5142702499999999, 0.3407095, -0.440973, 2.9261475, -0.12308175, -0.39481900000000003, -0.35627749999999997, -0.1483877175, 0.11289235, -0.015213125000000001, 0.17625675000000002, -0.07122885, -0.20517575, 0.086117125, -0.0948227125, -0.12285238750000002, 0.09711437499999998, 0.042062249999999995, 0.10476750000000001, 0.06911347500000001, -0.33918875, 0.15417925, -0.3824299875]"
453814,58e70e16bd6a778b8e6e,What is the problem by applying under non spp programme for diploma?,0,"[what, is, the, problem, by, applying, under, non, spp, programme, for, diploma]","[0.10228499999999996, 0.1616280833333333, -0.2621360833333333, -0.17496466666666666, 0.013830833333333334, 0.27991916666666666, -0.11652300833333334, -0.5318070833333334, 0.06933048083333336, -0.043853333333333334, 0.3777045833333334, 0.16693748333333336, -0.21299074999999998, -0.20035249999999996, 0.40398829166666667, 0.16409150000000003, 0.11596033333333333, 0.27136699999999997, 0.011845000000000017, -0.17380533333333334, 0.16143941666666667, -0.03459016666666666, 0.16009291666666667, -0.06589066666666667, -0.016000583333333315, -1.2845283333333333, -0.15463863333333336, -0.12465575, -0.17625783333333334, 0.0423305, 2.986266666666667, -0.009552916666666684, -0.4953204166666667, -0.3815233333333332, -0.08247433916666663, 0.025679699999999986, 0.22865049999999998, 0.15907983333333334, 0.05586916666666667, -0.03997416666666668, -0.099632, -0.17751914166666669, 0.11900566666666668, 0.14280725000000002, -0.14472025, -0.11361800000000001, -0.10435825833333333, 0.32043916666666666, 0.01..."


In [46]:
X = df_sample.vector.apply(lambda x : pd.Series(x))
X = X.set_index(df_sample.index)
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
879280,0.518603,-0.138013,0.128421,0.007646,0.12983,0.051235,-0.319765,0.065653,-0.228245,0.014102,...,-0.118901,-0.020867,0.263501,0.403998,-0.109543,-0.002275,-0.268847,0.359126,-0.039279,-0.157245
43285,0.506896,0.062074,0.487136,-0.383953,0.459732,-0.12036,-0.463464,0.090315,0.023969,0.21075,...,-0.148134,0.122221,0.176637,0.583872,0.293376,0.118642,0.153615,-0.006049,-0.15621,0.600302
740986,0.073237,0.328231,-0.394382,-0.21712,0.284743,0.17198,-0.492919,-0.18479,0.099077,-0.107368,...,-0.238314,0.145366,-0.121155,0.177531,-0.044804,0.305474,-0.050946,0.003374,-0.030235,0.093014
472594,0.202483,0.270304,-0.26049,0.057956,0.579097,0.161558,-0.370708,-0.411702,-0.03207,-0.257589,...,0.086117,-0.094823,-0.122852,0.097114,0.042062,0.104768,0.069113,-0.339189,0.154179,-0.38243
453814,0.102285,0.161628,-0.262136,-0.174965,0.013831,0.279919,-0.116523,-0.531807,0.06933,-0.043853,...,-0.099632,-0.177519,0.119006,0.142807,-0.14472,-0.113618,-0.104358,0.320439,0.010098,-0.027357


In [48]:
y = df_sample.target
y.head()

879280    0
43285     0
740986    0
472594    0
453814    0
Name: target, dtype: int64

In [49]:
df_new = pd.concat([X, y], axis=1)
df_new = df_new.dropna()

## ML model

In [51]:
y = df_new.target
X = df_new.drop("target", axis=1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [52]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# TODO: Train the model


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [53]:
# TODO Estimate the accuracy


              precision    recall  f1-score   support

           0       0.95      0.99      0.97      1885
           1       0.46      0.10      0.17       115

    accuracy                           0.94      2000
   macro avg       0.70      0.55      0.57      2000
weighted avg       0.92      0.94      0.92      2000



## Can you explain why the performance for class 1 (insincere questions) is bad? Can we improve?