# Final Project Template

This workbook provides the template for the final project. 

## Instructions
- Work individually or in pairs
- Each team is to complete 1 copy of this template.
  - Complete all sections.
  - Feel free to include supporting material / slides / documents as needed.
- At the end of the project, you will get 15 minutes to present this workbook to the class.

### Submission Instructions
- Submit the .ipynb with the Output cells showing the results
  - Naming convention:
  ```
      <name1>-<name2>-<project_short_name>.ipynb
  ```
- If you provide your own datasets, include the data with your .ipynb, unless it is confidential

## Section 0: Team Members
- Member : Ting Dai


## Section 1: Project Title
Quora Insincere Questions Classification


## Section 2: Project Definition

### Goals

The goal of this project is to predict whether a question asked on Quora is sincere or not.

### Dataset

Data source: Kaggle competion: 

Quora Insincere Questions Classification - Detect toxic content to improve online conversations
https://www.kaggle.com/c/quora-insincere-questions-classification#Kernels-FAQ

There are 2 files: one training dataset and one testing dataset. 3 columns are included: unique question Id, question, and target. If the target is 1 then this is an unsincere question. If the target is 0 then this is a sincere question. 


### Tasks

1. Resampling the biased data
2. Text processing 
3. Classic machine learning models
4. Try neural network
5. Evaluation metrics: accuracy, confusion matrix etc. 

## Section 3: Data Engineering


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('train1.csv',nrows = 400000)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400000 entries, 0 to 399999
Data columns (total 3 columns):
qid              400000 non-null object
question_text    400000 non-null object
target           400000 non-null int64
dtypes: int64(1), object(2)
memory usage: 9.2+ MB


In [3]:
# Show % many unsincere questions
df['target'].value_counts()

0    375182
1     24818
Name: target, dtype: int64

In [4]:
# The dataset is umbalanced. Downsampling for data with target value 0

from sklearn.model_selection import train_test_split
df_ones_train, df_ones_test = train_test_split(df.loc[df['target'] == 1], test_size=.2, random_state=8)
df_zeros_train, df_zeros_test = train_test_split(df.loc[df['target'] == 0], test_size=.2, random_state=8)

df_zeros_train_downsample = df_zeros_train.sample(n=32000, random_state=8, replace=True)

df_balanced = pd.concat([df_zeros_train_downsample, df_ones_train])

df_balanced['target'].value_counts()

pd.options.display.max_colwidth = 100

df_zeros_train.head(10)

Unnamed: 0,qid,question_text,target
272388,35520bc99015e0818fbd,How do I start a online boutique in India?,0
349133,446c498e6b18936f7cc5,What is the mening of alif?,0
167976,20d355aa43079de5e21c,What's the best lyric/line on the album The Live Anthology by Tom Petty & The Heartbreakers?,0
354355,4573b572b472c4b2aa8c,What does calcium do for your muscles?,0
163715,20039b923eb9af8e5554,Which urban myth did you believe was true?,0
226239,2c3ce52018b2877bad6f,Is it possible to be a full time student and work at the same time?,0
235364,2e0b0e7b1b6384564ce2,How do I get a free certificate on Coursera?,0
248277,3092d2271016240c0492,Is pierced ear affect my engineering career in any way?,0
232101,2d66555e464153938556,Do you think that the new Bahubali movie will not be able to match up to the expections we have?,0
95002,1298072896fbac4db993,How should a 17 year old spend their free time?,0


In [5]:
df_ones_train.head(10)

Unnamed: 0,qid,question_text,target
245960,301aad706dec2e9c73ed,Why are many non-Muslim people so prejudiced against Turks?,1
244917,2fe71fb0fc88de3ce0b6,"Why do Muslims always remove ""O children of Israel"" From sura 5:32, is it part of deceptions fro...",1
278031,366aa955c3dc7856ea51,"If Muslims can oppress non Muslim minorities in Islamic countries, then what is wrong in Myanmar...",1
319570,3ea1c117a892318d6f9e,Hillary Clinton’s public complaints show that she still can’t get over her loss to Donald Trump....,1
298494,3a739851f11bf8851266,Why is Trump such an asshole?,1
130125,1974a54786e72d047334,What is the psychological condition that causes bestiality?,1
352234,4509b35b2df7a97df49e,What is your take on why so many Filipinos lack basic respect and the ability to follow guidelin...,1
171969,21a06469a292d804308f,Wouldn't landmines be a cheaper and more effective solution than a giant wall between the US and...,1
19541,03d3c9d692c368d7f6bc,Is Justin Trudeau starting to look like a grovelling suckhole on the world stage?,1
397209,4dd0c4f65b9eae5e184a,Why do Tamils think they are better than most Indians?,1


In [6]:
# Unique ID is not a valid feature, drop it
df_balanced.drop(['qid'],axis=1).head()

Unnamed: 0,question_text,target
190973,"Does my cat know that by feeding, grooming, cleaning his box that I am his caretaker?",0
327097,Why will Manchester City win over Stoke City?,0
54493,"If you could change a single negative trait about your nation, what would it be?",0
357837,What happens to magma as it rises through the Earth's mantle?,0
104803,Do you think Quora should put a word limit on answers?,0


In [7]:
df_test = pd.concat([df_zeros_test, df_ones_test])
df_test.drop(['qid'],axis=1).head()

Unnamed: 0,question_text,target
274034,How did the Middle East Crisis started? I am very curious.,0
262748,How do I talk my wife into letting me wear mini skirts?,0
187315,What influenced Allen Hoskins to start acting at an early age?,0
373193,What will happen if I don't get hemmorroids treated?,0
86542,Why is the GDP of the US so high?,0


## Section 4: Text Processing

In [8]:
Text_train_ML = pd.Series.as_matrix(df_balanced['question_text'])
Target_train_ML = pd.Series.as_matrix(df_balanced['target'])

Text_test_ML = pd.Series.as_matrix(df_test['question_text'])
Target_test_ML = pd.Series.as_matrix(df_test['target'])

  """Entry point for launching an IPython kernel.
  
  after removing the cwd from sys.path.
  """


In [9]:
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

tokens=[]

for i in range(len(Text_train_ML)):
    tokens.append(word_tokenize(Text_train_ML[i]))

wnl = WordNetLemmatizer()

Lemmatized_word = []
for i in range(len(Text_train_ML)):
    Lemmatized_word.append([wnl.lemmatize(token).lower() for token in tokens[i]])        

In [10]:
#remove stop word and punctuation 

import string
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

Remove_stop_word = []
for i in range(len(Text_train_ML)):
    Remove_stop_word.append([token for token in Lemmatized_word[i] if (not (token in stop or token in string.punctuation))])  


In [11]:
Remove_stop_word[3]

['happens', 'magma', 'rise', 'earth', "'s", 'mantle']

In [12]:
#Vectorization 
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
Corrected_list = [" ".join(question) for question in Remove_stop_word]
result = tfidf.fit_transform(Corrected_list)

col_name = tfidf.get_feature_names()
temp = pd.DataFrame(result.todense(), columns=col_name)
temp

Unnamed: 0,00,000,0001,000w,000webhost,001,002200,01,010062,02,...,へも,傳送,油腻,王晓菲,短信,簡訊,红宝书,话不可以乱讲,邋遢,饭可以乱吃
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
tokens_test=[]

for i in range(len(Text_test_ML)):
    tokens_test.append(word_tokenize(Text_test_ML[i]))

Lemmatized_word_test = []
for i in range(len(Text_test_ML)):
    Lemmatized_word_test.append([wnl.lemmatize(token).lower() for token in tokens_test[i]])         
    
Remove_stop_word_test = []
for i in range(len(Text_test_ML)):
    Remove_stop_word_test.append([token for token in Lemmatized_word_test[i] if (not (token in stop or token in string.punctuation))])  

Corrected_list_test = [" ".join(question) for question in Remove_stop_word_test]
    
Z_test=tfidf.transform(Corrected_list_test)


## Section 4: Random forest

In [14]:
# Radom forest
from sklearn.ensemble import RandomForestClassifier

In [15]:
rf = RandomForestClassifier(n_estimators=10, n_jobs=2, max_depth=7)

In [16]:
rf.fit(temp, Target_train_ML)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=7, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [17]:
rf.estimators_

[DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=7,
                        max_features='auto', max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort=False,
                        random_state=1136075534, splitter='best'),
 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=7,
                        max_features='auto', max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort=False,
                        random_state=357855817, splitter='best'),
 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=7,
                        max_features='auto', max_leaf_nodes=None,
                        min

In [18]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred_rf = rf.predict(Z_test)
print(classification_report(Target_test_ML, y_pred_rf))
print(confusion_matrix(Target_test_ML, y_pred_rf))
print(rf.score(Z_test, Target_test_ML))
print(rf.score(temp, Target_train_ML))

              precision    recall  f1-score   support

           0       0.94      1.00      0.97     75037
           1       0.63      0.02      0.05      4964

    accuracy                           0.94     80001
   macro avg       0.79      0.51      0.51     80001
weighted avg       0.92      0.94      0.91     80001

[[74967    70]
 [ 4844   120]]
0.9385757678029024
0.6261426312338488


In [19]:
feature_importances = pd.DataFrame(rf.feature_importances_,
                                   index = temp.columns,
                                    columns=['importance']).sort_values('importance',ascending=False)
feature_importances.head(100)

Unnamed: 0,importance
jewish,0.061535
people,0.058609
men,0.057214
indians,0.057004
quora,0.043017
racist,0.040617
hate,0.033002
conservative,0.025310
british,0.023122
christians,0.022692


In [20]:
feature_importances.shape

(32975, 1)

In [21]:
def plot_tree(tree, columns, filename):
    from sklearn.tree import export_graphviz
    import graphviz

    export_graphviz(tree,
                    out_file=filename,
                    feature_names=columns,
                    filled=True,
                    rounded=True)
    source = graphviz.Source.from_file(filename)
    source.render(view=True)

plot_tree(rf.estimators_[0], temp.columns, 'tree1.dot')
plot_tree(rf.estimators_[1], temp.columns, 'tree2.dot')
plot_tree(rf.estimators_[2], temp.columns, 'tree3.dot')

## Section 5: Deep learning with word embedding

In [None]:
#Load pretrained glove embedding
import numpy as np

embeddings_index = dict()
f = open('glove.6B.200d.txt',encoding ='UTF-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

vector_size = 200

In [None]:
embeddings_index['people']

In [None]:
X_train_DL = df_balanced['question_text']
y_train_DL = df_balanced['target']

X_test_DL = df_test['question_text']
y_test_DL = df_test['target']

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train_DL)
X_train_DL = tokenizer.texts_to_sequences(X_train_DL)
X_test_DL = tokenizer.texts_to_sequences(X_test_DL)

In [None]:
question_length = [len(question) for question in X_train_DL]
Max_len = max(question_length)
Mean_len = sum(question_length)/len(question_length)
Max_len,Mean_len

In [None]:
from keras.preprocessing.sequence import pad_sequences
Max_length = 50
X_train_DL = pad_sequences(X_train_DL, maxlen=Max_length)
X_test_DL = pad_sequences(X_test_DL,maxlen=Max_length)

In [None]:
vocab_size = len(tokenizer.word_index) +1
embedding_matrix = np.zeros((vocab_size, vector_size))

In [None]:
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
embedding_matrix.shape

In [None]:
from keras.models import Sequential
from keras.layers import Input
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import LSTM 
from keras.layers import Flatten 
from keras.layers import Bidirectional
from keras.callbacks import TensorBoard, EarlyStopping, ModelCheckpoint

tensorboard = TensorBoard(log_dir='./logs/quora_questions_balanced_data')

earlystop = EarlyStopping(patience=3)

checkpoint = ModelCheckpoint('weights.{epoch:02d}-{loss:.2f}.hdf5')

In [None]:
model = Sequential()
e = Embedding(vocab_size, vector_size, weights=[embedding_matrix], input_length=Max_length, trainable=False)
model.add(e)
model.add(Bidirectional(LSTM(100,input_shape=(Max_length,vector_size))))
model.add(Dense(50, activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='RMSProp', loss='binary_crossentropy', metrics=['acc'])
print(model.summary())
model.fit(X_train_DL, y_train_DL,epochs=20, verbose=1,callbacks=[tensorboard, earlystop,checkpoint],validation_split=.2)
loss, accuracy = model.evaluate(X_train_DL, y_train_DL, verbose=1)

In [None]:
loss,accuracy

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
y_pred_DL = model.predict_classes(X_test_DL)
print(confusion_matrix(y_test_DL, y_pred_DL))
print(classification_report(y_test_DL, y_pred_DL))

In [None]:
actual_class = np.asarray(y_test_DL)
wrong_question = []
for i in range(len(y_pred_DL)):
    if ((y_pred_DL[i]!=actual_class[i]) & (y_pred_DL[i]==1)):
        wrong_question.append(i)

In [None]:
df_test.iloc[wrong_question]