# Project on Detecting Insults

# Goal of the Project: To detect whether the online comment is an insult or not using machine learning.

Have used Linear and Non-linear classifiers to test and have finalized the best classifier suited for the data.

In [183]:
# Import Numpy package
import numpy as np
# Import Pandas package
import pandas as pd
#Importing Scikit-learn to run the models
import sklearn
#To split data, import train_test_split and ShuffleSplit
from sklearn.cross_validation import train_test_split, ShuffleSplit
#For feature selection, import sklearn.feature_extraction.text
import sklearn.feature_extraction.text as text

In [184]:
#For decision tree
from sklearn.tree import DecisionTreeClassifier
#For Naive Bayes
import sklearn.naive_bayes as nb
#For Logistic Regression
from sklearn.linear_model import LogisticRegression
#For random forest 
from sklearn.ensemble import RandomForestClassifier
#For Support Vector machine
from sklearn import svm

# 1. Data Discovery and Exploration

The test and train data to detect insults is downloaded from Kaggle. It contains 3947 observations, consisting of columns comments, date and its actual label. A label of 1 means its an insulting post, while a label of 0 represents not an insulting post. For Example, If the comment is  "You fuck your dad." then Label would be 1; if the comment is “Yeah and where are you now?”, then Label would be: 0

#### Read Data File

In [185]:
# Opening the training file using pandas
data = pd.read_csv("Data/train.csv")

#### Exploring data

In [186]:
#Check the dataframe shape
data.shape

(3947, 3)

In [187]:
# Inspect data
data.tail(5)

Unnamed: 0,Insult,Date,Comment
3942,1,20120502172717Z,"""you are both morons and that is never happening"""
3943,0,20120528164814Z,"""Many toolbars include spell check, like Yahoo..."
3944,0,20120620142813Z,"""@LambeauOrWrigley\xa0\xa0@K.Moss\xa0\nSioux F..."
3945,0,20120528205648Z,"""How about Felix? He is sure turning into one ..."
3946,0,20120515200734Z,"""You're all upset, defending this hipster band..."


In [188]:
# Check the top 10 rows of the dataset
data.head(13)

Unnamed: 0,Insult,Date,Comment
0,1,20120618192155Z,"""You fuck your dad."""
1,0,20120528192215Z,"""i really don't understand your point.\xa0 It ..."
2,0,,"""A\\xc2\\xa0majority of Canadians can and has ..."
3,0,,"""listen if you dont wanna get married to a man..."
4,0,20120619094753Z,"""C\xe1c b\u1ea1n xu\u1ed1ng \u0111\u01b0\u1edd..."
5,0,20120620171226Z,"""@SDL OK, but I would hope they'd sign him to ..."
6,0,20120503012628Z,"""Yeah and where are you now?"""
7,1,,"""shut the fuck up. you and the rest of your fa..."
8,1,20120502173553Z,"""Either you are fake or extremely stupid...may..."
9,1,20120620160512Z,"""That you are an idiot who understands neither..."


In [189]:
# Getting the Count of Positive and Negative
data.groupby('Insult').count()

Unnamed: 0_level_0,Date,Comment
Insult,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2265,2898
1,964,1049


# 2. Data Pre-Processing

In [190]:
# splitting Labels and comments from the dataset
insult_label, comments = data['Insult'],data['Comment']

In [191]:
# checking type label and comments from the dataset
type(insult_label), type(comments)

(pandas.core.series.Series, pandas.core.series.Series)

In [192]:
#Text Cleaning: To Lower Case
# converting data to lowercase
comments=comments.str.lower()

In [193]:
comments.iloc[27]

'"but how would you actually get the key out?"'

In [194]:
#Checking comment type
type(comments)

pandas.core.series.Series

In [195]:
# Import Counter package to count the most common sentences
from collections import Counter
c = Counter(comments)
c.most_common(5)

[('"dan_amd\\n\\n\\n\\n\\nyou have realy no clue on every single idiotic comment of all your posts.\\nwe all don\'t enjoy your stupid pro amd b:u:l:l:s:h:i:t ignorance.\\nplease crawl back then in the fat ass of your gay loving buddy charlie\\ndemerjian semiinaccurate and try to find the light inside - u stupid fag!\\n\\n\\n\\n\\nwe realy don\'t need and want your post here anymore!"',
  3),
 ('"you\'re an idiot"', 3),
 ('"faggot"', 2),
 ('"how old are you?"', 2),
 ('"fucking idiots"', 2)]

In [196]:
#nltk
from nltk import sent_tokenize
#RegularExpression
import re
from re import sub
#Import String package to perform string operations
import string
from nltk.stem.wordnet import WordNetLemmatizer #for lemmatizing
from nltk.stem.snowball import SnowballStemmer  #for stemming

In [197]:
#For Text Cleaning, Removing all special characters
def remove_special_char(x):
# Replacing all punctuations with spaces
    punc = string.punctuation.replace("-", "")
    punc = punc.replace("'", "")
    pat= r"[{}]".format(punc)
    x=re.sub(pat, " ", x)
    # Replacing all digits with None
    x=re.sub(pattern=r"\d", repl=r" ", string=x)
    # Stripping extra white spaces
    return " ".join( i for i in stemming(x))

In [198]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\shrad\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [199]:
#Stemming and Lemmatization
def stemming(x):
    st = WordNetLemmatizer()
    words=x.strip().split()
    st3=SnowballStemmer("english")
    return [st3.stem(st.lemmatize(x)) for x in words]

In [200]:
# creating copy of the transformed 
comments_transformed = comments.apply(remove_special_char)

In [201]:
# check the change
comments.iloc[1]

'"i really don\'t understand your point.\\xa0 it seems that you are mixing apples and oranges."'

In [202]:
# check the change
comments_transformed.iloc[23]

'you are a land creatur you would drown'

In [203]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\shrad\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [204]:
#To Remove Stopwords
import nltk
from nltk.corpus import stopwords
# getting stopwords
stopWords = set(stopwords.words('english'))
type(stopWords)

set

In [205]:
#Checking the length of stop words
len(stopWords)

179

In [206]:
#Tokenization
tokens = text.TfidfVectorizer(stop_words=stopWords, ngram_range=(1, 1))
D = tokens.fit_transform(comments_transformed)
print(D.shape)

(3947, 11775)


In [207]:
# checking the sparsity of matrix
print("Each sample has ~{0:.2%} non-zero features".format(D.nnz / float(D.shape[0] * D.shape[1])))

Each sample has ~0.13% non-zero features


In [208]:
# Storing words in vocab list
vocab=list((tokens.vocabulary_).keys())
len(vocab)

11775

In [209]:
vocab.sort()
print(vocab[:50])

['aaaaaaaaa', 'aaaah', 'aaahhh', 'aac', 'aamir', 'aap', 'aarongmy', 'ab', 'abacha', 'abandon', 'abc', 'abe', 'abel', 'aberdeen', 'abet', 'abid', 'abigail', 'abil', 'ability', 'abit', 'abl', 'abnorm', 'abolish', 'abolit', 'abomin', 'abort', 'abortifaci', 'abortion', 'abov', 'abraham', 'abroad', 'abrupt', 'abscam', 'absenc', 'absolut', 'absolutejok', 'absolutely', 'abstain', 'abstractfirework', 'absurd', 'absurdum', 'absurt', 'aburrido', 'abus', 'abuse', 'abuses', 'abxxv', 'abysm', 'ac', 'academ']


# 3. Model Planning and Building Models

Here, we will split the data set into train and test data in 75:25 ratio. Post which we will define the models like Decision Tree, Naive Bayes, Logistic Regression, Random forest and support vector machine.

In [210]:
#Preparing Data - Train and Validation
(D_train, D_val,label_train, label_val) = train_test_split(D, insult_label, test_size=.25)

In [211]:
#Checking shape of train data
D_train.shape

(2960, 11775)

In [214]:
#Checking shape of value
D_val.shape

(987, 11775)

### Linear Classification Models

In [215]:
#Logistic Regression (LR)
def model_LR():
    # creating classifier
    clf = LogisticRegression(tol=1e-8, penalty='l2', C=2)
    # training classifier
    clf.fit(D_train, label_train)
    # model type
    print("Model: ",type(clf))
    # Predicting probabilities
    p = clf.predict_proba(D_val)
    return (clf.predict(D_val),p)

In [216]:
#Support Vector Machine (SVM)
def model_SVM():
    # creating classifier
    clf = svm.LinearSVC(penalty='l2', loss='squared_hinge',tol=1e-8)
    # training classifier
    clf.fit(D_train, label_train)
    # model type
    print("Model: ",type(clf))
    return clf.predict(D_val)

In [217]:
#Naive Bayes (NB)
# Bernoulli Naive Baiyes
def model_BernoulliNB():
    # creating classifier
    clf = nb.BernoulliNB(alpha=1.0, binarize=0.0)
    # training classifier
    clf.fit(D_train, label_train)
    # model type
    print("Model: ",type(clf))
    # Predicting probabilities
    p = clf.predict_proba(D_val)
    return (clf.predict(D_val),p)

### Non-Linear Classification Models

In [218]:
#Random Forest Classifier
def model_RF():
    # creating classifier
    clf = RandomForestClassifier(n_estimators=100)
    # training classifier
    clf.fit(D_train, label_train)
    # model type
    print("Model: ",type(clf))
    # Predicting probabilities
    p = clf.predict_proba(D_val)
    return (clf.predict(D_val),p)

In [219]:
#Decision Tree Classifier (DT)
def model_DT():
    # creating classifier
    clf = DecisionTreeClassifier(max_depth=100)
    # training classifier
    clf.fit(D_train, label_train)
    # model type
    print("Model: ",type(clf))
    # Predicting probabilities
    p = clf.predict_proba(D_val)
    return (clf.predict(D_val),p)

# 4. Model Evaluation

In [220]:
# Importing roc_auc_score for ROC and AUC score
from sklearn.metrics import roc_auc_score as auc_score
# Importing confusion_matrix
from sklearn.metrics import confusion_matrix

In [221]:
def model_evaluation(model,label_test):
    # confusion matrix
    cm = confusion_matrix(label_test, model, labels=None, sample_weight=None)
    tp, fn, fp, tn = cm[0][0], cm[0][1], cm[1][0], cm[1][1]
    precision= float(tp)/(tp+fp)
    recall =  float(tp)/(tp+tn)
    accuracy = np.mean(model == label_test)
    print_results (precision, recall, accuracy)
    return accuracy
    

def print_results (precision, recall, accuracy):
    banner = "Here is the classification report"
    print ('\n',banner)
    print ('=' * len(banner))
    print ('{0:10s} {1:.1f}'.format('Precision',precision*100))
    print ('{0:10s} {1:.1f}'.format('Recall',recall*100))
    print ('{0:10s} {1:.1f}'.format('Accuracy',accuracy*100))
    

#### Evaluating Precision, Recall and Accuracy

In [222]:
#Logistic Regression
# prediction of the model
clf_LR, p = model_LR()
# evaluating model
acc_LR = model_evaluation(clf_LR, label_val)

print ('{0:10s} {1:.1f}'.format('AUC Score',auc_score(label_val, p[:,1])*100))

Model:  <class 'sklearn.linear_model.logistic.LogisticRegression'>

 Here is the classification report
Precision  82.2
Recall     88.2
Accuracy   81.8
AUC Score  86.7


In [223]:
#Support Vector Machine
# prediction of the model
clf_SVM = model_SVM()
# evaluating model
acc_SVM = model_evaluation(clf_SVM, label_val)

Model:  <class 'sklearn.svm.classes.LinearSVC'>

 Here is the classification report
Precision  85.0
Recall     84.5
Accuracy   83.1


In [227]:
#Naive Bayes
# prediction of the model
clf_NB,p=model_BernoulliNB()
# evaluating model
acc_NB = model_evaluation(clf_NB, label_val)

print ('{0:10s} {1:.1f}'.format('AUC Score',auc_score(label_val, p[:,1])*100))

Model:  <class 'sklearn.naive_bayes.BernoulliNB'>

 Here is the classification report
Precision  77.4
Recall     94.5
Accuracy   76.2
AUC Score  80.4


In [228]:
#Random Forest
# prediction of the model
clf_RF,p=model_RF()
# evaluating model
acc_RF = model_evaluation(clf_RF, label_val)

print ('{0:10s} {1:.1f}'.format('AUC Score',auc_score(label_val, p[:,1])*100))

Model:  <class 'sklearn.ensemble.forest.RandomForestClassifier'>

 Here is the classification report
Precision  82.7
Recall     87.3
Accuracy   81.7
AUC Score  82.9


In [229]:
#Decision Tree
# prediction of the model
clf_DT,p=model_DT()
# evaluating model
acc_DT = model_evaluation(clf_DT, label_val)

print ('{0:10s} {1:.1f}'.format('AUC Score',auc_score(label_val, p[:,1])*100))

Model:  <class 'sklearn.tree.tree.DecisionTreeClassifier'>

 Here is the classification report
Precision  84.8
Recall     82.7
Accuracy   78.4
AUC Score  68.2


In [230]:
# Accuracy for all Models
accuracy_model=[acc_LR, acc_SVM, acc_NB, acc_RF, acc_DT]
accuracy_model=[('{0:2f}'.format(i*100)) for i in accuracy_model]

In [231]:
accuracy_model

['81.762918', '83.080041', '76.190476', '81.661601', '78.419453']

In [232]:
def model_evaluation12(model,label_test):
    # confusion matrix code
    cm = confusion_matrix(label_test, model, labels=None, sample_weight=None)
    tp, fn, fp, tn = cm[0][0], cm[0][1], cm[1][0], cm[1][1]
    recall =  float(tp)/(tp+tn)
    #print_results (precision, recall, accuracy)
    return recall
recall_LR = model_evaluation12(clf_LR, label_val)
recall_SVM = model_evaluation12(clf_SVM, label_val)
recall_NB = model_evaluation12(clf_NB, label_val)
recall_RF = model_evaluation12(clf_RF, label_val)
recall_DT = model_evaluation12(clf_DT, label_val)


In [233]:
# Sensitivity for all Models
sensitivity_model=[recall_LR, recall_SVM, recall_NB, recall_RF, recall_DT]
sensitivity_model=[('{0:2f}'.format(i*100)) for i in sensitivity_model]

In [234]:
sensitivity_model

['88.228005', '84.512195', '94.547872', '87.344913', '82.687339']

In [235]:
#Accuracy : 10-Fold Cross Validation
# Logistic Regression
clf1 = LogisticRegression(tol=1e-8, penalty='l2', C=2)
# Support Vector Machines
clf2 = svm.LinearSVC(penalty='l2', loss='squared_hinge')
# Naive Bayes
clf3 = nb.BernoulliNB(alpha=1.0, binarize=0.0)
# Random Forest
clf4 = RandomForestClassifier(n_estimators=100)
# Decision Tree
clf5 = DecisionTreeClassifier(max_depth=100)

models=[clf1, clf2, clf3, clf4, clf5]

In [236]:
### Finding the accuracy of all the models, run the 10 fold cross validation and find the accuracy
n_Folds = 10
# Accuracy after cross validation:
accuracy_cv=[]
for clf in models:
    accuracy_common=0
    for test_run in range(n_Folds):
        (X_train, X_test, y_train, y_test) = train_test_split(D, insult_label, test_size=.2)
        # call classifier
        clf.fit(X_train, y_train)
        model=clf.predict(X_test)
        # compare result
        accuracy=np.mean(model == y_test)
        # append to common
        accuracy_common += accuracy
        # final score
    print ('{0:10s} {1:.1f}'.format('Accuracy',float(accuracy_common)/10*100))
    accuracy_cv.append('{0:.1f}'.format(float(accuracy_common)/10*100))
    
print("Normal Accuracy")
print(accuracy_model)
print("Accuracy post CV")
print(accuracy_cv)

Accuracy   81.1
Accuracy   82.5
Accuracy   75.1
Accuracy   81.8
Accuracy   78.5
Normal Accuracy
['81.762918', '83.080041', '76.190476', '81.661601', '78.419453']
Accuracy post CV
['81.1', '82.5', '75.1', '81.8', '78.5']


In [237]:
# Visualize results
import plotly
plotly.__version__
import plotly.graph_objs as go

In [238]:
#Checking type of accuracy
type(accuracy)

numpy.float64

In [239]:
#accuracy1=np.array(accuracy).tolist()

In [240]:
#type(accuracy1)

# 5. Data Visualization of Results

#### Accuracy of all models

In [241]:
# Plot bar comparison of normal accuracy
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

normal = go.Bar(
            x=['Logistic', 'SVM', 'Naive Bayes', 'Decion Tree', 'Random Forest'],
            y= accuracy_model,
            text=accuracy_model,
            textposition = 'auto',
            marker=dict(
                color='rgb(190,98,221)',
                line=dict(
                color='rgb(8,48,107)',
                width=1.5
            )
    )
)
data = [normal]
iplot(data, filename='basic-bar')

In [242]:
# Plot bar comparison of cross validation accuracy
data = [go.Bar(
            x=['Logistic', 'SVM', 'Naive Bayes', 'Decion Tree', 'Random Forest'],
            y= accuracy_cv,
            text=accuracy_cv,
            textposition = 'auto',
    marker=dict(
                color='rgb(95,208,224)',
                line=dict(
                color='rgb(8,48,107)',
                width=1.5
                )
    )
    )]
iplot(data, filename='basic-bar')

#### Sensitivity of all models

In [243]:
# Ploting bar comparison of sensitivity of all the models
sensitivity = [go.Bar(
            x=['Logistic', 'SVM', 'Naive Bayes', 'Decion Tree', 'Random Forest'],
            y= sensitivity_model,
            text=sensitivity_model,
            textposition = 'auto',
    marker=dict(
                color='rgb(128,0,0)',
                line=dict(
                color='rgb(8,48,107)',
                width=1.5
                )
    )
    )]
iplot(sensitivity, filename='basic-bar')

# 6. Analysis on Results

Logistic Regression:    Normal Accuracy: 82.28, Accuracy with 10 fold cross validation: 81.6,  Sensitivity:84.92
Support Vector Machine: Normal Accuracy: 81.39, Accuracy with 10 fold cross validation: 83,  Sensitivity:81.96
Random Forest:          Normal Accuracy: 79.75 Accuracy with 10 fold cross validation: 78.7, Sensitivity: 78.41

Above three are the top 3 classifier results. The normal accuracy is high for logistic regression compared to Support Vector Machine and Random Forest.
But accuracy with cross vlidation, Support vector machine is high. In case of sensitivity Logistic regression has the highest.

Thus, combining the overall analysis, Logistic regression is best for the data, after which Support Vector Machine and Random Forest perform best.