# CS5100 Foundations of Artificial Intelligence Spring 2018  
Project on developing a machine learning model to classify wikipedia discussion comments as whether they contain a
personal attack or not.  
The following code is implemented in jupyter notebook - python version 3.    
By Survi Satpathy

Steps for code execution
1. Text cleanup  - Removal of special characters like puntuations, URL , digits, extra spaces and newline      characters
2. Feature Extraction - Tokeninzing and tranforming the bag of words to useful features to model a classifier
3. List of Classifiers - Selecting a list of classfiers to learn on the extracted features. Comparing the
 performance of each based on ROC AUC score
4. Cross Validation - Performing cross validation on the classifiers by making use of the dev data set to choose the best classifier
5. Parameter tunning - Perform grid search and randomized search on the selected classfier to know the best set of parameters which       improve the ROC AUC score of the chosen classifier
6. Final Metrics - Testing the performance of the best classifier(used it;s the list of best parameters) on the testing data set. Evaluting it's ROC AUC score, Confusion metrics,Classification Report,Precision, Recall, F1 score and Support
7. Conclusion 

In [1]:
import pandas as pd
import urllib
import string
import re
import time
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier,Perceptron
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import train_test_split

In [2]:
# # download annotated comments and annotations

# ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7554634' 
# ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7554637' 


# def download_file(url, fname):
#     urllib.request.urlretrieve(url, fname)

                
# download_file(ANNOTATED_COMMENTS_URL, 'attack_annotated_comments.tsv')
# download_file(ANNOTATIONS_URL, 'attack_annotations.tsv')

In [2]:
comments = pd.read_csv('attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')

In [3]:
len(annotations['rev_id'].unique())

115864

In [4]:
# labels a comment as an atack if the majority of annoatators did so
labels = annotations.groupby('rev_id')['attack'].mean() > 0.5

In [5]:
# join labels and comments
comments['attack'] = labels

# Text Cleanup

In [6]:
# Text clean up methods tried removing punctuations , digits, URL , multiple white spaces, newline and tab
# Also applied English stop words removal technique but this did not improve performance. So removed it
# Text cleanup methods kept in the final code - removing punctuations , digits, URL, multiple white spaces,
# newline and tab
# This is the first optimization technique applied on data preprocessing and filtering
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: re.sub(r"[!#$%&'()*+,-./:;<=>?@^_`{|}~\d+]+\ *", " ", x))
comments['comment'] = comments['comment'].apply(lambda x: re.sub(r"http\S+", "", x))
comments['comment'] = comments['comment'].apply(lambda x: ' '.join(x.split()))



In [7]:
comments.query('attack')['comment'].head()

rev_id
801279                           Iraq is not good USA is bad
2702703    fuck off you little asshole If you want to tal...
4632658           i have a dick its bigger than yours hahaha
6545332    renault you sad little bpy for driving a renau...
6545351    renault you sad little bo for driving a renaul...
Name: comment, dtype: object

# Initial trial of all classifiers 

In [9]:
# This cell demonstrates the performance of various classifiers on the test
# data set before applying cross validation technique. 

# Training and test set used for inital test of all classifiers
train_all_classifiers = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

# list of classifiers to choose from 
clf_list = [
            CalibratedClassifierCV(LinearSVC()),
            LogisticRegression(),
            MultinomialNB(),
            RandomForestClassifier(),
            CalibratedClassifierCV(SGDClassifier(loss='log',max_iter=1600,tol=1e-3)),# SGD classifier 
                                                             # requires passing of paramters like max_iter and tol. 
            CalibratedClassifierCV(SGDClassifier(loss='modified_huber',max_iter=1600,tol=1e-3)),
            CalibratedClassifierCV(Perceptron(max_iter=1600,tol=1e-3)),# Perceptron classifier requires 
                                                                # passing of parameters max_iter & tol
            MLPClassifier()
            ]

# features include - unigram and bigrams of words or chars. Adding both improved the performance of the code
# so included both in the final version
for classifier in clf_list:
    start_time = time.time()
    clf = Pipeline([
        ('vect', FeatureUnion([ 
            # Applied a combination of char and word tokenization using CountVectorizer and TfidfVectorizer
            # After trial and testing applied max_feature as 75000, 
            # decode_error when set to 'replace'/'ignore' performed better than 'strict' 
            # This is the 3rd optimization technique applied.
            ('count',CountVectorizer(max_features = 75000, ngram_range = (1,2),analyzer='word',
                                     decode_error='replace')),
            ('count1',CountVectorizer(max_features = 75000, ngram_range = (1,2),analyzer='char',
                                     decode_error='replace')),
            ('tfvect',TfidfVectorizer(max_features = 75000,norm = 'l2',analyzer='word',
                                    decode_error='replace')),
            ('tfvect1',TfidfVectorizer(max_features = 75000,norm = 'l2',analyzer='char',
                                      decode_error='ignore'))
        ])),
        ('tfidf', TfidfTransformer(norm = 'l2')),
        ('clf',  classifier)])
    
    print('Classifier:',classifier)
    clf=clf.fit(train_all_classifiers['comment'], train_all_classifiers['attack'])
    # This is the roc_auc score of the classifier before applying cross validation technique 
    auc = roc_auc_score(test_comments['attack'], clf.predict_proba(test_comments['comment'])[:, 1])
    predict = clf.predict(test_comments['comment'])
    # Claculation of Performance Metric : precision_recall
    precision_recall=precision_recall_fscore_support(test_comments['attack'], predict, average='binary')
    # Claculation of Performance Metric : confusion_matrix
    tn, fp, fn, tp=confusion_matrix(test_comments['attack'],predict).ravel()

    print('precision recall:',precision_recall)
    print('true neg: ',tn,' false pos: ',fp,' false neg: ',fn,' true pos: ',tp)

    print('Test ROC AUC: %.3f' %auc)
    print("Time taken %s seconds " % (time.time() - start_time))
    print('-----------------------------------------------------------------------------------------')
    

Classifier: CalibratedClassifierCV(base_estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
            cv=3, method='sigmoid')
precision recall: (0.8874501992031872, 0.6465892597968069, 0.7481108312342569, None)
true neg:  20196  false pos:  226  false neg:  974  true pos:  1782
Test ROC AUC: 0.965
Time taken 216.07876348495483 seconds 
-----------------------------------------------------------------------------------------
Classifier: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
precision recall: (0.9126406353408338, 0.5003628447024674, 0.6463557534567612, None)
true neg:  20290  false pos:  132 

# Performance metrics before Cross Validation
By comparing the ROC AUC score of the various classifiers listed above **LinearSVC** has been found to be   
the best so far. Let us now calculate the values of various performance metrics on the LinearSVC.   
This step has only been done to compare the final result metrics.  

In [19]:
# Multiple performance metrics are captured below
# Looking at Confusion matrix we can tell how many of the true positives where classified correctly from
# the test set. The classification report gives the Precision, Recall, F1 score and support values from the
# test set. This gives an idea of how well the classifier performed. 

# Training and test set used for inital test of all classifiers
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")
clf = Pipeline([
        # Feature Union of the previously used combination of CountVectorizer and TfIdfVectorizer
        # Applied sublinear tf scaling, i.e. replace tf with 1 + log(tf).
        ('vect', FeatureUnion([
            ('count',CountVectorizer(max_features = 75000, ngram_range = (1,2),analyzer='word',
                                     decode_error='replace')),
            ('count1',CountVectorizer(max_features = 75000, ngram_range = (1,2),analyzer='char',
                                     decode_error='replace')),
            ('tfvect',TfidfVectorizer(max_features = 75000,norm = 'l2',analyzer='word',
                                    decode_error='replace',sublinear_tf=True)),
            ('tfvect1',TfidfVectorizer(max_features = 75000,norm = 'l2',analyzer='char',
                                      decode_error='ignore',sublinear_tf=True))
        ])),
        ('tfidf', TfidfTransformer(norm = 'l2',sublinear_tf=True)),
        # for the purpose of using roc_auc_score function on the learnt classifier, ClaiberatedClassifierCV() 
        # is used on LinearSVM along with its best params list
        # This was required because by default LinearSVM does not have predict_proba() function
        # Used LinearSVC as the classifier here as it showed the best result on the previous cell
        ('clf',  CalibratedClassifierCV(LinearSVC()))])

clf.fit(train_comments['comment'], train_comments['attack'])
predict = clf.predict(test_comments['comment'])
# Claculation of Performance Metric : precision_recall
precision_recall=precision_recall_fscore_support(test_comments['attack'], predict, average='binary')
# Claculation of Performance Metric : confusion_matrix
tn, fp, fn, tp=confusion_matrix(test_comments['attack'],predict).ravel()

print('precision recall:',precision_recall)
print('true neg: ',tn,' false pos: ',fp,' false neg: ',fn,' true pos: ',tp)
print('-----------------------------------------------------------------------------------------')


precision recall: (0.8921326076199901, 0.6542089985486212, 0.7548670713837138, None)
true neg:  20204  false pos:  218  false neg:  953  true pos:  1803
-----------------------------------------------------------------------------------------


# Cross Validation to choose the best classifier

In [9]:
# Used training and dev set for choosing best classifier
# Making use of development set ensured more data for the classifier to learn. 
# This is the 2nd optimization technique used. 

train_all_classifiers = comments.query("split=='train' | split =='dev'")

# The ML classifiers used are LinearSVC, Logistic Regression, MultiNomial NB, 
# Random Forest Classifier, Stochastic Gradient Descent, Perceptron and Multi Layer Perceptron
# The best result for each classifer is shown below. 
# I found LinearSVC to be the best classifier among all
clf_list = [
            LinearSVC(),
            LogisticRegression(),
            MultinomialNB(),
            RandomForestClassifier(),
            SGDClassifier(loss='log',max_iter=1600,tol=1e-3),# SGD classifier requires passing of paramters like
                                                             # max_iter and tol. 
            SGDClassifier(loss='modified_huber',max_iter=1600,tol=1e-3),
            Perceptron(max_iter=1600,tol=1e-3),# Perceptron classifier requires passing of parameters max_iter & tol
            MLPClassifier()
            ]
# features include - unigram and bigrams of words or chars. Adding both improved the performance of the code
# so included both in the final code
for classifier in clf_list:
    start_time = time.time()
    print(classifier)
    clf = Pipeline([
        ('vect', FeatureUnion([ 
            # Applied a combination of char and word tokenization using CountVectorizer and TfidfVectorizer
            # After trial and testing applied max_feature as 75000, 
            # decode_error when set to 'replace'/'ignore' performed better than 'strict' 
            # This is the 3rd optimization technique applied.
            ('count',CountVectorizer(max_features = 75000, ngram_range = (1,2),analyzer='word',
                                     decode_error='replace')),
            ('count1',CountVectorizer(max_features = 75000, ngram_range = (1,2),analyzer='char',
                                     decode_error='replace')),
            ('tfvect',TfidfVectorizer(max_features = 75000,norm = 'l2',analyzer='word',
                                    decode_error='replace')),
            ('tfvect1',TfidfVectorizer(max_features = 75000,norm = 'l2',analyzer='char',
                                      decode_error='ignore'))
        ])),
        ('tfidf', TfidfTransformer(norm = 'l2')),
        ('clf',  classifier)])
    # Cross validation score was calculated for each classifer with the scoring parameter as 'roc_auc'
    # cv=5 i.e. a K fold cross validation was performed with k as 5
    # This is the 4th optimization technique applied. 
    scores=cross_val_score(clf, train_all_classifiers['comment'], train_all_classifiers['attack'], cv=5, scoring='roc_auc')
    # time taken by each classifier was monitored
    print("Time taken %s seconds " % (time.time() - start_time))
    # cross validation score with standard deviation is used to find the best classifier
    print("Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))


LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
Time taken 216.67947006225586 seconds 
Accuracy: 0.9623 (+/- 0.0081)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Time taken 237.87252974510193 seconds 
Accuracy: 0.9508 (+/- 0.0073)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
Time taken 222.55188393592834 seconds 
Accuracy: 0.8381 (+/- 0.0185)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_

# Hyperparameter tuning using GridSearchCV 

In [13]:
# This cell shows the grid search operation implemented on the best classifier found previously. 
# After the best params are found from the grid search operation, these are applied on the classifier 
# The classifier is then used to predict on the test data set
# Classification report is then generated on the predicted labels.
# The ROC_AUC score, confusion matrix and Precison ,Recall,F1 score and support is also included below

# for the grid search operation we only use training set samples
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

# Hyper parameters tuned for linear SVM are C, tol, loss and random_state
# Accuracy percentage before applying hyperparameter tuning was = 95.6%
# Accuracy percentage after applying hyperparameter tuning is = 96.6%
# % difference between accuracies = 1 %
# The tuned hyperparameters are : {'C': 0.4, 'loss': 'squared_hinge', 'random_state': 37, 'tol': 0.001}

# params list for the grid search operation
params = [{'clf__C':[0.4,0.6], # C - penalty factor for error term
           'clf__tol':[1e-4,1e-3], # tol - tolerance for stopping criteria
           'clf__loss':('hinge','squared_hinge'), # loss function - hinge is the standard SVM loss & 
                                                  # squared hinge is its sqaure
          'clf__random_state':[37,67]}, # random state -The seed of the pseudo random number generator 
                                        # to use when shuffling the data.
          {'clf__C':[0.5],
           'clf__tol':[1e-4]}]
clf = Pipeline([
        # Feature Union of the previously used combination of CountVectorizer and TfIdfVectorizer
        # Applied sublinear tf scaling, i.e. replace tf with 1 + log(tf).
        ('vect', FeatureUnion([
            ('count',CountVectorizer(max_features = 75000, ngram_range = (1,2),analyzer='word',
                                     decode_error='replace')),
            ('count1',CountVectorizer(max_features = 75000, ngram_range = (1,2),analyzer='char',
                                     decode_error='replace')),
            ('tfvect',TfidfVectorizer(max_features = 75000,norm = 'l2',analyzer='word',
                                    decode_error='replace',sublinear_tf=True)),
            ('tfvect1',TfidfVectorizer(max_features = 75000,norm = 'l2',analyzer='char',
                                      decode_error='ignore',sublinear_tf=True))
        ])),
        ('tfidf', TfidfTransformer(norm = 'l2',sublinear_tf=True)),
        # Used LinearSVC as the classifier here as it showed the best result on the previous cell
        ('clf',  LinearSVC())])
# Grid Search function applied on the classifier with it's params list to choose from
# scoring parameter is set to ROC_AUC to maintain uniformity across comparisions
clf=GridSearchCV(clf, params,scoring='roc_auc')
# The model learns on the training dataset 
clf.fit(train_comments['comment'], train_comments['attack'])
print("Best parameters set found on train set:")
print()
print(clf.best_params_)
print()
print("Grid scores on training set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
            % (mean, std * 2, params))
print()

print("Detailed classification report:")
print()
print("The model is trained on the full training set.")
print("The scores are computed on the full training set.")
print()
# used the trained model to predit on the test data set
y_true, y_pred = test_comments['attack'], clf.predict(test_comments['comment'])
print(classification_report(y_true, y_pred))
print()
best_params = {}
for key in clf.best_params_:
    best_params[key[5:]] = clf.best_params_[key]
    
print(best_params)



Best parameters set found on train set:

{'clf__C': 0.4, 'clf__loss': 'squared_hinge', 'clf__random_state': 37, 'clf__tol': 0.001}

Grid scores on training set:

0.956 (+/-0.006) for {'clf__C': 0.4, 'clf__loss': 'hinge', 'clf__random_state': 37, 'clf__tol': 0.0001}
0.956 (+/-0.006) for {'clf__C': 0.4, 'clf__loss': 'hinge', 'clf__random_state': 37, 'clf__tol': 0.001}
0.956 (+/-0.006) for {'clf__C': 0.4, 'clf__loss': 'hinge', 'clf__random_state': 67, 'clf__tol': 0.0001}
0.956 (+/-0.006) for {'clf__C': 0.4, 'clf__loss': 'hinge', 'clf__random_state': 67, 'clf__tol': 0.001}
0.961 (+/-0.004) for {'clf__C': 0.4, 'clf__loss': 'squared_hinge', 'clf__random_state': 37, 'clf__tol': 0.0001}
0.961 (+/-0.004) for {'clf__C': 0.4, 'clf__loss': 'squared_hinge', 'clf__random_state': 37, 'clf__tol': 0.001}
0.961 (+/-0.004) for {'clf__C': 0.4, 'clf__loss': 'squared_hinge', 'clf__random_state': 67, 'clf__tol': 0.0001}
0.961 (+/-0.004) for {'clf__C': 0.4, 'clf__loss': 'squared_hinge', 'clf__random_state': 6

# Hyperparameter tuning using RandomizedSearchCV

In [14]:
# for random search operation we only use training set samples
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")
# Hyper parameters tuned for linear SVM are C, tol, loss ,random_state & max_iter
# Accuracy percentage before applying hyperparameter tuning was = 95.6%
# Accuracy percentage after applying hyperparameter tuning is = 96.6%
# % difference between accuracies =  1%
# The tuned hyperparameters are : {'C': 0.4, 'loss': 'squared_hinge', 'random_state': 37, 'tol': 0.0001, 
#                                   'max_iter'=2000}

# params dictionary for the random search operation
params = {'clf__C':[0.4,0.6], # C - penalty factor for error term
          'clf__tol':[1e-4,1e-3], # tol - tolerance for stopping criteria
          'clf__loss':('hinge','squared_hinge'), # loss function - hinge is the standard SVM loss & 
                                                  # squared hinge is its sqaure
          'clf__random_state':[37,67], # random state -The seed of the pseudo random number generator 
                                        # to use when shuffling the data.
          'clf__max_iter':[1500,2000]} # The maximum number of iterations to be run.
clf = Pipeline([
        # Feature Union of the previously used combination of CountVectorizer and TfIdfVectorizer
        # Applied sublinear tf scaling, i.e. replace tf with 1 + log(tf).
        ('vect', FeatureUnion([
            ('count',CountVectorizer(max_features = 75000, ngram_range = (1,2),analyzer='word',
                                     decode_error='replace')),
            ('count1',CountVectorizer(max_features = 75000, ngram_range = (1,2),analyzer='char',
                                     decode_error='replace')),
            ('tfvect',TfidfVectorizer(max_features = 75000,norm = 'l2',analyzer='word',
                                    decode_error='replace',sublinear_tf=True)),
            ('tfvect1',TfidfVectorizer(max_features = 75000,norm = 'l2',analyzer='char',
                                      decode_error='ignore',sublinear_tf=True))
        ])),
        ('tfidf', TfidfTransformer(norm = 'l2',sublinear_tf=True)),
        # Used LinearSVC as the classifier here as it showed the best result on the previous cell
        ('clf',  LinearSVC())])
# Random Search function applied on the classifier with it's params dictionary to choose from
# scoring parameter is set to ROC_AUC to maintain uniformity across comparisions
clf=RandomizedSearchCV(clf, param_distributions=params,scoring='roc_auc',return_train_score=True)
# The model learns on the training dataset 
clf.fit(train_comments['comment'], train_comments['attack'])
print("Best parameters set found on train set:")
print()
print(clf.best_params_)
print()
print("Randomized Search scores on training set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
            % (mean, std * 2, params))
print()

print("Detailed classification report:")
print()
print("The model is trained on the full training set.")
print("The scores are computed on the full training set.")
print()
# used the trained model to predit on the test data set
y_true, y_pred = test_comments['attack'], clf.predict(test_comments['comment'])
print(classification_report(y_true, y_pred))
print()
best_params = {}
for key in clf.best_params_:
    best_params[key[5:]] = clf.best_params_[key]
    
print(best_params)



Best parameters set found on train set:

{'clf__tol': 0.001, 'clf__random_state': 37, 'clf__max_iter': 2000, 'clf__loss': 'squared_hinge', 'clf__C': 0.4}

Randomized Search scores on training set:

0.957 (+/-0.006) for {'clf__tol': 0.001, 'clf__random_state': 37, 'clf__max_iter': 2000, 'clf__loss': 'hinge', 'clf__C': 0.6}
0.960 (+/-0.004) for {'clf__tol': 0.0001, 'clf__random_state': 37, 'clf__max_iter': 2000, 'clf__loss': 'squared_hinge', 'clf__C': 0.6}
0.957 (+/-0.006) for {'clf__tol': 0.001, 'clf__random_state': 67, 'clf__max_iter': 1500, 'clf__loss': 'hinge', 'clf__C': 0.6}
0.956 (+/-0.006) for {'clf__tol': 0.0001, 'clf__random_state': 37, 'clf__max_iter': 2000, 'clf__loss': 'hinge', 'clf__C': 0.4}
0.957 (+/-0.006) for {'clf__tol': 0.001, 'clf__random_state': 67, 'clf__max_iter': 2000, 'clf__loss': 'hinge', 'clf__C': 0.6}
0.961 (+/-0.004) for {'clf__tol': 0.0001, 'clf__random_state': 37, 'clf__max_iter': 1500, 'clf__loss': 'squared_hinge', 'clf__C': 0.4}
0.956 (+/-0.006) for {'clf_

# Testing the model 

In [13]:
###### This cell demonstrates the performance of Linear SVC along with it's tuned parameters applied to 
# test set for prediction purpose. 
# Multiple performance metrics are captured below on the learnt classifier
# Looking at Confusion matrix we can tell how many of the true positives where classified correctly from
# the test set. The classification report gives the Precision, Recall, F1 score and support values from the
# test set. This gives an idea of how well the classifier performed. 
# Cross Validation has been previously applied to select the best classifier among the list.
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

clf = Pipeline([
        # Feature Union of the previously used combination of CountVectorizer and TfIdfVectorizer
        # Applied sublinear tf scaling, i.e. replace tf with 1 + log(tf).
        ('vect', FeatureUnion([
            ('count',CountVectorizer(max_features = 75000, ngram_range = (1,2),analyzer='word',
                                     decode_error='replace')),
            ('count1',CountVectorizer(max_features = 75000, ngram_range = (1,2),analyzer='char',
                                     decode_error='replace')),
            ('tfvect',TfidfVectorizer(max_features = 75000,norm = 'l2',analyzer='word',
                                    decode_error='replace',sublinear_tf=True)),
            ('tfvect1',TfidfVectorizer(max_features = 75000,norm = 'l2',analyzer='char',
                                      decode_error='ignore',sublinear_tf=True))
        ])),
        ('tfidf', TfidfTransformer(norm = 'l2',sublinear_tf=True)),
        # for the purpose of using roc_auc_score function on the learnt classifier, ClaiberatedClassifierCV() 
        # is used on LinearSVM along with its best params list
        # This was required because by default LinearSVM does not have predict_proba() function
        # Used LinearSVC as the classifier here as it showed the best result on the previous cell
        ('clf',  CalibratedClassifierCV(LinearSVC(tol= 0.001, random_state= 37, max_iter= 2000, loss= 'squared_hinge', C= 0.4)))])

clf.fit(train_comments['comment'], train_comments['attack'])
# Claculation of Performance Metric : roc_auc_score
auc = roc_auc_score(test_comments['attack'], clf.predict_proba(test_comments['comment'])[:, 1])
predict = clf.predict(test_comments['comment'])
# Claculation of Performance Metric : precision_recall
precision_recall=precision_recall_fscore_support(test_comments['attack'], predict, average='weighted')
# Claculation of Performance Metric : confusion_matrix
tn, fp, fn, tp=confusion_matrix(test_comments['attack'],predict).ravel()
y_true, y_pred = test_comments['attack'], clf.predict(test_comments['comment'])
print(classification_report(y_true, y_pred))
print()
print('Test ROC AUC: %.3f' %auc)
print('precision recall:',precision_recall)
print('true neg: ',tn,' false pos: ',fp,' false neg: ',fn,' true pos: ',tp)
print('-----------------------------------------------------------------------------------------')

#

             precision    recall  f1-score   support

      False       0.96      0.99      0.97     20422
       True       0.89      0.66      0.76      2756

avg / total       0.95      0.95      0.95     23178


Test ROC AUC: 0.966
precision recall: (0.9476515491220501, 0.9497368193977047, 0.9465097793283376, None)
true neg:  20189  false pos:  233  false neg:  932  true pos:  1824
-----------------------------------------------------------------------------------------


# Interesting facts learnt :  
1. Preprocessing and data cleanup has a great impact on the performance of any classifier.   
   Noise reduction in the data leads to better training of the model. Thus giving a better result   
   on the test data!  
2. GridSearchCV,RandomizedSearchCV are useful functions to find out the best parameters for any classifier    
3. Random forest has the power to handle large data set with higher dimensionality. It can handle thousands
   of input variables and identify most significant variables. It surely does a good job at classification 
   but not as good as for regression problem as it does not give precise continuous nature predictions.
4. SGD can be successfully applied to large-scale and sparse machine learning problems often 
   encountered in text classification and natural language processing. But SGD requires a number of 
   hyperparameters such as the regularization parameter and the number of iterations.
5. MLP cannot guarantee that the that the minima it stops at during training is the global minima. 
   The MLP algorithm can, therefore, get stuck in a local minima.Another disadvantage is is that 
   the number of Hidden Neurons must be set by the user, setting this value too low may result  
   in the MLP model underfitting while setting this value too high may result in overfitting. 

# Challenges faced:
1. Finding the list of parameters that should be provided to grid search/randomized search operation.   
   This is a time consuming process as it needs carefully trials for various values of   
   the possible parameters to decide which suits best for the experiment.  
2. Finding the right combination of the features in the feature union function provided for   
   tokenization of text. This was again achieved on a trial and error basis and hence was an exhaustive process.

# Final Metrics 
   Test ROC AUC score: 0.966    
   precision recall: (0.8867282450170151, 0.6618287373004355, 0.7579472262622066, None)  
   true neg:  20189  false pos:  233  false neg:  932  true pos:  1824  
   LinearSVC model gave the above metrics    

   Original strawman code had ROC AUC score of 0.957  
   Thus LinearSVM improved the ROC AUC score by 0.9 units     

# Classification Report - 
    precision    recall  f1-score   support  
      
    False       0.95      0.99      0.97     20422  
    True       0.92      0.62      0.74      2756  
  
    avg / total       0.95      0.95      0.94     23178  
      



In [17]:
# correctly classify nice comment
clf.predict(['Thanks for you contribution, you did a great job!'])

array([False])

In [18]:
# correctly classify nasty comment
clf.predict(['People as stupid as you should not edit Wikipedia!'])

array([ True])