The Unhealthy Comments Corpus (UCC) is corpus of 44355 comments intended to assist in research on identifying subtle attributes which contribute to unhealthy conversations online. Eeach comment is either labeled healthy or unhealthy, and this is done through the use of 7 sub attributes: (1) hostile; (2) antagonistic, insulting, provocative or trolling; (3) dismissive; (4) condescending or patronising; (5) sarcastic; and/or (6) an unfair generalisation (7) generalisation. These sub attributes allow the model to distinguish if a comment is healthy or not by both seeing if the sub-attribute is True (1) or False (0), and by using the confidence score of the sub-attributes. 

This notebook will walk you thorugh how ensemble learning can be used to predict if a comment will be either healthy or unhealthy. 

The data we will be using comes form two csv files provided by UCC, those being train.csv and test.csv

    First we need to import the 3 most important libraries in python machine learning. Also, we want to convert the train dataset into a Dataframe

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('train.csv')
dataset.head()

Unnamed: 0,_unit_id,_trusted_judgments,comment,antagonise,antagonise:confidence,condescending,condescending:confidence,dismissive,dismissive:confidence,generalisation,generalisation:confidence,generalisation_unfair,generalisation_unfair:confidence,healthy,healthy:confidence,hostile,hostile:confidence,sarcastic,sarcastic:confidence
0,1739460326,5,proving there is no cure for stupidity.,1,0.5816,1,0.5816,1,0.5816,0,1.0,0.0,1.0,0,0.5816,1.0,0.5816,0.0,0.8001
1,2297540155,5,Personally I prefer the Flying Spaghetti Monst...,0,1.0,0,1.0,0,1.0,0,1.0,0.0,1.0,1,0.7981,0.0,1.0,0.0,1.0
2,1812168131,5,Your comparing a pipeline to a well? One that ...,0,1.0,0,0.8063,0,1.0,0,1.0,0.0,1.0,1,0.6081,0.0,1.0,0.0,0.8063
3,1739470334,5,who is writing this pap!?,0,0.7931,0,0.5959,0,0.5959,0,1.0,0.0,1.0,0,0.7917,0.0,1.0,0.0,0.6052
4,1739466190,3,Natives refuse to even consider that their cur...,0,1.0,0,1.0,0,1.0,0,1.0,0.0,1.0,1,1.0,0.0,1.0,0.0,1.0


    What we want to check is that we have a balanced dataset, as if it is not balanced our data will be skewed. 

In [2]:
dataset_healthy = dataset[dataset.healthy == 1]
dataset_unhealthy = dataset[dataset.healthy == 0]


    First lets check how many healthy comments there are in the dataset.

In [3]:
dataset_healthy["healthy"]

1        1
2        1
4        1
5        1
6        1
        ..
35498    1
35499    1
35500    1
35501    1
35502    1
Name: healthy, Length: 32848, dtype: int64

    Now lets look at how many unhealthy comments there are in the dataset.

In [4]:
dataset_unhealthy["healthy"]

0        0
3        0
17       0
29       0
41       0
        ..
35430    0
35458    0
35466    0
35490    0
35497    0
Name: healthy, Length: 2655, dtype: int64

    There is clearly an imbalance between unhealhty and healhty comments, which would give us skewed results. If we were to use this data we could have the model predict 1 for any comment and have an accuracy score of above 90%. What we want to do is balance the dataset by upsampling the minority (unhealhty comments) and downsampling tha majority (healhty comments)

In [5]:
from sklearn.utils import resample
dataset_unhealthy_upsampled = resample(dataset_unhealthy,
                                      replace=True,
                                      n_samples=20000,
                                      random_state=123)
dataset_healthy_downsampled = resample(dataset_healthy,
                                        replace=False,
                                        n_samples=20000,
                                        random_state=123)
train_dataset = pd.concat([dataset_healthy_downsampled, dataset_unhealthy_upsampled])


    Now let us check how balanced the new dataset is 

In [6]:
new_dataset_healthy = train_dataset[dataset.healthy == 1]
print(new_dataset_healthy["healthy"])
new_dataset_unhealthy = train_dataset[dataset.healthy == 0]
print(new_dataset_unhealthy["healthy"])

27173    1
32171    1
32036    1
17321    1
26325    1
        ..
26798    1
32108    1
7360     1
8128     1
1054     1
Name: healthy, Length: 20000, dtype: int64
18151    0
14849    0
23955    0
29069    0
15192    0
        ..
13998    0
30137    0
1370     0
19554    0
25081    0
Name: healthy, Length: 20000, dtype: int64


  """Entry point for launching an IPython kernel.
  This is separate from the ipykernel package so we can avoid doing imports until


     Now that the data is balanced, we cna move on the select features form the data that are useful from the train_dataset. The columns we will be selecting are all the sub-attributes and their confidence scores for the X_data and the "healthy" column as the Y_data.

In [7]:
X_set_1 = train_dataset.iloc[:, 3:13].values
df_X_1 = pd.DataFrame(X_set_1,
                      columns=["antagonise", "antagonise:confidence", "condescending", "condescending:confidence",
                               "dismissive", "dismissive:confidence", "generalisation", "generalisation:confidence",
                               "generalisation_unfair", "generalisation_unfair:confidence"])
X_set_2 = train_dataset.iloc[:, 15:].values
df_X_2 = pd.DataFrame(X_set_2, columns=["hostile", "hostile:confidence", "sarcastic", "sarcastic:confidence"])

X_data_train = pd.concat([df_X_1.reset_index(drop=True),
                          df_X_2.reset_index(drop=True)],
                         axis=1,
                         ignore_index=True)
X_data_columns = [
    list(df_X_1.columns),
    list(df_X_2.columns)]

flatten = lambda nested_lists: [item for sublist in nested_lists for item in sublist]
X_data_train.columns = flatten(X_data_columns)


y_data = train_dataset.iloc[:, 13].values
y_train = pd.DataFrame(y_data, columns=["Healthy"])
y_train = y_train.values.ravel()


In [8]:
X_data_train.head()

Unnamed: 0,antagonise,antagonise:confidence,condescending,condescending:confidence,dismissive,dismissive:confidence,generalisation,generalisation:confidence,generalisation_unfair,generalisation_unfair:confidence,hostile,hostile:confidence,sarcastic,sarcastic:confidence
0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
1,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
2,0.0,0.6134,0.0,0.6134,0.0,0.6134,0.0,1.0,0.0,1.0,0.0,0.8026,0.0,0.8026
3,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.8046,0.0,0.8046
4,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0


    If we observe the X_data, we can see that the confidence scores of the attributes are relative to the boolean value of the attribute. What we want to do is have the confidence score be the same type (either all unhealhty confidence score or all healthy scores), as then the model can make better sense of the data. 

In [9]:
c1 = X_data_train["antagonise"] - X_data_train["antagonise:confidence"]
c2 = X_data_train["condescending"] - X_data_train["condescending:confidence"]
c3 = X_data_train["dismissive"] - X_data_train["dismissive:confidence"]
c4 = X_data_train["generalisation"] - X_data_train["generalisation:confidence"]
c5 = X_data_train["generalisation_unfair"] - X_data_train["generalisation_unfair:confidence"]
c6 = X_data_train["hostile"] - X_data_train["hostile:confidence"]
c7 = X_data_train["sarcastic"] - X_data_train["sarcastic:confidence"]

X_train = pd.DataFrame({"antagonise": c1,
                        "condescending": c2,
                        "dismissive": c3,
                        "generalisation": c4,
                        "generalisation_unfair": c5,
                        "hostile": c6,
                        "sarcastic": c7},
                       )
X_train = X_train.abs()

In [10]:
X_train.head()

Unnamed: 0,antagonise,condescending,dismissive,generalisation,generalisation_unfair,hostile,sarcastic
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,0.6134,0.6134,0.6134,1.0,1.0,0.8026,0.8026
3,1.0,1.0,1.0,1.0,1.0,0.8046,0.8046
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0


    Now we have all the confidence scores relative to unhealthy attributes, 1 being 100 percent confident the ocmment is unhealhty and 0 being 100 percent confidence the commment is healhty.

    The whole process is now repeated with the test dataset

In [11]:
test_dataset = pd.read_csv('test.csv')
X_set_1 = test_dataset.iloc[:, 3:13].values
df_X_1 = pd.DataFrame(X_set_1,
                      columns=["antagonise", "antagonise:confidence", "condescending", "condescending:confidence",
                               "dismissive", "dismissive:confidence", "generalisation", "generalisation:confidence",
                               "generalisation_unfair", "generalisation_unfair:confidence"])
X_set_2 = test_dataset.iloc[:, 15:].values
df_X_2 = pd.DataFrame(X_set_2, columns=["hostile", "hostile:confidence", "sarcastic", "sarcastic:confidence"])

X_data_test = pd.concat([df_X_1.reset_index(drop=True),
                         df_X_2.reset_index(drop=True)],
                        axis=1,
                        ignore_index=True)
X_data_columns = [
    list(df_X_1.columns),
    list(df_X_2.columns)]

flatten = lambda nested_lists: [item for sublist in nested_lists for item in sublist]
X_data_test.columns = flatten(X_data_columns)

c1 = X_data_test["antagonise"] - X_data_test["antagonise:confidence"]
c2 = X_data_test["condescending"] - X_data_test["condescending:confidence"]
c3 = X_data_test["dismissive"] - X_data_test["dismissive:confidence"]
c4 = X_data_test["generalisation"] - X_data_test["generalisation:confidence"]
c5 = X_data_test["generalisation_unfair"] - X_data_test["generalisation_unfair:confidence"]
c6 = X_data_test["hostile"] - X_data_test["hostile:confidence"]
c7 = X_data_test["sarcastic"] - X_data_test["sarcastic:confidence"]

X_test = pd.DataFrame({"antagonise": c1,
                       "condescending": c2,
                       "dismissive": c3,
                       "generalisation": c4,
                       "generalisation_unfair": c5,
                       "hostile": c6,
                       "sarcastic": c7},
                      )
X_test = X_test.abs()

y_data = test_dataset.iloc[:, 13].values
y_test = pd.DataFrame(y_data, columns=["Healthy"])
y_test.values.ravel()

y_test = y_test.values.ravel()

In [12]:
X_test.head()

Unnamed: 0,antagonise,condescending,dismissive,generalisation,generalisation_unfair,hostile,sarcastic
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1.0,1.0,1.0,0.8041,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,0.7529,0.7529,0.7529,0.7529,1.0,0.7529,1.0
4,1.0,1.0,1.0,1.0,1.0,1.0,0.6068


       Since the Data has been preprocessed correctly, we can move on to the different ensemble methods we can use to make accurate models for prediction. We will be using three common techniques, those being bagging, boosting and stacking. We will be implementing them in the order stated above.

       Random Forrest Classifier as a method of bagging

In [13]:
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(
    n_estimators=400,
    max_leaf_nodes=15,
    random_state=42)
rnd_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=15,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=400,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

    We produce a confusion matrix to get a more realistic roc auc score

In [14]:
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score

y_pred_rf = rnd_clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred_rf)
print(cm)
print("accuracy score rf: ",accuracy_score(y_test, y_pred_rf), "\n"
      "  roc auc score rf: ", roc_auc_score(y_test, y_pred_rf))

[[ 266   54]
 [ 653 3452]]
accuracy score rf:  0.840225988700565 
  roc auc score rf:  0.836087850182704


    Cross Validation allows us to get the mean roc auc score and the standard deviation of those scores

In [15]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = rnd_clf,
                             X = X_train, y = y_train,
                             scoring= "roc_auc",
                             cv = 10)
print("ROC Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

ROC Accuracy: 91.74 %
Standard Deviation: 0.40 %


    Now lets use a boosting technique called Adaboosting in order to compare the results of this model with the Random Forrest Classifier model

In [16]:
from sklearn.ensemble import AdaBoostClassifier
boost_clf = AdaBoostClassifier(n_estimators=100, learning_rate=1)
boost_clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1,
                   n_estimators=100, random_state=None)

    We produce a confusion matrix to get a more realistic roc auc score

In [17]:
y_pred_ab = boost_clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred_ab)
print(cm)
print("accuracy score ab: ",accuracy_score(y_test, y_pred_ab), "\n"
      "  roc auc score ab: ", roc_auc_score(y_test, y_pred_ab))
accuracy_score(y_test, y_pred_ab)

[[ 256   64]
 [ 555 3550]]
accuracy score ab:  0.8601129943502824 
  roc auc score ab:  0.8323995127892815


0.8601129943502824

    Cross Validation allows us to get the mean roc auc score and the standard deviation of those scores

In [18]:
accuracies = cross_val_score(estimator=boost_clf, X=X_train, y=y_train, scoring="roc_auc", cv=10)
print("ROC Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

ROC Accuracy: 91.71 %
Standard Deviation: 0.38 %


    Important to note: we could have used another estimator for Adaboost than the default one (Decision Trees), such as Logistic Regression or K nearest neighbours. However, the default was the one that gave the best result. 

    Another boosting technique we can use is one called XGBoosting technique, which whill be implemented below 

In [19]:
import sys
!{sys.executable} -m pip install xgboost

from xgboost import XGBClassifier
## XGBoost Classifier
xg_clf = XGBClassifier()
xg_clf.fit(X_train, y_train)
## Confsuion Matrix
y_pred_xg = xg_clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred_xg)
print(cm)
print("accuracy score xg: ",accuracy_score(y_test, y_pred_xg), "\n"
      "  roc auc score xg: ", roc_auc_score(y_test, y_pred_xg))
## Cross Validation
accuracies_xg = cross_val_score(estimator = xg_clf, X=X_train, y=y_train, scoring= "roc_auc", cv=10)
print("ROC Accuracy xg: {:.2f} %".format(accuracies_xg.mean()*100))
print("Standard Deviation xg: {:.2f} %".format(accuracies_xg.std()*100))

[[ 231   89]
 [ 366 3739]]
accuracy score xg:  0.8971751412429378 
  roc auc score xg:  0.8163577192448235
ROC Accuracy xg: 95.20 %
Standard Deviation xg: 0.35 %


    XGBoost has a really solid roc auc score of 95%, making it the best model yet. Adaboosting gives us a score of 91.71%, and Random Forret gives us 91.74%. Another ensemble method we can use is Soft voting, as will be implemented below. 

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

log_clf = LogisticRegression()
knn_clf = KNeighborsClassifier(n_neighbors=4)
svc_clf = SVC(probability=True)
voting_clf = VotingClassifier(
    estimators=[("lr", log_clf), ("knn", knn_clf), ("svc", svc_clf)],
    voting="soft"
)
for clf in (log_clf, knn_clf, svc_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, roc_auc_score(y_test, y_pred))



LogisticRegression 0.8349364342265531
KNeighborsClassifier 0.8427527405602924




SVC 0.8305648599269184




    The final ensemble method we will be using is the Stacking Classifier, using all the previous models used into one meta model (excluding Adaboost as the XGboost model performed better)
    

In [None]:
from mlxtend.classifier import StackingCVClassifier
stack_clf = StackingCVClassifier(classifiers=[voting_clf, rnd_clf, xg_clf],
                                 shuffle=False,
                                 use_probas=True,
                                 cv=5,
                                 meta_classifier=SVC(probability=True))
classifiers = {"voting": voting_clf,
               "RF": rnd_clf,
               "XG": xg_clf,
               "Stack": stack_clf}
for key in classifiers:
    classifier = classifiers[key]
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    print(key, roc_auc_score(y_pred, y_test))
    classifiers[key] = classifier