# Comment classification

The goal of this notebook is to create a text classification model on comments, to distinguish comments made by bots from those made by humans. 

Let's first define a couple of imports. 

In [1]:
import pandas
import numpy as np
import sklearn

Since many parts of this notebook will rely on randomness, let's set an initial random state for replicability.

In [2]:
SEED = 12345

## Dataset

We created a training set in the other notebook, here we will use the training set to model selection and hyper parameter tuning

In [3]:
df = (
    pandas.read_csv('../data/df_training.csv.gz')
    .drop('training',axis=1)
)

In [4]:
df.sample(n=10, random_state=SEED)

Unnamed: 0,commenter,comment,label
5963,b'nqWR8UfegvOkyF4wVPhRMg==',@odeer95 Just do:\r\n```\r\nnpm test\r\n```,0
9061,b'wRBuA0cPXoSMOOqC/Oqw6g==',Please change name `api_server_client.go` to `...,0
3813,b'MoT7ah0NH60r4CutyhPChQ==',There are several cases where Docker doesn't a...,1
7330,b'L6Upr6wX75DdJ3yFdg1KDw==',Thanks again @floatW0lf,0
8363,b'ctsrvcuE18kTp3Uc/1/nbQ==',Closes #113,0
5813,b'ctOc+cfw0rBOqYioNtCXZA==',"Arf, do you think I can re-create the release ...",0
9370,b'GNH7oIEmmch73L7425edSw==',I think its something we can evaluate. Its an ...,0
1599,b'TkhhiZ2i88MGb6bQX4Sxqw==',"Congratulations, your PR was merged at 67c8ef8...",1
5553,b'45/YEaKXwT7uiL5i2z1YdA==',@mcjazzyfunky I think the hooks and `.Children...,0
6181,b'iE4wRJxELUrNws3C1Zv5cQ==',"In some cases, the developer wants to add a ne...",0


## Pre-processing

Our goal being to classify comments based on their content, we need an appropriate representation for these comments. For text classification, it is quite usual to vectorize the words of a text, and to store their frequency. Since some words are more frequent than other ones, we will apply TF-IDF as well. However, TF-IDF requires to compute the term frequency of the corpus, and as such, it has to be part of the learning pipeline.

We won't do any lemmatization, since (1) we can have multiple languages in our dataset and (2) it has been shown to be of little interest in text classification (Toman, M., Tesar, R., & Jezek, K. (2006). Influence of word normalization on text classification. Proceedings of InSciT, 4, 354-358).

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

The `CountVectorizer` of `sklearn` has many different parameters that can be used to limit the number of extracted features. This includes bounds on the term frequency, as well as an absolute number of features. Most of the time, these parameters are used to prevent the model to overfit the training set. We'll use these parameters mainly to remove words that are too frequent, and words that are too infrequent. We'll do a hyper-parameter grid search to find *good* values for these parameters. 

In [6]:
preprocessing = [
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer())
]

preprocessing_params = {
    'vectorizer__ngram_range': [(1, 1), (1, 2)],
    'vectorizer__min_df': [1, 2, 5], 
    'vectorizer__max_df': [1.0, 0.5, 0.1],
    'vectorizer__max_features': [None, 500, 2500],
    
    'tfidf__use_idf': [True, False],
}

There are many text classifiers, and we'll test a few of them, including kNN, SVM and NaiveBayes.

We also add ZeroR as a baseline.

In [7]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn import svm
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm

classifiers = {
    'linear_svc': svm.SVC(random_state=SEED),
    'multinomial_nb': MultinomialNB(), 
    'knn': KNeighborsClassifier(),
    'rndfrst':RandomForestClassifier(random_state=SEED),
    'zeror': DummyClassifier(strategy='most_frequent'),
}

Of course, these classifiers also have parameters that could be tuned for the analysis. We won't test them now, since our goal at this stage is mainly to find the parameters we'll use to preprocess the data.

In [8]:
classifier_params = {
    'linear_svc__kernel': ['linear','poly','rbf', 'sigmoid'], 
    
    'multinomial_nb__alpha': [0.5, 1.0, 1.5],
    'multinomial_nb__fit_prior': [True, False],
    
    'rndfrst__n_estimators': [10, 100, 200],
    'rndfrst__criterion':['gini', 'entropy'],
    'rndfrst__max_depth':[4,6,8],
    
    'knn__leaf_size': [5, 15, 30], 
    'knn__n_neighbors': [1, 3, 5], 
}

Since we have many different possible combinations of classifiers and parameters, and since it is not possible to test all of them (that would be very time-consuming), we first evaluate each classifier using the "default values". Then we'll do hyper-parameter optimization for the preprocessing on the selected classifier.

Let's first define some scoring functions.

In [9]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, make_scorer

scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, average='weighted'),
    'recall': make_scorer(recall_score, average='weighted'),
    'f1': make_scorer(f1_score, average='weighted'),
    'precision_bot': make_scorer(precision_score, pos_label=1),
    'recall_bot': make_scorer(recall_score, pos_label=1),
    'f1_bot': make_scorer(f1_score, pos_label=1),
    'precision_human': make_scorer(precision_score, pos_label=0),
    'recall_human': make_scorer(recall_score, pos_label=0),
    'f1_human': make_scorer(f1_score, pos_label=0),
}

As already stated, the comments are not all independent: each commenter has exactly 20 comments. Since we are going to use a K-Fold cross validation process at several places, we should avoid having the comments of a single commenter to be spread accross multiple folds. Indeed, such a situation means that the classifier will learn from some comments and will be tested on the remaining ones for that commenter, hence it will likely overfit the training data, and won't be able to generalize to new commenters. 

Unfortunately, `sklearn` has no implementation to support this (yet, there is a PR in progress: https://github.com/scikit-learn/scikit-learn/pull/18649). 
Let's borrow the code: 

In [10]:
from collections import Counter, defaultdict

import numpy as np

from sklearn.model_selection._split import _BaseKFold, _RepeatedSplits
from sklearn.utils.validation import check_random_state


class StratifiedGroupKFold(_BaseKFold):
    """Stratified K-Folds iterator variant with non-overlapping groups.

    This cross-validation object is a variation of StratifiedKFold that returns
    stratified folds with non-overlapping groups. The folds are made by
    preserving the percentage of samples for each class.

    The same group will not appear in two different folds (the number of
    distinct groups has to be at least equal to the number of folds).

    The difference between GroupKFold and StratifiedGroupKFold is that
    the former attempts to create balanced folds such that the number of
    distinct groups is approximately the same in each fold, whereas
    StratifiedGroupKFold attempts to create folds which preserve the
    percentage of samples for each class.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters
    ----------
    n_splits : int, default=5
        Number of folds. Must be at least 2.

    shuffle : bool, default=False
        Whether to shuffle each class's samples before splitting into batches.
        Note that the samples within each split will not be shuffled.

    random_state : int or RandomState instance, default=None
        When `shuffle` is True, `random_state` affects the ordering of the
        indices, which controls the randomness of each fold for each class.
        Otherwise, leave `random_state` as `None`.
        Pass an int for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.

    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import StratifiedGroupKFold
    >>> X = np.ones((17, 2))
    >>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    >>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
    >>> cv = StratifiedGroupKFold(n_splits=3)
    >>> for train_idxs, test_idxs in cv.split(X, y, groups):
    ...     print("TRAIN:", groups[train_idxs])
    ...     print("      ", y[train_idxs])
    ...     print(" TEST:", groups[test_idxs])
    ...     print("      ", y[test_idxs])
    TRAIN: [2 2 4 5 5 5 5 6 6 7]
           [1 1 1 0 0 0 0 0 0 0]
     TEST: [1 1 3 3 3 8 8]
           [0 0 1 1 1 0 0]
    TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
           [0 0 1 1 1 1 0 0 0 0 0 0]
     TEST: [2 2 6 6 7]
           [1 1 0 0 0]
    TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
           [0 0 1 1 1 1 1 0 0 0 0 0]
     TEST: [4 5 5 5 5]
           [1 0 0 0 0]

    See also
    --------
    StratifiedKFold: Takes class information into account to build folds which
        retain class distributions (for binary or multiclass classification
        tasks).

    GroupKFold: K-fold iterator variant with non-overlapping groups.
    """

    def __init__(self, n_splits=5, shuffle=False, random_state=None):
        super().__init__(n_splits=n_splits, shuffle=shuffle,
                         random_state=random_state)

    # Implementation based on this kaggle kernel:
    # https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
    def _iter_test_indices(self, X, y, groups):
        labels_num = np.max(y) + 1
        y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
        y_distr = Counter()
        for label, group in zip(y, groups):
            y_counts_per_group[group][label] += 1
            y_distr[label] += 1

        y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
        groups_per_fold = defaultdict(set)

        groups_and_y_counts = list(y_counts_per_group.items())
        rng = check_random_state(self.random_state)
        if self.shuffle:
            rng.shuffle(groups_and_y_counts)

        for group, y_counts in sorted(groups_and_y_counts,
                                      key=lambda x: -np.std(x[1])):
            best_fold = None
            min_eval = None
            for i in range(self.n_splits):
                y_counts_per_fold[i] += y_counts
                std_per_label = []
                for label in range(labels_num):
                    std_per_label.append(np.std(
                        [y_counts_per_fold[j][label] / y_distr[label]
                         for j in range(self.n_splits)]))
                y_counts_per_fold[i] -= y_counts
                fold_eval = np.mean(std_per_label)
                if min_eval is None or fold_eval < min_eval:
                    min_eval = fold_eval
                    best_fold = i
            y_counts_per_fold[best_fold] += y_counts
            groups_per_fold[best_fold].add(group)

        for i in range(self.n_splits):
            test_indices = [idx for idx, group in enumerate(groups)
                            if group in groups_per_fold[i]]
            yield test_indices


Let's use it on our training set. 

In [11]:
from sklearn.model_selection import cross_validate

X_train, y_train = df.comment, df.label
groups = df.commenter

results = dict()

for name, estimator in classifiers.items():
    pipeline = Pipeline(preprocessing + [(name, estimator)])
    cv = cross_validate(pipeline, X_train, y_train, scoring=scoring, cv=StratifiedGroupKFold(5), groups=groups, n_jobs=-1)
    results[name] = cv

In [12]:
(
    pandas.DataFrame.from_dict(results)
    .applymap(np.mean)
    .T
    .style.highlight_max(axis=0)
)

Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1,test_precision_bot,test_recall_bot,test_f1_bot,test_precision_human,test_recall_human,test_f1_human
linear_svc,13.351954,2.227446,0.795406,0.846081,0.795406,0.787352,0.977636,0.604659,0.746271,0.714485,0.986217,0.828443
multinomial_nb,0.568958,0.145986,0.856995,0.859141,0.856995,0.856758,0.885377,0.819887,0.851216,0.832884,0.894132,0.862303
knn,0.757356,1.988466,0.679951,0.802174,0.679951,0.643029,0.993613,0.362519,0.528726,0.610662,0.997494,0.757357
rndfrst,7.108972,0.355559,0.742374,0.827437,0.742374,0.724053,0.993668,0.487969,0.65327,0.661147,0.996867,0.794854
zeror,1.217383,0.287509,0.498956,0.248957,0.498956,0.332174,0.199529,0.4,0.266248,0.299427,0.6,0.39949


So, it seems that **multinomial_nb** performs the best with this "default configuration".

Let's create a grid search to find out which parameters are adequate for the preprocessing. The grid search process is performed on a cross-validation on the training set only. 

In [13]:
name = 'multinomial_nb'

from sklearn.model_selection import GridSearchCV

pipeline = Pipeline(preprocessing + [(name, classifiers[name])])
grid = GridSearchCV(pipeline, preprocessing_params, scoring=scoring, refit='f1', cv=StratifiedGroupKFold(5), n_jobs=-1)

grid.fit(X_train, y_train, groups=groups);

In [14]:
grid.best_score_, grid.best_params_

(0.8682308503803154,
 {'tfidf__use_idf': True,
  'vectorizer__max_df': 1.0,
  'vectorizer__max_features': None,
  'vectorizer__min_df': 1,
  'vectorizer__ngram_range': (1, 2)})

The best score is obtained with the above parameters. However, as we'll see next, the difference is low. 

In [15]:
with pandas.option_context('display.max_colwidth', -1):
    display(
        pandas.DataFrame.from_dict(grid.cv_results_)
        .sort_values('mean_test_f1', ascending=False)
        [['params'] + ['mean_test_' + score for score in scoring.keys()]]
        .rename(columns=lambda s: s[10:] if s.startswith('mean_test_') else s)
        .head(10)
        .style.highlight_max()
    )

Unnamed: 0,params,accuracy,precision,recall,f1,precision_bot,recall_bot,f1_bot,precision_human,recall_human,f1_human
19,"{'tfidf__use_idf': True, 'vectorizer__max_df': 0.5, 'vectorizer__max_features': None, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 2)}",0.868281,0.868766,0.868281,0.868231,0.870284,0.865651,0.867666,0.867214,0.870956,0.868802
1,"{'tfidf__use_idf': True, 'vectorizer__max_df': 1.0, 'vectorizer__max_features': None, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 2)}",0.868281,0.868766,0.868281,0.868231,0.870284,0.865651,0.867666,0.867214,0.870956,0.868802
73,"{'tfidf__use_idf': False, 'vectorizer__max_df': 0.5, 'vectorizer__max_features': None, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 2)}",0.867759,0.868272,0.867759,0.867703,0.868774,0.866486,0.867308,0.867737,0.869075,0.868104
55,"{'tfidf__use_idf': False, 'vectorizer__max_df': 1.0, 'vectorizer__max_features': None, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 2)}",0.867654,0.86816,0.867654,0.867598,0.869048,0.865857,0.867133,0.86724,0.869494,0.868069
91,"{'tfidf__use_idf': False, 'vectorizer__max_df': 0.1, 'vectorizer__max_features': None, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 2)}",0.866826,0.868311,0.866826,0.866678,0.887131,0.840846,0.863027,0.849435,0.892881,0.870339
54,"{'tfidf__use_idf': False, 'vectorizer__max_df': 1.0, 'vectorizer__max_features': None, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 1)}",0.864945,0.865717,0.864945,0.864848,0.876481,0.849378,0.86241,0.854917,0.88056,0.867293
72,"{'tfidf__use_idf': False, 'vectorizer__max_df': 0.5, 'vectorizer__max_features': None, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 1)}",0.864529,0.865298,0.864529,0.864432,0.876034,0.848963,0.86198,0.854527,0.880142,0.86689
37,"{'tfidf__use_idf': True, 'vectorizer__max_df': 0.1, 'vectorizer__max_features': None, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 2)}",0.858046,0.859539,0.858046,0.857889,0.880107,0.829118,0.853643,0.838925,0.887035,0.862143
74,"{'tfidf__use_idf': False, 'vectorizer__max_df': 0.5, 'vectorizer__max_features': None, 'vectorizer__min_df': 2, 'vectorizer__ngram_range': (1, 1)}",0.857945,0.860903,0.857945,0.857634,0.891989,0.814511,0.851295,0.829773,0.90144,0.863982
0,"{'tfidf__use_idf': True, 'vectorizer__max_df': 1.0, 'vectorizer__max_features': None, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 1)}",0.856995,0.859141,0.856995,0.856758,0.885377,0.819887,0.851216,0.832884,0.894132,0.862303


In [16]:
preprocessing = [
    ('vectorizer', CountVectorizer(ngram_range=(1, 2), min_df=1, max_df=1.0, max_features=None)),
    ('tfidf', TfidfTransformer(use_idf=True)),
]

## Classifier selection & parameter tuning

Now, we can do a grid search on the classifiers and their parameters. 

In [17]:
from sklearn.model_selection import GridSearchCV

resuts = dict()

for name, estimator in classifiers.items():
    pipeline = Pipeline(preprocessing + [(name, estimator)])
    params = { k:v for k,v in classifier_params.items() if k.startswith(name)}
    
    grid = GridSearchCV(pipeline, params, scoring=scoring, refit='f1', cv=StratifiedGroupKFold(5), n_jobs=-1)
    grid.fit(X_train, y_train, groups=groups)
    
    results[name] = {
        'score': grid.best_score_, 
        'params': grid.best_params_, 
        'cv': grid.cv_results_,
        'index': grid.best_index_,
    }

In [18]:
with pandas.option_context('display.max_colwidth', -1):
    display(
        pandas.DataFrame.from_dict(results)
        .drop(['cv', 'index'])
    )

Unnamed: 0,linear_svc,multinomial_nb,knn,rndfrst,zeror
score,0.845378,0.871881,0.647742,0.727304,0.332174
params,{'linear_svc__kernel': 'sigmoid'},"{'multinomial_nb__alpha': 1.5, 'multinomial_nb__fit_prior': False}","{'knn__leaf_size': 5, 'knn__n_neighbors': 1}","{'rndfrst__criterion': 'gini', 'rndfrst__max_depth': 8, 'rndfrst__n_estimators': 10}",{}



Let's have a deeper look at the scores: 

In [19]:
temp = dict()

for name in classifiers.keys():
    cv = results[name]['cv']
    index = results[name]['index']
    temp[name] = {score: cv['mean_test_' + score][index] for score in scoring.keys()}

pandas.DataFrame.from_dict(temp).T.style.highlight_max()

Unnamed: 0,accuracy,precision,recall,f1,precision_bot,recall_bot,f1_bot,precision_human,recall_human,f1_human
linear_svc,0.848223,0.874151,0.848223,0.845378,0.971157,0.717784,0.824705,0.777118,0.9787,0.866057
multinomial_nb,0.871934,0.872586,0.871934,0.871881,0.864021,0.88253,0.872908,0.881142,0.86135,0.870855
knn,0.683091,0.802769,0.683091,0.647742,0.992623,0.369023,0.536488,0.612843,0.997284,0.759036
rndfrst,0.738615,0.78495,0.738615,0.727304,0.897645,0.541896,0.672958,0.67215,0.935485,0.781695
zeror,0.498956,0.248957,0.498956,0.332174,0.199529,0.4,0.266248,0.299427,0.6,0.39949


In [20]:
(
    pandas.DataFrame.from_dict(temp).T
    [['precision_bot','recall_bot','precision_human','recall_human','precision','recall','f1']]
    .sort_values('f1',ascending=False)
    .round(3)
)

Unnamed: 0,precision_bot,recall_bot,precision_human,recall_human,precision,recall,f1
multinomial_nb,0.864,0.883,0.881,0.861,0.873,0.872,0.872
linear_svc,0.971,0.718,0.777,0.979,0.874,0.848,0.845
rndfrst,0.898,0.542,0.672,0.935,0.785,0.739,0.727
knn,0.993,0.369,0.613,0.997,0.803,0.683,0.648
zeror,0.2,0.4,0.299,0.6,0.249,0.499,0.332


We'll then use **multinomial_nb** with its best parameters.