Creating baseline for the project using NBSVM (Naive Bayes - Support Vector Machine)

NBSVM was introduced by Sida Wang and Chris Manning in the paper [Baselines and Bigrams: Simple, Good Sentiment and Topic Classiﬁcation](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf)



## Observations:

- Dataset is multiclass **not** only multilabel

## Constants

In [1]:
TRAIN_DATA_PATH = "downloads/train.csv.zip"
TEST_DATA_PATH = "downloads/test.csv.zip"
label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
comment_col = 'comment_text'

In [49]:
%matplotlib inline

import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion

from sklearn.base import BaseEstimator, ClassifierMixin

from sklearn.utils.validation import check_X_y
from sklearn.utils.validation import check_array
from sklearn.utils.validation import check_is_fitted
from sklearn.utils.multiclass import unique_labels

from sklearn.model_selection import cross_validate

In [3]:
train = pd.read_csv(TRAIN_DATA_PATH)
test = pd.read_csv(TEST_DATA_PATH)

In [50]:
class NbSVMClassifier(BaseEstimator, ClassifierMixin):
    
    def __init__(self, modelType = 'lg', **kwargs):
        self.modelType = modelType
        self.modelArgs = kwargs
        
    def fit(self, X, y):
        
        # Check that X and y have correct shape
        y = y.values
        X, y = check_X_y(X, y, accept_sparse=True, multi_output=True)
        
        self.classes_ = unique_labels(y)
        
        self._clf = []
        self._r = []
        
        def cr(X, y, y_i):
            p = X[y == y_i].sum(axis = 0)
            return (1 + p)/ ((y == y_i).sum() + 1)
        
        for i in self.classes_:
            print('Fitting Model for: ', label_cols[i])
            y_i = y[:, i]
            log_count_ratio = np.log(cr(X, 1,y_i) / cr(X, 0, y_i))
            X_enhanced = X.multiply(log_count_ratio)
            
            if self.modelType == 'lg':
                model = LogisticRegression(**self.modelArgs)
            elif self.modelType == 'svm':
                model = SGDClassifier(**self.modelArgs)
                
            self._clf.append(model.fit(X_enhanced, y_i))
            self._r.append(log_count_ratio)
        
        return self
    
    def predict(self, X):
        check_is_fitted(self, ['_r', '_clf'])
        predict = np.zeros((X.shape[0], len(self.classes_)))
        for i in range(predict.shape[1]):
            predict[:, i] = self._clf[i].predict(X.multiply(self._r[i]))
        return predict
    
    def predict_proba(self, X):
        if self.modelType == 'svm':
            print('No Probabilistic Interpretation for SVM')
            return None
        check_is_fitted(self, ['_r', '_clf'])
        predict_proba = np.zeros((X.shape[0], len(self.classes_)))
        for i in range(predict.shape[1]):
            predict_proba[:, i] = self._clf[i].predict_proba(X.multiply(self._r[i]))
        return predict_proba

In [51]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1,2), 
                      min_df = 3, 
                      max_df = 0.9, 
                      strip_accents = 'unicode', use_idf=1,
                      smooth_idf=1, sublinear_tf=1)),
    ('clf', NbSVMClassifier(modelType = 'svm', loss = 'hinge', class_weight = 'balanced'))
])

In [52]:
cv_results = cross_validate(pipeline, train[comment_col], train[label_cols], 
                           n_jobs = 2, return_train_score = False,
                           scoring = 'accuracy')

Fitting Model for:  toxic




Fitting Model for:  toxic




Fitting Model for:  severe_toxic
Fitting Model for:  severe_toxic
Fitting Model for:  obscene
Fitting Model for:  obscene
Fitting Model for:  threat
Fitting Model for:  threat
Fitting Model for:  insult
Fitting Model for:  insult
Fitting Model for:  identity_hate
Fitting Model for:  identity_hate
Fitting Model for:  toxic




Fitting Model for:  severe_toxic
Fitting Model for:  obscene
Fitting Model for:  threat
Fitting Model for:  insult
Fitting Model for:  identity_hate


CV Result with Linear Regression using NB features
``` python
{'fit_time': array([79.08852482, 79.20061374, 42.78198028]),
 'score_time': array([6.6369772 , 6.62154627, 5.88802028]),
 'test_score': array([0.92122728, 0.92169581, 0.92276744])}
 ```
 
 CV Result with Linear SVM using NB features
 ```python
{'fit_time': array([30.04560089, 30.4099865 , 24.19594097]),
 'score_time': array([7.33976412, 7.11350369, 6.19125676]),
 'test_score': array([0.91893365, 0.92066178, 0.9214326 ])}
```