# 2 - Establish Baseline
Since we saw that the classes are unbalanced, just by guessing for the most likely outcome we will get an above 50% correct rate. Thus, we need to have some idea of a baseline in order to know that any model we build produce an actual improvement.

In [57]:
import numpy as np
import pandas as pd
import zipfile

filepath =  '/Users/freddiekarlbom/.kaggle/competitions/jigsaw-toxic-comment-classification-challenge/train.csv.zip'

with zipfile.ZipFile(filepath) as zip:
    with zip.open('train.csv') as myZip:
        df = pd.read_csv(myZip) 

In [38]:
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.multioutput import MultiOutputClassifier

In [39]:
prediction_columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

X = df['comment_text']
Y = df[prediction_columns]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=1337)

In [41]:
import numpy as np

# Don't bother converting text to features, since this dummy classifier just will assign based on frequency anyhow
X_train_dummy = np.zeros((X_train.shape[0], 1))

In [42]:
clf = DummyClassifier(strategy='most_frequent',random_state=0)
multi_clf = MultiOutputClassifier(clf, n_jobs=-1)

multi_clf.fit(X_train_dummy, Y_train)

MultiOutputClassifier(estimator=DummyClassifier(constant=None, random_state=0, strategy='most_frequent'),
           n_jobs=-1)

In [43]:
multi_clf.score(X_train_dummy, Y_train)

0.8983100415700529

In [44]:
X_test_dummy = np.zeros((X_test.shape[0], 1))
multi_clf.score(X_test_dummy, Y_test)

0.8984208547437023

In [54]:
# Since 0 is the more common output - that is what is always predicted
multi_clf.predict(X_test_dummy)

array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       ...,
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]])

## Takeaways
- Just by predicting zeros, you get **90% accuracy** since most comments aren't inflammatory.
- The Recall is horrible though since no true positives at all are found.
- In the end, since the evaluation is [based on probabilities](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge#evaluation) rather than binary outcomes though, a similar accuracy in classification can still end up giving different score for the competition.