# Classifying Toxic Comments

Using naïve tokenization, TF-IDF vectorization, and a few linear models from scikit-learn.

This project uses labeled data originally sourced from [this comment classification challenge on Kaggle](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge).

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import roc_auc_score

## Data import

You'll need to get the training data downloaded locally. Either use curl/wget in a terminal, or run the below cell:

In [None]:
!curl http://web.stanford.edu/~sjespers/mse231/toxic.train.csv.gz -o toxic.train.csv.gz

This can take a while.

In [None]:
train_df = pd.read_csv("toxic.train.csv.gz", index_col="id")

Might as well get a peek at it:

In [None]:
train_df.sample(10)

There's some nasty stuff in there! (Your results may vary, since we're using `sample`.) The `toxic` column will be our response variable:

In [None]:
y = train_df.toxic.values

The `comment` column is the natural input. Here we transform each comment (or "document") into a TF-IDF vector using a handy scikit-learn function:

In [None]:
vectorizer = TfidfVectorizer(strip_accents='ascii')
X = vectorizer.fit_transform(train_df.comment_text)

## Logistic regression

Now that we've set up our training set, we can try some models.

**Best practices note:** In the real world, you would hold out some data for validation. This being a quick and dirty workbook, we ignore this crucial step.

Let's try a simple logistic regression model, using the TF-IDF vectors as input features and the toxicity label as the response variable.

In [None]:
logreg_clf = LogisticRegression(solver='lbfgs')

In [None]:
logreg_clf.fit(X, y)

How well did we do?

In [None]:
logreg_clf.score(X, y)

Not bad! But our positives and negatives are not evenly distributed:

In [None]:
sum(train_df.toxic == 1) / len(train_df)

Only ~10% of the comments are labeled as toxic. So let's get an AUC measure:

In [None]:
roc_auc_score(train_df.toxic.values, logreg_clf.predict(X))

Okay, at least that's a lot better than random.

## SVM trained with SGD

For a second model, let's try an SVM:

In [None]:
svm_clf = SGDClassifier()

Typically, SVMs can be slow to train, but using SGD speeds things up a lot.

In [None]:
svm_clf.fit(X, y)

How did that one do?

In [None]:
svm_clf.score(X, y)

I guess not quite as well. What about in terms of AUC?

In [None]:
roc_auc_score(train_df.toxic.values, svm_clf.predict(X))

Well, okay, we just took the model out of the box and applied it, so i guess that's not terrible, considering.

## Testing it out

Let's try out the models on the test data to see how they do! First, download the test data the way we did the training data:

In [None]:
!curl http://web.stanford.edu/~sjespers/mse231/toxic.test.csv.gz -o toxic.test.csv.gz

Now, read it in:

In [None]:
test_df = pd.read_csv("toxic.test.csv.gz", index_col="id")

In order to make predictions on the comments, remember that we have to apply the TF-IDF transformation:

In [None]:
X_test = vectorizer.transform(test_df.comment_text)

Not here that we are not *fitting* the TF-IDF vectorizer to these unseen examples! Doing so would be a slight violation of the test-train split.

The `y_test` is as straightforward as it was in training:

In [None]:
y_test = test_df.toxic.values

🥁Drumroll please...

In [None]:
logreg_clf.score(X_test, y_test)

In [None]:
svm_clf.score(X_test, y_test)

We've just scratched the surface by training linear models on vector representations of documents here.

There are so many dimensions in which this could go differently: n-gram representations, neural word/document embeddings (e.g. word2vec, doc2vec, GloVe), using RNNs or CNNs instead of linear models---lots of these choices depend on the particular application.

Hopefully this gives just a tiny hint of what's possible.