My first Kaggle kernel!

I'm going to take my first stab at the Toxic Comments challenge using a simple SVM. In the course of doing this I've learned that I can't throw all the data into a single SVM (I forgot about scaling issues). I'm using a simple baggling classfier to split up the data into an ensemble so that I can mitigate the quadratic (IIRC) scaling of SVM.

In [None]:
import numpy as np 
import pandas as pd
import os
from scipy.sparse import hstack

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
import re

import time

In [None]:
print(os.listdir("../input"))

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
sample_submission = pd.read_csv('../input/sample_submission.csv')

In [None]:
print(train.shape)
print(test.shape)

In [None]:
# for testing purposes
#train = train.iloc[:10000,:]
#test = test.iloc[:10000,:]

In [None]:
train['comment_text'].fillna("_na_", inplace=True)
test['comment_text'].fillna("_na_", inplace=True)

In [None]:
# Note, preprocessing the URLs to be uniform had minimal effect on my CV scores.
# As does processing internal wikipedia references (though it saw a teeny-tiny improvement)
# Adding this step did raise my competition result by .0029, which will matter more if I get my rank way up

mod_comments =[]
URLReg = re.compile(r'(http|https)://[^\s]*')
WikiReg = re.compile(r'(Wikipedia|Image|Help):[^\s]*') #finds all reference to internal wikipedia tags
IPReg = re.compile(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b')
NumReg = re.compile(r'[0-9]+')
HTMLReg = re.compile(r'\|[\w\s\"\#-\:\;\!\?\%\=]+=[\w\s\"\#-\:\;\!\?\%\=\@\^\&]+') #I think this gets some stuff it shouldn't.... but it also gets a lot of the junk html looking code.
for comment in train['comment_text']:
    comment = re.sub(URLReg, 'httpaddr', comment)
    comment = re.sub(WikiReg, 'wikitag', comment)
    comment = re.sub(IPReg, 'IPaddress', comment)
    comment = re.sub(NumReg, 'number', comment)
    comment = re.sub(HTMLReg, 'htmlcode', comment)
    mod_comments.append(comment)
new_comments_df = pd.DataFrame({'comment_text': mod_comments})
    
train.update(new_comments_df)

In [None]:


train_text = train['comment_text']
test_text = test['comment_text']
all_data = train.iloc[:,2:]



In [None]:


headings = list(train.columns.values)
comment_headings = headings[2:]



To start, I'll explore the data a little bit to get a sense of what I'm dealing with.

In [None]:
train.head(15)

In [None]:
train.describe()

In [None]:
for i in range(3):
    print(train['comment_text'][i] + '\n')

General impressions of the training data:

    Length varies significantly.
    On average the comments are fine.
    My model should predict a comment is toxic if it also predicts it to be severe_toxic.

Time to vectorize the data and start learning!

# Comment Processing

Inspiration from this part comes from Bojan Tunguz's kernel: Logistic Regression with words and char n-grams.

Vectorize the comments into word and char n-grams. The rational is that these can encode information differently. For example, users might obsfucate swear words .

Bojan's justification for this approach: "People often try to obfuscate bad words with additional characters. Using character n-grams can potentially detect those."


In [None]:
# TODO play with settings of vectorizer further

#all_text = pd.concat([train_text, test_text])

word_vectorizer = TfidfVectorizer(
    analyzer='word',
    token_pattern=r'\w{1,}',
    strip_accents='unicode', 
    stop_words='english',
    lowercase=False, #because usage of all caps is likely indicate of naughty behavior
    sublinear_tf=True,
    ngram_range=(1,1),
    max_features=10000)
#word_vectorizer.fit(all_text)
word_vectorizer.fit(train_text)
train_text_word_transform = word_vectorizer.transform(train_text)
test_text_word_transform = word_vectorizer.transform(test_text)

char_vectorizer = TfidfVectorizer(
    analyzer='char', 
    strip_accents='unicode', 
    stop_words='english',
    lowercase=False, #because usage of all caps is likely indicate of naughty behavior
    sublinear_tf=True,
    ngram_range=(2,6), #TODO I want to set the upper bound based off average word length, I think
    max_features=50000)
#char_vectorizer.fit(all_text)
char_vectorizer.fit(train_text)
train_text_char_transform = char_vectorizer.transform(train_text)
test_text_char_transform = char_vectorizer.transform(test_text)

complete_train_text = hstack((train_text_word_transform, train_text_char_transform))
complete_test_text = hstack((test_text_word_transform, test_text_char_transform))

In [None]:
print(train_text_word_transform.shape)
print(train_text_char_transform.shape)
print(complete_train_text.shape)
print(all_data.shape)

# Model

Simple logistic regression model.

In [None]:
# Comment/uncomment this if running for testing.
"""
#X_train, X_test, y_train, y_test = train_test_split(complete_train_text, all_data, test_size=0.3)

start = time.time()

pred = {}
cv_scores =[]
for category in comment_headings:
    clf = LogisticRegression(
            C=1.,
            solver='sag',
            max_iter=1000)
    scores = cross_val_score(clf, complete_train_text_tSVD, all_data[category], cv=5)
    print(f'CV scores for {category}: {scores}, and average: {sum(scores)/5}')
        
end = time.time()
print(end-start)
"""

In [None]:
# Comment/uncomment this depending on if running for submission

pred = {}
cv_scores = []
for category in comment_headings:
    clf = LogisticRegression(
            C=1.0,
            solver='sag',
            max_iter=1000)
    clf.fit(complete_train_text, all_data[category])
    cv_score = clf.score(complete_train_text, all_data[category])
    cv_scores.append(cv_score)
    print(f'Validation score for {category} on entire training set: {cv_score}')
    pred[category] = clf.predict_proba(complete_test_text)
    pred[category] = pred[category][:,1]
print(f'Overall validation score: {sum(cv_scores)/6}')

In [None]:
submission_id = pd.DataFrame({'id': test["id"]})
submission = pd.concat([submission_id, pd.DataFrame(pred, columns = headings[2:])], axis=1)
submission.describe()

In [None]:
submission.to_csv('submission.csv', index=False)

Final result: 0.9773 where first place was 0.9885