# Toxic Comment Classification
## Jigsaw Kaggle Competition 

For this Kaggle competition, comments from Wikipedia had to be analyzed to determine their probabalistic chance of meeting a certain toxic criteria. The criteria were the following:

- toxic
- severe_toxic
- obscene
- threat
- insult
- identity_hate

The comment texts spanned from relatively mundane to extremely offensive. For EDA purposes I created wordcloud and frequency analysis but, given the offensive comments and for the sake of brevity, I will only focus on code that is relevant for the machine learning part of the challenge.

First, I imported and cleaned the text as well as all relevant libraries:


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack
import re
from sklearn.linear_model import LogisticRegression

train=pd.read_csv("C:/Users/Malte/Documents/My repositories/Toxic-Comments-Classification/train.csv",sep=",")
test=pd.read_csv("C:/Users/Malte/Documents/My repositories/Toxic-Comments-Classification/test.csv",sep=",")

The cleaning of the text is very rudimentary and with more time, I suspect that this could've yielded the highest increase in accuracy.

In [None]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\'scuse", " excuse ", text)
    text = re.sub('\W', ' ', text)
    text = re.sub('\s+', ' ', text)
    text = text.strip(' ')
    return text

train["clean_comments"]=train["comment_text"].apply(lambda x: clean_text(x))
test["clean_comments"]=test["comment_text"].apply(lambda x: clean_text(x))

To prepare the text for vectorization, we will concatenate the two text files to make sure our tfidf-vectorizer incorporates all words into its vocabulary. 

In [None]:
word_vectorizer=TfidfVectorizer(strip_accents="unicode",token_pattern=r'\w{1,}',
                                analyzer="word",ngram_range=(1,1),stop_words="english",max_features=100000)
all_text=pd.concat([train["comment_text"],test["comment_text"]])

word_vectorizer.fit(all_text)

train_word_features = word_vectorizer.transform(train["clean_comments"])
test_word_features = word_vectorizer.transform(test["clean_comments"])

Next, we do a character vectorizer, which increases accuracy tremendously. I suppose the unique nature of toxic comments ("IDIOT!!!!!11!!!") is responsible for that. Afterwords, we combine the two features for both the training and the test set. 

In [None]:
char_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='char',
    stop_words='english',
    ngram_range=(2, 6),
    max_features=50000)
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train["clean_comments"])
test_char_features = char_vectorizer.transform(test["clean_comments"])

train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])


Afterwards, we run the logistic regression algorithm over it. I experimented with some parameters, but the standard yielded my best accuracy which was 0.9802. 

In [None]:
submission=pd.DataFrame()
submission["id"]=test["id"]

for e in ratings:
    y=train[e]
    clf=LogisticRegression()
    clf.fit(train_features,y)
    submission[e]=clf.predict_proba(test_features)[:,1]

## Acknowledgements

As always, the Kaggle community was extremely helpful. I'd like to point out especially this kernel:

https://www.kaggle.com/ogrellier/lgbm-with-words-and-chars-n-gram

This helped me tremendously. I experimented with lightgbm as well but could not beat logistic regression, not even close. 