### Introduction

Description from the challenge page:

<i>The Conversation AI team, a research initiative founded by Jigsaw and Google (both a part of Alphabet) are working on tools to help improve online conversation. One area of focus is the study of negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion). So far they’ve built a range of publicly available models served through the Perspective API, including toxicity. But the current models still make errors, and they don’t allow users to select which types of toxicity they’re interested in finding (e.g. some platforms may be fine with profanity, but not with other types of toxic content).

In this competition, you’re challenged to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate better than Perspective’s current models. You’ll be using a dataset of comments from Wikipedia’s talk page edits. Improvements to the current model will hopefully help online discussion become more productive and respectful.

<b>Disclaimer: the dataset for this competition contains text that may be considered profane, vulgar, or offensive.</b></i>

Link to the challenge: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

### Imports and load data

In [36]:
import pandas as pd
import numpy as np
import nltk
from matplotlib import pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.linear_model import RidgeClassifier, LogisticRegression, RidgeClassifierCV
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.dummy import DummyClassifier
from nltk import word_tokenize, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from tqdm import tqdm, tqdm_pandas
import string
import re
import collections

%matplotlib inline
pd.set_option('display.max_colwidth', -1)

In [2]:
train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')
sample_submission_data = pd.read_csv('data/sample_submission.csv')

In [3]:
train_data.head()
train_data.columns

Index(['id', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat',
       'insult', 'identity_hate'],
      dtype='object')

### Short exploration

In [4]:
def get_random_examples(data, keyword='severe_toxic', n_samples=4):
    if n_samples%2 >0:
        n_samples = n_samples + 1
    ones_df = data[(data[keyword] == 1) & (data['comment_text'].str.len() < 600)].copy().sample(int(n_samples/2))
    zeroes_df = data[(data[keyword] == 0) & (data['comment_text'].str.len() < 600)].copy().sample(int(n_samples/2+1))
    merged = zeroes_df.append(ones_df)
    
    return merged[['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat',
                   'insult', 'identity_hate']].sample(frac=1).reset_index(drop=True)

In [5]:
get_random_examples(train_data, keyword='threat', n_samples=6)

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,"I've looked over Tznkai's proposal, but my vote is still to leave things where they are. The confusion of a move would be much greater than any current confusion that may or may not exist.",0,0,0,0,0,0
1,Do not add a speedy deletion tag to this as I may expand it.,0,0,0,0,0,0
2,u motherfukkin bitch i want to rape you smelly whore stop fucking blocking my account or ill get my nigga homies to came and kidnap and rape you and your family,1,1,1,1,1,1
3,"JEW \n\nGet the fuck out of here you jewish son of a bitch, I'll rape your fucking family if you don't leave you semite bastard. I will shoot you if you return, because you're a dirty semite, I hope you choke on a fucking bagel, prick. Shalom. We Came In?",1,1,1,1,1,1
4,"Block Me! I DARE YOU \n\nHey, don't tell me what I can and can't do, go ahead and block me, cause if you do, I will have this whole damn website shut down for good. So try me!!! Wweppvguy",1,0,1,1,0,0
5,Even look at Marty Sertich's Wiki page. 2000-2001. HIGH SCHOOL. ROSEVILLE.\n\nWhat do you need? A bloody news article stating the aforementioned people went to RAHS? Why can't you uptight assholes just accept personal assurance? What do I have to gain from lying about past alumni? Seriously.,1,0,1,0,0,0
6,"I said that it's a mix of the various bits, anyway, I'm sure we'll email back and fourth a few times before I get permission, so I'll tell then if they specifically ask. However, I feel that it's still fair use. (talk|email)",0,0,0,0,0,0


### Clean data: tokenize, remove stop words

In [6]:
stopset = set(stopwords.words('english'))
snow = SnowballStemmer('english')
WNlemma = WordNetLemmatizer()

def clean_text(x, normalization='stemming', remove_stop=False):
    """Function to preprocess text data. Removes punctuation and numbers. 
    Lemmatizes or stems words, depending on given parameter. Can also remove 
    stopwords if specified.
    
    Args:
        x (str): The piece of text to process.
        normalization (str): how to normalize words, 'stemming' (default) or 'lemmatization'.
        remove_stop (bool): whether to remove stopwords. Default is False.
        
    Returns:
        str: Preprocessed tokens, re-joined with spaces.
    """
    # split text
    words = word_tokenize(x)
    
    # remove punctuation and numbers
    words = [word for word in words if word not in string.punctuation and not bool(re.search(r'\d', word))]
    
    if normalization == 'stemming':
        words = [snow.stem(t) for t in words] # stemming
    elif normalization == 'lemmatization':
        words = [WNlemma.lemmatize(t.lower()) for t in words] # lemmatize words (advanced stemming)
    else:
        return 'Invalid parameter for normalization'
    
    # remove stop words
    if remove_stop:
        words = [word for word in words if word not in stopset]
    
    joined_words = ' '.join(words).replace('_', '')
    
    return joined_words

Apply text cleaning to column

In [7]:
#tqdm.pandas(tqdm()) # for tracking progress
#train_data['comment_text'] = train_data['comment_text'].progress_apply(lambda x: clean_text(x, normalization='lemmatization'))

train_data['comment_text'] = train_data['comment_text'].apply(lambda x: clean_text(x, normalization='lemmatization'))

In [8]:
clean_text('This is a __test FUCK 99 !! .. fUcking ObSCENE languages ASsh0l3.', normalization='lemmatization')
# note: words with numbers in them currently get dropped. 
# suggestion: replace numbers in words with letters (e.g. 0 = o, 1 = i, 7 = t, 3 = e)

'this is a test fuck .. fucking obscene language'

### Explore N-gram frequencies to better estimate appropriate min_df parameter for model

In [9]:
vect = CountVectorizer(ngram_range=(1,2))
train_vect = vect.fit_transform(train_data['comment_text'])
dist = np.sum(train_vect, axis=0).tolist()[0]
vocab = vect.get_feature_names()

In [10]:
ngram_freq = {}

for tag, count in zip(vocab, dist):
    ngram_freq[tag]=count
    
counts = collections.Counter(list(ngram_freq.values()))

In [11]:
# freq, occurrences of freq
# e.g. 2045516 words occur one time
counts.most_common()[:10]

[(1, 1512510),
 (2, 277999),
 (3, 111360),
 (4, 61127),
 (5, 39095),
 (6, 27197),
 (7, 19998),
 (8, 15470),
 (9, 12206),
 (10, 9907)]

### Split data

In [12]:
X = train_data['comment_text']
y = train_data[['toxic', 'severe_toxic', 'obscene', 'threat',
       'insult', 'identity_hate']]

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Pipeline a few models

In [14]:
NB_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,2), min_df=4, max_df=0.5, max_features=50000)),
                    ('tfidf', TfidfTransformer(use_idf=True)),
                    ('clf', OneVsRestClassifier(MultinomialNB(alpha=0.01), n_jobs=-1))])

SVC_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,2), min_df=4, max_df=0.5, max_features=50000)),
                         ('tfidf', TfidfTransformer(use_idf=True)),
                       ('clf', OneVsRestClassifier(SVC(C=10, probability=True), n_jobs=-1))])

logistic_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,2), min_df=4, max_df=0.5, max_features=15000)),
                         ('tfidf', TfidfTransformer(use_idf=True)),
                       ('clf', OneVsRestClassifier(LogisticRegression(C=0.1, class_weight='balanced'), n_jobs=-1))])

### Hyperparameter tuning

In [15]:
# Define parameters, specify for which part of pipeline with prefix, e.g. 'vect__'
SVC_params = {'vect__ngram_range': [(1,2)],
              'tfidf__use_idf': [True],
              'clf__estimator__C':[0.1, 1, 10]}

logistic_params = {#'vect__ngram_range': [(1,2)],
                   'vect__min_df': [3, 4, 5, 6],
                   'vect__max_df': [0.3, 0.4, 0.5, 0.6],
                   #'vect__max_features': [25000, 50000, 100000, None],
                    #'vect__max_features': [5000, 7500, 10000, 12500],
                  #'tfidf__use_idf': [True],
                  #'clf__estimator__C':[0.1, 0.3, 0.6, 1, 3],
                  #'clf__estimator__class_weight':['balanced', None],
                  #'clf__estimator__penalty':['l1', 'l2']
                  }

NB_params = {'vect__ngram_range': [(1,2)],
              'tfidf__use_idf': [True],
              'clf__estimator__alpha':[0.1, 1, 10]}

### New pipeline with optimal parameters

In [16]:
Pipeline([('vect', CountVectorizer(ngram_range=(1,2), min_df=6, max_df=0.3, max_features=25000)),
                         ('tfidf', TfidfTransformer(use_idf=True)),
                       ('clf', OneVsRestClassifier(LogisticRegression(C=0.1, class_weight='balanced'), n_jobs=-1))])

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.3, max_features=25000, min_df=6,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        stri...None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False),
          n_jobs=-1))])

### Validation

In [17]:
# models to test
models = {'Logistic regression': logistic_clf,
          #'SVC': SVC_clf,
         #'Naïve Bayes': NB_clf
         }

In [18]:
# Benchmark score (in case of all 0 predictions):
pred = np.zeros(y_test.shape)
roc_auc_score(y_test, pred)

0.5

In [19]:
for model_name, model in models.items():
    print('Training {}...'.format(model_name))
    clf = model.fit(X_train, y_train)
    
    y_pred = clf.predict(X_train)
    print('{} train ROC_AUC score: {}'.format(model_name, roc_auc_score(y_train, y_pred)))
    
    y_pred = clf.predict(X_test)
    print('{} test ROC_AUC score: {}'.format(model_name, roc_auc_score(y_test, y_pred)))
    print('{} cross validation ROC_AUC score on 5 folds: {}'.format(model_name, cross_val_score(model, X, y, scoring='roc_auc', cv=5, n_jobs=-1).mean()))
    print('')

Training Logistic regression...
Logistic regression train ROC_AUC score: 0.9553583845455703
Logistic regression test ROC_AUC score: 0.9066418804422969
Logistic regression cross validation ROC_AUC score on 5 folds: 0.9765465005353489



#### Validation log

### Final parameter tuning (manual)

In [20]:
logistic_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,2), min_df=6, max_df=0.3, max_features=25000)),
                         ('tfidf', TfidfTransformer(use_idf=True)),
                       ('clf', OneVsRestClassifier(LogisticRegression(C=0.1, class_weight='balanced'), n_jobs=-1))])

print('Cross_val_score with C=0.1, max_features=25000, max_df=0.3, min_df=6: ', cross_val_score(logistic_clf, X, y, scoring='roc_auc', cv=5, n_jobs=-1).mean())

logistic_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,2), min_df=5, max_df=0.3, max_features=25000)),
                         ('tfidf', TfidfTransformer(use_idf=True)),
                       ('clf', OneVsRestClassifier(LogisticRegression(C=0.1, class_weight='balanced'), n_jobs=-1))])

print('Cross_val_score with C=0.1, max_features=25000, max_df=0.4, min_df=5: ', cross_val_score(logistic_clf, X, y, scoring='roc_auc', cv=5, n_jobs=-1).mean())

#Cross_val_score with C=0.1:  0.9759436463859725
#Cross_val_score with C=0.01:  0.9635830622046198

#Cross_val_score with C=0.1, max_features=25000, max_df=0.5, min_df=4:  0.9760136616571229
#Cross_val_score with C=0.1, max_features=50000, max_df=0.5, min_df=4:  0.9759436463859725

#Cross_val_score with C=0.1, max_features=25000, max_df=0.3, min_df=4:  0.9762885351187846
#Cross_val_score with C=0.1, max_features=25000, max_df=0.4, min_df=4:  0.9760047347166096

#Cross_val_score with C=0.1, max_features=25000, max_df=0.3, min_df=6:  0.9763128416184881
#Cross_val_score with C=0.1, max_features=25000, max_df=0.3, min_df=5:  0.976302536076927

# Same test, but with correction in clean_text function (fixed lowercase issue):
# Cross_val_score with C=0.1, max_features=25000, max_df=0.3, min_df=6:  0.977167248668523
# Cross_val_score with C=0.1, max_features=25000, max_df=0.4, min_df=5:  0.9771515047562769

Cross_val_score with C=0.1, max_features=25000, max_df=0.3, min_df=6:  0.977167248668523
Cross_val_score with C=0.1, max_features=25000, max_df=0.4, min_df=5:  0.9771515047562769


To do: write automated test function (start with default, set params to best so far, test one param per iteration)

### Declare model with final parameters to use

In [21]:
final_model = Pipeline([('vect', CountVectorizer(ngram_range=(1,2), min_df=6, max_df=0.3, max_features=25000)),
                         ('tfidf', TfidfTransformer(use_idf=True)),
                       ('clf', OneVsRestClassifier(LogisticRegression(C=0.1, class_weight='balanced'), n_jobs=-1))])

### Make predictions for submission and save

Note: submission should be probabilities

#### Clean submission data

In [22]:
print(test_data.comment_text[0])

Yo bitch Ja Rule is more succesful then you'll ever be whats up with you and hating you sad mofuckas...i should bitch slap ur pethedic white faces and get you to kiss my ass you guys sicken me. Ja rule is about pride in da music man. dont diss that shit on him. and nothin is wrong bein like tupac he was a brother too...fuckin white boys get things right next time.,


In [23]:
# clean test data
#test_data['comment_text'] = test_data['comment_text'].progress_apply(lambda x: clean_text(x, normalization='lemmatization'))
test_data['comment_text'] = test_data['comment_text'].apply(lambda x: clean_text(x, normalization='lemmatization'))

In [24]:
print(test_data.comment_text[0])

yo bitch ja rule is more succesful then you 'll ever be whats up with you and hating you sad mofuckas ... i should bitch slap ur pethedic white face and get you to kiss my as you guy sicken me ja rule is about pride in da music man dont dis that shit on him and nothin is wrong bein like tupac he wa a brother too ... fuckin white boy get thing right next time.


In [25]:
sample_submission_data.head(2)

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.5,0.5,0.5,0.5,0.5,0.5
1,0000247867823ef7,0.5,0.5,0.5,0.5,0.5,0.5


#### Train final model and predict probabilities for the labels

In [26]:
y_pred_final = final_model.fit(X, y).predict_proba(test_data['comment_text'])

In [27]:
predictions = pd.DataFrame(y_pred_final, columns=y_test.columns)

In [28]:
predictions.head(2)

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0.99539,0.942465,0.996497,0.866189,0.987634,0.947539
1,0.116185,0.063854,0.073821,0.03913,0.103398,0.085246


#### Merge predictions with IDs, explore the outcome and confirm correct shape

In [29]:
submission = pd.concat([test_data['id'], predictions], axis=1)

In [30]:
print('ID 00001cee341fdb12:\n',test_data.comment_text[0], '\n')
print('ID 0000247867823ef7:\n',test_data.comment_text[1])
submission.head(2)

ID 00001cee341fdb12:
 yo bitch ja rule is more succesful then you 'll ever be whats up with you and hating you sad mofuckas ... i should bitch slap ur pethedic white face and get you to kiss my as you guy sicken me ja rule is about pride in da music man dont dis that shit on him and nothin is wrong bein like tupac he wa a brother too ... fuckin white boy get thing right next time. 

ID 0000247867823ef7:
 == from rfc == the title is fine a it is imo


Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.99539,0.942465,0.996497,0.866189,0.987634,0.947539
1,0000247867823ef7,0.116185,0.063854,0.073821,0.03913,0.103398,0.085246


In [31]:
submission.shape

(153164, 7)

#### Save predictions to .csv, ready for submission

In [32]:
TARGET_PATH = './data/submission_simple.csv'
submission.to_csv(TARGET_PATH, index=False)

My highest public leaderboard score on Kaggle: 0.0.9723

### Bonus: most important words per label 

Show 10 ten most important words, and 10 least important words per label.

In [33]:
all_estimators = final_model.named_steps['clf'].estimators_
vocab = final_model.named_steps['vect'].vocabulary_
index_to_words = {value: key for key,value in vocab.items()}

In [34]:
for index, label in enumerate(y.columns):
    print('Current label: {}'.format(label))
    words = {}
    coefs = all_estimators[index].coef_[0]
    for key in index_to_words.keys():
        words[index_to_words[key]] = coefs[key]
    words = sorted(words.items(), key=lambda x:x[1], reverse=True)
    top_5_most_important = words[:10]
    top_5_least_important = words[-10:]
    print('Top 10 most {} words:'.format(label))
    for pair in top_5_most_important:
        print(pair)
    print('')
    print('Top 10 least {} words:'.format(label))
    for pair in top_5_least_important:
        print(pair)
    print('\n')

Current label: toxic
Top 10 most toxic words:
('fuck', 9.295420547748519)
('fucking', 8.01926826032406)
('stupid', 7.775941645246558)
('idiot', 7.415786121096115)
('shit', 7.325892787270488)
('suck', 6.230368382977478)
('as', 5.351426184985407)
('asshole', 5.212978738540356)
('crap', 5.074895273248084)
('dick', 5.033463892230288)

Top 10 least toxic words:
('but', -2.018151832647545)
('at', -2.0247040154355096)
('source', -2.086360135356195)
('thank you', -2.116636549977007)
('thank', -2.1968729718784465)
('utc', -2.29552975080801)
('article', -2.5343513950197027)
('please', -2.843646044913631)
('talk', -3.364017510262568)
('thanks', -3.3772175055523883)


Current label: severe_toxic
Top 10 most severe_toxic words:
('fucking', 9.139312853190223)
('fuck', 9.003776593117689)
('bitch', 5.724100840073539)
('asshole', 5.601583500585075)
('shit', 5.515879511526424)
('suck', 5.185368550320496)
('dick', 4.986553179438271)
('as', 4.767529613564076)
('fucker', 4.682142091935353)
('faggot', 4.530