# Toxic Comment Classification
#### Identify and classify toxic online comments

1. [Goal](#goal)  
2. [Data](#dat) 
4. [Load Data](#loaddata)    
5. [Feature Extraction](#features)    
6. [Naïve Bayes](#nb)  
7. [Stochastic Gradient Descent](#sgd) 
8. [Results](#result) 

<a id='goal'></a>
## Goal  
Study negative online behaviors, like toxic comments (comments that are rude, disrespectful or otherwise likely 
to make someone leave a discussion) and build a model to identify type of toxicity.

<a id='dat'></a>
## Data  

Data is taken from kaggle Toxic Comment Classification Challenge.

URL: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data

### Install required Libraries

In [8]:
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import Pipeline,FeatureUnion
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

import tqdm

<a id='loaddata'></a>
## Load Data

In [9]:
train = pd.read_csv('D:/Kaggle/Toxic Comment Classification Challenge/train.csv')
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [10]:
test = pd.read_csv('D:/Kaggle/Toxic Comment Classification Challenge/test.csv')
test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


<a id='features'></a>
## Features Extraction

In [15]:
word_count_vect=CountVectorizer(ngram_range=(1,2),min_df=3,strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}')


<a id='nb'></a>
## Naïve Bayes

In [16]:
from sklearn.naive_bayes import MultinomialNB
clf_NB = MultinomialNB()
pipe = Pipeline([("word_count_vect", word_count_vect),('NB',clf_NB)])

In [17]:
target=['toxic','severe_toxic','obscene','threat','insult','identity_hate']
Total_Auc=0

for i in target:
    print('Training for label '+str(i)+'.......')
    cv_score = np.mean(cross_val_score(pipe, train['comment_text'].values, train[i].values, cv=3, scoring='roc_auc'))
    Total_Auc=Total_Auc+cv_score
    print('AUC for lable '+str(i)+' '+str(round(cv_score,5)))
    pipe.fit(train['comment_text'].values, train[i].values)
    print('Testing for label '+str(i)+'.......')
    test[i]=pipe.predict_proba(test['comment_text'].values)[:, 1]
    print('Prediction ready for label '+str(i)+'.......')
    print('#####################################################')
print('Avg AUC ',round(float(Total_Auc/6),3))

Training for label toxic.......
AUC for lable toxic 0.91684
Testing for label toxic.......
Prediction ready for label toxic.......
#####################################################
Training for label severe_toxic.......
AUC for lable severe_toxic 0.90191
Testing for label severe_toxic.......
Prediction ready for label severe_toxic.......
#####################################################
Training for label obscene.......
AUC for lable obscene 0.91227
Testing for label obscene.......
Prediction ready for label obscene.......
#####################################################
Training for label threat.......
AUC for lable threat 0.78705
Testing for label threat.......
Prediction ready for label threat.......
#####################################################
Training for label insult.......
AUC for lable insult 0.909
Testing for label insult.......
Prediction ready for label insult.......
#####################################################
Training for label identity_hate.

In [None]:
test.loc[:,['id','toxic','severe_toxic','obscene','threat','insult','identity_hate']].to_csv('ToxicComments_Submission_NB.csv',index=False)

<a id='sgd'></a>
## Stochastic Gradient Descent

In [31]:
SGD=SGDClassifier(n_iter=100,random_state=100,loss='log')
pipe = Pipeline([("word_count_vect", word_count_vect),('SGD',SGD)])

In [32]:
SGD

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', n_iter=100, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=100, shuffle=True,
       verbose=0, warm_start=False)

In [33]:
target=['toxic','severe_toxic','obscene','threat','insult','identity_hate']
Total_Auc=0

for i in target:
    print('Training for label '+str(i)+'.......')
    cv_score = np.mean(cross_val_score(pipe, train['comment_text'].values, train[i].values, cv=3, scoring='roc_auc'))
    Total_Auc=Total_Auc+cv_score
    print('AUC for lable '+str(i)+' '+str(round(cv_score,5)))
    pipe.fit(train['comment_text'].values, train[i].values)
    print('Testing for label '+str(i)+'.......')
    test[i]=pipe.predict_proba(test['comment_text'].values)[:, 1]
    print('Prediction ready for label '+str(i)+'.......')
    print('#####################################################')
print('Avg AUC ',round(float(Total_Auc/6),3))

Training for label toxic.......
AUC for lable toxic 0.95732
Testing for label toxic.......
Prediction ready for label toxic.......
#####################################################
Training for label severe_toxic.......
AUC for lable severe_toxic 0.96394
Testing for label severe_toxic.......


  np.exp(prob, prob)


Prediction ready for label severe_toxic.......
#####################################################
Training for label obscene.......
AUC for lable obscene 0.97131
Testing for label obscene.......
Prediction ready for label obscene.......
#####################################################
Training for label threat.......
AUC for lable threat 0.94577
Testing for label threat.......
Prediction ready for label threat.......
#####################################################
Training for label insult.......
AUC for lable insult 0.95935
Testing for label insult.......
Prediction ready for label insult.......
#####################################################
Training for label identity_hate.......
AUC for lable identity_hate 0.93743
Testing for label identity_hate.......
Prediction ready for label identity_hate.......
#####################################################
Avg AUC  0.956


In [34]:
test.loc[:,['id','toxic','severe_toxic','obscene','threat','insult','identity_hate']].to_csv('ToxicComments_Submission_SGD.csv',index=False)

<a id='result'></a>
## Results

**Naïve Bayes**
- Cross Validation Score = 0.872
- Test Data Score (kaggle Private score) = 0.865

**Stochastic Gradient Descent**
- Cross Validation Score = 0.956
- Test Data Score = 0.954