# Toxic Comment Classicfication

Installing the scikit-learn (sklearn) library <br>
Importing numpy, pandas and matplotlib for data processing and plotting <br>
Importing re and string to operate with strings and regular expressions

In [1]:
%%capture
!pip install scikit-learn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import string

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

Reading the input files from the csv format (after manually unzipping them) and printing the column labels using `train.head()`

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
subm = pd.read_csv('sample_submission.csv')

train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


Creating a column for non-toxic comments which is supposedly the vast majority as can be inferred from the `train.describe()` data of `mean`

In [3]:
label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
train['non-toxic'] = 1-train[label_cols].max(axis=1)
train.describe()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate,non-toxic
count,159571.0,159571.0,159571.0,159571.0,159571.0,159571.0,159571.0
mean,0.095844,0.009996,0.052948,0.002996,0.049364,0.008805,0.898321
std,0.294379,0.099477,0.223931,0.05465,0.216627,0.09342,0.302226
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Checking whether there exist any null valued columns/comment texts so we can ignore them while predicting (Turns out the data is clean)

In [4]:
train.isnull().sum(), test.isnull().sum()

(id               0
 comment_text     0
 toxic            0
 severe_toxic     0
 obscene          0
 threat           0
 insult           0
 identity_hate    0
 non-toxic        0
 dtype: int64,
 id              0
 comment_text    0
 dtype: int64)

The function `token` is used to find all the strings containing punctuations and replacing them with a blank space and then splitting the regex object. This regex object is created using the `re.compile()` function. <br><br>

The `TfifVectorizer` method imported from `sklearn` works in the following way:
- The `ngram_range` defines the max and min sizes of phrases that are to be used in the vocabulary. (Here it is either 1-word or 2-word phrases).
- The `tokenizer` creates these words by using the defined function `token`
- The `min_df=3` ignores the words in the vocabulary that are used less than 3 times and `max_df=0.9` ignores the words that form more than 90% of the total document, i.e., too frequently used to be helpful for our model.
- `strip_accents` has a similar job as that of the tokenizer, it strips down all the <b>'unicode'</b> type characters which is basically an exteneded <b>'ascii'</b>.
- `sublinear_tf` works as a boolean of how we scale the freqeuncy of vocabulary to the significance. Assigning it true we use a sublinear logarithmic function which assigns weights to the frequencies. 

And now we call it on the <b>"comment_text"</b> column for both the training and testing data. <br>
<u><b>Note:</b></u> We use `fit_transform()` instead of `transform()` as it automatically scales the data points. 

In [5]:
def token(s):
    return re.compile(f'([{string.punctuation}])').sub(r' \1 ', s).split()

vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=token, min_df=3, max_df=0.9, strip_accents='unicode', sublinear_tf=True )
X = vec.fit_transform(train['comment_text'])
test_X = vec.transform(test['comment_text'])

X, test_X

(<159571x425924 sparse matrix of type '<class 'numpy.float64'>'
 	with 17758669 stored elements in Compressed Sparse Row format>,
 <153164x425924 sparse matrix of type '<class 'numpy.float64'>'
 	with 14751682 stored elements in Compressed Sparse Row format>)

Now we define functions:
- `bayes` which takes in the value at an index(`Y_index`) and the scaled frequencies(`Y`) which serve as probabilities of occurence and apply the naive Bayes equation on the sparse matrix `X` with only a few non-zero entries.
- `get_model` functions returns a model fit using Logarithmic regression on a given input of training data column using the default `solver = lgbfs` for gradient descent.

In [6]:
def bayes(Y_index, Y):
    p = X[Y==Y_index].sum(0)
    return (p+1) / ((Y==Y_index).sum()+1)

def get_model(Y):
    Y = Y.values
    r = np.log(bayes(1,Y) / bayes(0,Y))
    m = LogisticRegression(C=4, max_iter=10000)
    X_nb = X.multiply(r)
    return m.fit(X_nb, Y), r

We first run the each solumn of our training data on the `get_model` function which returns the model fit for the `train` and the logarithmic scale `r`.<br> 
Then we run `m.predict_proba()` on the testing data and store it in the <b>submission.csv</b> file.

In [7]:
preds = np.zeros((len(test), len(label_cols)))

for i, j in enumerate(label_cols):
    m,r = get_model(train[j])
    preds[:,i] = m.predict_proba(test_X.multiply(r))[:,1]

submid = pd.DataFrame({'id': subm["id"]})
submission = pd.concat([submid, pd.DataFrame(preds, columns = label_cols)], axis=1)
submission.to_csv('submission.csv', index=False)
submission

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.999988,0.109407,0.999987,0.002334,0.963042,0.095380
1,0000247867823ef7,0.002865,0.000588,0.001887,0.000089,0.002215,0.000330
2,00013b17ad220c46,0.011778,0.000843,0.005566,0.000090,0.003255,0.000288
3,00017563c3f7919a,0.000955,0.000220,0.001140,0.000157,0.001051,0.000291
4,00017695ad8997eb,0.009949,0.000464,0.001992,0.000115,0.002369,0.000339
...,...,...,...,...,...,...,...
153159,fffcd0960ee309b5,0.586452,0.000295,0.068315,0.000110,0.018617,0.000366
153160,fffd7a9a6eb32c16,0.017819,0.001031,0.019616,0.001074,0.018754,0.001581
153161,fffda9e8d6fafa9e,0.001381,0.000154,0.002696,0.000070,0.000937,0.000187
153162,fffe8f1340a79fc2,0.008045,0.000329,0.002248,0.000087,0.002192,0.000863
