### Model training

#### Import all the required packages

In [2]:
import pandas as pd
import numpy as np
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import feature_extraction,model_selection,preprocessing, naive_bayes,pipeline, manifold
from sklearn.metrics import accuracy_score, classification_report
import sys  
sys.path.append('F:/AI/Toxic-comment-classifier/src')
from word_embeddings import w_embeddings


#### load processed dataset

In [3]:
df = pd.read_csv('../data/processed/processed_stem_data.csv')

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate,comment_text
0,0,0000997932d777bf,0,0,0,0,0,0,explan edit made usernam hardcor metallica fan...
1,1,000103f0d9cfb60f,0,0,0,0,0,0,aww match background colour seemingli stuck th...
2,2,000113f07ec002fd,0,0,0,0,0,0,hey man realli tri edit war guy constantli rem...
3,3,0001b41b1c6bb37e,0,0,0,0,0,0,ca make real suggest improv wonder section sta...
4,4,0001d958c54c6e35,0,0,0,0,0,0,sir hero chanc rememb page


In [5]:
### fill NA for any missing data 
df['comment_text'].fillna("missing", inplace=True)

In [6]:
labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
corpus = df['comment_text']

### Split the date into train test datasets

In [18]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(corpus,df[labels],test_size=0.25,random_state=42)

In [19]:
X_train.shape, X_test.shape

((119678,), (39893,))

In [20]:
# Stats of X_train labels
counts = []
for i in labels:
    counts.append((i, y_train[i].sum()))
df_stats = pd.DataFrame(counts, columns=['Labels', 'number_of_comments'])
df_stats

Unnamed: 0,Labels,number_of_comments
0,toxic,11479
1,severe_toxic,1189
2,obscene,6306
3,threat,373
4,insult,5866
5,identity_hate,1048


In [21]:
#stats of X_test labels
counts = []
for i in labels:
    counts.append((i, y_test[i].sum()))
df_stats = pd.DataFrame(counts, columns=['Labels', 'number_of_comments'])
df_stats

Unnamed: 0,Labels,number_of_comments
0,toxic,3815
1,severe_toxic,406
2,obscene,2143
3,threat,105
4,insult,2011
5,identity_hate,357


### Converting text comments into vectors using bag of words or TF-IDF 

We know that machine learning models doesn't accept input in the text format. So we need to convert the text data into Vector form, it is also called **Word Embeddings**. Word Embeddings can be broadly classified as:
1. Frequency based - Most popular techniques are **Bag-of-Words**, **TF-IDF**
2. Pridiction based - Most popular techniques are **Word2vec** and **Glove**



Here we will be using **Bag-of-Words** and **TF-IDF**<br>
<br>**Bag-of-Words(BOW)** - To get the embeddings from BOW we will firstly make a dictionary of words is from the test data along with the count of each word occurance in the data, then these words from the dictionary are sorted in descending order of their occurance,put these words into the columns and used as an independent features and here rows are the sentences or samples. These features will have values 0 or 1 based on if the word exists in the sentence.
<br>
**Disadvantage** of BOW - Word Embedding we get from BOW have either 0's and 1's as a values, no weights are given to the words according to their importance in the sentence. That means we can not get the sementics of the sentence.

**TF-IDF** - It stands for Term Frequency - Inverse Document Frequency
<br> To get embedding with TF-IDF, we calculate Term frequency and Inverse Document Frequency seperate and then multiply them together to get TF-IDF.
Formulas to calculate TF-IDF: <br>
**TF** : $$\frac{Number\, of\, repetition\, of\, word\, in\, a\, sentence}{Number\, of\, words\, in\, a\, sentence}$$ 
**IDF**:$$log\Bigg[\frac{Total\, Number\, of\,sentences}{Number\, of\, sentences\, containing\, the \, word}\Bigg]$$ 
<br>
**TF-IDF**: $$\Bigg[\frac{Number\, of\, repetition\, of\, word\, in\, a\, sentence}{Number\, of\, words\, in\, a\, sentence}\Bigg]*log\Bigg[\frac{Total\, Number\, of\,sentences}{Number\, of\, sentences\, containing\, the \, word}\Bigg]$$ 
<br>In **TF-IDF** also, we need dictionary of words with their count of occurance to do the calculation. **TF** assign more weightage to the word which repeat multiple times in the sentance where as **IDF** decreases the weightage to word as number of sentences containing the increases. Here, feature vectors not only contains 0's and 1' but does contain other other values depending on the word importance in the sentence. This is retaining the sementics of the sentence to some extent so it should perform better than BOW.
<br>Here **TF-IDG** can have zero value for the word which existed in every sentence and give more weightage to less often occured words that means it could cause over-fitting problem but that is yet to discove. 

In [11]:
Xv_train, Xv_test = w_embeddings(X_train, X_test, "tfidf")

### Training

In [52]:
### Linear regression 
for label in labels:
    print('... Processing {}'.format(label))
    # train the model 
    logreg = OneVsRestClassifier(LogisticRegression(solver='sag'))
    logreg.fit(Xv_train, y_train[label])
    # compute the testing accuracy
    prediction = logreg.predict(Xv_test)
    print('Validation accuracy is \n {}'.format(classification_report(y_test[label], prediction)))

... Processing toxic
Validation accuracy is 
              precision    recall  f1-score   support

          0       0.92      1.00      0.96     36078
          1       0.94      0.16      0.28      3815

avg / total       0.92      0.92      0.89     39893

... Processing severe_toxic
Validation accuracy is 
              precision    recall  f1-score   support

          0       0.99      1.00      0.99     39487
          1       0.57      0.07      0.13       406

avg / total       0.99      0.99      0.99     39893

... Processing obscene
Validation accuracy is 
              precision    recall  f1-score   support

          0       0.96      1.00      0.98     37750
          1       0.96      0.26      0.41      2143

avg / total       0.96      0.96      0.95     39893

... Processing threat


  'precision', 'predicted', average, warn_for)


Validation accuracy is 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00     39788
          1       0.00      0.00      0.00       105

avg / total       0.99      1.00      1.00     39893

... Processing insult
Validation accuracy is 
              precision    recall  f1-score   support

          0       0.96      1.00      0.98     37882
          1       0.80      0.18      0.29      2011

avg / total       0.95      0.96      0.94     39893

... Processing identity_hate
Validation accuracy is 
              precision    recall  f1-score   support

          0       0.99      1.00      1.00     39536
          1       0.00      0.00      0.00       357

avg / total       0.98      0.99      0.99     39893



  'precision', 'predicted', average, warn_for)


<br> Checked the impact of use of Bag of words and TF-IDF on the accuracy of Linear regression. 
<br>Accuracy remained same - `Identity hate`, `threat`
<br>Accuracy remained same - `Severe_toxic`
<br>Accuracy improved little bit  with the use of TF-IDF but not very significant change for `Toxic`, `Obscene`, `insult`

In [64]:
### Naive bayes
for label in labels:
    print('... Processing {}'.format(label))
    # train the model 
    nbayes = OneVsRestClassifier(naive_bayes.MultinomialNB())
    nbayes.fit(Xv_train, y_train[label])
    # compute the testing accuracy
    prediction = nbayes.predict(Xv_test)
    print('Validation accuracy is \n {}'.format(classification_report(y_test[label], prediction)))


... Processing toxic
Validation accuracy is 
              precision    recall  f1-score   support

          0       0.92      1.00      0.96     36078
          1       0.99      0.21      0.35      3815

avg / total       0.93      0.92      0.90     39893

... Processing severe_toxic
Validation accuracy is 
              precision    recall  f1-score   support

          0       0.99      1.00      0.99     39487
          1       0.00      0.00      0.00       406

avg / total       0.98      0.99      0.98     39893

... Processing obscene


  'precision', 'predicted', average, warn_for)


Validation accuracy is 
              precision    recall  f1-score   support

          0       0.96      1.00      0.98     37750
          1       0.97      0.33      0.49      2143

avg / total       0.96      0.96      0.95     39893

... Processing threat
Validation accuracy is 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00     39788
          1       0.00      0.00      0.00       105

avg / total       0.99      1.00      1.00     39893

... Processing insult


  'precision', 'predicted', average, warn_for)


Validation accuracy is 
              precision    recall  f1-score   support

          0       0.96      1.00      0.98     37882
          1       0.78      0.20      0.32      2011

avg / total       0.95      0.96      0.94     39893

... Processing identity_hate
Validation accuracy is 
              precision    recall  f1-score   support

          0       0.99      1.00      1.00     39536
          1       0.00      0.00      0.00       357

avg / total       0.98      0.99      0.99     39893



  'precision', 'predicted', average, warn_for)
