# NLP Toxcicty Machine Learning Model

!['X_now_tweeter'](https://www.stockvault.net/data/2019/10/07/269936/preview16.jpg)

The aim for this project is to predict text in order to create an optional filter for internet users to better combat toxcicty. To achieve this task I will use different supervised classification models in order to best predict when a comment is considered toxic. 

## Imports

In [3]:
import pandas as pd
import string
import numpy as np

from nltk.corpus import stopwords, wordnet
from nltk.tokenize import RegexpTokenizer
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, hamming_loss
from sklearn.multioutput import MultiOutputClassifier

import re

# Data

In [4]:
df_train= pd.read_csv('data/train.csv', index_col = 'id')

In [5]:
df_train.head()

Unnamed: 0_level_0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


The train data is from https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data which was gathered from various wikipedia comments. It contains the comment text as well as the different categorizations of toxcicity it may be associated with. 

In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 159571 entries, 0000997932d777bf to fff46fc426af1f9a
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   comment_text   159571 non-null  object
 1   toxic          159571 non-null  int64 
 2   severe_toxic   159571 non-null  int64 
 3   obscene        159571 non-null  int64 
 4   threat         159571 non-null  int64 
 5   insult         159571 non-null  int64 
 6   identity_hate  159571 non-null  int64 
dtypes: int64(6), object(1)
memory usage: 9.7+ MB


In [11]:
# Amount of labels in comment_text
y.sum()

toxic            15294
severe_toxic      1595
obscene           8449
threat             478
insult            7877
identity_hate     1405
dtype: int64

## Data cleaning


The data looks quite clean as there are no null values. I believe the data manipulation will be: normazling the text by lower casing all the letters as well as removing any non-alpha or numeric charecter, tokenizing the words which will seperate the words and make them easier to work with, removing stop words - such as you, he, so, the -  to reduce dimensionality of the data as well as removing words that don't contribute as much meaning, and finally lematizing the words to find the root word of the tokens so the model can better understand the comments. 

In [7]:
# Function for improving parts of speech information

### get_wordnet_pos was taken from Lecture 51-nlp_modeling.ipynb 
### link to the lecture: https://github.com/dvdhartsman/NTL-DS-080723/blob/main/4phase/51-nlp_modeling.ipynb


def get_wordnet_pos(treebank_tag):
    '''
    Translate nltk POS to wordnet tags
    '''
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [8]:
# Function for handling the transformation of data

### preprocess taken from nlp-sentiment-analysis
### link to the project: https://github.com/dvdhartsman/NLP-Sentiment-Analysis/blob/main/Text_Classification_Final_Notebook.ipynb

def preprocess(comment):
    """
    This is a function that is intended to handle all of the tokenization, lemmatization, and other
    preprocessing for our tweet data. It will make use of objects from other libraries, and will return
    a complete list of tokens that are ready to be vectorized into numerical data.
    """
    
    # Create a list of stopwords to be removed from our tokenized word list
    stops = stopwords.words("english")
    # Add punctuation to the list of stopwords
    stops += string.punctuation
    # Providing a regex pattern for the tokenizer to handle
    pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
    # Instantiating a tokenizer
    tokenizer = RegexpTokenizer(pattern)
    # Creating a list of raw tokens
    raw_tokens = tokenizer.tokenize(comment)
    # Using a comprehension to lower case every token
    lower_tokens = [i.lower() for i in raw_tokens]
    # Remove the stopwords from the list of tokens
    stopped_words = [i for i in lower_tokens if i not in stops]
    
    # Adding parts of speech to prepare for Lemmatization
    
    # This is the initial method to get parts of speech
    stopped_words = pos_tag(stopped_words)
    
    # Get_wordnet_pos() is the function to modify the pos definitions/assignments, creates tuples of (<word>, <pos>)
    stopped_words = [(word[0], get_wordnet_pos(word[1])) for word in stopped_words]
    
    lemmatizer = WordNetLemmatizer() 
    
    # This corrects the parts of speech and maximizes the usefulness of the lemmatization!!!!!
    document = [lemmatizer.lemmatize(word[0], word[1]) for word in stopped_words]
    
    # Re-join the list of cleaned tokens
    cleaned_doc = " ".join(document)
    return cleaned_doc

In [13]:
# Splitting the comment_text and target
X = df_train.comment_text
y = df_train[['toxic', 'severe_toxic', 'obscene', 'threat','insult','identity_hate']]

In [14]:
# Example of what the text looks like before and after being processed
print(X.iloc[1])
print(preprocess(X.iloc[1]))

D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)
d'aww match background colour i'm seemingly stuck thanks talk january utc


In [15]:
# Tokenizing, removing stop words and lemmatizing the comment_text
X_clean = X.apply(preprocess)

In [16]:
# Vectorizing the data to begin modeling
# Count Vectorizer counts how many times a word appears per comment
# Term Frequency - Inverse Document Frequency measures a terms relevance based on how infrequnt it is in the corpus
count_vec = CountVectorizer(ngram_range=(1, 2), max_features=10000)
tf_vec = TfidfVectorizer(ngram_range=(1, 2), max_features=10000)

In [17]:
# Fitting the data
count_vec.fit(X_clean)
tf_vec.fit(X_clean)

In [18]:
# Transforming the data
X_count = count_vec.transform(X_clean)
X_tfidf = tf_vec.transform(X_clean)

In [19]:
X_count

<159571x10000 sparse matrix of type '<class 'numpy.int64'>'
	with 4444977 stored elements in Compressed Sparse Row format>

In [20]:
X_tfidf

<159571x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 4444977 stored elements in Compressed Sparse Row format>

In [21]:
X_train_cv, X_test_cv, y_train, y_test = train_test_split(X_count,y, random_state = 42)

In [22]:
X_train_tf, X_test_tf, y_train, y_test = train_test_split(X_tfidf,y, random_state = 42)

# Modeling

## Baseline Dummy 

In [22]:
# Using a Dummy Model as a baseline and predict the most frequent class
dummy = DummyClassifier(strategy='most_frequent')

In [23]:
dummy_clf = MultiOutputClassifier(dummy).fit(X_train_cv,y_train)

In [24]:
accuracy_score(y_train, dummy_clf.predict(X_train_cv))

0.898343889436655

In [25]:
# Baseline accuracy of all comments classified as not any kind of toxic
accuracy_score(y_test, dummy_clf.predict(X_test_cv))

0.8982528263103803

In [26]:
hamming_loss(y_test, dummy_clf.predict(X_test_cv))

0.036919593245264413

In [27]:
# A classification report to compare to later models, looking to find the best recall scores

print(classification_report(y_test, dummy_clf.predict(X_test_cv)))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00      3815
           1       0.00      0.00      0.00       406
           2       0.00      0.00      0.00      2143
           3       0.00      0.00      0.00       105
           4       0.00      0.00      0.00      2011
           5       0.00      0.00      0.00       357

   micro avg       0.00      0.00      0.00      8837
   macro avg       0.00      0.00      0.00      8837
weighted avg       0.00      0.00      0.00      8837
 samples avg       0.00      0.00      0.00      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Logistic Regression

### Logreg Count Vec

In [23]:
# A logistic regression model to classify comments, using a MultiOutputClassifier to predict all the labels at once
logreg_clf = MultiOutputClassifier(LogisticRegression()).fit(X_train_cv, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [24]:
accuracy_score(y_train, logreg_clf.predict(X_train_cv))

0.9319590902254382

In [25]:
accuracy_score(y_test, logreg_clf.predict(X_test_cv))

0.9147970821948713

In [36]:
print(classification_report(y_test, logreg_clf.predict(X_test_cv)))

              precision    recall  f1-score   support

           0       0.84      0.66      0.74      3815
           1       0.57      0.23      0.33       406
           2       0.88      0.68      0.77      2143
           3       0.28      0.12      0.17       105
           4       0.79      0.52      0.63      2011
           5       0.47      0.22      0.30       357

   micro avg       0.82      0.59      0.69      8837
   macro avg       0.64      0.41      0.49      8837
weighted avg       0.81      0.59      0.68      8837
 samples avg       0.06      0.05      0.05      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [27]:
#Example of what the predictions look like

logreg_clf.predict(X_test_cv)

array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       ...,
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]], dtype=int64)

### Logreg TF vec

In [29]:
# Logistic Regression using the TF-IDF data
logreg_tf = MultiOutputClassifier(LogisticRegression()).fit(X_train_tf, y_train)

In [30]:
logreg_tf.score(X_test_tf, y_test)

0.9191336826009575

In [31]:
accuracy_score(y_train, logreg_tf.predict(X_train_tf))

0.9246227376794398

In [32]:
accuracy_score(y_test, logreg_tf.predict(X_test_tf))

0.9191336826009575

In [37]:
print(classification_report(y_test, logreg_tf.predict(X_test_tf)))

              precision    recall  f1-score   support

           0       0.91      0.62      0.73      3815
           1       0.57      0.21      0.31       406
           2       0.92      0.63      0.75      2143
           3       0.61      0.10      0.18       105
           4       0.82      0.51      0.63      2011
           5       0.68      0.15      0.24       357

   micro avg       0.88      0.55      0.68      8837
   macro avg       0.75      0.37      0.47      8837
weighted avg       0.86      0.55      0.67      8837
 samples avg       0.06      0.05      0.05      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [39]:
accuracy_score(y_test, logreg_tf.predict(X_test_cv))

0.8704785300679317

In [38]:
# It seems the count vectorized test data performs extremely well on the TFIDF model in terms of recall

print(classification_report(y_test, logreg_tf.predict(X_test_cv)))

              precision    recall  f1-score   support

           0       0.66      0.82      0.73      3815
           1       0.28      0.82      0.42       406
           2       0.60      0.89      0.72      2143
           3       0.08      0.79      0.15       105
           4       0.50      0.84      0.63      2011
           5       0.18      0.78      0.29       357

   micro avg       0.49      0.84      0.62      8837
   macro avg       0.38      0.82      0.49      8837
weighted avg       0.56      0.84      0.66      8837
 samples avg       0.06      0.08      0.06      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [263]:
def final_model_predictor(text):
    label = []
    einput = [preprocess(text)]
    exi_vec = tf_vec.transform(einput)
    prediction = logreg_tf.predict(exi_vec)
    for i in range(len(y.columns)):
        if prediction[0][i] == 1:
            label.append(y.columns[i])
    return label

In [40]:
# Example of how the model works
example_input = 'you suck, I hope you have a bad day'

In [260]:
preprocess(example_input)

'suck hope bad day'

In [266]:
final_model_predictor("taco bell isn't as bad as people think")

[]

### Grid search

In [41]:
# Using a gird search to see if the hyperparameters can be optimized

grid = [{'estimator__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']},
              {'estimator__penalty':['none', 'elasticnet', 'l1', 'l2']},
              {'estimator__max_iter':['100','1000','10000']},
              {'estimator__C':[0.001, 0.01, 0.1, 1, 10, 100]}]

In [42]:
logreg_tf_gs = MultiOutputClassifier(LogisticRegression())

In [43]:
grid_search = GridSearchCV(estimator=logreg_tf_gs,
                          param_grid = grid,
                          cv = 5,
                          verbose = 1,
                          n_jobs=-1)

In [44]:
grid_search.fit(X_train_tf,y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


25 fits failed out of a total of 90.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\rchag\anaconda3\envs\tf_gpu\lib\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\rchag\anaconda3\envs\tf_gpu\lib\site-packages\sklearn\multioutput.py", line 538, in fit
    super().fit(X, Y, sample_weight=sample_weight, **fit_params)
  File "C:\Users\rchag\anaconda3\envs\tf_gpu\lib\site-packages\sklearn\base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "C:\Users\rchag\anaconda3\envs\tf_gpu\lib\site-packages\sklearn\multioutput

In [49]:
grid_search.best_estimator_

In [50]:
#Relatively similar

accuracy_score(y_test, grid_search.best_estimator_.predict(X_test_cv))

0.8719574862757877

In [51]:

print(classification_report(y_test, grid_search.best_estimator_.predict(X_test_cv)))

              precision    recall  f1-score   support

           0       0.66      0.82      0.73      3815
           1       0.29      0.82      0.43       406
           2       0.60      0.89      0.72      2143
           3       0.08      0.77      0.15       105
           4       0.50      0.84      0.63      2011
           5       0.18      0.78      0.30       357

   micro avg       0.50      0.84      0.62      8837
   macro avg       0.39      0.82      0.49      8837
weighted avg       0.57      0.84      0.67      8837
 samples avg       0.06      0.08      0.06      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Random Forest

### Random Forest Count Vectorization

In [57]:
from sklearn.ensemble import RandomForestClassifier

rfclf = MultiOutputClassifier(RandomForestClassifier(n_jobs = -1, random_state=42, max_depth=50, verbose = 0)).fit(X_train_cv, y_train)

In [58]:
accuracy_score(y_test, rfclf.predict(X_test_cv))

0.9049958639360289

In [59]:
print(classification_report(y_test, rfclf.predict(X_test_cv)))

              precision    recall  f1-score   support

           0       0.97      0.30      0.45      3815
           1       0.52      0.03      0.06       406
           2       0.95      0.36      0.52      2143
           3       0.00      0.00      0.00       105
           4       0.88      0.22      0.35      2011
           5       0.71      0.01      0.03       357

   micro avg       0.94      0.27      0.41      8837
   macro avg       0.67      0.15      0.23      8837
weighted avg       0.90      0.27      0.40      8837
 samples avg       0.03      0.02      0.02      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Random Forest TFIDF Vectorization

In [61]:
rfclf_tf = MultiOutputClassifier(RandomForestClassifier(n_jobs = -1, random_state=42, max_depth=50, verbose = 0)).fit(X_train_tf, y_train)

In [62]:
accuracy_score(y_test, rfclf_tf.predict(X_test_cv))

0.9037926453262477

In [65]:
# Looks like the magic is gone, back to our regularly scheduled programming

print(classification_report(y_test, rfclf_tf.predict(X_test_cv)))

              precision    recall  f1-score   support

           0       0.92      0.32      0.47      3815
           1       0.50      0.04      0.08       406
           2       0.93      0.36      0.52      2143
           3       0.00      0.00      0.00       105
           4       0.89      0.22      0.35      2011
           5       0.50      0.01      0.03       357

   micro avg       0.91      0.28      0.42      8837
   macro avg       0.62      0.16      0.24      8837
weighted avg       0.87      0.28      0.41      8837
 samples avg       0.03      0.02      0.02      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [66]:
# Looks like the recall rate is very poor, this is likely due to imbalanced classes
# and will be addressed later with downsampling
print(classification_report(y_test, rfclf.predict(X_test_tf)))

              precision    recall  f1-score   support

           0       1.00      0.03      0.05      3815
           1       0.00      0.00      0.00       406
           2       1.00      0.03      0.05      2143
           3       0.00      0.00      0.00       105
           4       0.85      0.01      0.03      2011
           5       0.00      0.00      0.00       357

   micro avg       0.97      0.02      0.04      8837
   macro avg       0.48      0.01      0.02      8837
weighted avg       0.87      0.02      0.04      8837
 samples avg       0.00      0.00      0.00      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Multinomial Naive Bayes

### MNB count vec

In [74]:
from sklearn.naive_bayes import MultinomialNB

mnb_cv = MultiOutputClassifier(MultinomialNB()).fit(X_train_cv, y_train)

In [75]:
accuracy_score(y_test, mnb_cv.predict(X_test_cv))

0.9033915724563206

In [76]:
print(classification_report(y_test, mnb_cv.predict(X_test_cv)))

              precision    recall  f1-score   support

           0       0.81      0.63      0.71      3815
           1       0.37      0.62      0.46       406
           2       0.75      0.67      0.71      2143
           3       0.14      0.45      0.21       105
           4       0.66      0.61      0.63      2011
           5       0.23      0.43      0.30       357

   micro avg       0.65      0.62      0.64      8837
   macro avg       0.49      0.57      0.50      8837
weighted avg       0.71      0.62      0.66      8837
 samples avg       0.05      0.05      0.05      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### MNB TF-IDF

In [77]:
mnb_tf = MultiOutputClassifier(MultinomialNB()).fit(X_train_tf, y_train)

In [78]:
accuracy_score(y_test, mnb_tf.predict(X_test_cv))

0.8979018875491941

In [80]:
print(classification_report(y_test, mnb_tf.predict(X_test_cv)))

              precision    recall  f1-score   support

           0       0.71      0.74      0.73      3815
           1       0.39      0.56      0.46       406
           2       0.69      0.74      0.72      2143
           3       0.09      0.12      0.10       105
           4       0.63      0.70      0.66      2011
           5       0.25      0.29      0.27       357

   micro avg       0.64      0.70      0.67      8837
   macro avg       0.46      0.52      0.49      8837
weighted avg       0.65      0.70      0.67      8837
 samples avg       0.06      0.06      0.06      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [82]:
accuracy_score(y_test, mnb_tf.predict(X_test_tf))

0.9127165166821247

In [83]:
# Doesn't seem like we're finding anything special here
print(classification_report(y_test, mnb_tf.predict(X_test_tf)))

              precision    recall  f1-score   support

           0       0.91      0.52      0.67      3815
           1       0.61      0.23      0.33       406
           2       0.89      0.51      0.65      2143
           3       0.00      0.00      0.00       105
           4       0.81      0.44      0.57      2011
           5       0.40      0.06      0.11       357

   micro avg       0.86      0.46      0.60      8837
   macro avg       0.60      0.29      0.39      8837
weighted avg       0.84      0.46      0.59      8837
 samples avg       0.05      0.04      0.04      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Sequential

In [70]:
from keras.models import Sequential
from keras import layers
from keras.layers import Dropout


In [71]:
import tensorflow as tf
# works on VS code but not on jupyter
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Num GPUs Available:  0


In [68]:
# A class to help build Neural Network models

def get_model(n_inputs, n_outputs, dropout = None, layer_amnt = 1):    
    model = Sequential()
    if dropout != None:
        model.add(layers.Dropout(0.2, input_shape = (n_inputs,)))
    else:
        model.add(layers.Dense(128, input_dim = n_inputs, activation = 'relu'))
    for i in range(layer_amnt):
        model.add(layers.Dense(128, activation = 'relu'))
    model.add(layers.Dense(n_outputs, activation = 'sigmoid'))
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics= ['accuracy'])
    return model

### Neural Netwmork 1 Count Vec

In [90]:
nn1 = get_model(10000, y_train.shape[1])

In [91]:
X_train_cv_df = pd.DataFrame(X_train_cv.toarray())
X_test_cv_df = pd.DataFrame(X_test_cv.toarray())

In [92]:
X_train_tf_df = pd.DataFrame(X_train_tf.toarray())
X_test_tf_df = pd.DataFrame(X_test_tf.toarray())

In [87]:
X_train_cv_df.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9990,9991,9992,9993,9994,9995,9996,9997,9998,9999
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [93]:
nn1.fit(X_train_cv_df, y_train, verbose = 1, epochs = 4, workers = -1)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x291b5d30a00>

In [94]:
nn1.evaluate(X_test_cv_df, y_test)



[0.10160305351018906, 0.8115960359573364]

In [95]:
# Predictions from X_test Neural Network 1

yhat_cv = nn1.predict(X_test_cv_df)

In [96]:
yhat_cv = yhat_cv.round()

In [111]:
# There is a discrepency in accuracy as the way they are measured is different, Keras measures total amount of predictions
# corrrect, while accuracy_score only measures the set of predictoins

accuracy_score(y_test, yhat_cv)

0.911939437996641

In [98]:
print(classification_report(y_test, yhat_cv))

              precision    recall  f1-score   support

           0       0.80      0.70      0.75      3815
           1       0.49      0.47      0.48       406
           2       0.84      0.73      0.78      2143
           3       0.52      0.29      0.37       105
           4       0.74      0.60      0.66      2011
           5       0.63      0.30      0.40       357

   micro avg       0.77      0.65      0.71      8837
   macro avg       0.67      0.51      0.57      8837
weighted avg       0.77      0.65      0.70      8837
 samples avg       0.06      0.06      0.06      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Neural Network 300

In [99]:
nn2 = get_model(300, y_train.shape[1])

In [100]:
X_train_cv_300 = X_train_cv_df.iloc[:,0:300]
X_test_cv_300 = X_test_cv_df.iloc[:,0:300]

In [101]:
# The most frequent 300 terms from count vectorizer
X_train_cv_300.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [102]:
nn2.fit(X_train_cv_300, y_train, verbose = 1, epochs = 3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x291b87e53d0>

In [109]:
nn2_test = nn2.predict(X_test_cv_300)
nn2_test = nn2_test.round()
accuracy_score(y_test, nn2_test)

0.8981274910385281

In [110]:
# Seems pretty much like the dummy model, not great
print(classification_report(y_test, nn2_test))

              precision    recall  f1-score   support

           0       0.42      0.00      0.00      3815
           1       0.00      0.00      0.00       406
           2       0.00      0.00      0.00      2143
           3       0.00      0.00      0.00       105
           4       0.00      0.00      0.00      2011
           5       0.00      0.00      0.00       357

   micro avg       0.33      0.00      0.00      8837
   macro avg       0.07      0.00      0.00      8837
weighted avg       0.18      0.00      0.00      8837
 samples avg       0.00      0.00      0.00      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### NN 1000

In [112]:
X_train_cv_1000 = X_train_cv_df.iloc[:,0:1000]
X_test_cv_1000 = X_test_cv_df.iloc[:,0:1000]

In [113]:
nn_1000_cv = get_model(1000, y_train.shape[1])

In [114]:
nn_1000_cv.fit(X_train_cv_1000, y_train, epochs = 3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x291fd5e5880>

In [115]:
nn_1000_test = nn_1000_cv.predict(X_test_cv_1000)
nn_1000_test = nn_1000_test.round()
accuracy_score(y_test, nn_1000_test)

0.9009600681823879

In [117]:
# Getting better but it seems the best is when more features are used

print(classification_report(y_test, nn_1000_test))

              precision    recall  f1-score   support

           0       0.81      0.13      0.23      3815
           1       0.62      0.01      0.02       406
           2       0.81      0.18      0.29      2143
           3       0.00      0.00      0.00       105
           4       0.69      0.16      0.26      2011
           5       0.00      0.00      0.00       357

   micro avg       0.77      0.14      0.23      8837
   macro avg       0.49      0.08      0.13      8837
weighted avg       0.73      0.14      0.23      8837
 samples avg       0.01      0.01      0.01      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Neural Network dropout

In [118]:
nndrop = get_model(10000, 6, dropout = True, layer_amnt = 3)

In [119]:
nndrop.fit(X_train_cv_df, y_train, verbose = 1, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x2921217dc70>

In [120]:
nndrop_test = nndrop.predict(X_test_cv)
nndrop_test = nndrop_test.round()

In [122]:
accuracy_score(y_test, nndrop_test)

0.917905396936806

In [123]:
print(classification_report(y_test, nndrop_test))

              precision    recall  f1-score   support

           0       0.85      0.68      0.76      3815
           1       0.58      0.27      0.37       406
           2       0.85      0.72      0.78      2143
           3       0.68      0.12      0.21       105
           4       0.74      0.65      0.69      2011
           5       0.58      0.25      0.35       357

   micro avg       0.81      0.64      0.72      8837
   macro avg       0.71      0.45      0.53      8837
weighted avg       0.80      0.64      0.71      8837
 samples avg       0.06      0.06      0.06      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### NND 2

In [124]:
nndrop2 = get_model(10000, 6, dropout = True, layer_amnt = 5)

In [125]:
nndrop2.fit(X_train_cv_df, y_train, verbose = 1, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x295914a3190>

In [126]:
nndrop2.evaluate(X_test_cv_df, y_test)



[0.06256882101297379, 0.9934073686599731]

In [127]:
nndrop2_test = nndrop2.predict(X_test_cv)
nndrop2_test = nndrop2_test.round()
accuracy_score(y_test, nndrop2_test)

0.9154237585541323

In [128]:
print(classification_report(y_test, nndrop2_test))

              precision    recall  f1-score   support

           0       0.84      0.67      0.75      3815
           1       0.83      0.05      0.09       406
           2       0.86      0.71      0.78      2143
           3       0.00      0.00      0.00       105
           4       0.75      0.58      0.65      2011
           5       0.52      0.04      0.07       357

   micro avg       0.83      0.59      0.69      8837
   macro avg       0.63      0.34      0.39      8837
weighted avg       0.80      0.59      0.67      8837
 samples avg       0.06      0.05      0.05      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### NN TF-IDF

In [129]:
nntf = get_model(X_train_tf_df.shape[1], y_train.shape[1], layer_amnt=1)

In [130]:
nntf.fit(X_train_tf_df, y_train, epochs = 3, verbose=1, shuffle = True)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x292185658e0>

In [131]:
nntf.evaluate(X_test_tf_df, y_test)



[0.06329762935638428, 0.8367382884025574]

In [132]:
nntf_test = nntf.predict(X_test_tf_df)
nntf_test = nntf_test.round()
accuracy_score(y_test, nntf_test)

0.9157997643696889

In [134]:
print(classification_report(y_test, nntf_test))

              precision    recall  f1-score   support

           0       0.85      0.67      0.75      3815
           1       0.54      0.25      0.34       406
           2       0.86      0.70      0.77      2143
           3       0.54      0.25      0.34       105
           4       0.74      0.62      0.68      2011
           5       0.69      0.29      0.40       357

   micro avg       0.81      0.63      0.71      8837
   macro avg       0.70      0.46      0.55      8837
weighted avg       0.80      0.63      0.70      8837
 samples avg       0.06      0.06      0.06      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [135]:
nntf_test2 = nntf.predict(X_test_cv_df)
nntf_test2 = nntf_test2.round()
accuracy_score(y_test, nntf_test2)

0.7721655428270624

In [136]:
print(classification_report(y_test, nntf_test2))

              precision    recall  f1-score   support

           0       0.34      0.90      0.50      3815
           1       0.44      0.50      0.47       406
           2       0.48      0.86      0.62      2143
           3       0.39      0.28      0.32       105
           4       0.54      0.78      0.63      2011
           5       0.45      0.43      0.44       357

   micro avg       0.41      0.82      0.55      8837
   macro avg       0.44      0.62      0.50      8837
weighted avg       0.43      0.82      0.55      8837
 samples avg       0.07      0.08      0.07      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### NNTF Dropout

In [97]:
nntf_drop = get_model(X_train_tf_df.shape[1], y_train.shape[1],dropout = True, layer_amnt=3)

In [98]:
nntf_drop.fit(X_train_tf_df, y_train, epochs = 3, verbose = True)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x26e29674a30>

In [99]:
nntf_drop_test = nntf_drop.predict(X_test_tf_df)
nntf_drop_test = nntf_drop_test.round()
accuracy_score(y_test, nntf_drop_test)

0.918080866317399

In [101]:
print(classification_report(y_test, nntf_drop_test))

              precision    recall  f1-score   support

           0       0.88      0.64      0.74      3815
           1       0.53      0.31      0.39       406
           2       0.88      0.71      0.78      2143
           3       0.50      0.01      0.02       105
           4       0.75      0.60      0.67      2011
           5       0.69      0.27      0.38       357

   micro avg       0.83      0.61      0.70      8837
   macro avg       0.70      0.42      0.50      8837
weighted avg       0.82      0.61      0.70      8837
 samples avg       0.06      0.05      0.05      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### NN TFIDF 1000

In [102]:
X_tf_1000 = X_train_tf_df.iloc[:,0:1000]
X_test_tf_1000 = X_test_tf_df.iloc[:,0:1000]

In [103]:
tf_1000 = get_model(X_tf_1000.shape[1], y_train.shape[1], dropout = True, layer_amnt= 3)

In [104]:
tf_1000.fit(X_tf_1000, y_train, epochs = 3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x26e2d1b3340>

In [105]:
tf_1000_test = tf_1000.predict(X_test_tf_1000)
tf_1000_test = tf_1000_test.round()
accuracy_score(y_test, tf_1000_test)

0.9018374150853533

In [106]:
print(classification_report(y_test, tf_1000_test))

              precision    recall  f1-score   support

           0       0.87      0.12      0.21      3815
           1       0.00      0.00      0.00       406
           2       0.81      0.17      0.28      2143
           3       0.00      0.00      0.00       105
           4       0.72      0.15      0.25      2011
           5       0.00      0.00      0.00       357

   micro avg       0.80      0.13      0.22      8837
   macro avg       0.40      0.07      0.12      8837
weighted avg       0.74      0.13      0.22      8837
 samples avg       0.01      0.01      0.01      8837



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Seems like my models are having a hard time with such a big data imbalance, I will now try with more balanced datasets

# Downsampled models

### Downsampling

In [108]:
y.info()

<class 'pandas.core.frame.DataFrame'>
Index: 159571 entries, 0000997932d777bf to fff46fc426af1f9a
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype
---  ------         --------------   -----
 0   toxic          159571 non-null  int64
 1   severe_toxic   159571 non-null  int64
 2   obscene        159571 non-null  int64
 3   threat         159571 non-null  int64
 4   insult         159571 non-null  int64
 5   identity_hate  159571 non-null  int64
dtypes: int64(6)
memory usage: 8.5+ MB


In [109]:
y.sum()

toxic            15294
severe_toxic      1595
obscene           8449
threat             478
insult            7877
identity_hate     1405
dtype: int64

In [146]:
from sklearn.utils import resample

In [147]:
df_majority = df_train[df_train.iloc[:,1] == 0]
df_minority = df_train[df_train.iloc[:,1] == 1]

In [148]:
#sanity check
df_majority.head()

Unnamed: 0_level_0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [149]:
df_majority.info()

<class 'pandas.core.frame.DataFrame'>
Index: 144277 entries, 0000997932d777bf to fff46fc426af1f9a
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   comment_text   144277 non-null  object
 1   toxic          144277 non-null  int64 
 2   severe_toxic   144277 non-null  int64 
 3   obscene        144277 non-null  int64 
 4   threat         144277 non-null  int64 
 5   insult         144277 non-null  int64 
 6   identity_hate  144277 non-null  int64 
dtypes: int64(6), object(1)
memory usage: 8.8+ MB


In [150]:
df_minority.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15294 entries, 0002bcb3da6cb337 to ffbdbb0483ed0841
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   comment_text   15294 non-null  object
 1   toxic          15294 non-null  int64 
 2   severe_toxic   15294 non-null  int64 
 3   obscene        15294 non-null  int64 
 4   threat         15294 non-null  int64 
 5   insult         15294 non-null  int64 
 6   identity_hate  15294 non-null  int64 
dtypes: int64(6), object(1)
memory usage: 955.9+ KB


In [151]:
df_minority.head()

Unnamed: 0_level_0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0002bcb3da6cb337,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,1,1,0,1,0
0005c987bdfc9d4b,Hey... what is it..\n@ | talk .\nWhat is it......,1,0,0,0,0,0
0007e25b2121310b,"Bye! \n\nDon't look, come or think of comming ...",1,0,0,0,0,0
001810bf8c45bf5f,You are gay or antisemmitian? \n\nArchangel WH...,1,0,1,0,1,1
00190820581d90ce,"FUCK YOUR FILTHY MOTHER IN THE ASS, DRY!",1,0,1,0,1,0


In [152]:
df_majority_downsampled = resample(df_majority,
                                  replace = False,
                                  n_samples = 15294)
df_downsampled = pd.concat(([df_majority_downsampled, df_minority]))

In [153]:
df_downsampled.head()

Unnamed: 0_level_0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1d57ec7bfa863ae5,"""\n\nJust remember that there are still specif...",0,0,0,0,0,0
b8d134969d9be4e3,"""The only other user to comment cited a non-ex...",0,0,0,0,0,0
6ffcefcc59519d40,"P.S. Oh, never mind. yes, please have those ro...",0,0,0,0,0,0
8ce3d2a2d966f51c,{{unblock reviewed|1= The unrelated person has...,0,0,0,0,0,0
d25975ce11376c7b,"Please stop your vandalism, or you are liable ...",0,0,0,0,0,0


In [154]:
X_downsampled = df_downsampled['comment_text']
y_downsampled = df_downsampled.drop('comment_text', axis = 1)

In [155]:
y_downsampled.sum()

toxic            15294
severe_toxic      1595
obscene           7981
threat             450
insult            7397
identity_hate     1312
dtype: int64

In [156]:
X_clean_ds = X_downsampled.apply(preprocess)

In [157]:
X_clean_ds.head()

id
1d57ec7bfa863ae5    remember still specific meaning craft trade ge...
b8d134969d9be4e3    user comment cite non existant wiki guideline ...
6ffcefcc59519d40    p oh never mind yes please rouge rule place pl...
8ce3d2a2d966f51c    unblock review unrelated person read discussio...
d25975ce11376c7b      please stop vandalism liable block edit jul utc
Name: comment_text, dtype: object

In [158]:
count_vec_ds = CountVectorizer(ngram_range=(1, 2), max_features=10000)
tf_vec_ds = TfidfVectorizer(ngram_range=(1, 2), max_features=10000)

In [159]:
count_vec_ds.fit(X_clean_ds)
tf_vec_ds.fit(X_clean_ds)

In [160]:
ds_count = count_vec.transform(X_clean_ds)
ds_tf = tf_vec_ds.transform(X_clean_ds)


In [161]:
ds_count

<30588x10000 sparse matrix of type '<class 'numpy.int64'>'
	with 702610 stored elements in Compressed Sparse Row format>

In [162]:
X_train_ds_cv, X_test_ds_cv, y_train_ds, y_test_ds = train_test_split(ds_count, y_downsampled, random_state = 42)

In [163]:
X_train_ds_tf, X_test_ds_tf, y_train_ds, y_test_ds = train_test_split(ds_tf, y_downsampled, random_state = 42)

## Downsampled Modeling

### Dummy (baseline)

In [164]:
dummy_ds = DummyClassifier(strategy='most_frequent')

In [165]:
dummy_ds.fit(X_train_ds_cv, y_train_ds)

In [166]:
print(classification_report(y_test_ds,(dummy_ds.predict(X_test_ds_cv))))

              precision    recall  f1-score   support

           0       0.49      1.00      0.66      3773
           1       0.00      0.00      0.00       387
           2       0.00      0.00      0.00      1969
           3       0.00      0.00      0.00       117
           4       0.00      0.00      0.00      1790
           5       0.00      0.00      0.00       325

   micro avg       0.49      0.45      0.47      8361
   macro avg       0.08      0.17      0.11      8361
weighted avg       0.22      0.45      0.30      8361
 samples avg       0.49      0.30      0.35      8361



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Logreg

#### Logreg cv

In [167]:
logreg_ds_cv = MultiOutputClassifier(LogisticRegression()).fit(X_train_ds_cv, y_train_ds)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [168]:
print(classification_report(y_test_ds, logreg_ds_cv.predict(X_test_ds_cv)))

              precision    recall  f1-score   support

           0       0.90      0.87      0.88      3773
           1       0.59      0.24      0.34       387
           2       0.89      0.75      0.82      1969
           3       0.42      0.22      0.29       117
           4       0.76      0.60      0.67      1790
           5       0.55      0.28      0.37       325

   micro avg       0.85      0.72      0.78      8361
   macro avg       0.69      0.49      0.56      8361
weighted avg       0.83      0.72      0.77      8361
 samples avg       0.40      0.36      0.36      8361



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [169]:
accuracy_score(y_test_ds, logreg_ds_cv.predict(X_test_ds_cv))

0.666928207140055

#### Logreg TF-IDF

In [170]:
logreg_ds_tf = MultiOutputClassifier(LogisticRegression()).fit(X_train_ds_tf, y_train_ds)

In [171]:
print(classification_report(y_test_ds, logreg_ds_tf.predict(X_test_ds_tf)))

              precision    recall  f1-score   support

           0       0.91      0.86      0.88      3773
           1       0.50      0.16      0.25       387
           2       0.91      0.71      0.80      1969
           3       0.53      0.08      0.13       117
           4       0.79      0.59      0.68      1790
           5       0.67      0.17      0.27       325

   micro avg       0.88      0.70      0.78      8361
   macro avg       0.72      0.43      0.50      8361
weighted avg       0.85      0.70      0.76      8361
 samples avg       0.40      0.35      0.36      8361



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [172]:
accuracy_score(y_test_ds, logreg_ds_tf.predict(X_test_ds_tf))

0.6729436380279848

In [175]:
# Just for grins
print(classification_report(y_test_ds, logreg_ds_tf.predict(X_test_ds_cv)))

              precision    recall  f1-score   support

           0       0.45      0.32      0.37      3773
           1       0.06      0.05      0.05       387
           2       0.16      0.08      0.11      1969
           3       0.00      0.00      0.00       117
           4       0.15      0.12      0.13      1790
           5       0.03      0.02      0.03       325

   micro avg       0.27      0.19      0.23      8361
   macro avg       0.14      0.10      0.12      8361
weighted avg       0.28      0.19      0.23      8361
 samples avg       0.14      0.12      0.11      8361



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Random Forest Classifier

#### RFC CV

In [176]:
rf_ds_cv = MultiOutputClassifier(RandomForestClassifier(n_jobs = -1, random_state=42, 
                                                        max_depth=50, verbose = 1)).fit(X_train_ds_cv, y_train_ds)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    1.0s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.9s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    1.1s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.5s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0

In [177]:
# Performing a lot better but still not the best model in terms of recall

print(classification_report(y_test_ds, rf_ds_cv.predict(X_test_ds_cv)))

[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s


              precision    recall  f1-score   support

           0       0.83      0.84      0.84      3773
           1       0.60      0.04      0.07       387
           2       0.94      0.55      0.69      1969
           3       1.00      0.02      0.03       117
           4       0.87      0.38      0.53      1790
           5       0.71      0.02      0.03       325

   micro avg       0.86      0.59      0.70      8361
   macro avg       0.83      0.31      0.37      8361
weighted avg       0.85      0.59      0.66      8361
 samples avg       0.41      0.30      0.33      8361



[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [178]:
accuracy_score(y_test_ds,rf_ds_cv.predict(X_test_ds_cv))

[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0

0.6014123185562966

#### RFC TF

In [179]:
rf_ds_tf = MultiOutputClassifier(RandomForestClassifier(n_jobs = -1, random_state=42, 
                                                        max_depth=50, verbose = 1)).fit(X_train_ds_tf, y_train_ds)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    1.1s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    1.0s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    1.1s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.7s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0

In [180]:
# Performing a lot better but still not the best model in terms of recall

print(classification_report(y_test_ds, rf_ds_tf.predict(X_test_ds_tf)))

[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.


              precision    recall  f1-score   support

           0       0.85      0.83      0.84      3773
           1       0.41      0.02      0.03       387
           2       0.94      0.53      0.67      1969
           3       0.75      0.03      0.05       117
           4       0.88      0.37      0.52      1790
           5       0.71      0.04      0.07       325

   micro avg       0.87      0.58      0.70      8361
   macro avg       0.75      0.30      0.37      8361
weighted avg       0.85      0.58      0.65      8361
 samples avg       0.40      0.30      0.33      8361



[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [181]:
accuracy_score(y_test_ds, rf_ds_tf.predict(X_test_ds_tf))

[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=24)]: Using backend ThreadingBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0

0.6036354125800968

### MNB downsampled

#### MNB cv

In [146]:
mnb_ds_cv = MultiOutputClassifier(MultinomialNB()).fit(X_train_ds_cv, y_train_ds)

In [148]:
print(classification_report(y_test_ds, mnb_ds_cv.predict(X_test_ds_cv)))

              precision    recall  f1-score   support

           0       0.93      0.74      0.82      3773
           1       0.43      0.48      0.45       387
           2       0.80      0.69      0.74      1972
           3       0.28      0.26      0.27       117
           4       0.70      0.62      0.66      1789
           5       0.41      0.28      0.34       325

   micro avg       0.79      0.67      0.72      8363
   macro avg       0.59      0.51      0.55      8363
weighted avg       0.80      0.67      0.73      8363
 samples avg       0.32      0.31      0.30      8363



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [149]:
accuracy_score(y_test_ds, mnb_ds_cv.predict(X_test_ds_cv))

0.6147508826990977

In [150]:
hamming_loss(y_test_ds, mnb_ds_cv.predict(X_test_ds_cv))

0.0931737936445665

#### MNB tf

In [151]:
mnb_df_tf = MultiOutputClassifier(MultinomialNB()).fit(X_train_ds_tf, y_train_ds)

In [154]:
print(classification_report(y_test_ds, mnb_df_tf.predict(X_test_ds_tf)))

              precision    recall  f1-score   support

           0       0.87      0.88      0.87      3773
           1       0.59      0.11      0.18       387
           2       0.84      0.67      0.74      1972
           3       0.11      0.01      0.02       117
           4       0.74      0.58      0.65      1789
           5       0.41      0.04      0.08       325

   micro avg       0.83      0.68      0.75      8363
   macro avg       0.59      0.38      0.42      8363
weighted avg       0.79      0.68      0.72      8363
 samples avg       0.40      0.35      0.36      8363



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [155]:
accuracy_score(y_test_ds, mnb_df_tf.predict(X_test_ds_tf))

0.6360664312802407

In [156]:
hamming_loss(y_test_ds, mnb_df_tf.predict(X_test_ds_tf))

0.08356218124754806

### Neural Networks

#### NN downsample tf

In [157]:
X_train_ds_cv_df = pd.DataFrame(X_train_ds_cv.toarray())
X_test_ds_cv_df = pd.DataFrame(X_test_ds_cv.toarray())

X_train_ds_tf_df = pd.DataFrame(X_train_ds_tf.toarray())
X_test_ds_tf_df = pd.DataFrame(X_test_ds_tf.toarray())

In [158]:
nn_ds = get_model(X_train_ds_tf_df.shape[1], y_train_ds.shape[1])

In [159]:
nn_ds.fit(X_train_ds_tf_df, y_train_ds, epochs = 3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x26e364c93a0>

In [160]:
nn_ds.evaluate(X_test_ds_tf_df, y_test_ds)



[0.1999722719192505, 0.9351379871368408]

In [161]:
y_hat_ds_tf =  nn_ds.predict(X_test_ds_tf_df)
y_hat_ds_tf = y_hat_ds_tf.round()

In [163]:
accuracy_score(y_test_ds, y_hat_ds_tf)

0.6422126324048647

In [164]:
hamming_loss(y_test_ds, y_hat_ds_tf)

0.07706725949173968

In [165]:
print(classification_report(y_test_ds, y_hat_ds_tf))

              precision    recall  f1-score   support

           0       0.87      0.87      0.87      3773
           1       0.46      0.28      0.35       387
           2       0.84      0.78      0.81      1972
           3       0.55      0.26      0.36       117
           4       0.71      0.70      0.71      1789
           5       0.58      0.37      0.45       325

   micro avg       0.81      0.76      0.78      8363
   macro avg       0.67      0.54      0.59      8363
weighted avg       0.80      0.76      0.77      8363
 samples avg       0.39      0.37      0.36      8363



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### NN downsample cv

In [166]:
nn_ds_cv = get_model(X_train_ds_cv_df.shape[1], y_train_ds.shape[1])

In [167]:
nn_ds_cv.fit(X_train_ds_cv_df, y_train_ds, epochs = 3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x26e35e4e6a0>

In [168]:
nn_ds_cv.evaluate(X_test_ds_tf_df, y_test_ds)



[0.5780880451202393, 0.9894075989723206]

In [169]:
y_hat_ds_cv =  nn_ds_cv.predict(X_test_ds_cv_df)
y_hat_ds_cv = y_hat_ds_cv.round()

In [170]:
accuracy_score(y_test_ds, y_hat_ds_cv)

0.6449588073754413

In [171]:
print(classification_report(y_test_ds, y_hat_ds_cv))

              precision    recall  f1-score   support

           0       0.90      0.85      0.87      3773
           1       0.47      0.32      0.38       387
           2       0.81      0.84      0.82      1972
           3       0.52      0.12      0.19       117
           4       0.69      0.74      0.71      1789
           5       0.49      0.05      0.09       325

   micro avg       0.80      0.76      0.78      8363
   macro avg       0.64      0.49      0.51      8363
weighted avg       0.79      0.76      0.77      8363
 samples avg       0.37      0.37      0.36      8363



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
