# NLP Project - Toxic Comment Detection

The code below is used for the detection of toxic comments in a given dataset -https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge-. The decision was made to use a RandomForrest model to predict whether a comment is, primarily, toxic; but is also trained to derive whether a comment is: Severely toxic, obscene, a threat, an insult and/or identity hate. 

The code begins with importing all the relevant libraries, where the RandomForestClassifier from the sklearn library is the main player. After the importation of all the libraries, a function is defined called: cleanText. The function takes the text which has to be cleaned as its input, where after it will start to clean the text as follows:
- First  the stopword are imported from the default nltk.corpus where the 'english' stopwords are selected while we deal with -mostly- an English text;
- The decision was made to whitelist the following words: 'not','you','your','you're' and are. This while not is a negation and will make a toxic comment untoxic by negating the toxidity and the variations of you while this indicates a direction against someone which is usually used when swearing;
- The text is transformed to only be lowercase so we do not have to deal with capital letters, this was also found to lack improvement in our accuracy;
- English abbreviations are also subbed in order to get the full spelling;
- the stopwords are deleted from the text -of course without the whitelisted words-;
- all the numerals are deleted from the data, this while you can hardly be toxic by using numbers;
- punctuation is deleted;
- in the final step the word tokens are joined together to get back to the original data form.


In [16]:
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

import collections
import re
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk import word_tokenize, sent_tokenize
from string import punctuation
import math
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer


#cleaning the available text to get clean data
def cleanText(text):

    stopw = stopwords.words('english')

    # remove 'not' from the stopwords while this can negate an insult
    # decided to add 'you' while in toxic conversations you is used to enhance te meaning
    #TODO: enhance comments
    stopw.remove('not')
    stopw.remove('you')
    stopw.remove('your')
    stopw.remove('you\'re')
    stopw.remove('are')

    #make the whole text lowercase so we don't make differences between capitalization
    text = text.lower()

    #subbing words to match cleaner words.
    text = re.sub("\'s", " ", text)
    text = re.sub(" whats ", " what is ", text, flags=re.IGNORECASE)
    text = re.sub("\'n't ", " not ", text, flags=re.IGNORECASE)
    text = re.sub(" n't ", " not ", text, flags=re.IGNORECASE)
    text = re.sub("I'm", "I am", text)
    text = re.sub("shouldn\'t", " should not ", text, flags=re.IGNORECASE)
    text = re.sub("were\'nt", " were not ", text, flags=re.IGNORECASE)
    text = re.sub("can't", " can not ", text, flags=re.IGNORECASE)
    text = re.sub("\'ve", " have ", text)
    text = re.sub("\'ll", " will ", text)

    #remove all the stopwords and remove all non letters
    words = word_tokenize(text)
    tokens = [word for word in words if word not in stopw]

    #remove all non letters in the dataset
    tokens = [word for word in tokens if re.match(r'[^\W\d]*$', word)]

    #remove all URLs in the dataset
    # TODO: write a regex for this

    text = ' '.join(tokens)

    #remove punctuation
    text = ''.join([word for word in text if word not in punctuation])

    #dealing with empty data line
    if type(text) != str or text == '':
        return ''


    cleaned_text = text
    return cleaned_text


Having finished a method to clean the dataset, it is imported and the method is applied to the dataFrame. This way the cleaned text is now inserted into the dataFrame which is easier to work with later on. The commented code was a way to investigate the text and what was needed to be cleaned, this is done in an easier way outside of an IDE and into dataset software like Excel. The original text was also exported to get a view on how well the original text was being cleaned.

In [35]:
# importing the training data using pandas.
df_train = pd.read_csv('Data/train.csv')

# save the original text to easily inspect it and derive what has to be cleaned
# df_train['comment_text'].to_csv('Data/OriginalText.csv')

# save the cleaned text to easily inspect it
# df_train['comment_text'].to_csv('Data/cleanedText.csv')

# clean the text
df_train['comment_text'] = df_train['comment_text'].apply(cleanText)


# Preparing the dataFrame for the RandomForestClassifier

With having a clean dataset the first step is completed towards implementing the randomForestClassifier. However, more preparation has to be done. The problem stands that the RandomForestClassifier -from here on to be abbreviated as RFC- needs an input of floats and the current dataFrame only holds strings as an input. Thus, in a way, these strings have to be converted to floats. In this case, the choice was made to do this by using a word2vec model from the Gensim library. Word2Vec will give a vector value to every word in the dataset, in order to make this useful for the dataset the average value of all vectors in a sentence is taken such that we have one value representing every sentence. In order to do this, two methods were constructed:

averageVecValue: 
   - Inputs: 
        comment; the sentence which has to be averaged
        model; the word2vec model
        vectorSize; the vectorSize that was chosen to initialize the word2vec model
        vocab; the vocabulary made by the word2vec model

the method takes every word in the comment and checks if it is in the vocabulary, these words are all added and finally divided by the vectorSize to compute an average value. 

Word2Vec:    
    - Inputs: 
        cleanedData; is the cleaned dataSet, to be more specific the cleaned comments
        dataSet; the full dataFrame
  
initializes the word2vec model from the gensim library and uses the averageVecValue to compute the average vector value of a sentence, all of the comment's respective vector values are saved in the list vectorizedData where after it is passed back.


In [18]:
def averageVecValue(comment, model, vectorSize, vocab):
    Vector = np.zeros(vectorSize)
    
    for word in comment:
        if word in vocab:
            Vector += np.array(model.wv.get_vector(word))
    
    Vector_value = np.divide(Vector, vectorSize)
    
    return Vector_value.tolist()

In [19]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

def word2Vec(cleanedData, dataSet):
    dataSet['comment_text_tokenized'] = dataSet['comment_text'].apply(word_tokenize)
    tokens = df_train['comment_text_tokenized']
    
    vectorSize = 300
    word2vec = Word2Vec(tokens,min_count = 2, size = vectorSize)
    vocab = word2vec.wv.vocab
    
    vectorizedData = []
    for index, row in dataSet.iterrows():
        vectorizedData.append(averageVecValue(row['comment_text'], word2vec, vectorSize, vocab))
    
    return vectorizedData


# Initializing the RFC

To initialize the RFC, the dataSet is first reduced to a sample size of 25000. This while computation would be to long for testing and the goal of the project is not to waste a lot of precious time by watching tests running. the Xtrain variable is the computed word2vec vectors for every comment and the Ytrain are the labels already given with our dataSet. Because we are running for the first time and it is interesting to see how the model will perform without perfect parameters; the RFC has its basic parameters. 

The training data is split into a train and test set and the training data is fitted to the RFC. To get a simple view of its accuracy the .score method is used; however this is not a really good indication. To get a good indication, the choice was made to let the RFC make a prediction and use Metrics from the sklearn library to get a better view of the model's performance. 

In [20]:
# setting up the X training comments (vectorize them to be able to be used as input for model) and Y training labels
print("setting up training data ")
df_train = df_train.sample(n=25000, random_state=33)
Xtrain = word2Vec(df_train['comment_text'], df_train)
Ytrain = df_train[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]


# importing the test data set to test the algorithm
df_test = pd.read_csv('Data/test.csv')


#run the algorithm for the first time and get an idea of the accuracy with the basic parameters.
print("started training the model")
rf_model = RandomForestClassifier()
# rf_model.fit(Xtrain, Ytrain)

# test the accuracy of the model on a split training dataset
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_auc_score


xtrain, xtest,ytrain,ytest = train_test_split(Xtrain,Ytrain,test_size=0.33, random_state=66)
rf_model.fit(xtrain, ytrain)
print("RF Accuracy: %0.2f%%" % (100 * rf_model.score(xtest, ytest)))

#predictions
rf_predict = rf_model.predict(xtest)
rfc_cv_score = cross_val_score(rf_model, Xtrain, Ytrain, cv=10, scoring='roc_auc')

setting up training data 
started training the model




RF Accuracy: 89.27%


# Tests

The tests conducted can be found below. From these we can see that the model performs quite well with just basic parameters.
This can be seen from:.............................TODO

However this is just with the basic parameters of the RFC. To get an higher accuracy we can try to predict which hyperparameter values would be applicable. Therefore, the choice was made to implement a RandomizedSearchCV, which will be explained later.    
    

In [21]:
#printing tests 
#TODO: explain the results... 
print("Confusion Matrix:")
print(confusion_matrix(ytest.values.argmax(axis=1), rf_predict.argmax(axis=1)))
print('\n')

# TODO: recall and f-score warning should be fixed 
print("Classification Report")
print(classification_report(ytest, rf_predict))
print('\n')

print("All Cross Validation Scores")
print(rfc_cv_score)
print('\n')


print("Mean Cross Validation Score")
print("Mean AUC Score - Random Forest: ", rfc_cv_score.mean())


Confusion Matrix:
[[8210    1    0    0    0]
 [  24    0    0    0    0]
 [   1    0    0    0    0]
 [  11    0    0    0    0]
 [   3    0    0    0    0]]


Classification Report
              precision    recall  f1-score   support

           0       0.55      0.12      0.19       798
           1       0.38      0.06      0.11        81
           2       0.67      0.14      0.24       418
           3       0.00      0.00      0.00        24
           4       0.59      0.11      0.18       409
           5       0.00      0.00      0.00        88

   micro avg       0.58      0.11      0.19      1818
   macro avg       0.37      0.07      0.12      1818
weighted avg       0.54      0.11      0.18      1818
 samples avg       0.01      0.01      0.01      1818



All Cross Validation Scores
[0.67092956 0.67907178 0.72430859 0.6652532  0.65776084 0.69016694
 0.69766884 0.69008297 0.63862162 0.67633277]


Mean Cross Validation Score
Mean AUC Score - Random Forest:  0.679019710639

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


# Test of the optimized model and RandomizedSearchCV 

As already said, a RandomizedSearchCV was used to get the best hyperParameters for the RFC. What a RandomizedSearchCV basically does is giving random values to your randomGrid parameters and testing the model for a -predefined- number of models. 

To elaborate, first the parameters that will be investigated have to be chosen. In this case, these are chosen to be the hyperparameters -read most important parameters- of the RFC; being: the number of estimators, the maximum amount of features, the maximum tree depth and the bootstrap Boolean value. 
The iterations of the RandomSearchCV is set to 50 and these 50 models will be checked twice. Therefore, in total the amount of iterations the algorithm will make is 100; this again chosen due to not wanting to spend hours waiting for the algorithm.  After the algorithm is finished the best parameters found can be retrieved and inserted in the new model which will then again be tested. The found parameters in my initial run where:  n_estimators = 200 , max_features = 'sqrt' , max_depth = 140 , bootstrap = 'true'. 

Finally, to test if the model was improved by the found hyper parameters the same tests are conducted as before and this resulted in: .................................
TODOOOOO


In [36]:
# testing if the retrieved parameters are actually better
# parameters;
# n_estimators = 200 , max_features = 'sqrt' , max_depth = 140 , bootstrap = 'true'
    
#initializing the model and fitting it to the training data
print("beginning")


rf_optimized_model = RandomForestClassifier(n_estimators=200, max_features='sqrt', max_depth=140)
rf_optimized_model.fit(xtrain,ytrain)

# normal score of the model
print("Optimized Model Accuracy")
print("RF Accuracy: %0.2f%%" % (100 * rf_optimized_model.score(xtest, ytest)))


print("predicting")
# prediction and crossvalidation
rf_predict_optimized = rf_optimized_model.predict(xtest)
# rfc_cv_score_optimized = cross_val_score(rf_optimized_model, Xtrain, Ytrain, cv=10, scoring='roc_auc')

# tests
# TODO: explain the results.
print("Optimized model Confusion Matrix:")
print(confusion_matrix(ytest.values.argmax(axis=1), rf_predict_optimized.argmax(axis=1)))
print('\n')

#f1 score and recall things should be fixed
print("Optimized Model Classification Report")
print(classification_report(ytest, rf_predict_optimized))
print('\n')

# print("Optimized Model All Cross Validation Scores")
# print(rfc_cv_score_optimized)
# print('\n')

# print("Optimized Model Mean Cross Validation Score")
# print(rfc_cv_score_optimized.mean())
    

beginning
Optimized Model Accuracy
RF Accuracy: 89.70%
predicting
Optimized model Confusion Matrix:
[[1644    0    0]
 [   4    0    0]
 [   2    0    0]]


Optimized Model Classification Report
              precision    recall  f1-score   support

           0       0.62      0.13      0.21       157
           1       0.40      0.12      0.18        17
           2       0.67      0.14      0.23        99
           3       0.00      0.00      0.00         2
           4       0.48      0.12      0.19        84
           5       0.00      0.00      0.00        18

   micro avg       0.58      0.12      0.20       377
   macro avg       0.36      0.08      0.14       377
weighted avg       0.56      0.12      0.20       377
 samples avg       0.01      0.01      0.01       377





  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [None]:
# improve the model by trying to find the best parameters for the random forest, we can do this by using RandomizedSearchedCV
# hyperparameter training, using 4 parameters
# only has to be run once!
# be careful for overfitting
# https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74 with modified parameters
from sklearn.model_selection import RandomizedSearchCV

# number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]

# number of features at every split
max_features = ['auto', 'sqrt']

# max depth
max_depth = [int(x) for x in np.linspace(100, 500, num=11)]
max_depth.append(None)

# bootstrap
bootstrap = ['true', 'false']

# create random grid
random_grid = {
        'n_estimators': n_estimators,
        'max_features': max_features,
        'max_depth': max_depth,
        'bootstrap': bootstrap
}

# Random search of parameters
rfc_random = RandomizedSearchCV(estimator=rf_model, param_distributions=random_grid, n_iter=50, cv=2, verbose=2,
                                    random_state=42, n_jobs=-1)

# Fit the model
rfc_random.fit(xtrain, ytrain)
# print the best result
print(rfc_random.best_params_)

#result = {'n_estimators': 200, 'max_features': 'sqrt', 'max_depth': 140, 'bootstrap': 'true'}
