# Training Data for Toxicity Report


One of the major issues in working with the public is assessing when comments cross the line between playful banter or informative dialogue and toxicity. Under small-scale circumstances, a human might be able to manually flag and address toxicity on their own. However, large public forums like Twitter and Reddit do not have this privilege as the sheer number of posts and comments uploaded every day can be overwhelming for any human to do on their own.

This then shows the importance of natural language processing for companies that offer public services like this. Utilizing the power of computers to identify patterns based on training data, data scientists can automate the process of flagging text for toxicity. That is the ultimate goal of this notebook: to develop a logistic regression model that can read text and flag it if it contains instances of toxicity, obscenities, threats, insults, or hateful language. After training the model, it will have a practical use of flagging unique data found in the test.csv file.

## Importing Modules

The first step in solving this problem is to import the useful modules and create a function that normalizes the text data. To begin, I imported pandas to help parse through the csv files and identify the relevant data. Then, I imported sklearn and some specific functions from that module used to training my program for multilabel identification.

In [217]:
import pandas as pd # to read csv file
import numpy as np

from nltk.tokenize import word_tokenize

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier

Next it's imperative to prepare the vectorizor and the model ahead of time. For this, I pulled the TfidfVectorizer which uses NLTK's word tokenizer method to identify tokens. It also filters out English stopwords as these aren't necessarily relevant. For the model, I have chosen the One vs Rest Classifier startegy which trains a logistic regression on multivariable data. The goal here is to flag the text for multiple types of toxicity when necessary, and this strategy allows for this.

In [218]:
vect = TfidfVectorizer(stop_words = 'english', token_pattern="<?\w+>?")
clf = OneVsRestClassifier(LogisticRegression(max_iter=3000)) # Uses OneVRest multilabel classification
    # with logistic regression model as its parameters; 1500 epochs

## ETL for Language Data

For the next step, I defined a function that transforms the training data and normalizes it through the power of vectorization. The function performs a series of normalization tasks through regular expressions and a function that seperates contractions. This will help ensure that the data is clean for the Tfidf Vectorizer so that urls, hashtags, mentions, and punctuation do not influence the data.

In [180]:
import re, string, json

def normalize_data(data):
    contractions = json.loads(open('english_contractions.json').read())
    data = data.str.lower()
    data = data.str.replace(r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+',
                            '<URL>',
                            regex=True)
    data = data.str.replace(r'#(\w+)',
                            '<HASHTAG>',
                            regex=True)
    data = data.str.replace(r'@(\w+)',
                            '<MENTION>',
                            regex=True)
    data = data.apply(lambda text: normalize_contractions(text, contractions))
    punctuation = '['+re.sub('[<>]', '', string.punctuation) + '“”¨«»®´·º½¾¿¡§£₤‘’'+']'
    data = data.str.replace(punctuation,
                            '',
                            regex=True)
    data = data.str.replace('(<(?!=(URL|HASHTAG|MENTION))|(?!(URL|HASHTAG|MENTION))>)',
                           '',
                           regex=True)
    data = data.str.replace(r'\s+',
                            ' ',
                            regex=True)
    data = data.str.replace(r'\d+',
                            '',
                            regex=True)
    return data

def normalize_contractions(text, contractions):
    new_token_list = []
    token_list = text.split()
    for word in token_list:
        if word.lower() in contractions:
            replacement = contractions[word.lower()]
            replacement_tokens = replacement.split()
            # contractions are now split, so each individual word needs to appended
            if len(replacement_tokens) > 1:
                new_token_list.append(replacement_tokens[0])
                new_token_list.append(replacement_tokens[1])
            else:
                new_token_list.append(replacement_tokens[0])
        else:
            new_token_list.append(word)
    sentence = " ".join(new_token_list).strip(" ")
    return sentence

Now, I'm ready to establish my training set. Using the Pandas to read the csv, I download the training data from train.csv. I also identified the list of tags that will act as the dependent variables the model will train on. Next, I ensure the text data is normalized and ready for Tfidf vectorization.

In [191]:
train_data = pd.read_csv('train.csv')
X = normalize_data(train_data['comment_text'])
tagset=['toxic','severe_toxic','obscene','threat','insult','identity_hate']
y = train_data[tagset]

Using the data established above, I use TfidfVectorizer with the parameters I specified to create a matrix of my training data. Using the vocabulary established in fit_transform, I then split the data into training and testing variables with sklearn's train_test_split method.

In [192]:
X = vect.fit_transform(X)
X_trn, X_tst, y_trn, y_tst = train_test_split(X, y)

## Training the Model

With the training matrix (trn_mtx), I use the OneVsRestClassifier with LogisticRegression to train on 3000 epochs. When it is complete, the accuracy score will be recorded and printed.

In [196]:
clf.fit(X_trn, y_trn) # Trains data classifier on training data

In [198]:
y_pred = clf.predict(X_tst)

accuracy = accuracy_score(y_tst, y_pred) # Calculate the accuracy
print('The calculated accuracy is: {:.2%}'.format(accuracy))

The calculated accuracy is: 91.64%


## Finding the Best Parameters

While our current model is really accurate (between 91 and 92%), the question remains whether or not it is the most accurate we can get. To decide this, we must use a grid search to find the best parameters for this model.

In [205]:
from sklearn.model_selection import GridSearchCV

parameters = {'estimator__tol':[0.01, 0.001, 0.0001], 'estimator__max_iter':[1500,2500,3000]}

In [210]:
#Instantiate the model
grid_search = GridSearchCV(estimator=clf, param_grid=parameters, cv=5)

# Fit grid_search to the data
grid_search_result = grid_search.fit(X_trn, y_trn)

# Summarize results
best_score, best_params = grid_search_result.best_score_, grid_search_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

# Extract the best model and evaluate it on the test set
best_model = grid_search_result.best_estimator_
print("Accuracy of the best classifier: {:.2%}".format(best_model.score(X_tst, y_tst)))

Best: 0.916384 using {'estimator__max_iter': 1500, 'estimator__tol': 0.001}
Accuracy of the best classifier: 91.64%


## Practical Usage

With the best model selected and trained, I am ready to move to the final step. Here, I take the test.csv file which only contains the comments that have yet to be tagged. The process will largely remain the same, with the difference being a lack of an accuracy score. I first normalize and vectorize the data, then have the model predict toxicity.

In [211]:
prediction_data = pd.read_csv('test.csv').set_index(keys=['id'])
test_data = normalize_data(prediction_data['comment_text'])
tst_mtx = vect.transform(test_data)

In [213]:
predict = best_model.predict(tst_mtx)
prediction_data[tagset] = predict

Out of curiousity, I also single out the indexes where the comments were deemed toxic, and print the first 10 to see which comments these were.

In [215]:
toxic_ids = prediction_data[tagset].where(prediction_data[tagset] == 1).dropna(how='all')
toxic_comments = prediction_data[prediction_data.index.isin(toxic_ids.index)]['comment_text'].tolist()
print(*toxic_comments[:10], sep='\n\n===============New Comment===============\n')

Yo bitch Ja Rule is more succesful then you'll ever be whats up with you and hating you sad mofuckas...i should bitch slap ur pethedic white faces and get you to kiss my ass you guys sicken me. Ja rule is about pride in da music man. dont diss that shit on him. and nothin is wrong bein like tupac he was a brother too...fuckin white boys get things right next time.,

:Dear god this site is horrible.

" 

 ==balance== 
 This page has one sentence about the basic definition of the word, and a huge amount about the slang/profane uses. Perhaps the former should be extended; is there no information about female dogs available beyond their name? This is an encyclopaedia, not a dictionary.  

  
 i feel that whoever is looking this definition up is very appropiate and should be deleted from wikipedia...IMMEDIATLY. this word is used very often and is also a very ""mean"" word. i belive that is majorly true. very much so. okay so, the good meaning is a female dog.  BITCH !!!!!!!!!It also stands 

Finally, I save the output of the practical usage into a csv, where it can be accessed later.

In [216]:
prediction_data.to_csv('final_output.csv')