# Training Data for Toxicity Report


This final was tough, taking a lot of time and trial and error before finding an algorithm that worked and worked fast. As a reminder, my initial goal for the program was to train a naïve Bayes classifier on a .csv of Wikipedia comments that can identify toxic or negative comments. Ultimately, the hope was that the program would see a potentially toxic comment and flag it for review. The proposed steps for solving this problem were the following:

1. Build functions that parse through and normalize the .csv documents;
2. Define a vocabulary of unique words within the normalized training data;
3. Load training data into two arrays:
4. An array of vectors based on the size of the vocabulary;
 - An array of labels for each comment in the training data;
 - Train the naive Bayes classifier with these arrays;
5. Test the classifier on the test data;
6. Calculate the prediction score.

In practice, there were many changes to that initial proposal that had to be made in order to have a functioning program that did not take days to process. For instance, I did away with normalizing the data. Doing so took time to figure out, finding guidance in the article by Duque (2020). This program parsed through the original data, filtered and normalized it and saved that data to a new .txt file. While it worked, it took several hours to parse through and normalize all the data. Another roadblock I faced was in how we had learned to create a bag of words through a matrix of float tensors. Initially, I was going to use an rnn model like the ones created for labs 2 and 5. However, the bag of words turned out to be too large for my computer to handle, and the kernel would die regularly. As a result, I ended up having to do lots of research on quicker and easier means of parsing through and training the data. Those included the handbooks for pandas and sklearn, video tutorials and suggestions from others who completed the original Kaggle challenge. My final method for solving the problem was as follows:

1. Import the modules necessary for reading csv files and multilabel classification training;
2. Identify the training and testing data and normalize them;
3. Use the scikit-learn vectorizing tools for making the data computer readable;
4. Train the classifier and calculate its accuracy;
5. Test the classifier;
6. Save the tested classifications to a .csv file.

The first step in solving this problem has to do with importing the modules necessary and creating a function that normalizes the text data. To begin, I imported pandas to help parse through the csv files and identify the relevant data. Then, I imported sklearn and some specific functions from that module used to training my program for multilabel identification.

In [1]:
import pandas as pd # to read csv file
import numpy as np

In [48]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
vect = TfidfVectorizer(stop_words = 'english')
clf = OneVsRestClassifier(LogisticRegression(max_iter=3000)) # Uses OneVRest multilabel classification
    # with logistic regression model as its parameters; 3000 epochs

For the next step, I defined a function that parses through the training data and normalizes it. Doing so would hopefully make the training vectors a little less complicated, particularly in removing any unnecessary punctuation or complicated images ([NB-SVM strong linear baseline](href=https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline?scriptVersionId=2329069&cellId=1), [Duque (2020)](https://towardsdatascience.com/text-normalization-7ecc8e084e31)).

In [3]:
import re, string, json
def normalize_data(data):
    contractions = json.loads(open('english_contractions.json').read())
    data = data.lower()
    data = re.sub(r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', '<URL>', data)
    data = re.sub(r'#(\w+)', '<HASHTAG>', data)
    data = re.sub(r'@(\w+)', '<MENTION>', data)
    data = normalize_contractions(data, contractions)
    data = re.sub(f'[{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’]', '', data)
    data = re.sub(r'\s+', ' ', data)
    data = re.sub(r'\d+', '', data)
    return data

def normalize_contractions(text, contractions):
    new_token_list = []
    token_list = text.split()
    for word_pos in range(len(token_list)):
        word = token_list[word_pos]
        first_upper = False
        if word[0].isupper():
            first_upper = True
        if word.lower() in contractions:
            replacement = contractions[word.lower()]
            if first_upper:
                replacement = replacement[0].upper()+replacement[1:]
            replacement_tokens = replacement.split()
            if len(replacement_tokens) > 1:
                new_token_list.append(replacement_tokens[0])
                new_token_list.append(replacement_tokens[1])
            else:
                new_token_list.append(replacement_tokens[0])
        else:
            new_token_list.append(word)
    sentence = " ".join(new_token_list).strip(" ")
    return sentence

Now, I'm ready to establish my training set. Using the pandas.read_csv function, I establish which document is the training data and which one is the testing data. I also identified the list of tags used in the header of each file for later. Finally, I assign the normalized test and training data to variables X_trn and X_tst, respectively. Initially, I planned to do the same with the taglists, but I later realized that the data provided in __test_labels.csv__ does not accurately reflect the tags associated with __test.csv__. I am sure why they were included in the dataset, honestly.

In [4]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
#tst_tags = pd.read_csv('test_labels.csv')
tagset=['toxic','severe_toxic','obscene','threat','insult','identity_hate']

In [58]:
X_trn = train.comment_text.apply(normalize_data)
X_tst = test.comment_text.apply(normalize_data)
Y_trn = train[tagset].values
#Y_tst = test_tags[tagset].values

Using the data established above, I use TfidfVectorizer with the parameters I specified to create a matrix of my training data. Using the vocabulary established in fit_transform, I then create a matrix of the testing data. With the training matrix (trn_mtx), I use the OneVsRestClassifier with LogisticRegression to train on 3000 epochs. When it is complete, an accuracy will be recorded and printed.

In [49]:
trn_mtx = vect.fit_transform(X_trn)
tst_mtx = vect.transform(X_tst)

In [50]:
clf.fit(trn_mtx, Y_trn)# Trains data classifier on training data
accuracy = clf.score(trn_mtx, Y_trn) # Calculate the accuracy
print('The calculated accuracy is: {:.2%}'.format(accuracy))

The calculated accuracy is: 92.42%


The final step in this process was to calculate predicted scores for the testing data and save those scores to an output file. Like I had said, I expected the data in __train_labels.csv__ to be more data to help calculate accuracy. However, the list of labels turned out to be -1's and 0's, with no information about the comments indicated. Upon further reflection, the data seems to have been added later for those competing to peruse. It was not initially part of the challenge, so I chose to ignore it and submit my predicted scores instead for perusal.

In [51]:
predict = clf.predict(tst_mtx)

In [52]:
submission = pd.read_csv('sample_submission.csv')
submission[tagset] = predict

In [53]:
submission.to_csv('final_output.csv', index=False)

Like I said, this challenge was difficult and took a lot of extra research to complete. However, I found it very valuable as it taught me the logistics of teaching myself a machine learning software like scikit-learn and the established a solid foundation in multiclass output classification compared to the naïve Bayes binary classification