# Profanity Classification
Based on [this kaggle challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data) called "Toxic Comment Classification Challenge - Identify and classify toxic online comments"

### Quick research
Turned up these papers (pdf warning and such):
1. [Text classifcation of short messages. Lundborg et al.](http://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=8928009&fileOId=8928011)
2. [Automatic Classification of Abusive Language and Personal Attacks in Various Forms of Online Communication. Bourgonje et al.](https://link.springer.com/content/pdf/10.1007%2F978-3-319-73706-5_15.pdf)
3. [Automatic Detection of Cyberbullying in Social Media Text. van Hee et al.](https://arxiv.org/pdf/1801.05617.pdf) and a [different version](https://biblio.ugent.be/publication/6969774/file/6969839.pdf)
4. [Harnessing the Power of Text Mining for the Detection of Abusive Content in Social Media. Chen et al.](https://arrow.dit.ie/cgi/viewcontent.cgi?referer=https://scholar.google.nl/scholar?as_ylo=2014&q=profanity+text+classification&hl=en&as_sdt=0,5&httpsredir=1&article=1196&context=scschcomcon)

#### A  shallow scan of these papers/thesis:
1. They are also looking into a multi class problem. Although generally less than our 7 classes.
2. They, among English, are working on Swedish, Dutch (well, Belgium ;)) data.
3. Annotation seems to be one of the hardest problems. Already done for us :D
4. There could be a nice scientific comparison to peer work. If time
5. n-grams (character, word, skip-grams)
   - 1-3 word n-grams and 1-6 character n-grams
   - skipgrams using NER from spacy, mcparseface or NLTK. For example 2-grams in the list of nouns 
6. (Feature) Spelchecker results
7. (Feature) Sentiment analysis (In this case: #positive words/#total words, #negative words/#total words)
8. (Feature) Linguistic features
   - #words, #characters
   - #uppercase words (normalized)
   - #uppercase characters (normalized) (-> avg ratio uppercase characters per word?)
   - Longest word
   - Average word length
   - #one letter tokens, #number of one letter tokens / #words.
   - #punctuation, spaces, exclamation marks, question marks, at signs and commas
9. (Feature) You can extract sytactic features from word trees (generated by for example spacy, I remember).
   - Word + parent
   - Word + grandparent
   - Word + children
   - Word + siblings of parent
10. (Feature) Specifically engineered term lists
   - Binary features indicating that a term from given list is in the text. Lists are researcher engineered and contain for example words indicating bullying (in the case of paper 3)
11. (Classifiers Used) Naive Bayes, SVM, Random Forest, Neural Networks

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import time
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

In [None]:
# Load data
trainPd = pd.read_csv("train.csv")
testPd = pd.read_csv("test.csv")

### So, what's in the data:

In [None]:
def show_text_len_values(column, label):
    """ Given column, calculates numbers such as
    mean, median, std dev, min, max text length 
    of the character counts of the text
    and prints these. Also prints a histogram based 
    on the character counts of the texts
    """
    textLen = column.apply(len)
    print("{7}:\n\tmin:\t{0}\n\tmax:\t{1}\n\tmedian:\t{2}\n\tmean:\t{3}\n\tstddev:\t{4}\n\t10th perc:\t{5}\n\t90th perc:\t{6}".format(
        np.min(textLen),
        np.max(textLen),
        np.median(textLen),
        round(np.mean(textLen),2),
        round(np.std(textLen),2),
        round(np.percentile(textLen,10),2),
        round(np.percentile(textLen,90),2),
        label
    ))
    
    # Show hist
    plt.hist(textLen, bins=50)
    plt.yscale('log', nonposy='clip')
    plt.ylabel('#texts (log)');
    plt.xlabel('text length');
    plt.show()
    
def show_data_stats(trainPd, testPd):
    """ Shows/prints information on the dataset
    """
    # Amount of rows?
    print("Train data:\n\t#rows: {0}\n\tcolumns: {1}\n".format(trainPd.shape[0], list(trainPd.columns)))
    print("Test data:\n\t#rows: {0}\n\tcolumns: {1}\n".format(testPd.shape[0], list(testPd.columns)))

    # Can one line have multiple true labels?
    multipleLables = trainPd[np.sum(trainPd[['toxic','severe_toxic','obscene','threat','insult','identity_hate']], axis=1) > 1]
    print("Can one line have multiple true labels? (train)")
    print("\t#rows with more than one true label: {0}\n".format(multipleLables.shape[0]))

    # How many of each label is there?
    print("How many of each label is there?")
    for label in ['toxic','severe_toxic','obscene','threat','insult','identity_hate']:
        sumL = np.sum(trainPd[label], axis=0)
        print("\t#{0}:\t{1}\t(={2}% of train data)".format(label, sumL, round((sumL/trainPd.shape[0])*100,2)))

    # Also print amount of normal samples
    sumL = trainPd[np.sum(trainPd[['toxic','severe_toxic','obscene','threat','insult','identity_hate']], axis=1) < 1].shape[0]
    print("\t#normal:\t{0}\t(={1}% of train data)\n".format(sumL, round((sumL/trainPd.shape[0])*100,4)))

    # What is the mean, median, std dev, min, max text length? (train and test)
    show_text_len_values(trainPd['comment_text'],"Train")
    show_text_len_values(testPd['comment_text'],"Test")

    # What is the mean, median, std dev, min, max message length per class? 
    #  (ignoring the fact that a message can have multiple classes)
    for label in ['toxic','severe_toxic','obscene','threat','insult','identity_hate']:
        column = trainPd[trainPd[label] == 1]['comment_text']
        show_text_len_values(column,"Train {0} == 1".format(label))

show_data_stats(trainPd, testPd)

### Interpretation
- Multilabel classification problem. One row can have more than one label
- Skewed dataset. Lots (90%) of "Normal", only 0.88% of the rows has the label "threat" 
- 90% is of normal class
- Lots of rows have more than one label. For example. Of the total, 9.58% in the train data is "toxic". Which is almost 100% of the non normal train rows.

### What to do?
We could (among others):
1. Bootstrap our samples in the underrepresented classes
2. Undersample the overrepresented classes (Or in combination with 1.)
3. Do nothing, let the classification algorithm figure it out (For example Random Forest can handle some skewness)
4. Do a binary classification for each of 8 classes (Is it normal or other, is it obscene or other. etc. We can now sample non skewed sets per class)
5. Create, from the 7 binary columns one integer (0-127). 

### What should we do?
- I think, 4. Because 1, 2 are hard due to the multilabel properties. Sampling more from the "threat" class will also result lots of samples from other classes. We would still have a skewed set. (it is solvable though). Also, this could result in a biased train set.
- 3 is also not optimal because it is not slightly skewed. We have a significant skewness. 
- 4 allows us to create equal sets and still get a probability per class. Which is what the Kaggle challenge asks for.
- 5: I for now wouldn't immediately know how to handle the results. We would have to map back from this continuous value to 7 probabilities.

_(So, I'm going with 4 and not trying more due to time constraints)_

## Implementation

1. create_binary_training_data
   - Creates, for each class a training set with n True and n False samples
2. train_on_binary_training_data 
   - For each class separately calculate features and train an model
   - Creates some simple counting features
   - Uses create_document_term_matrix to create TF and TF/IDF matrices
   - Trains a Random Forest Classifier using Grid Search and 10 fold cross validation
3. show_feature_importances
   - Shows for each of the models the n most important features
   - Allows for visual inspection. Does it make sense? Does it look like we could improve on this? 
4. validate_on_remaining_train_data
   - In train_on_binary_training_data, only part of the training data has been sampled to be used
   - Use the rest of this training data as an validation set
   - predict on these validation samples using the trained models (in "predict")
   - Calculate the avg ROC AUC on the results. Kaggle ranks on this value (on the test set that is)
5. predict
   - Uses the existing models to predict for each class the probability that this comment falls in given class
6. save_predictions
   - Saves the comment_id's with the predicted probabilities to an CSV (to be uploaded to Kaggle)
7. print_random_classifications
   - For visual inspection
   - Shows comment_text for n random samples with actual label and predicted labels

In [None]:
def create_binary_training_data(trainPd, 
                                classes=['toxic','severe_toxic','obscene','threat','insult','identity_hate'], 
                                n=500):
    """ Creates binary training data
    Returns dictionary with, for each class in classes an dataframe df
    where df has two classes of n values. Given class and "other" (True = given, False = other)
    """
    trainPds = {}

    # Create train datasets for each class. Resulting in two types of rows *class and *other
    # Bytheway we're not going to classify anything as "normal". It results from "other" in all binary classifications
    for label in classes:
        # Sample "label"(if len(label) < n, sample w/o replacement. Otherwise with replacement)
        ofLabel = (trainPd[label] == 1)
        replace = True
        if sum(ofLabel) > n:
            replace = False
        classPd = trainPd[ofLabel].sample(n, replace=replace)
        classPd['label'] = True

        # Sample not "label" #TODO: Stratified sampling? Or an even amount per class?
        # This class will contain LOTS (90%) of "normal" values if we don't do anything about it
        nonClassPd = trainPd[trainPd[label] != 1].sample(n, replace=False)
        nonClassPd['label'] = False

        # Add as one (vertically joined) dataframe to trainPds
        trainPds[label] = pd.concat([classPd, nonClassPd], axis=0)

        # We now have, per label, one trainPd with samples of given label and samples of all other labels
        # Note, the True values belong to the label class. But can ALSO belong to other classes.
        #       the False values DON'T belong to the label class. But to any or no amount of other classes 
        #       (With same distribution as the original dataset)
    return trainPds

binaryTrainingData = create_binary_training_data(trainPd)

In [None]:
def calculate_simple_features(message):
    """Cleans message, calculates values on message such as:
    - % of capitals
    - #characters
    - #words
    - #punctuation
    - #!
    - #?
    etc...
    
    Returns pandas series of dictionary with features
    """
    return pd.Series({
        'numCapitals':sum(message.count(x) for x in ('Q','W','E','R','T','Y','U','I','O','P','A','S','D','F','G','H','J','K','L','Z','X','C','V','B','N','M')),   # RF model will scale/normalize in respect to num characters
        'numCharacters':len(message),
        'numWords':len(message.replace('\n', ' ').replace('  ', ' ').split(' ')),  # Not entirely correct, close enough
        'numPunctuation':sum(message.count(x) for x in ('!','@','#','$','%','^','&','*','(',')','.',',','/','\\',']','[','{','}','"',':',';',"'",']','`','~','|')),
        'numExclamation':message.count('!'),
        'numQuestion':message.count('?'),
    })

In [None]:
def create_document_term_matrix(messages, 
                                max_features=1000, 
                                strip_accents='unicode',  #None
                                analyzer='word',
                                ngram_range=(1,5),        #Optimize
                                stop_words='english',
                                lowercase=True,
                                max_df=0.9,               #Optimize
                                min_df=1,                 #Optimize
                                tfidf=False):
    """ Create a document term matrix from given list of messages
    """
    vectorizer = CountVectorizer(max_features=max_features,
                                strip_accents=strip_accents,
                                analyzer=analyzer,
                                ngram_range=ngram_range,
                                stop_words=stop_words,
                                lowercase=lowercase,
                                max_df=max_df,
                                min_df=min_df,
                                )
    
    if tfidf:
        vectorizer = TfidfVectorizer(max_features=max_features,
                                strip_accents=strip_accents,
                                analyzer=analyzer,
                                ngram_range=ngram_range,
                                stop_words=stop_words,
                                lowercase=lowercase,
                                max_df=max_df,
                                min_df=min_df)    # Override. Use the tfidf vectorizer
    vectorizer.fit(messages)

    dtm = vectorizer.transform(messages)
    dtmDf = pd.DataFrame(dtm.toarray(), columns=vectorizer.get_feature_names())
    
    return dtmDf, vectorizer

In [None]:
def calculate_features(textPd, max_dtm_features=100, vectorizers=None):
    """ Calculates features based on text column of dataframe
    Returns features and a dictionary of built vectorizers
    """
    # Simple
    print("Creating simple features..")
    simpleDf = textPd['comment_text'].apply(calculate_simple_features)
    simpleDf = simpleDf.reset_index()
    simpleDf = simpleDf.drop(columns=['index'])
    
    # Based on vectorizers given or not given, calculate or use
    if vectorizers == None:
        print("No vectorizers given. Creating new on data.")
        
        # DTM
        print("Creating document term matrix..")
        print("Count..")
        dtmCountDf, dtmCountVectorizer = create_document_term_matrix(list(textPd['comment_text']), max_features=max_dtm_features)

        # TF/IDF
        print("TF/IDF..")
        dtmTfIdfDf, tfIdfVectorizer = create_document_term_matrix(list(textPd['comment_text']), max_features=max_dtm_features, tfidf=True)
    else:
        print("Vectorizers found. Only applying new data on these vectorizers.")
        
        # DTM
        print("Count..")
        dtmCount = vectorizers['dtmCountVectorizer'].transform(list(textPd['comment_text']))
        dtmCountDf, dtmCountVectorizer = pd.DataFrame(dtmCount.toarray(), columns=vectorizers['dtmCountVectorizer'].get_feature_names()), vectorizers['dtmCountVectorizer']
        
        # TF/IDF
        print("TF/IDF..")
        dtmTfIdf = vectorizers['tfIdfVectorizer'].transform(list(textPd['comment_text']))
        dtmTfIdfDf, tfIdfVectorizer = pd.DataFrame(dtmCount.toarray(), columns=vectorizers['tfIdfVectorizer'].get_feature_names()), vectorizers['tfIdfVectorizer']
    
    # Merge into one feature DF
    featureDf = pd.concat([simpleDf, dtmCountDf, dtmTfIdfDf], axis=1)
    return featureDf, {'dtmCountVectorizer':dtmCountVectorizer, 'tfIdfVectorizer':tfIdfVectorizer}

In [None]:
def train_on_binary_training_data(binaryTrainingData,
                                  labels=['toxic','severe_toxic','obscene','threat','insult','identity_hate'],
                                  max_dtm_features=1000):
    """ Train an classifier for each of the binary classes in binaryTrainingData
    """
    gscvScores = {}       # To hold results
    featuresPds = {}      # To hold dataframes with features
    vectorizers = {}      # To hold vectorizers necessary for testing
    for label in labels: 
        print("Processing label {0}:\n------------------".format(label))
        currentPd = binaryTrainingData[label]
        textPd = currentPd[['id','comment_text']]
        originalLabelsPd = currentPd[labels]
        labelsPd = currentPd['label']

        # Calculate features
        featureDf, vectorizers[label] = calculate_features(textPd, max_dtm_features)
        featuresPds[label] = featureDf
        print("\n#Features: {0}".format(featureDf.shape[1]))

        # Parameter optimization
        parameters = {
            'n_estimators': [8, 10, 12], 
            'max_depth': [None, 5, 10], # commented out because I don't want to wait now. Should be done sooner or later
            'max_features': ['auto','log2',0.25,50] #,25
        }

        # Use random forest
        # grid search on given parameters
        # Use 10 fold cross validation
        print("Grid search, cross validation..")
        rf = RandomForestClassifier()
        gridSearchCV = GridSearchCV(rf, parameters, cv=10)
        cvs = gridSearchCV.fit(featureDf, labelsPd)

        # Best parameters and corresponding CV score
        # When necessary, more results in: cvs.cv_results_
        print("\n--- Results for {0} ---".format(label))
        print("Best parameters: {0}".format(cvs.best_params_))
        print("With best score: {0}".format(cvs.best_score_))
        print("-------------------------\n".format(label))

        # Append scores to history
        gscvScores[label] = {
            'score':cvs.best_score_,
            'params':cvs.best_params_,
            'max_features':max_dtm_features,
            'best_estimator':cvs.best_estimator_
        }
        
    return gscvScores, featuresPds, vectorizers

startTime = time.time()
gscvScores, featuresPds, vectorizers = train_on_binary_training_data(binaryTrainingData)
elapsedTime = time.time() - startTime
print("Training finished in {0}s".format(elapsedTime))

In [None]:
def show_feature_importances(gscvScores, top_n=30):
    """ Look into feature importances. Does it make sense?
    Allows for visual inspection
    """
    featureImportances = {}
    for label in ['toxic','severe_toxic','obscene','threat','insult','identity_hate']: 
        fi = gscvScores[label]['best_estimator'].feature_importances_
        featureImportances[label] = pd.DataFrame([fi], columns=featuresPds[label].columns).T.sort_values(by=0, ascending=False)
        print("Feature importances for {0}:\n\n{1}\n\n".format(label, featureImportances[label][0:top_n]))
show_feature_importances(gscvScores)

In [None]:
def predict(inPd, labels=['toxic','severe_toxic','obscene','threat','insult','identity_hate']):
    """ Classify on given DF using 
    the "best estimators" for each class
    and the precalculated (on train data, without labels) vectorizers
    
    DF should contain column 'comment_text'
    
    Returns prediction (continuous, 0-1) for each of the labels to be True
    """
    predicted = {}
    for label in labels: 
        print("Predicting on label {0}".format(label))

        # Calculate features on data
        # There is double work being done here 
        #   (for example for each label the simple features are extracted)
        #   This should be moved to a different location (or memoized)
        print("Calculating features..")
        features, vectorizer = calculate_features(inPd, max_dtm_features=500, vectorizers=vectorizers[label])

        # Predict on previously trained models
        print("Predicting..")
        estimator = gscvScores[label]['best_estimator']
        predicted[label] = estimator.predict_proba(features)
    print("Done")
    return predicted

In [None]:
def validate_on_remaining_train_data(trainPd, 
                                     binaryTrainingData,
                                     sample=5000,
                                     labels=['toxic','severe_toxic','obscene','threat','insult','identity_hate']):
    """ Uses remaining training data 
    predicts using trained models
    calculates and shows average area under curve roc result
    
    Returns average AUC ROC and for each class the AUC ROC
    """
    # Create a dataframe from the train dataframe 
    # without all samples on which the models are trained
    trainedWith = pd.Series()
    for label in labels:
        trainedWith = pd.concat([trainedWith, binaryTrainingData[label]['id']], axis=0)    
    validatePd = trainPd[~trainPd['id'].isin(trainedWith)]

    # Validate on all train samples which have not been used in training
    if sample is not None:
        samplePd = validatePd.sample(sample)
    else:
        # Use the entire validation Pd
        samplePd = validatePd
    predictedValidation = predict(samplePd)

    # Calculate average area under roc
    resultPd = pd.Series()
    for label in labels:
        labelResult = pd.DataFrame(predictedValidation[label], columns=['not_{0}'.format(label),label])
        resultPd = pd.concat([resultPd, labelResult[label]], axis=1)
    roc_auc = roc_auc_score(samplePd[labels], resultPd[labels])
    
    return roc_auc,samplePd,predictedValidation

startTime = time.time()
roc_auc, validatePd, predictedValidation = validate_on_remaining_train_data(trainPd, binaryTrainingData)
elapsedTime = time.time() - startTime
print("Avg ROC AUC on our validation data is {0}. Calculated in {1}s".format(roc_auc, elapsedTime))
# Note: This ROC AUC can be based on too little data of some 
#       classes when a large amount of data has been used for training.
#       In anyway, the sampling of train data has impact on the skewness of the validation set
#       Take into account when considering this value

In [None]:
# Predict on test set
predicted = predict(testPd)

In [None]:
def save_predictions(inPd, 
                     predictions,
                     csvPath,
                     labels=['toxic','severe_toxic','obscene','threat','insult','identity_hate']):
    """ Save results to output csv
    """
    resultPd = testPd['id']
    for label in labels:
        labelResult = pd.DataFrame(predicted[label], columns=['not_{0}'.format(label),label])
        resultPd = pd.concat([resultPd, labelResult[label]], axis=1)
    resultPd.to_csv(csvPath, index=False)
save_predictions(testPd,predicted,"test_result_009.csv")

In [None]:
def print_random_classifications(actual,
                                 predicted,
                                 labels=['toxic','severe_toxic','obscene','threat','insult','identity_hate'],
                                 n=10):
    """Shows n messages with their classification
    """
    print(predicted.keys())
    resultPd = actual
    resultPd = resultPd.reset_index()
    resultPd.drop(0)
    for label in labels:
        labelResult = pd.DataFrame(predicted[label], columns=['predicted_not_{0}'.format(label),'predicted_{0}'.format(label)])
        resultPd = pd.concat([resultPd, labelResult], axis=1)
    resultPd = resultPd.sample(n)
    
    for index, row in resultPd.iterrows():
        print("===================\n")
        print("Class:\t\tPredicted\tActual")
        print("Toxic:\t\t{0}\t\t{1}".format(round(row['predicted_toxic'],2),row['toxic']))
        print("severe_toxic:\t{0}\t\t{1}".format(round(row['predicted_severe_toxic'],2),row['severe_toxic']))
        print("obscene:\t{0}\t\t{1}".format(round(row['predicted_obscene'],2),row['obscene']))
        print("threat:\t\t{0}\t\t{1}".format(round(row['predicted_threat'],2),row['threat']))
        print("insult:\t\t{0}\t\t{1}".format(round(row['predicted_insult'],2),row['insult']))
        print("identity_hate:\t{0}\t\t{1}\n".format(round(row['identity_hate'],2),row['identity_hate']))
        print(row['comment_text'])
        print("\n===================\n")
print_random_classifications(validatePd,predictedValidation,n=100)

In [None]:
# Kaggle expects result:
# id,toxic,severe_toxic,obscene,threat,insult,identity_hate
# 00001cee341fdb12,0.5,0.5,0.5,0.5,0.5,0.5
# 0000247867823ef7,0.5,0.5,0.5,0.5,0.5,0.5
# etc.

# Evaluation:
# Submissions are now evaluated on the mean column-wise ROC AUC. 
# In other words, the score is the average of the individual AUCs of each predicted column.
# - Which basically is the average of the area under the ROC (True positives vs. False positive curve)

## Results
- Train Cross Validation results are in the 0.8-0.95 range (for all binary classes)
- Validation avg ROC AUC is ~0.9 (Note: The validation set has different ratios than the train set because of the sampling (and removel of the train data)
- Visual inspection shows some correctness

## Conclusion
There are still lots of possibilities for improvement. 
A lot of these have been mentioned in the introduction above.
In addition, features from topics resulting from LDA might improve on our results.

They are not implemented due to time constraints on my part. 
Also, for example spell checking and NER tools will take significant processing time. 
When distributed, this should not be a problem but on my 2 core mac it is ;)

In my opinion, the numbers look good. It is fairly accurate (although 10% False classification will result in 15k misclassifications).

Visual inspection of the results show some correctness although I find it hard to believe it is near good enough to use. Also it doesn't necessarily seem to be close to numbers resulting from the CV and separate validation. I wouldn't be surprised if something is wronge here somewhere. (And am looking for it ;). It should have to do with the skewness, binary sampling and resulting validation set distributions)