## Modelling - ML

The goal for our work is to not only train a model to identify toxic comments, but to do so while reducing bias.
Bias in this task can be viewed as the situation where certain identies such as 'Black', 'Muslim', 'Gay' e.t.c, begin triggering toxic classification for comments they are in, even when the comment is actually positive. This is a key issue in toxic comment classification. 

The goal of the [jigsaw unintended Bias in Toxicity Classification](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data) Kaggle challenge was to reduce this bias via a newly developed submetric which we have defined below.

**Note: The goal of the model is simply to predict the toxicity score of a model.** 

The bias weighted ROC metric below is calulated by taking segmenting the the dataset into identity subgroups by using the provided identity labels and then calculating the subgroup metrics. 

### Metrics:

In addition to accuracy we will observe the below metrics for our models

#### Overall ROC-AUC:

This is the standard ROC-AUC for the full evaluation set. In other words this is the area under the Reciever Operating Characteristic curve. It compares the true positive and false positive rates of a binary model.

#### Subgroup ROC-AUC:

Here, we restrict the data set to only the examples that mention the specific identity subgroup. A low value in this metric means the model does a poor job of distinguishing between toxic and non-toxic comments that mention the identity.

#### BPSN AUC:

BPSN (Background Positive, Subgroup Negative) AUC: Here, we restrict the test set to the non-toxic examples that mention the identity and the toxic examples that do not. A low value in this metric means that the model confuses non-toxic examples that mention the identity with toxic examples that do not, likely meaning that the model predicts higher toxicity scores than it should for non-toxic examples mentioning the identity.

#### BNSP AUC:

BNSP (Background Negative, Subgroup Positive) AUC: Here, we restrict the test set to the toxic examples that mention the identity and the non-toxic examples that do not. A low value here means that the model confuses toxic examples that mention the identity with non-toxic examples that do not, likely meaning that the model predicts lower toxicity scores than it should for toxic examples mentioning the identity.


#### Generalized Mean of Bias AUCs
To combine the per-identity Bias AUCs into one overall measure, we calculate their generalized mean as defined below:

$M_p(m_s) = \left(\frac{1}{N} \sum_{s=1}^{N} m_s^p\right)^\frac{1}{p}$

Where:

$M_p$ = the $p$th power-mean function

$m_s$ = the bias metric $m$ calulated for subgroup $s$

$N$ = number of identity subgroups

For this competition, JigsawAI use a p value of -5 to encourage competitors to improve the model for the identity subgroups with the lowest model performance.

### Final Metric
We combine the overall AUC with the generalized mean of the Bias AUCs to calculate the final model score:

$score = w_0 AUC_{overall} + \sum_{a=1}^{A} w_a M_p(m_{s,a})$

$A$ = number of submetrics (3)

$m_{s,a}$ = bias metric for identity subgroup $s$ using submetric $a$

$w_a$ = $a$ weighting for the relative importance of each submetric; all four $w$ values set to 0.25


### Process:

#### Classical ML models
This is primarily an NLP task, our X feature matrix will be based off the text from online comments. We have defined a pre-processing pipeline in the 'preprocessing.ipynb' notebook to use for our our ML classifiers and a seperate pre-processing pipeline for the neural network models we are planning on training.

From the classic ML classifer models, we intend to use the following models - our base word embedding technique will be TF-IDF: 

   * Logistic Regression
   * SVM
   * Random Forest
 
   
We will carry out hyperparameter optimization for each model and calculate the metrics for each.

#### Neural Networks

We will also train a neural network to answer this problem. We will start with a basic LSTM model which will be made of:
    
   * Two LSTM layers to read through the data
   * Two Dense layers w/ 4 nodes
   * Output layer using sigmoid for the classes
   
We will then seek to improve this LSTM by creating a Bidrectional LSTM (BiLSTM) which we believe will improve accuracy by reading input sequences in both directions. If time allows we will also attempt to include a simple attention mechanism.

The NN models will use Glove 840B 300d word embeddings. 


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
import contractions
import tqdm
from tqdm import tqdm
tqdm.pandas()
import gc
import operator


import warnings
warnings.filterwarnings('ignore')

  from pandas import Panel


## Stage 1: Apply pre-processing to text

In [2]:
# Load train_data
train_df = pd.read_csv('data/train_clean.csv')

# Load in test data
test_df = pd.read_csv('data/test_clean.csv')

In [3]:
test_df.rename({'toxicity':'target'}, axis=1, inplace=1)

In [5]:
# Drop the unneeded columns
train_df = train_df.iloc[:,1:]
test_df = test_df.iloc[:,1:]

In [6]:
# In this cell we define the function that pre-processes our text

from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize.treebank import TreebankWordDetokenizer
import contractions
import string

def text_cleaner(df, col_name, clean_col_name):
    '''
    Text pre-processing pipeline, we lemmatize words, expand contractions, remove common stop words, apply lower case,
    tokenize, and delete punctuation. All functions use apply and list comprehension for speed benefit.
   
    INPUT:
    df = name of dataframe
    col_name = name of column to pre-process
    clean_col_name = name of new cleaned_column
   
    OUTPUT:
    None - changes are made directly to dataframe
    
    '''

    # Lemmatize helper functions
    # Lemmatize nouns
    def lemmatize_text_noun(text):
        return [lemmatizer.lemmatize(w, pos='n') for w in text]
    
    # Lemmatize verbs
    def lemmatize_text_verb(text):
        return [lemmatizer.lemmatize(w, pos='v') for w in text]
    # Lemmatize adjectives
    def lemmatize_text_adj(text):
        return [lemmatizer.lemmatize(w, pos='a') for w in text]

    # Lemmatize adverbs
    def lemmatize_text_adv(text):
        return [lemmatizer.lemmatize(w, pos='r') for w in text]
    
    # Expand contraction method
    def contraction_expand(text):
        return contractions.fix(text)
    
    # To lower case.
    df[clean_col_name] = df[col_name].apply(lambda x: x.lower())
    
    # Expand contractions
    df[clean_col_name] = df[clean_col_name].apply(lambda x: contraction_expand(x))
    
    #Tokenize:
    tokenizer = TweetTokenizer(reduce_len=True)
    df[clean_col_name] = df[clean_col_name].apply(lambda x: tokenizer.tokenize(x))
   
    
    #Remove Stop words
    stop_words = stopwords.words('english')
    df[clean_col_name] = df[clean_col_name].apply(lambda x: [item for item in x if item not in stop_words])
    
    #Delete punctuation
    punc_table = str.maketrans('', '', string.punctuation)
    df[clean_col_name] = df[clean_col_name].apply(lambda x: [item.translate(punc_table) for item in x])
    
    # LEMMATIZATION
    lemmatizer = WordNetLemmatizer()
    
    df[clean_col_name] = df[clean_col_name].apply(lemmatize_text_noun)
    df[clean_col_name] = df[clean_col_name].apply(lemmatize_text_verb)
    df[clean_col_name] = df[clean_col_name].apply(lemmatize_text_adj)
    df[clean_col_name] = df[clean_col_name].apply(lemmatize_text_adv)
    
    
    return None


def detokenizer(df, col_name):
    detokenizer = TreebankWordDetokenizer()
    df[col_name+'_detokenize'] = df[col_name].apply(lambda x: detokenizer.detokenize(x))
    
    return None

In [7]:
%%time
# Run the cleaner func on train data
text_cleaner(train_df, 'comment_text', 'comment_text_clean')

CPU times: user 19min 18s, sys: 8.28 s, total: 19min 26s
Wall time: 19min 30s


In [None]:
%%time 
# Run the same cleaner on the training data
text_cleaner(test_df, 'comment_text', 'comment_text_clean')

In [9]:
%%time
detokenizer(train_df,'comment_text_clean')

CPU times: user 3min 34s, sys: 803 ms, total: 3min 35s
Wall time: 3min 37s


In [10]:
%%time
# detokenize the test data
detokenizer(test_df, 'comment_text_clean')

CPU times: user 23.9 s, sys: 194 ms, total: 24.1 s
Wall time: 24.4 s


## Stage 2: Modelling

The data has been pre-processed for our models. We can now begin model training.

We have a seperate test dataset that will be kept aside for testing only once we have an ideal model. For hyperparameter optimizing we will use Scikit-learn's GridSearchCV. 

After we have fitted a model and predicted results, we can then append the predictions to the dataframe and calculate the subgroup AUCs and the final weighted metric . 

#### Defining subgroup AUC metrics

In [11]:
from sklearn import metrics
SUBGROUP_AUC = 'subgroup_auc'
BPSN_AUC = 'bpsn_auc'  # stands for background positive, subgroup negative
BNSP_AUC = 'bnsp_auc'  # stands for background negative, subgroup positive


# These calculations have been provided by Jigsaw AI for scoring based on the metrics of the kaggle competition
# https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview/evaluation

# They work by filtering the relevant dataframe into specific subgroups and using the roc_auc_score metric from sklearn.

def compute_auc(y_true, y_pred):
    try:
        return metrics.roc_auc_score(y_true, y_pred)
    except ValueError:
        return np.nan

def compute_subgroup_auc(df, subgroup, label, model_name):
    subgroup_examples = df[df[subgroup]]
    return compute_auc(subgroup_examples[label], subgroup_examples[model_name])

def compute_bpsn_auc(df, subgroup, label, model_name):
    """Computes the AUC of the within-subgroup negative examples and the background positive examples."""
    subgroup_negative_examples = df.loc[df[subgroup] & ~df[label]]
    non_subgroup_positive_examples = df.loc[~df[subgroup] & df[label]]
    examples = subgroup_negative_examples.append(non_subgroup_positive_examples)
    return compute_auc(examples[label], examples[model_name])

def compute_bnsp_auc(df, subgroup, label, model_name):
    """Computes the AUC of the within-subgroup positive examples and the background negative examples."""
    subgroup_positive_examples = df.loc[df[subgroup] & df[label]]
    non_subgroup_negative_examples = df.loc[~df[subgroup] & ~df[label]]
    examples = subgroup_positive_examples.append(non_subgroup_negative_examples)
    return compute_auc(examples[label], examples[model_name])

def compute_bias_metrics_for_model(dataset,
                                   subgroups,
                                   model,
                                   label_col,
                                   include_asegs=False):
    """Computes per-subgroup metrics for all subgroups and one model."""
    records = []
    for subgroup in subgroups:
        record = {
            'subgroup': subgroup,
            'subgroup_size': len(dataset.loc[dataset[subgroup]])
        }
        record[SUBGROUP_AUC] = compute_subgroup_auc(dataset, subgroup, label_col, model)
        record[BPSN_AUC] = compute_bpsn_auc(dataset, subgroup, label_col, model)
        record[BNSP_AUC] = compute_bnsp_auc(dataset, subgroup, label_col, model)
        records.append(record)
    return pd.DataFrame(records).sort_values('subgroup_auc', ascending=True)



In [12]:
# These calculations have been provided by Jigsaw AI for scoring based on the metrics of the kaggle competition
# https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview/evaluation

# They work by filtering the relevant dataframe into specific subgroups and using the roc_auc_score metric from sklearn.

def calculate_overall_auc(df, model_name):
    true_labels = df[TOXICITY_COLUMN]
    predicted_labels = df[model_name]
    return metrics.roc_auc_score(true_labels, predicted_labels)

def power_mean(series, p):
    total = sum(np.power(series, p))
    return np.power(total / len(series), 1 / p)

def get_final_metric(bias_df, overall_auc, POWER=-5, OVERALL_MODEL_WEIGHT=0.25):
    bias_score = np.average([
        power_mean(bias_df[SUBGROUP_AUC], POWER),
        power_mean(bias_df[BPSN_AUC], POWER),
        power_mean(bias_df[BNSP_AUC], POWER)
    ])
    return (OVERALL_MODEL_WEIGHT * overall_auc) + ((1 - OVERALL_MODEL_WEIGHT) * bias_score)
    


y_train shape: (1804874,)
y_test shape: (194640,)


### Logistic Regression

We will use Grid Search Cross validation to optimize our hyperparameters on the training set. We will then fit the best estimator and then predict on the test set and calculate relevent metrics.

Because the subgroup ROCs are calculated post prediction and require predictions to be appended to a dataframe, we cannot actually pass the metrics into the scoring function of scikit-learn's gridsearchCV(). Instead we will just test the models on accuracy and total ROC-AUC and then refitting the grid-search on the model with the best ROC-AUC as a proxy.

In [19]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
#no scaling data is already transformed

# Instantiate the tokenizer to use in the vectorizer
tweet_tokenizer = TweetTokenizer(reduce_len=True)

# Instantiate the vectorizer, we will pass in the TweetTokenizer() from nltk 
tfid_vec_2 = TfidfVectorizer(lowercase=False, tokenizer = tweet_tokenizer.tokenize)

# define pipeline
pipeline = Pipeline([#('tf-idf', tfid_vec_2), 
                     ('model', SVC())])



CPU times: user 1.65 ms, sys: 3.6 ms, total: 5.25 ms
Wall time: 8.3 ms


In [15]:
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
# Define scoring functions
scorers = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score)}


In [None]:
# define parameters
param_grid_log = [{'model': [LogisticRegression()], 'tf-idf': [tfid_vec_2], 
                   'model__penalty': ['l1', 'l2'],
                   'model__C': [.001, 0.01, 0.1, 1, 10, 100]}]

# Setting refit='AUC', refits an estimator on the whole dataset with the
# parameter setting that has the best cross-validated AUC score, i.e it finds the params that gives the best scores
# then refits using the ones that give the best AUC. 
grid_log = GridSearchCV(pipeline, param_grid_log, cv=5, scoring=scorers, refit='AUC', 
                        return_train_score=True, n_jobs=-1)

fittedgrid_log = grid_log.fit(train_df['comment_text_clean_detokenize'], train_df['target'])

In [None]:
from sklearn.externals import joblib
# Save best estimator to file
joblib.dump(fittedgrid.best_estimator_, 'saved_models/best_log_reg.pkl')

In [50]:
#Run this to load the model if required
from sklearn.externals import joblib
from joblib import load
fittedgrid_log = load('saved_models/best_log_reg.pkl')

In [52]:
# Get scores
train_accuracy = fittedgrid_log.score(train_df['comment_text_clean_detokenize'], y_train)
test_accuracy = fittedgrid_log.score(test_df['comment_text_clean_detokenize'], y_test)

In [53]:
# predict train and test values
y_train_pred = fittedgrid_log.predict(train_df['comment_text_clean_detokenize'])
y_test_pred = fittedgrid_log.predict(test_df['comment_text_clean_detokenize'])

y_train_pred_prob = fittedgrid_log.predict_proba(train_df['comment_text_clean_detokenize'])
y_test_pred_prob = fittedgrid_log.predict_proba(test_df['comment_text_clean_detokenize'])

In [57]:
# Confusion Matrix and Classification Report, we want to store the F1 Score.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

# store the precision, recall, and f1 score for later and print the classification report
log_precision = precision_score(y_test, y_test_pred)
log_recall = recall_score(y_test, y_test_pred)
log_f1 = f1_score(y_test, y_test_pred)

print(classification_report(y_test, y_test_pred))


              precision    recall  f1-score   support

           0       0.96      0.99      0.97    179192
           1       0.76      0.50      0.60     15448

    accuracy                           0.95    194640
   macro avg       0.86      0.74      0.79    194640
weighted avg       0.94      0.95      0.94    194640



In [85]:
# Append preds and probabilities to train and valid dfs
train_df['Prediction_log'] = y_train_pred
train_df['Prediction_probability_log'] = y_train_pred_prob[:, 1]
test_df['Prediction_log'] = y_test_pred
test_df['Prediction_probability_log'] = y_test_pred_prob[:, 1]

In [34]:
# Identity columns used to calculate subgroup AUC
identity_columns = [
    'male', 'female', 'homosexual_gay_or_lesbian', 'christian', 'jewish',
    'muslim', 'black', 'white', 'psychiatric_or_mental_illness']



In [None]:
for col in identity_columns + ['target']:
    train_df[col] = np.where(train_df[col] >= 0.5, True, False)
    test_df[col] = np.where(test_df[col] >= 0.5, True, False)

In [None]:
# Generate the AUC metrics 

SUBGROUP_AUC = 'subgroup_auc'
BPSN_AUC = 'bpsn_auc'  # stands for background positive, subgroup negative
BNSP_AUC = 'bnsp_auc'

MODEL_NAME = 'Prediction_log'
TOXICITY_COLUMN = 'target'

log_bias_metrics_df_train = compute_bias_metrics_for_model(train_df, identity_columns, MODEL_NAME, TOXICITY_COLUMN)
log_final_metric_train = get_final_metric(log_bias_metrics_df_train, calculate_overall_auc(train_df, MODEL_NAME))

log_bias_metrics_df_test = compute_bias_metrics_for_model(test_df, identity_columns, MODEL_NAME, TOXICITY_COLUMN)
log_final_metric_test = get_final_metric(log_bias_metrics_df_test, calculate_overall_auc(test_df, MODEL_NAME))

In [None]:
log_bias_metrics_df_test

In [None]:
print(f'train_accuracy:{train_accuracy}')
print(f'train weighted subgroup AUC:{get_final_metric(bias_metrics_df_train, calculate_overall_auc(train_df, MODEL_NAME))}')
print(f'test_accuracy:{test_accuracy}')
print(f'test weighted subgroup AUC::{get_final_metric(bias_metrics_df_val, calculate_overall_auc(test_df, MODEL_NAME))}')


#### Model Evaluation:

The logistic regression performed well on pure accuracy, with a train accuracy of 95.38% and 94.96% test accuracy. What is also positive to see is that our hyperparameter optimization has led to a model which does not overfit excessively. 

However when we look at the weighted subgroup AUC metric, the 71.9% train score and 71.5% test score show that the model did have a tendency towards biased predictions for certain subgroup. In comparison, the benchmark CNN that was provided had a weighted AUC score of 88.35% albeit just on a validation set 


For example we can see that for the 'black' identity BPSN AUC was relatively low, suggesting the model is likely overweighting mentions of the 'black' identity with toxicity. 

--------------

### SVM

We will now carry out a similar process with SVM to see if this performs appreciably different to logistic regression. 

#### Grid searching on a subset of data

Given the length of time taken to grid search using SVM, we will run a gridsearch on a subset of our dataset by taking a new train test split. We will then use the results there as a proxy for the optimal parameters for our full dataset.  

In [None]:
from sklearn.model_selection import train_test_split
# Create a new split of the training dataframe to use for this reduced test. 
# we will not need the remainder. We will take 1/4 of the total data set
remainder, reduced_df = train_test_split(train_df, test_size=0.25, stratify=train_df['target'], random_state=1)

In [None]:
# Grid search applies tfid_vec_2 to all models. We try combinations of model loss function penalty strength and
# model gamma amma is a parameter for non linear hyperplanes. 
# The higher the gamma value it tries to exactly fit the training data set

param_grid_svc = [{'model__kernel': ['rbf'],'tf-idf': [tfid_vec_2],
              'model__C': [0.1, 1, 10,], 'model__gamma': [0.1, 1, 10, 'scale'] 'model__probability': [True]}]

grid_svc = GridSearchCV(pipeline, param_grid_svc, scoring=scorers, cv=5, refit='AUC',
                       return_train_score=True, n_jobs=-1, verbose=10)

reduced_grid_svm = grid_svc.fit(train_df['comment_text_clean_detokenize'], train_df['target'])

In [None]:
# Gives the model with the best AUC score
reduced_grid_svm.best_estimator

In [None]:
# gives the model with the best AUC score
reduced_grid_svm.best_score_

In [None]:
%%time
# Full dataset grid search

param_grid_svc = [{'model__kernel': ['rbf'],'tf-idf': [tfid_vec_2],
              'model__C': [0.1, 1, 10], 'model__probability': [True]}]

grid_svc = GridSearchCV(pipeline, param_grid_svc, scoring=scorers, cv=5, refit='AUC',
                       return_train_score=True, n_jobs=-1, verbose=10)

fittedgrid_svc = grid_svc.fit(train_df['comment_text_clean_detokenize'], train_df['target'])

In [None]:
# Save best estimator to file
joblib.dump(fittedgrid_svc.best_estimator_, 'saved_models/best_svm.pkl')

In [None]:
# Get scores
train_accuracy = fittedgrid_svc.score(train_df['comment_text_clean_detokenize'], y_train)
test_accuracy = fittedgrid_svc.score(test_df['comment_text_clean_detokenize'], y_test)

In [None]:
# predict train and test values
y_train_pred = fittedgrid_svc.predict(train_df['comment_text_clean_detokenize'])
y_test_pred = fittedgrid_svc.predict(test_df['comment_text_clean_detokenize'])

y_train_pred_prob = fittedgrid_svc.predict_proba(train_df['comment_text_clean_detokenize'])
y_test_pred_prob = fittedgrid_svc.predict_proba(test_df['comment_text_clean_detokenize'])

In [None]:
# store the precision, recall, and f1 score for later and print the classification report
svm_precision = precision_score(y_test, y_test_pred)
svm_recall = recall_score(y_test, y_test_pred)
svm_f1 = f1_score(y_test, y_test_pred)

display(classification_report(y_test, y_test_pred))

In [84]:
# Append preds and probabilities to train and valid dfs


train_df['Prediction_svc'] = y_train_pred
train_df['Prediction_probability_svc'] = y_train_pred_prob[:, 1]
test_df['Prediction_svc'] = y_test_pred
test_df['Prediction_probability_svc'] = y_test_pred_prob[:, 1]

In [None]:
# Generate the AUC metrics 

SUBGROUP_AUC = 'subgroup_auc'
BPSN_AUC = 'bpsn_auc'  # stands for background positive, subgroup negative
BNSP_AUC = 'bnsp_auc'

MODEL_NAME = 'Prediction_svc'
TOXICITY_COLUMN = 'target'

svm_bias_metrics_df_train = compute_bias_metrics_for_model(train_df, identity_columns, MODEL_NAME, TOXICITY_COLUMN)
svm_final_metric_train = get_final_metric(svm_bias_metrics_df_train, calculate_overall_auc(train_df, MODEL_NAME))

svm_bias_metrics_df_test = compute_bias_metrics_for_model(test_df, identity_columns, MODEL_NAME, TOXICITY_COLUMN)
svm_final_metric_test get_final_metric(svm_bias_metrics_df_test, calculate_overall_auc(test_df, MODEL_NAME))

--------

### Random Forest -- Put in code for single run rf model 

For our final model we will try the Random Forest Classifier which is an ensemble method. It is based on the Decision Tree model, only the Random Forest works by fitting on random sub samples of the data (with replacement), which is known as 'bagging'. A voting algorithm is then applied on the results of each of the trees to determine the final class of the data point. The aim of Random Forest is to train a series of overfit models and then average out the results via voting to get better results. The main hyperparemeter we will optimize for is the number of decision trees to use. 


#### Grid Search on a subset of data

Given the length of time taken to grid search on the Random Forest Classifier, we will run a gridsearch on a subset of our dataset by taking a new train test split. We will then use the results there as a proxy for the optimal parameters for our full dataset. We will use the same train_test split as used for the SVM subset search. 

In [60]:
%%time
from sklearn.ensemble import RandomForestClassifier

# Note, most of the time here actually comes from the application of the tfid vectorizer to each fold of the data
# Random forest itself takes ~5minutes to run on the vectorized data.

# We control the model max depth 
param_grid_RF = {'model': [RandomForestClassifier()], 'tf-idf': [tfid_vec_2], 
                 'model__n_estimators': [10,50,100],
                 'model__max_depth': [100, 500, 1000, 5000],
             }
reduced_grid_RF = GridSearchCV(pipeline, param_grid_RF, scoring=scorers, cv=3, refit='AUC',
                       return_train_score=True, n_jobs=-1, verbose = 10)

reduced_grid_RF = reduced_grid_RF.fit(reduced_df['comment_text_clean_detokenize'], reduced_df['target'])

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:  5.4min
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed: 13.1min
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed: 35.6min
[Parallel(n_jobs=-1)]: Done  25 out of  36 | elapsed: 78.8min remaining: 34.7min
[Parallel(n_jobs=-1)]: Done  29 out of  36 | elapsed: 113.6min remaining: 27.4min
[Parallel(n_jobs=-1)]: Done  33 out of  36 | elapsed: 132.4min remaining: 12.0min
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed: 165.0min finished


CPU times: user 50min 44s, sys: 12.9 s, total: 50min 57s
Wall time: 3h 35min 49s


In [61]:
# check the best estimator parameters
reduced_grid_RF.best_estimator_

Pipeline(memory=None,
         steps=[('tf-idf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=False, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern...
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=1000,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                               

In [62]:
# gives the best AUC score
reduced_grid_RF.best_score_

0.9230210872062684

In [None]:
#reduced_grid_RF.cv_results_

The above test has suggested that in terms of hyperparameters, we should look at a model that has n_estimators around 100 or above, and max_depth of around 1000. We will use this as a guide for our grid search. 

#### Grid Search in the full dataset
Now that we have a good idea of optimal parameters, we will run another grid search on a few parameters around the optimal figures provided by our test on the subset of data. 

In [None]:
# Running a grid-search on the full set of data 
param_grid_RF = {'model': [RandomForestClassifier()], 'tf-idf': [tfid_vec_2], 
                 'model__n_estimators': [100,150,200],
                 'model__max_depth': [1000, 2000, 3000, 4000],
             }
grid_RF = GridSearchCV(pipeline, param_grid_RF, scoring=scorers, cv=3, refit='AUC',
                       return_train_score=True, n_jobs=-1, verbose = 10)

fittedgrid_RF = grid_RF.fit(train_df['comment_text_clean_detokenize'], train_df['target'])

In [None]:
# predict train and test values
y_train_pred = fittedgrid_rf.predict(train_df['comment_text_clean_detokenize'])
y_test_pred = fittedgrid_rf.predict(test_df['comment_text_clean_detokenize'])

y_train_pred_prob = fittedgrid_rf.predict_proba(train_df['comment_text_clean_detokenize'])
y_test_pred_prob = fittedgrid_rf.predict_proba(test_df['comment_text_clean_detokenize'])

In [None]:
# store the precision, recall, and f1 score for later and print the classification report
rf_precision = precision_score(y_test, y_test_pred)
rf_recall = recall_score(y_test, y_test_pred)
rf_f1 = f1_score(y_test, y_test_pred)

display(classification_report(y_test, y_test_pred))

In [None]:
# Identity columns used to calculate subgroup AUC
identity_columns = [
    'male', 'female', 'homosexual_gay_or_lesbian', 'christian', 'jewish',
    'muslim', 'black', 'white', 'psychiatric_or_mental_illness']

In [None]:
for col in identity_columns + ['target']:
    train_df[col] = np.where(train_df[col] >= 0.5, True, False)
    test_df[col] = np.where(test_df[col] >= 0.5, True, False)

In [None]:
# Append preds and probabilities to train and valid dfs
train_df['Prediction_RF'] = y_train_pred
train_df['Prediction_probability_RF'] = y_train_pred_prob[:, 1]
test_df['Prediction_RF'] = y_test_pred
test_df['Prediction_probability_RF'] = y_test_pred_prob[:, 1]

In [None]:
# Generate the AUC metrics 

SUBGROUP_AUC = 'subgroup_auc'
BPSN_AUC = 'bpsn_auc'  # stands for background positive, subgroup negative
BNSP_AUC = 'bnsp_auc'

MODEL_NAME = 'Prediction_RF'
TOXICITY_COLUMN = 'target'

rf_bias_metrics_df_train = compute_bias_metrics_for_model(train_df, identity_columns, MODEL_NAME, TOXICITY_COLUMN)
rf_final_metric_train = get_final_metric(rf_bias_metrics_df_train, calculate_overall_auc(train_df, MODEL_NAME))

rf_bias_metrics_df_test = compute_bias_metrics_for_model(test_df, identity_columns, MODEL_NAME, TOXICITY_COLUMN)
rf_final_metric_test = get_final_metric(rf_bias_metrics_df_test, calculate_overall_auc(test_df, MODEL_NAME))

In [None]:
rf_bias_metrics_df_test

In [None]:
rf_final_metric_test

In [None]:
print(f'train_accuracy:{train_accuracy}')
print(f'train weighted AUC:{rf_final_metric_train}')
print(f'test_accuracy:{test_accuracy}')
print(f'test weighted AUC::{rf_final_metric_test}')


In [None]:
We can see that the best estimator was selected with no bound on the max tree depth. THis has caused the model to significantly overfit on 

--------

### Model performance

Now that we have trained three seperate models which have been shown to deliver strong performance in text classification tasks in the past let us take the time to compare them side-by-side and also discuss their short-comings in terms of reducing bias

In [None]:
#Display the subgroup bias metrics tables for each model and then the final metrics for each model for test only
display(log_bias_metrics_df_test)
display(svm_bias_metrics_df_test)
display(rf_bias_metrics_df_test)

In [None]:
# Display final metrics for each model for test only.
print(f' Final metric for Logistic Regression: {log_final_metric_test}')
print(f' Final metric for SVM: {svm_final_metric_test}')
print(f' Final metric for Random Forest: {rf_final_metric_test}')

-----

In [None]:
# Lime implementation for logreg

In [66]:
import eli5
from eli5.lime import TextExplainer

Using TensorFlow backend.


In [None]:
# Load in log model
joblib.dump(fittedgrid.best_estimator_, 'saved_models/best_log_reg.pkl')

In [105]:
text_model_log = TextExplainer(random_state=1)

text = test_df.loc[59102,'comment_text']

text_model_log.fit(text, fittedgrid_log.predict_proba)
text_model_log.show_prediction()

Contribution?,Feature
1.845,Highlighted in text (sum)
-0.821,<BIAS>


In [77]:
test_df['comment_text_clean_detokenize']

0         jeff session another one trump orwellian choic...
1         actually inspected infrastructure grand chief ...
2         wishful thinking democrat fault  100 th time  ...
3         instead wringing hand nibbling periphery issue...
4         many commenters garbage piled high yard  bald ...
                                ...                        
194635    lose job promoting misinformation harmful student
194636    thinning project meant lower fire danger impro...
194637            hope millennials happy put airhead charge
194638    I thinking kellyanne conway   k   trump whispe...
194639    still figure pizza ak cost pizza washington  i...
Name: comment_text_clean_detokenize, Length: 194640, dtype: object

In [103]:
test_df[(test_df['target'] == 0) &(test_df['Prediction_log'] == 1) & (test_df['black']==1)].iloc[40:60,:]

Unnamed: 0,id,comment_text,created_date,publication_id,article_id,rating,funny,wow,sad,likes,...,physical_disability,intellectual_or_learning_disability,psychiatric_or_mental_illness,other_disability,comment_text_clean,comment_text_clean_detokenize,Prediction_svc,Prediction_probability_svc,Prediction_log,Prediction_probability_log
59102,7059102,"In the US at least, someone who is half-black ...",2016-10-15 07:49:23.171142+00,53,148503,approved,0,0,0,7,...,0,0,0,0,"[u, least, , someone, halfblack, considered, b...",u least someone halfblack considered black g...,1,0.73604,1,0.73604
62101,7062101,I would say the photo is not fake. It is a rea...,2016-06-04 23:07:41.077770+00,21,138054,approved,0,0,0,1,...,0,0,0,0,"[would, say, photo, fake, , real, photo, , eve...",would say photo fake real photo even say tak...,1,0.58031,1,0.58031
63982,7063982,The reason no KKK member or neo- Nazi wouldn't...,2017-08-17 19:00:58.413842+00,102,367562,approved,0,0,1,0,...,0,0,0,0,"[reason, kkk, member, neo, , nazi, would, vote...",reason kkk member neo nazi would voted barack...,1,0.842451,1,0.842451
67553,7067553,You may have lost to Trump; but you sure have ...,2017-01-23 15:47:27.135615+00,54,163496,approved,0,0,0,7,...,0,0,0,0,"[may, lost, trump, , sure, clue, happened, , h...",may lost trump sure clue happened here tip r...,1,0.629164,1,0.629164
67887,7067887,"I posted this info, but people have more fun p...",2017-08-30 20:16:15.845371+00,102,372110,approved,0,0,0,1,...,0,0,0,0,"[posted, info, , people, fun, pointing, finger...",posted info people fun pointing finger rather...,1,0.747986,1,0.747986
68316,7068316,The FBI statistics you're referring to point o...,2017-08-30 04:37:03.116935+00,102,372110,approved,0,0,0,3,...,0,0,0,0,"[fbi, statistic, referring, point, white, blac...",fbi statistic referring point white black most...,1,0.539965,1,0.539965
71154,7071154,Typical response of the entitled white male wi...,2016-07-10 00:23:04.787112+00,21,141018,approved,0,0,0,0,...,0,0,0,0,"[typical, response, entitled, white, male, und...",typical response entitled white male underlyin...,1,0.756865,1,0.756865
71313,7071313,The Electoral College was designed to support ...,2017-08-13 01:33:33.428425+00,13,365725,approved,0,0,0,3,...,0,0,0,0,"[electoral, college, designed, support, race, ...",electoral college designed support race based ...,1,0.5424,1,0.5424
77472,7077472,Obama is to blame for all the hate shown towar...,2016-07-12 23:22:56.167130+00,21,141234,approved,0,0,0,2,...,0,0,0,0,"[obama, blame, hate, shown, towards, police, ,...",obama blame hate shown towards police furguso...,1,0.904188,1,0.904188
79693,7079693,1/3 black.,2017-10-12 17:35:16.953682+00,54,388225,rejected,0,0,0,0,...,0,0,0,0,"[13, black, ]",13 black,1,0.541024,1,0.541024


In [68]:
y_predict_

Pipeline(memory=None,
         steps=[('tf-idf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=False, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern...
                                 tokenizer=<bound method TweetTokenizer.tokenize of <nltk.tokenize.casual.TweetTokenizer object at 0x26c293668>>,
                                 use_idf=True, vocabulary=None)),
                ('model',
                 LogisticRegression(C=1, class_weight=None, dual

In [86]:
test_df['Prediction_log']

0         0
1         0
2         0
3         0
4         1
         ..
194635    0
194636    0
194637    1
194638    0
194639    0
Name: Prediction_log, Length: 194640, dtype: int64