# Key Problem IV: Binarizing Classes
# Empirically Setting the Threshold

Here, we demonstrate how to empirically select the classifier threshold for binarizing classes for group comparisons. For example, let's assume that we'd like to investigate whether tweets containing argumentation strategy opinion are also less hateful than tweets not containing opinion (see also notebook on sampling groups for comparison). We will need to define a classifier threshold to distinguish tweets with "opinion" from tweets with "no opinion". In the following, we demonstrate how to tune this threshold against classifier F1 scores to make both classes maximally distinct.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.metrics import classification_report
from scipy.special import softmax
import os

In [2]:
src = "data"

## Define Helper Functions

### Find the Best Model

First, we need to find the best performing model for a certain classification task overall (e.g., classifying argumentation strategy). Remember: We trained models on five different data splits for the same task to select the best out of five (see also notebook on training on confident examples).

In [3]:
def get_best_split(src, model, class_dict):
    """
    Finds the split that performs best in terms of macro avg F1 score over
    all classes (for argumentation startegy) and returns the split number.
    Assume we have already generated files with predictions of each of the
    models (splits) for our test set (here in directory "data/preds_strategy_test").
    """
    pred_files = os.listdir(Path(src, "preds_strategy_test"))
    pred_files = [f for f in pred_files if model in f]
    pred_files.sort()
    pred_cols = [f"raw_pred_{i}" for i in range(len(class_dict))]
    
    performances = pd.DataFrame()
    for split in range(5):
        # load raw predictions of classifier
        preds = pd.read_csv(
            Path(src, "preds_strategy_test", pred_files[split])
        )
        # apply softmax to turn raw predictions into probabilities
        preds[pred_cols] = softmax(preds[pred_cols], axis=1)
        preds = preds.rename(columns={"label": "true_label"})
        report = classification_report(
            preds["true_label"], 
            preds["pred_label"], 
            output_dict=True,
            zero_division = 0
        )
        
        row = pd.DataFrame({
            "split": [split], 
            "f1-score": [report["macro avg"]["f1-score"]]
        })
        
        performances = pd.concat([performances, row]).reset_index(drop=True)
    
    best_split = int(performances.loc[performances.idxmax()["f1-score"]]["split"])
    
    return best_split

### Try Thresholds & Get Optimal Threshold

After finding the best performing model for argumentation strategy overall, we scan through different thresholds to distinguish for example "opinion" and "not opinion", calculating the F1 score with respect to the human annotated test set for each of those thresholds. We would like to select the threshold for which the F1 score is the highest to make groups "opinion" and "not opinion" maximally distinct.

In [4]:
def scan_thresholds(src, model, class_dict, split):
    """
    Scans different thresholds for prediction probabilities to binarize 
    predictions. Returns the classifier performances for each threshold.
    """
    # load raw predictions of classifier
    pred_files = os.listdir(Path(src, "preds_strategy_test"))
    pred_files = [f for f in pred_files if model in f]
    pred_files.sort()
    pred_cols = [f"raw_pred_{i}" for i in range(len(class_dict))]
    preds = pd.read_csv(
        Path(src, "preds_strategy_test", pred_files[split])
    )
    # apply softmax to turn raw predictions into probabilities
    preds[pred_cols] = softmax(preds[pred_cols], axis=1)
    preds = preds.rename(columns={"label": "true_label"})

    thresholds = np.arange(0, 1, 0.01)
    performances = pd.DataFrame()
    for label in range(len(class_dict)):
        for threshold in thresholds:
            true_label = np.where(preds["true_label"] == label, 1, 0)
            pred_label = np.where(preds[f"raw_pred_{label}"] > threshold, 1, 0)
            report = classification_report(
                true_label, 
                pred_label, 
                output_dict=True,
                zero_division = 0
            )

            row = pd.DataFrame({
                "label": [label],
                "threshold": [threshold],
                "precision": [report["macro avg"]["precision"]],
                "recall": [report["macro avg"]["recall"]],
                "f1-score": [report["macro avg"]["f1-score"]]
            })
            
            performances = pd.concat([performances, row])

    performances = performances.reset_index(drop=True)

    return performances

In [5]:
def get_optimal_thresholds(performances, class_dict):
    """
    Returns the optimum threshold for every class (in argmentation strategy)
    including the F1 score, precision and recall of the classifier in a
    one vs. all prediction scenario.
    """
    optima = pd.DataFrame()
    for label in class_dict.keys():
        subset = performances[performances["label"] == label]
        idx = subset.idxmax()["f1-score"]
        optimum = subset.loc[idx]
        optima = pd.concat([optima, pd.DataFrame(optimum).transpose()])

    optima["label"] = optima["label"].astype(int)
    optima["label"] = optima["label"].replace(class_dict)
    optima = optima.rename(columns={"threshold": "optimum_threshold"})

    return optima


## Apply Functions

Search for the optimum threshold to binarize predictions for all classes in argumentation strategy in a one vs. all other classes scenario.

In [6]:
model = "twitter-xlm-roberta-base_epochs-100_batchsize-64_strategy"
class_dict = {
    0:"construct",
    1:"opin",
    2:"sarc",
    3:"leave_fact",
    4:"other",
}

In [7]:
# First, we retrieve the best out of five models for argumentation strategy overall.
best_split = get_best_split(src, model, class_dict)
print(f"Best model for argumentation startegy overall is split {best_split}.")
# Second, we scan through the thresholds in increments of 0.01 for each of opinion - not opinion, construct - not construct, etc.
performances = scan_thresholds(src, model, class_dict, best_split)
# Third, we retrieve the thresholds where the F1 score is the highest for each of the classes.
optima = get_optimal_thresholds(performances, class_dict)

Best model for argumentation startegy overall is split 3.


In [8]:
cols = ["label", "optimum_threshold", "f1-score", "precision", "recall"]
optima = optima[cols].reset_index(drop=True)

In [9]:
# display optimal thresholds
optima

Unnamed: 0,label,optimum_threshold,f1-score,precision,recall
0,construct,0.27,0.739928,0.713683,0.780287
1,opin,0.4,0.782448,0.789912,0.776485
2,sarc,0.28,0.670686,0.663035,0.679363
3,leave_fact,0.47,0.839352,0.840664,0.838074
4,other,0.31,0.884361,0.907378,0.866389
