# Key Problem I: Unbalanced Training Data
# Iterative Sampling

In this example, we load a data set of tweets that is the pool of examples from which we want to sample tweets for the annotators to label. We want to overrepresent the minority class in the sampled examples, therefore we use a preliminary classifier – a Support Vector Machine – that was trained on already labeled examples to classify all examples in the sample pool and oversample examples that the preliminary classifier has classified as the minority class.

In [1]:
import os
import re
import pandas as pd
import numpy as np
from pathlib import Path
from joblib import load

## Load and clean the data

In [2]:
# load the data from which we can sample
# This is data without labels.
src = "data"
fname = "sample_pool.csv"
sample_pool = pd.read_csv(Path(src, fname), dtype={"tweet_id":str})
sample_pool.head()

Unnamed: 0,tweet_id,text
0,960158527582097408,Das ist so nicht richtig.
1,837717128480444418,Man sollte die Drohungen von der ISIS Gruppe e...
2,1082375206319128576,Gerade gibt's in den USA eine Ausschreibung fü...
3,1013645930443235328,Horst los! Das Netz hat dein Angebot voll ange...
4,1031345564649172992,Und was jetzt? Müssen wir uns jetzt neben dem ...


In [3]:
# clean the text in the examples (necessary for the TFIDF-based classifiers)

# remove URLs
sample_pool["text_clean"] = sample_pool["text"]\
    .apply(lambda x: re.sub(r"https?:\/\/\S*", "", x, flags=re.MULTILINE))

# lowercase all text
sample_pool["text_clean"] = sample_pool["text"]\
    .apply(lambda x: x.lower())

sample_pool.head()

Unnamed: 0,tweet_id,text,text_clean
0,960158527582097408,Das ist so nicht richtig.,das ist so nicht richtig.
1,837717128480444418,Man sollte die Drohungen von der ISIS Gruppe e...,man sollte die drohungen von der isis gruppe e...
2,1082375206319128576,Gerade gibt's in den USA eine Ausschreibung fü...,gerade gibt's in den usa eine ausschreibung fü...
3,1013645930443235328,Horst los! Das Netz hat dein Angebot voll ange...,horst los! das netz hat dein angebot voll ange...
4,1031345564649172992,Und was jetzt? Müssen wir uns jetzt neben dem ...,und was jetzt? müssen wir uns jetzt neben dem ...


In [4]:
# The preliminary classifier is actually two classifiers – each trained on 4000
# annotations from one of two annotators (AS and LT, "ensemble prediction"). 
# We load both classifiers as well as the embedding model (TFIDF) here.
# Please note: We do not demonstrate the training of the Support Vector Machine
# (preliminary classifier) here. Assume, we've previously trained the SVM as described above.
raters = ["AS", "LT"]
classifier_src = "finetuned_models"
classifier_model = "LinearSVC"
embedding = "TfidfVectorizer"

classifiers = {rater:{} for rater in raters}
for rater in raters:
    tfidf = load(Path(classifier_src, embedding, f"rater_{rater}.joblib"))
    clf = load(Path(classifier_src, classifier_model, f"rater_{rater}.joblib")) 
    classifiers[rater]["embedding"] = tfidf
    classifiers[rater]["classifier"] = clf

# Note on Inconsistent Version Warning: The SVM classifier might not make accurate 
# predicitons when the scikit-learn versions at training and use differ. This is
# irrelevant for illustration purposes, but becomes important in production.

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [5]:
# We use both preliminary classifiers to create predictions for each of the
# examples in the sample pool.
for rater in raters:
    X = classifiers[rater]["embedding"].transform(sample_pool["text_clean"]).toarray()
    pred = classifiers[rater]["classifier"].predict(X)
    sample_pool[f"pred_{rater}"] = pred

sample_pool.head()

Unnamed: 0,tweet_id,text,text_clean,pred_AS,pred_LT
0,960158527582097408,Das ist so nicht richtig.,das ist so nicht richtig.,1,1
1,837717128480444418,Man sollte die Drohungen von der ISIS Gruppe e...,man sollte die drohungen von der isis gruppe e...,1,1
2,1082375206319128576,Gerade gibt's in den USA eine Ausschreibung fü...,gerade gibt's in den usa eine ausschreibung fü...,1,1
3,1013645930443235328,Horst los! Das Netz hat dein Angebot voll ange...,horst los! das netz hat dein angebot voll ange...,2,2
4,1031345564649172992,Und was jetzt? Müssen wir uns jetzt neben dem ...,und was jetzt? müssen wir uns jetzt neben dem ...,4,4


In [6]:
# We retain only entries for which both classifiers agree.
sample_pool = sample_pool[sample_pool[[f"pred_{rater}" for rater in raters]]\
            .apply(lambda x: len(set(x.values)) == 1, axis=1)]
sample_pool = sample_pool.drop(columns=[f"pred_{rater}" for rater in raters][1:] + ["text_clean"])
sample_pool = sample_pool.rename(columns={f"pred_{raters[0]}": "pred"})  
sample_pool.head()

Unnamed: 0,tweet_id,text,pred
0,960158527582097408,Das ist so nicht richtig.,1
1,837717128480444418,Man sollte die Drohungen von der ISIS Gruppe e...,1
2,1082375206319128576,Gerade gibt's in den USA eine Ausschreibung fü...,1
3,1013645930443235328,Horst los! Das Netz hat dein Angebot voll ange...,2
4,1031345564649172992,Und was jetzt? Müssen wir uns jetzt neben dem ...,4


In [7]:
# This is how the distribution of predicted labels in the remaining sample pool looks like.
sample_pool["pred"].value_counts()

pred
1    2714
0    1633
3    1274
2    1114
4     661
5      65
Name: count, dtype: int64

In [8]:
# We want to bias the sampling towards class "0", which contains all the
# constructive counter speech strategies.
id_to_label = {
    0:"construct",
    1:"opin",
    2:"sarc",
    3:"other",
    4:"unint",
    5:"foreign"
}

# Therefore we retain only the 1633 examples that were classified as
# "construct" by the preliminary classifier.
sample_pool["pred_label"] = sample_pool["pred"].replace(id_to_label)
sample_pool = sample_pool[sample_pool["pred_label"] == "construct"]

sample_pool.head()

Unnamed: 0,tweet_id,text,pred,pred_label
13,894138148686573568,"Thema Völkerrecht: Irak-Krieg, Libyen, Guantán...",0,construct
18,1025077251258368000,Leeres Gerede! Nur eine einzige Nation gibt un...,0,construct
29,1076480651044503552,"Es wird Zeit, die „romantische Idee“ loszulass...",0,construct
33,882150372592234496,Jetzt ham wa alle ma was kapiert: was politisc...,0,construct
36,912209524123213829,"Hahaha, die werden sowieso zusammenarbeiten. H...",0,construct


In [9]:
# Finally, we draw a sample of 1000 examples from the remaining pool ...
sample_size = 1000
seed = 42  # for reproducibility
sample = sample_pool.sample(n=sample_size, random_state=seed).reset_index(drop=True)
sample = sample[["tweet_id", "text"]]

# ... and save it to a file for annotators to label.
fname = "biased_sample.csv"
sample.to_csv(Path(src, fname), index=False)
sample.head()

# Because our preliminary classifier (the Support Vector Machine) still has low
# performance, the sample that was only labeled as "construct" by this classifier
# will still contain enough of the majority classes to result in a balanced sample.

Unnamed: 0,tweet_id,text
0,1021094619222822913,Warum sollen AfD-Wähler rausgeekelt werden? Bi...
1,1007198873130078208,Deutschlandweite STASI-Aktion gegen freie Mein...
2,1046681052557848576,Schon mal was von Urheberrechtsverletzung gehö...
3,1060921392118620160,"Das ist voll das Standard-Ding in der Politik,..."
4,813671235158687744,Populismus bei den Mossad-Medien! @newsflash12...
