# Model validation using Augmentation
For this class we will conduct model validation using augmentation, we will especially use the package [Augmenty](https://kennethenevoldsen.github.io/augmenty/).

## Setup

We will need to set up a few things before we start.

### Packages:
For this tutorial you will need the following packages:

- spaCy and augmenty are used for the augmentation
- transformers are use to run the model we wish to validate
- danlp is used to download the dataset we want to use

In [1]:
!pip install augmenty spacy==3.1.1 transformers==4.2.2 danlp==0.0.12
!python -m spacy download da_core_news_lg

Collecting augmenty
  Using cached augmenty-0.0.9-py3-none-any.whl (61 kB)
Collecting spacy==3.1.1
  Using cached spacy-3.1.1-cp36-cp36m-win_amd64.whl (11.8 MB)
Collecting transformers==4.2.2
  Using cached transformers-4.2.2-py3-none-any.whl (1.8 MB)
Collecting danlp==0.0.12
  Using cached danlp-0.0.12-py3-none-any.whl (71 kB)
Collecting typer<0.4.0,>=0.3.0
  Using cached typer-0.3.2-py3-none-any.whl (21 kB)
Collecting tokenizers==0.9.4
  Using cached tokenizers-0.9.4-cp36-cp36m-win_amd64.whl (1.9 MB)
Collecting tweepy
  Using cached tweepy-4.4.0-py2.py3-none-any.whl (65 kB)
Collecting pyconll
  Using cached pyconll-3.1.0-py3-none-any.whl (26 kB)
Collecting conllu
  Using cached conllu-4.4.1-py2.py3-none-any.whl (15 kB)
Collecting click<7.2.0,>=7.1.1
  Using cached click-7.1.2-py2.py3-none-any.whl (82 kB)
Installing collected packages: click, typer, tweepy, tokenizers, spacy, pyconll, conllu, transformers, danlp, augmenty
  Attempting uninstall: click
    Found existing installation: 

You should consider upgrading via the 'C:\Users\louis\OneDrive - Aarhus universitet\Masters\1. Semester\Natural Language Processing\nlp_env\Scripts\python.exe -m pip install --upgrade pip' command.


Collecting da-core-news-lg==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/da_core_news_lg-3.1.0/da_core_news_lg-3.1.0-py3-none-any.whl (573.2 MB)
Installing collected packages: da-core-news-lg
Successfully installed da-core-news-lg-3.1.0
✔ Download and installation successful
You can now load the package via spacy.load('da_core_news_lg')


2021-12-02 13:52:34.833208: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2021-12-02 13:52:34.834955: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
You should consider upgrading via the 'C:\Users\louis\OneDrive - Aarhus universitet\Masters\1. Semester\Natural Language Processing\nlp_env\Scripts\python.exe -m pip install --upgrade pip' command.


## Dataset
For this dataset we will be using [DKHate](https://github.com/alexandrainst/danlp/blob/master/docs/docs/datasets.md#dkhate). The DKHate dataset contains user-generated comments from social media platforms (Facebook and Reddit) annotated for various types and target of offensive language. Note that only labels for the sub-task A (Offensive language identification), i.e. NOT (Not Offensive) / OFF (Offensive), are available.

In [2]:
from danlp.datasets import DKHate
import pandas as pd
dkhate = DKHate()
test, train = dkhate.load_with_pandas()

to make everything run faster we will only be using a subsample of the dataset:

In [4]:
samples = 20

# make sure to sample evenly from the two samples
n_labels = len(test["subtask_a"].unique())
samples_pr_lab = samples//n_labels

off = test[test["subtask_a"] == "OFF"].sample(samples_pr_lab)
not_off = test[test["subtask_a"] == "NOT"].sample(samples_pr_lab)
mini_test = pd.concat([off, not_off])

We can now inspect the data using:

In [5]:
mini_test

Unnamed: 0_level_0,tweet,subtask_a
id,Unnamed: 1_level_1,Unnamed: 2_level_1
924,Jeg ville ønske jeg kunne anerkende [KORRUPT P...,OFF
159,"Fuck perkerne. ses på 4chan, tabere.",OFF
658,Fuckkkkk det her er mig....,OFF
9,Hvordan i helvede fik de overhovedet dit numme...,OFF
137,@USER ryger du hash. ???,OFF
1139,Var vi de eneste røvhuller som dyppede karamel...,OFF
544,Normalt synes jeg Marx var lige højreorientere...,OFF
3545,"Potentiale til månedens billede, lige der. De...",OFF
3326,Det fandme en fugtig migmig det der,OFF
753,"uuh, denne her bliver nok upopulær, men jeg er...",OFF


## Loading the model
For this dataset we will be using a model trained on the train set of the corpus:

In [6]:
from transformers import pipeline

model_name = "DaNLP/da-bert-hatespeech-detection"
pipe = pipeline("sentiment-analysis", # text classification == sentiment analysis (don't ask me why, but they removed textcat in the latest version)
               model=model_name)

Downloading: 100%|██████████| 905/905 [00:00<00:00, 908kB/s]
Downloading: 100%|██████████| 443M/443M [00:35<00:00, 12.5MB/s]
Downloading: 100%|██████████| 253k/253k [00:00<00:00, 563kB/s]
Downloading: 100%|██████████| 112/112 [00:00<00:00, 37.4kB/s]
Downloading: 100%|██████████| 342/342 [00:00<00:00, 344kB/s]


We can quickly check the output using:

In [7]:
pipe(["Gamle stupide idiot", "Lækkert vejr i dag"]) # old stupid idiot, nice weather today

[{'label': 'offensive', 'score': 0.9902198910713196},
 {'label': 'not offensive', 'score': 0.9998297691345215}]

We can quickly apply this model to all our examples and save them in the dataset:

In [8]:
texts = mini_test["tweet"].to_list()

def apply(texts):
    output = pipe(texts, truncation=True) #truncate if it is too long for the model to handle it
    return [t["score"] if t["label"] == "offensive" else 1 - t["score"] for t in output]


# first without augmentations
mini_test["p_offensive_no_aug"] = apply(texts)

In [9]:
mini_test #inspecting

Unnamed: 0_level_0,tweet,subtask_a,p_offensive_no_aug
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
924,Jeg ville ønske jeg kunne anerkende [KORRUPT P...,OFF,0.99376
159,"Fuck perkerne. ses på 4chan, tabere.",OFF,0.995237
658,Fuckkkkk det her er mig....,OFF,0.013337
9,Hvordan i helvede fik de overhovedet dit numme...,OFF,0.972994
137,@USER ryger du hash. ???,OFF,0.99774
1139,Var vi de eneste røvhuller som dyppede karamel...,OFF,0.926999
544,Normalt synes jeg Marx var lige højreorientere...,OFF,0.125608
3545,"Potentiale til månedens billede, lige der. De...",OFF,0.001631
3326,Det fandme en fugtig migmig det der,OFF,0.02239
753,"uuh, denne her bliver nok upopulær, men jeg er...",OFF,0.021918


# Behavioural check using Augmentation

In the following we want to examine the behavioural consistency of the model using augmentation. The idea is to check the behavioural consistently of the model for instance if we introduce slight spelling errors we the model should still be able to recognize names. If this is not the case it might be unwise to apply the model to domains where spelling errors are common such as social media.  

![](img/aug.png)
**Figure 1**: Examples of augmentation applied by Enevoldsen et al. (2020) and what domains they might be of relevance.




## Augmenty
For the augmentation we will be using the package augmenty, the following provides a brief introduction to it.

**NOTE**: You are naturally not forced to use augmenty, you implement your own augmenters i.e. the following example with uppercasing is easy to implement by hand.  For example if you want to examine the effect of questionmarks you could make the augmentation:
```py
q_aug = [text + "?" for text in texts]
```

In [10]:
import augmenty
import spacy

nlp = spacy.load("da_core_news_lg")

# a list of augmenters from the augmenty module
for augmenter in augmenty.augmenters():
    print(augmenter)


spacy.orth_variants.v1
spacy.lower_case.v1
random_casing.v1
char_replace_random.v1
char_replace.v1
keystroke_error.v1
remove_spacing.v1
char_swap.v1
random_starting_case.v1
conditional_token_casing.v1
token_dict_replace.v1
wordnet_synonym.v1
token_replace.v1
word_embedding.v1
grundtvigian_spacing_augmenter.v1
spacing_insertion.v1
token_swap.v1
token_insert.v1
token_insert_random.v1
duplicate_token.v1
random_synonym_insertion.v1
ents_replace.v1
per_replace.v1
ents_format.v1
upper_case.v1
spongebob.v1
da_æøå_replace.v1
da_historical_noun_casing.v1


A list naturally does not give you all the information you need. You can always examine a specific augmenter more en detain in the [documentation](https://kennethenevoldsen.github.io/augmenty/).


Let us try one of the augmenters. We can use the `augmenty.load` as a common interface for all augmenters.

In [11]:
# load an augmenter
upper_case_augmenter = augmenty.load("upper_case.v1", level=1.00) # augment 100% 

These augmenters are made to work on the SpaCy data class Examples which allows for much more detailed augmentation, however augmenty have utility function to allow us to use them for strings:

In [12]:
examples = ["this is an example", "and another one"]
aug_texts = augmenty.texts(examples, augmenter=upper_case_augmenter, nlp=nlp)
list(aug_texts)

['THIS IS AN EXAMPLE', 'AND ANOTHER ONE']

## Is uppercasing more offensive?

Now we will can apply our model to the augmented examples to see if it changes predictions of the model.


In [13]:
aug_texts = augmenty.texts(texts, augmenter=upper_case_augmenter, nlp=nlp)
mini_test["p_offensive_upper"] = apply(list(aug_texts))

Examining the output of our models we quickly see that it doesn't change the result at all! 

In [14]:
mini_test

Unnamed: 0_level_0,tweet,subtask_a,p_offensive_no_aug,p_offensive_upper
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
924,Jeg ville ønske jeg kunne anerkende [KORRUPT P...,OFF,0.99376,0.99376
159,"Fuck perkerne. ses på 4chan, tabere.",OFF,0.995237,0.995237
658,Fuckkkkk det her er mig....,OFF,0.013337,0.013337
9,Hvordan i helvede fik de overhovedet dit numme...,OFF,0.972994,0.972994
137,@USER ryger du hash. ???,OFF,0.99774,0.99774
1139,Var vi de eneste røvhuller som dyppede karamel...,OFF,0.926999,0.926999
544,Normalt synes jeg Marx var lige højreorientere...,OFF,0.125608,0.125608
3545,"Potentiale til månedens billede, lige der. De...",OFF,0.001631,0.001631
3326,Det fandme en fugtig migmig det der,OFF,0.02239,0.02239
753,"uuh, denne her bliver nok upopulær, men jeg er...",OFF,0.021918,0.021918


To be a bit more explicit we can also compare it using summary information:

In [15]:
def compare_cols(
    augmentation,
    baseline=mini_test["p_offensive_no_aug"],
    category=mini_test["subtask_a"],
):
    """Compares augmentation with the baseline for each of the categories"""
    changes = ((augmentation > 0.5) != (baseline > 0.5)).sum()
    n = len(augmentation)
    print(f"The augmentation lead to classification changes in {changes}/{n}")
    for cat in set(category):
        aug_cat_mean = augmentation[category == cat].mean().round(3)
        aug_cat_std = augmentation[category == cat].std().round(3)
        cat_mean = baseline[category == cat].mean().round(3)
        cat_std = baseline[category == cat].std().round(3)
        print(
            f"The average prob. of {cat} went from {cat_mean}({cat_std}) to {aug_cat_mean}({aug_cat_std})."
        )

compare_cols(mini_test["p_offensive_upper"])

The augmentation lead to classification changes in 0/20
The average prob. of OFF went from 0.507(0.497) to 0.507(0.497).
The average prob. of NOT went from 0.006(0.012) to 0.006(0.012).


# Exercises:

1) Solve the above mystery, why doesn't the model estimate change might when uppercasing? *Hint*: Check the tokenizer of the model
2) Examining the data, I seemed to notice that spelling error were more common among offensive tweets. Is this correct? [*Hint*](https://kennethenevoldsen.github.io/augmenty/augmenty.character.html?highlight=keystroke#augmenty.character.replace.create_keystroke_error_augmenter)
3) Examine the data yourself and create three hypothesis on what augmentation might change the performance.
4) Outline how you could apply augmentation (behavioral testing) to examine a model (or pipeline) in your project
5) (Optional): Apply this behavioural testing to your model

In [None]:
# 1
# Probably because the tokenizer lower-cases everything (also when you made it upper-case)
# yes, it does

In [24]:
# 2
# load an augmenter
keystroke_augmenter = augmenty.load("keystroke_error.v1", level=0.05) # augment 5%

# Apply
aug_texts = augmenty.texts(texts, augmenter=keystroke_augmenter, nlp=nlp)
mini_test["p_offensive_keystroke_error"] = apply(list(aug_texts))

# Comparing
print(compare_cols(mini_test["p_offensive_keystroke_error"]))

# No clear trend when doing it on only 20 samples. The model is not robust.

                                                  tweet subtask_a  \
id                                                                  
924   Jeg ville ønske jeg kunne anerkende [KORRUPT P...       OFF   
159                Fuck perkerne. ses på 4chan, tabere.       OFF   
658                         Fuckkkkk det her er mig....       OFF   
9     Hvordan i helvede fik de overhovedet dit numme...       OFF   
137                            @USER ryger du hash. ???       OFF   
1139  Var vi de eneste røvhuller som dyppede karamel...       OFF   
544   Normalt synes jeg Marx var lige højreorientere...       OFF   
3545  Potentiale til månedens billede, lige der.  De...       OFF   
3326                Det fandme en fugtig migmig det der       OFF   
753   uuh, denne her bliver nok upopulær, men jeg er...       OFF   
777   Jeg mener ikke at en mand er mere troværdig. J...       NOT   
621   Tillykke til både Danmark og Island  - Hilsen ...       NOT   
1089  Selv mod Trump kæmper gudern

In [25]:
mini_test

Unnamed: 0_level_0,tweet,subtask_a,p_offensive_no_aug,p_offensive_upper,p_keystroke_error,p_offensive_keystroke_error
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
924,Jeg ville ønske jeg kunne anerkende [KORRUPT P...,OFF,0.99376,0.99376,0.006027,0.999302
159,"Fuck perkerne. ses på 4chan, tabere.",OFF,0.995237,0.995237,0.002658,0.993182
658,Fuckkkkk det her er mig....,OFF,0.013337,0.013337,0.004347,0.773904
9,Hvordan i helvede fik de overhovedet dit numme...,OFF,0.972994,0.972994,0.003511,0.989061
137,@USER ryger du hash. ???,OFF,0.99774,0.99774,0.008164,0.997467
1139,Var vi de eneste røvhuller som dyppede karamel...,OFF,0.926999,0.926999,0.004592,0.963933
544,Normalt synes jeg Marx var lige højreorientere...,OFF,0.125608,0.125608,0.007933,0.995904
3545,"Potentiale til månedens billede, lige der. De...",OFF,0.001631,0.001631,0.012909,0.000121
3326,Det fandme en fugtig migmig det der,OFF,0.02239,0.02239,0.000745,0.011757
753,"uuh, denne her bliver nok upopulær, men jeg er...",OFF,0.021918,0.021918,0.004804,0.857409


In [None]:
# 3
#Examine the data yourself and create three hypothesis on what augmentation might change the performance.
#Kenneth looked at @user (maybe more likely to be offensive, when you are directing the tweet towards someone)
# and looked at !!! use

In [None]:
# 4
#Outline how you could apply augmentation (behavioral testing) to examine a model (or pipeline) in your project