# Introduction to augmenters

This notebook provides a short introduction to some of the tools for augmentation included in `DaCy`. For information on how to  conduct robustness test of your models please see `dacy-robustness.ipynb`.

Let's start out by seeing how different augmenters change your text. The augmenters included in `DaCy` is based of the spaCy augmenters, which mean you can you them both for training and behavoiral testing. Thus it needs to work on the `Example` class from spaCy (as opposed to the `Doc`), so let's write a little helper function that converts a `Doc` to an `Example` and write some text to test on.

The `Example` class consist of two Docs one being the reference (or gold-standard) which contain the labeled data the other being the predicted, which contains the prediction of the model. 

In [1]:
!pip install dacy

Collecting dacy
  Using cached dacy-1.0.0-py3-none-any.whl (148 kB)
Collecting spacy>=3.0.3
  Using cached spacy-3.1.0-cp38-cp38-macosx_10_9_x86_64.whl (6.0 MB)
Collecting spacy-transformers>=1.0.1
  Using cached spacy_transformers-1.0.3-py2.py3-none-any.whl (39 kB)
Collecting pandas>=1.0.0
  Using cached pandas-1.3.0-cp38-cp38-macosx_10_9_x86_64.whl (11.4 MB)
Collecting tqdm>=4.42.1
  Using cached tqdm-4.61.2-py2.py3-none-any.whl (76 kB)
Collecting pytz>=2017.3
  Using cached pytz-2021.1-py2.py3-none-any.whl (510 kB)
Collecting numpy>=1.17.3
  Using cached numpy-1.21.0-cp38-cp38-macosx_10_9_x86_64.whl (16.9 MB)
Collecting murmurhash<1.1.0,>=0.28.0
  Using cached murmurhash-1.0.5-cp38-cp38-macosx_10_9_x86_64.whl (18 kB)
Collecting pathy>=0.3.5
  Using cached pathy-0.6.0-py3-none-any.whl (42 kB)
Collecting preshed<3.1.0,>=3.0.2
  Using cached preshed-3.0.5-cp38-cp38-macosx_10_9_x86_64.whl (105 kB)
Collecting cymem<2.1.0,>=2.0.2
  Using cached cymem-2.0.5-cp38-cp38-macosx_10_9_x86_64.whl

In [2]:
import dacy
from spacy.training import Example
from typing import List, Callable, Iterator

In [3]:
def doc_to_example(doc):
    return Example(doc, doc)

In [4]:
nlp = dacy.load("da_dacy_small_tft-0.0.0")
doc = nlp("Peter Schmeichel mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.")
example = doc_to_example(doc)



Let's see how some of the simple augmenters transform the text.

In [5]:
from spacy.training.augment import create_lower_casing_augmenter
from dacy.augmenters import (create_keyboard_augmenter, create_pers_augmenter,
                             create_spacing_augmenter)
from dacy.datasets import danish_names

In [7]:
lower_aug = create_lower_casing_augmenter(level=1)
keyboard_05 = create_keyboard_augmenter(doc_level=1, char_level=0.05, keyboard = "QWERTY_DA")
keyboard_15 = create_keyboard_augmenter(doc_level=1, char_level=0.15, keyboard = "QWERTY_DA")
space_aug = create_spacing_augmenter(doc_level=1, spacing_level=0.4)

`lower_aug` will change all text to lowercase as the level is set to 1 (100%), `keyboard_05` and `keyboard_15` will change 5% or 15% of all characters to a character on a neighbouring key on a Danish QWERTY keyboard (replace `DA` with `EN` for English), and `space_aug` will remove 40% of all whitespaces. The augmenters takes in an `Example` and modify both the reference and the predicted `Doc`s in the Example and makes sure that spans for NER, POS etc. remain correct. Let's see how the text looks. As the augmenters can return multiple examples we utilize the `next` to extract the first (and only) example.

In [12]:
for aug in [lower_aug, keyboard_05, keyboard_15, space_aug]:
    aug_example = next(aug(nlp, example))       # augment the example
    doc = aug_example.y                         # extract the reference doc
    print(doc)

peter schmeichel mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod england.
Peter Schmekchel mener også, at det dansje landshole anno 2021 tilh-rer verdenstoppen og kan vinde den kommende kamp mod England.
Perer Schm3icheo mener også, ag det dansme lanfshold anno 2021 tilhørwr verdenat9ppen og jan cinde den kommende kamp mod Emglanf.
Peter Schmeichelmener også, atdetdanskelandsholdanno 2021tilhører verdenstoppen og kanvindeden kommende kamp modEngland.


Pretty neat, right? 
`DaCy` also includes a more sophisticated augmenter for augmenting names. `create_pers_augmenter` which is highly flexible, and can augment names to fit a certain pattern (e.g. first_name, last_name; abbreviated_first_name, last_name) or replace names with one sampled from a dictionary. `DaCy` provides four utility functions for constructing such name dictionaries: `danish_names`, `female_names`, `male_names`, and `muslim_names` (see the README in `datasets/lookup_tables` for sources). The dictionaries are composed of the keys `first_name` and `last_name` which each contain a list of names to sample from. The `pers_augmenter` uses this dictionary when it replaces names to respect first and last names. Let's go through a couple of examples to demonstrate how it works

In [11]:
print(danish_names().keys())
print(danish_names()["first_name"][0:5])
print(danish_names()["last_name"][0:5])

dict_keys(['first_name', 'last_name'])
['Marie', 'Anna', 'Margrethe', 'Karen', 'Kirstine']
['Jensen', 'Nielsen', 'Hansen', 'Pedersen', 'Andersen']


In [None]:
def augment_texts(texts: List[str], augmenter: Callable) -> Iterator[Example]:
    """Takes a list of strings and yields augmented examples"""
    docs = nlp.pipe(texts)
    for doc in docs:
        ex = Example(doc, doc)
        aug = augmenter(nlp, ex)
        yield next(aug).y

In [14]:
texts = [
    "Hans Christian Andersen var en dansk digter og forfatter",
    "1, 2, 3, Schmeichel er en mur",
    "Peter Schmeichel mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England."
    ]

# Create a dictionary to use for name replacement
dk_name_dict = danish_names()


# force_pattern augments PER entities to fit the format and length of `patterns`. Patterns allows you to specificy arbitrary
# combinations of "fn" (first names), "ln" (last names), "abb" (abbreviated to first character) and "abbpunct" (abbreviated
# to first character + ".") separeated by ",". If keep_name=True, the augmenter will not change names, but if force_pattern_size
# is True it will make them fit the length and potentially abbreviate names. 
pers_aug = create_pers_augmenter(dk_name_dict, force_pattern_size=True, keep_name=False, patterns=["fn,ln"])
augmented_docs = augment_texts(texts, pers_aug)
for d in augmented_docs:
    print(d)

Ricard Melgaard var en dansk digter og forfatter
1, 2, 3, Nikolaj Bengtsson er en mur
May Kjellerup mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.


In [16]:
# Here's an example with keep_name=True and force_pattern_size=False which simply abbreviates first names
abb_aug = create_pers_augmenter(dk_name_dict, force_pattern_size=False, keep_name=True, patterns=["abbpunct"])
augmented_docs = augment_texts(texts, abb_aug)
for d in augmented_docs:
    print(d)

H. Christian Andersen var en dansk digter og forfatter
1, 2, 3, S. er en mur
P. Schmeichel mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.


In [18]:
# patterns can also take a list of patterns to replace from (which can be weighted using the
# patterns_prob argument. The pattern to use is sampled for each entity. 
# This setting is especially useful for finetuning models.
multiple_pats = create_pers_augmenter(dk_name_dict, 
                                      force_pattern_size=True,
                                      keep_name=False,
                                      patterns=["fn,ln", "abbpunct,ln", "fn,ln,ln,ln"])
augmented_docs = augment_texts(texts, multiple_pats)
for d in augmented_docs:
    print(d)

A. Wollesen var en dansk digter og forfatter
1, 2, 3, Richardt Korsholm er en mur
L. Tobiasen mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.


Feel free to play around with the options for `create_pers_augmenter` to get a feeling for how it works and check out the docs.

The main strength of making the augmenters work with SpaCy is that we ensure that the spans of the augmented data still has the correct tags even though we add or remove words. This allows us to use them with gold-standard tagged datasets such as DaNE and use them for both training and evaluation. 

In [20]:
docs = nlp.pipe(texts)
augmented_docs = augment_texts(texts, multiple_pats)

# Check that the added/removed PER entities are still tagged as entities
for doc, aug_doc in zip(docs, augmented_docs):
    print(doc.ents, "\t\t", aug_doc.ents)

(Hans Christian Andersen, dansk) 		 (Ib Bojesen Witt Walther, dansk)
(Schmeichel,) 		 (Bernhard Østergård,)
(Peter Schmeichel, danske, England) 		 (B. Knudsen, danske, England)


## Contributing

We highly encourage others to contribute more augmenters that cover a wider range of use cases. For inspiration on how to make your own, checkout the source code for the ones included in `DaCy` in the `dacy/augmenters` folder and SpaCy's documentation [here](https://spacy.io/usage/training#data-augmentation). If you have a good idea for one or encounter any problems, please open an issue or write on the discussion board.