# Snorkel Workshop: Augmentation Tutorial

## Getting started

In this tutorial, we'll explore augmenting training datasets using transformation functions (TFs). We'll focus on the Words in Context task from SuperGLUE. But first, we'll take care of a few imports and defaults.

In [1]:
import os
import random
import sys
from pathlib import Path

import pandas as pd
from snorkel.augmentation.apply import PandasTFApplier
from snorkel.augmentation.policy import RandomAugmentationPolicy
from snorkel.augmentation.tf import transformation_function

In [2]:
if not "cwd" in globals():
    cwd = Path(os.getcwd())
sys.path.insert(0, str(cwd.parents[0]))

from dataloaders import get_jsonl_path
from superglue_parsers.wic import get_rows

In [3]:
task_name = "WiC"
data_dir = os.environ.get("SUPERGLUEDATA", os.path.join(str(cwd.parents[0]), "data"))
split = "train"
max_data_samples = 50

## Loading data

We'll load the WiC data from our local download and construct a Pandas DataFrame with it. Just as a quick check, let's take a look at some of the first few entries.

In [4]:
jsonl_path = get_jsonl_path(data_dir, task_name, split)
wic_df = pd.DataFrame.from_records(get_rows(jsonl_path, max_data_samples))
wic_df.head()

Unnamed: 0,idx,label,pos,sentence1,sentence1_idx,sentence2,sentence2_idx,word
0,0,False,V,You must carry your camping gear .,2,Sound carries well over water .,1,carry
1,1,False,V,Messages must go through diplomatic channels .,2,Do you think the sofa will go through the door ?,6,go
2,2,False,V,Break an alibi .,0,The wholesaler broke the container loads into ...,2,break
3,3,True,N,He wore a jock strap with a metal cup .,8,Bees filled the waxen cups with honey .,4,cup
4,4,False,N,The Academy of Music .,1,The French Academy .,2,academy


## Writing transformation functions

Let's write our first transformation function. A common approach in NLP tasks is to replace important words with synonyms. Here, we'll replace the keyword in the two sentences with a new word randomly sampled from a synonym set. We'll filter out complicated phrases and different parts-of-speech from our synonyms.

First, we'll write a helper function to execute the core logic of our TF. Given the key word and its part-of-speech, it calls `nltk`'s wordnet tooling to create a filtered set of synonym words.

In [5]:
import nltk
from nltk.corpus import wordnet

nltk.download("wordnet")


def get_filtered_syns(word, pos):
    # Use Wordnet to find synonyms and filter out
    # synonyms that are
    #  * the same word as the original
    #  * composed of multiple words
    #  * different POS from the original
    syns = wordnet.synsets(word)
    syns_filtered = set()
    for s in syns:
        name_parts = s.name().split(".")
        s_word = name_parts[0]
        same_pos = name_parts[1] == pos.lower()
        if s_word != word and ("_" not in s_word) and same_pos:
            syns_filtered.add(s_word)
    return list(syns_filtered)

[nltk_data] Downloading package wordnet to /home/ubuntu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


We can look at a simple example to verify functionality. As expected, we get different synonyms for when "stream" is used as a verb and as a noun. Try out a few more words. It's important to note that this method doesn't provide perfect substitutions. However, they can still help with training. For more information, see [this blog post](https://towardsdatascience.com/these-are-the-easiest-data-augmentation-techniques-in-natural-language-processing-you-can-think-of-88e393fd610).

In [6]:
word = "stream"

print(f"Synonyms for '{word}' (verb):", get_filtered_syns(word, "V"))
print(f"Synonyms for '{word}' (noun):", get_filtered_syns(word, "N"))

Synonyms for 'stream' (verb): ['pour']
Synonyms for 'stream' (noun): ['flow', 'current']


Now, we'll wrap out helper in a transformation function. In addition to sampling from the generated synonym set, we need to reconstruct our example. Note that the TF returns `None` if there's no available transformation. This happens if there are no valid synonyms, or if the key word appears in different forms between the sentences.

In [7]:
@transformation_function
def replace_word(x):
    # Break up sentence into tokens
    sentence1_tokens = x.sentence1.split()
    sentence2_tokens = x.sentence2.split()
    sentence1_instance = sentence1_tokens[x.sentence1_idx]
    sentence2_instance = sentence2_tokens[x.sentence2_idx]
    # Check if any word forms are different
    if len({sentence1_instance, sentence2_instance, x.word}) > 1:
        return None
    # Get and filter synonyms, then randomly sample
    syns = get_filtered_syns(x.word, x.pos)
    if len(syns) == 0:
        return None
    syn = random.choice(syns)
    # Swap in synonym
    sentence1_tokens[x.sentence1_idx] = syn
    sentence2_tokens[x.sentence2_idx] = syn
    # Reconstruct example and return
    x.sentence1 = " ".join(sentence1_tokens)
    x.sentence2 = " ".join(sentence2_tokens)
    x.word = syn
    return x

## Applying our transformation function

In order to apply our TF, we need two things: a policy and an applier.

_Policy_

The policy dictates how the TFs should be composed in a sequence. Basic data augmentation systems use a random policy that applies a randomly sampled sequence of transformations to the input training example. Augmentation policies can also be learned using [TANDA](https://hazyresearch.github.io/snorkel/blog/tanda.html) and other related techniques. Here, since we only have one TF, we can use just about any policy.

In [8]:
tfs = [replace_word]
policy = RandomAugmentationPolicy(len([replace_word]), sequence_length=1)

_Applier_

The applier takes our TFs and policy, and applies them to a DataFrame of examples. We'll specify that we want 1 transformed example per original, and that we want to keep the original as well. If our TF returns `None`, there won't be a transformed example in our output DataFrame.

In [9]:
random.seed(1)

applier = PandasTFApplier(tfs, policy, k=1, keep_original=True)
wic_df_synonym = applier.apply(wic_df)

100%|██████████| 50/50 [00:00<00:00, 800.17it/s]


Now let's take a look at the augmented dataset.

In [10]:
wic_df_synonym.head(25)

Unnamed: 0,idx,label,pos,sentence1,sentence1_idx,sentence2,sentence2_idx,word
0,0,False,V,You must carry your camping gear .,2,Sound carries well over water .,1,carry
1,1,False,V,Messages must go through diplomatic channels .,2,Do you think the sofa will go through the door ?,6,go
1,1,False,V,Messages must move through diplomatic channels .,2,Do you think the sofa will move through the do...,6,move
2,2,False,V,Break an alibi .,0,The wholesaler broke the container loads into ...,2,break
3,3,True,N,He wore a jock strap with a metal cup .,8,Bees filled the waxen cups with honey .,4,cup
4,4,False,N,The Academy of Music .,1,The French Academy .,2,academy
5,5,False,V,Set the table .,0,To set glass in a sash .,1,set
6,6,True,V,Starch clothes .,0,She starched her blouses .,1,starch
7,7,False,V,Do you take sugar in your coffee ?,2,A reading was taken of the earth 's tremors .,3,take
8,8,True,V,I try to avoid the company of gamblers .,3,We avoided the ball .,1,avoid


## Writing more transformation functions

This is perhaps the simplest TF we can write for WiC. The two sentences in each training example are unordered, so swapping their order doesn't change the label. Since the model architecture we're using is not invariant to input order, we can generate a new, unique training example by simply swapping the two sentences. We'll apply this to our DataFrame with synonym-swapped examples as well so that we get new examples for those as well.

In [11]:
@transformation_function
def swap_sentences(x):
    x.sentence1, x.sentence2 = x.sentence2, x.sentence1
    x.sentence1_idx, x.sentence2_idx = x.sentence2_idx, x.sentence1_idx
    return x

Again, we'll define our policy and applier, then create our augmented DataFrame.

In [12]:
tfs = [swap_sentences]
policy = RandomAugmentationPolicy(len(tfs), sequence_length=1)
applier = PandasTFApplier(tfs, policy, k=1, keep_original=True)
wic_df_swapped = applier.apply(wic_df_synonym)
wic_df_swapped.head()

100%|██████████| 66/66 [00:00<00:00, 1289.65it/s]


Unnamed: 0,idx,label,pos,sentence1,sentence1_idx,sentence2,sentence2_idx,word
0,0,False,V,You must carry your camping gear .,2,Sound carries well over water .,1,carry
0,0,False,V,Sound carries well over water .,1,You must carry your camping gear .,2,carry
1,1,False,V,Messages must go through diplomatic channels .,2,Do you think the sofa will go through the door ?,6,go
1,1,False,V,Do you think the sofa will go through the door ?,6,Messages must go through diplomatic channels .,2,go
1,1,False,V,Messages must move through diplomatic channels .,2,Do you think the sofa will move through the do...,6,move


## Now it's your turn!

Try writing a transformation function of your own! Remember, it should output either a new example or `None`. Get creative! Just like we wrapped a resource from `nltk` in our synonym-swapping TF, we can wrap any other existing language model, etc. For more ideas, check out this [blog post](https://towardsdatascience.com/these-are-the-easiest-data-augmentation-techniques-in-natural-language-processing-you-can-think-of-88e393fd610)
or this [more advanced blog post](https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28).

We've included some starter code for inspiration:
```
@transformation_function
def my_tf(x):
    return x
    
tfs = [replace_word, my_tf]
policy = RandomAugmentationPolicy(len(tfs), sequence_length=2)
applier = PandasTFApplier(tfs, policy, k=2, keep_original=True)
wic_df_augmented = applier.apply(wic_df)
wic_df_augmented.head()
```

## Train with augmented data

Feeling ambitious? Try training a WiC model with your augmented data.

**_Important_**: to get the full training set, you'll need to re-execute from the beginning and set `max_data_samples` to `None`.

We'll construct our dataset with the default helpers.

In [13]:
from snorkel.mtl.data import MultitaskDataLoader
from snorkel.mtl.model import MultitaskModel
from snorkel.mtl.snorkel_config import default_config as config
from snorkel.mtl.trainer import Trainer

import superglue_tasks
from dataloaders import get_dataloaders
from superglue_parsers.wic import parse_from_rows
from tokenizer import get_tokenizer


max_sequence_length = 256
batch_size = 4
tokenizer_name = "bert-large-cased"
tokenizer = get_tokenizer(tokenizer_name)

# Construct training dataloader from augmented DF
rows = wic_df_swapped.to_dict("records")
dataset = parse_from_rows(rows, tokenizer, max_sequence_length)
train_dataloader = MultitaskDataLoader(
    task_to_label_dict={task_name: "labels"},
    dataset=dataset,
    split="train",
    batch_size=batch_size,
    shuffle=True,
)

valid_dataloader = get_dataloaders(
    data_dir,
    task_name=task_name,
    splits=["valid"],
    max_data_samples=None,
    max_sequence_length=max_sequence_length,
    tokenizer_name=tokenizer_name,
    batch_size=batch_size,
)[0]

dataloaders = [train_dataloader, valid_dataloader]

Similar to the slicing tutorial, we'll use the Snorkel API to configure a BERT model to train our natural language understanding model. This again comes from [huggingface's BERT library](https://github.com/huggingface/pytorch-pretrained-BERT).

In [14]:
bert_model = "bert-large-cased"
base_task = superglue_tasks.task_funcs[task_name](bert_model)
tasks = [base_task]
tasks

[Task(name=WiC)]

In [15]:
model = MultitaskModel(
    name=f"SuperGLUE",
    tasks=tasks, 
    dataparallel=False,
    device=-1 # use CPU
)

Feel free to uncomment this block to experiment with it yourself! It will take a while to train on CPU.

In [16]:
# trainer = Trainer(**config)
# trainer.train_model(model, dataloaders)
# model.save("./model_with_data_augmentation.pth")
# model.score(dataloaders[1])