# An Introduction to WISER, Part 1: Labeling Rules

Welcome to WISER (*Weak and Indirect Supervision for Entity Recognition*), a system for training sequence-to-sequence models, particularly neural networks for named entity recognition (NER) and related tasks. WISER uses *weak supervision* in the form of rules to train these models, as opposed to hand-labeled training data.

In this first part of the tutorial, we will be writing labeling rules to identify names, awards, and movies from a text corpus of actor descriptions extracted from Wikipedia.

## Loading Data
WISER is an add-on to [Allen NLP](http://allennlp.org), a great framework for natural language processing. That means we can use their tools for working with data.

Let's start by loading the [CoNLL 2003](https://www.clips.uantwerpen.be/conll2003/ner/) dataset, a common benchmark for NER.

In [29]:
%%capture

from allennlp.data.dataset_readers.conll2003 import Conll2003DatasetReader

dataset_reader = Conll2003DatasetReader(coding_scheme='IOB1')
train_data = dataset_reader.read('data/conll/eng.train')
dev_data = dataset_reader.read('data/conll/eng.testa')
test_data = dataset_reader.read('data/conll/eng.testb')

""" We must merge training and development data partitions to 
    simultaneously apply weak supervision rules to them
""" 
data = train_data + dev_data

The easiest way to use WISER with other data sets is to implement a new subclass of AllenNLP's [DatasetReader](https://allenai.github.io/allennlp-docs/api/allennlp.data.dataset_readers.dataset_reader.html#allennlp.data.dataset_readers.dataset_reader.DatasetReader). We have some additional examples in the package `wiser.data.dataset_readers`.

## Inspecting Data
Once the data is loaded, we use a WISER class called `Viewer` to inspect the sentences and tags.

In [6]:
from wiser.viewer import Viewer
Viewer(dev_data, height=100)

<IPython.core.display.Javascript object>

Viewer(html='<head>\n<style>\nspan.active {\n    background-color: skyblue;\n    box-shadow: 1px 1px 1px grey;…

You can use the left and right buttons to flip through the items in `dev_data`, each of which is an AllenNLP [`Instance`](https://allenai.github.io/allennlp-docs/api/allennlp.data.instance.html#allennlp.data.instance.Instance). The highlighted spans are the entities, and you can hover over each one with your cursor to see whether it is a person (PER), award (AWD), or movie (MOV).

The drop-down menu selects which source of labels is displayed. Currently only the gold labels from the benchmark are available, but we will add more soon.

Advance to the instance at index 2 to see an example with multiple entities of different classes. You can access the underlying tokens and tags too.

Notice that WISER uses the [IOB1 tagging scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), meaning that entities are represented as consecutive tags beginning with `I`. Many data sets use subsequent characters for different classes, for example `-PER` and `-MOV` here for person and movie, respectively. The `O` tag means that the token is not part of an entity. There is also a special set of tags beginning with `B` (like those beginning with `I`) that are used to start a new entity that immediately follows another of the same class without an `O` tag in between.

## Writing Labeling Functions

After inspecting the data to get an idea of the tagging patterns in the text corpus, we are ready to start writing our own labeling functions! 

In [30]:
from wiser.lf import LabelingFunction

In [32]:
# TODO: Change to something different for the actor-award dataset
class ConsecutiveCapitalization(LabelingFunction):
    
    def apply_instance(self, instance):
        labels = ['ABS'] * len(instance['tokens'])
        
        for i in range(1, len(instance['tokens'])):
            if instance['tokens'][i].text.lower()[0] \
                != instance['tokens'][i].text[0]:
                labels[i] = "I-PER"
        return labels

lf = ConsecutiveCapitalization()
lf.apply(data)

You can also use existing labeling functions and helpers available at `wiser.lf`

In [33]:
from wiser.lf import DictionaryMatcher

In [34]:
# TODO: change to some other dataset

# Some code to load person names
names = set((("Steve", "Bach"), ("Barack", "Obama"), ("Phil", "Simmons")))
lf = DictionaryMatcher("Actor", names, i_label="I-PER")
lf.apply(data)

In [35]:
# Tags all punctuations signs as 'O'
punctuation_chars = {'.', ',', ':', ';', '-', '?', '!', '@', '$'}

lf = DictionaryMatcher("Punctuation", terms=punctuation_chars, i_label="O")
lf.apply(data)

# Evaluating Labeling Functions

We can evalualte labeling functions on the development set in either of two ways. First, we can inspect individual labeling functions using the ``score_labeling_functions`` method.

In [36]:
from wiser.eval import score_labeling_functions
score_labeling_functions(dev_data)

Unnamed: 0,TP,FP,FN,Token Acc.,Token Votes
Actor,1,0,5942,1.0,2
ConsecutiveCapitalization,1387,4510,4556,0.3252,8575
Punctuation,0,0,5943,0.9993,4463


We can also inspect at the precision, recall, and F1 scores of the combined labeling rules with ``score_labels_majority_vote``.

In [37]:
from wiser.eval import score_labels_majority_vote
score_labels_majority_vote(dev_data)

Unnamed: 0,TP,FP,FN,P,R,F1
Majority Vote,1387,4510,4556,0.2352,0.2334,0.2343


# Saving Progress
We can use pickle to store the data with the labeling function applied to it

In [39]:
import pickle

with open('tmp/train_data.p', 'wb') as f:
    pickle.dump(train_data, f)

with open('tmp/dev_data.p', 'wb') as f:
    pickle.dump(dev_data, f)

with open('tmp/test_data.p', 'wb') as f:
    pickle.dump(test_data, f)

You have completed part 1! Now you can move on to part 2.