# An Introduction to WISER, Part 1: Tagging and Linking Rules

Welcome to WISER (*Weak and Indirect Supervision for Entity Recognition*), a system for training sequence-to-sequence models, particularly neural networks for named entity recognition (NER) and related tasks. WISER uses *weak supervision* in the form of rules to train these models, as opposed to hand-labeled training data.

In this first part of the tutorial, we will be writing tagging and linking rules to identify actor and acress names, awards, and movies from a text corpus of actor descriptions extracted from Wikipedia.

## Loading Data
WISER is an add-on to [Allen NLP](http://allennlp.org), a great framework for natural language processing. That means we can use their tools for working with data.

Let's start by loading the [CoNLL 2003](https://www.clips.uantwerpen.be/conll2003/ner/) dataset, a common benchmark for NER.

In [1]:
%%capture
from wiser.data.dataset_readers import MediaDatasetReader

dataset_reader = MediaDatasetReader()
train_data = dataset_reader.read('data/wikipedia/train.p')
dev_data = dataset_reader.read('data/wikipedia/dev.p')
test_data = dataset_reader.read('data/wikipedia/test.p')

""" We must merge training and development data partitions to 
    simultaneously apply rules to them
""" 
data = train_data + dev_data + test_data

ConfigurationError: 'Cannot register laptops as DatasetReader; name already in use for LaptopsDatasetReader'

The easiest way to use WISER with other data sets is to implement a new subclass of AllenNLP's [DatasetReader](https://allenai.github.io/allennlp-docs/api/allennlp.data.dataset_readers.dataset_reader.html#allennlp.data.dataset_readers.dataset_reader.DatasetReader). We have some additional examples in the package `wiser.data.dataset_readers`.

## Inspecting Data
Once the data is loaded, we use a WISER class called `Viewer` to inspect the sentences and tags.

In [2]:
from wiser.viewer import Viewer

Viewer(dev_data, height=100)

<IPython.core.display.Javascript object>

Viewer(html='<head>\n<style>\nspan.active {\n    background-color: skyblue;\n    box-shadow: 1px 1px 1px grey;…

You can use the left and right buttons to flip through the items in `dev_data`, each of which is an AllenNLP [`Instance`](https://allenai.github.io/allennlp-docs/api/allennlp.data.instance.html#allennlp.data.instance.Instance). The highlighted spans are the entities, and you can hover over each one with your cursor to see whether it is a person (ACT), award (AWD), or movie (MOV).

The drop-down menu selects which source of labels is displayed. Currently only the gold labels from the benchmark are available, but we will add more soon.

Advance to the instance at index 2 to see an example with multiple entities of different classes. You can access the underlying tokens and tags too.

Notice that WISER uses the [IOB1 tagging scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), meaning that entities are represented as consecutive tags beginning with `I`. Many data sets use subsequent characters for different classes, for example `-ACT` and `-MOV` here for actor/actress and movie, respectively. The `O` tag means that the token is not part of an entity. There is also a special set of tags beginning with `B` (like those beginning with `I`) that are used to start a new entity that immediately follows another of the same class without an `O` tag in between.

# Tagging Rules
Tagging rules are functions that map text instances to sequences of labels. We can define our own tagging rules by writing small functions that look at sequences of instance tokens, and vote on their correponding tags.

## Writing Tagging Rules
From inspecting the data, we know tokens [TODO] are likely tagged as actor names. Therefore, we can write our first tagging rule to reflect this!

In [3]:
import sys
sys.path.append('../..')
from wiser.lf import LabelingFunction

In [4]:
# TODO: Change to actor labeling rule from dataset
locations = {'australia', 'canada', 'usa', 'france', 'england'}

class Location(LabelingFunction):
    
    def apply_instance(self, instance):
        labels = ['ABS'] * len(instance['tokens'])
        
        for i in range(1, len(instance['tokens'])):
            if instance['tokens'][i].text.lower() in locations:
                labels[i] = 'I-LOC'
        return labels

lf = Location()
lf.apply(data)

You can also use existing tagging functions and helpers available at `wiser.lf`. DictionaryMatcher is a tagging function helper that allows us to quickly create a new rule that votes on tokens encountered in in a particular set using a predefined tag.

In [5]:
from wiser.lf import DictionaryMatcher

In [6]:
#TODO: Add some generic dictionary matcher

A good trick to developing efficient sequence taggers is to also generate some negative supervision in the form of *O* tags.

In [7]:
# Tags punctuations signs as 'O'. Feel free to add your own to the set!
punctuation_chars = {'.', ',', ':', ';', '-', '?', '!', '@', '$'}

lf = DictionaryMatcher("Punctuation", terms=punctuation_chars, i_label="O")
lf.apply(data)

## Evaluating Tagging Rules
We can evalualte labeling functions on the development set in either of two ways. First, we can inspect individual labeling functions using the ``score_labeling_functions`` method.

In [8]:
from wiser.eval import score_labeling_functions

score_labeling_functions(dev_data)

Unnamed: 0,TP,FP,FN,Token Acc.,Token Votes
Location,102,8,5841,0.9273,110
Punctuation,0,0,5943,0.9993,4463


We can also inspect at the precision, recall, and F1 scores of the combined labeling rules with ``score_labels_majority_vote``.

In [9]:
from wiser.eval import score_labels_majority_vote

score_labels_majority_vote(dev_data)

Unnamed: 0,TP,FP,FN,P,R,F1
Majority Vote,102,8,5841,0.9273,0.0172,0.0338


# Linking Rules
Linking rules are functions that vote on whether two or more adjacent tokens belong should belong to the same entity.

## Writing Linking Rules
Tagging rules do not always correctly vote on all the tokens in multi-span entities. For instance, a rule may only tag the *Barack* as a name in the string span *Barack Obama*. Therefore, we can write linking rules to indicate that *Barack* and *Obama* should share the same tag.

In [10]:
from wiser.lf import LinkingFunction

In [11]:
# Two consecutively capitalized words should share the same tag
class ConsecutiveCapitals(LinkingFunction):
    
    def apply_instance(self, instance):
        links = [0] * len(instance['tokens'])
        for i in range(1, len(instance['tokens'])):
            if instance['tokens'][i-1].text.istitle() \
                and instance['tokens'][i].text.istitle():
                links[i] = 1
        return links

lf = ConsecutiveCapitals()
lf.apply(data)

## Evaluating Linking Rules

Similar to tagging rules, we can evaluate the accuracy of our linking rules using the ``score_linking_functions`` method.

In [12]:
from wiser.eval import score_linking_functions

score_linking_functions(dev_data)

Unnamed: 0,Entity Links,Non-Entity Links,Incorrect Links,Accuracy
ConsecutiveCapitals,2109,106,290,0.8842


# Saving Progress
We can use pickle to store the data with the tagging and linking rules applied to it

In [13]:
import pickle

with open('tmp/train_data.p', 'wb') as f:
    pickle.dump(train_data, f)

with open('tmp/dev_data.p', 'wb') as f:
    pickle.dump(dev_data, f)

with open('tmp/test_data.p', 'wb') as f:
    pickle.dump(test_data, f)

You have completed part 1! Now you can move on to part 2.