In [1]:
import sys
sys.path.append('../..')

# An Introduction to WISER: Part 1

Welcome to WISER (_Weak and Indirect Supervision for Entity Recognition_), a system for training sequence-to-sequence models, particularly neural networks for named entity recognition (NER) and related tasks.

WISER uses _weak supervision_ in the form of rules to train these models, as opposed to hand-labeled training data.

## Loading Data
WISER is an add-on to [Allen NLP](http://allennlp.org), a great framework for natural language processing. That means we can use their tools for working with data.

__TODO: pointers to prerequisite installation__

Let's start by loading the [CoNLL 2003](https://www.clips.uantwerpen.be/conll2003/ner/) dataset, a common benchmark for NER.

In [2]:
from allennlp.data.dataset_readers.conll2003 import Conll2003DatasetReader

dataset_reader = Conll2003DatasetReader(coding_scheme='IOB1')
training_data = dataset_reader.read('data/eng.train')
dev_data = dataset_reader.read('data/eng.testa')

0it [00:00, ?it/s]06/18/2019 10:41:27 - INFO - allennlp.data.dataset_readers.conll2003 -   Reading instances from lines in file at: data/eng.train
14041it [00:01, 8653.63it/s]
0it [00:00, ?it/s]06/18/2019 10:41:28 - INFO - allennlp.data.dataset_readers.conll2003 -   Reading instances from lines in file at: data/eng.testa
3250it [00:00, 7645.85it/s] 


The easiest way to use WISER with other data sets is to implement a new subclass of AllenNLP's [DatasetReader](https://allenai.github.io/allennlp-docs/api/allennlp.data.dataset_readers.dataset_reader.html#allennlp.data.dataset_readers.dataset_reader.DatasetReader). We have some additional examples in the package `wiser.data.dataset_readers`.

## Inspecting Data
Now that the data is loaded, let's view it in a WISER class called `Viewer`.

In [3]:
from wiser.viewer import Viewer

Viewer(dev_data, height=100)

<IPython.core.display.Javascript object>

Viewer(html='<head>\n<style>\nspan.active {\n    background-color: skyblue;\n    box-shadow: 1px 1px 1px grey;…

In [36]:
print(dev_data[2]['tags'])

SequenceLabelField of length 35 with labels:
 		['I-MISC', 'I-MISC', 'O', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'I-ORG', 'O',
		'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
		'O', 'O']
 		in namespace: 'labels'.


You can use the left and right buttons to flip through the items in `dev_data`, each of which is an AllenNLP [`Instance`](https://allenai.github.io/allennlp-docs/api/allennlp.data.instance.html#allennlp.data.instance.Instance). The highlighted spans are the entities, and you can hover over each one with your cursor to see whether it is a person (PER), location (LOC), organization (ORG), or miscellaneous (MISC).

The drop-down menu selects which source of labels is displayed. Currently only the gold labels from the benchmark are available, but we will add more soon.

Advance to the instance at index 2 to see an example with multiple entities of different classes. You can access the underlying tokens and tags too.

In [25]:
print(dev_data[2]['tokens'])

TextField of length 35 with text: 
 		[West, Indian, all-rounder, Phil, Simmons, took, four, for, 38, on, Friday, as, Leicestershire,
		beat, Somerset, by, an, innings, and, 39, runs, in, two, days, to, take, over, at, the, head, of,
		the, county, championship, .]
 		and TokenIndexers : {'tokens': 'SingleIdTokenIndexer'}


In [27]:
print(dev_data[2]['tokens'][:5])

[West, Indian, all-rounder, Phil, Simmons]


In [26]:
print(dev_data[2]['tags'])

SequenceLabelField of length 35 with labels:
 		['I-MISC', 'I-MISC', 'O', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'I-ORG', 'O',
		'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
		'O', 'O']
 		in namespace: 'labels'.


In [41]:
print(dev_data[2]['WISER_LFs'])

{'ConsecutiveCapitalization': ['ABS', 'I-PER', 'ABS', 'I-PER', 'I-PER', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'I-PER', 'ABS', 'I-PER', 'ABS', 'I-PER', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS'], 'BornIn': ['ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS', 'ABS']}


Notice that WISER uses the [IOB1 tagging scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), meaning that entities are represented as consecutive tags beginning with `I`. Many data sets use subsequent characters for different classes, for example `-MISC` and `-PER` here for miscellaneous and person, respectively. The `O` tag means that the token is not part of an entity. There is also a special set of tags beginning with `B` (like those beginning with `I`) that are used to start a new entity that immediately follows another of the same class without an `O` tag in between.

## Writing Labeling Functions

In [3]:
from wiser.lf import LabelingFunction

In [5]:
class ConsecutiveCapitalization(LabelingFunction):
    
    def label_instance(self, instance):
        labels = ['ABS'] * len(instance['tokens'])
        
        for i in range(1, len(instance['tokens'])):
            if instance['tokens'][i].text.lower()[0] \
                != instance['tokens'][i].text[0]:
                labels[i] = "I-PER"
        return labels

lf = ConsecutiveCapitalization()
lf.apply(training_data)
lf.apply(dev_data)

In [4]:
from wiser.lf import DictionaryMatcher
# Some code to load person names
names = set((("Steve", "Bach"), ("Barack", "Obama"), ("Phil", "Simmons")))
lf = DictionaryMatcher("BornIn", names, b_label="I-PER", i_label="I-PER")

In [5]:
lf.apply(dev_data)

# Evaluating Labeling Functions

In [6]:
from wiser.eval import score_lfs, score_lfs_majority_vote

In [7]:
score_lfs(dev_data)

Unnamed: 0,TP,FP,FN
BornIn,1,0,5901


In [8]:
score_lfs_majority_vote(dev_data)

Unnamed: 0,TP,FP,FN,P,R,F1
Majority Vote,1384,4513,4518,0.2347,0.2345,0.2346


# Saving Progress
Let's store the data with the labeling function outputs for use in the next part of the tutorial. We just pickle the data.

In [9]:
import pickle

with open('data/training_data.p', 'wb') as f:
    pickle.dump(training_data, f)

with open('data/dev_data.p', 'wb') as f:
    pickle.dump(dev_data, f)

You have completed part 1! Now you can move on to part 2.