# An Introduction to WISER, Part 1: Tagging and Linking Rules

Welcome to WISER (*Weak and Indirect Supervision for Entity Recognition*), a system for training sequence-to-sequence models, particularly neural networks for named entity recognition (NER) and related tasks. WISER uses *weak supervision* in the form of rules to train these models, as opposed to hand-labeled training data.

In this first part of the tutorial, we will be writing tagging and linking rules to identify actor and acress names, awards, and movies from a text corpus of actor descriptions extracted from Wikipedia.

## Loading Data
WISER is an add-on to [Allen NLP](http://allennlp.org), a great framework for natural language processing. That means we can use their tools for working with data.

Let's start by loading the MovieAwards dataset, a new NEW dataset we created just for this tutorial. 

In [1]:
%%capture
from wiser.data.dataset_readers import MediaDatasetReader

dataset_reader = MediaDatasetReader()
train_data = dataset_reader.read('data/wikipedia/unlabeled_train.csv')
dev_data = dataset_reader.read('data/wikipedia/labeled_dev.csv')
test_data = dataset_reader.read('data/wikipedia/labeled_test.csv')

""" We must merge data partitions to simultaneously apply rules to them """ 
data = train_data + dev_data + test_data

The easiest way to use WISER with other data sets is to implement a new subclass of AllenNLP's [DatasetReader](https://allenai.github.io/allennlp-docs/api/allennlp.data.dataset_readers.dataset_reader.html#allennlp.data.dataset_readers.dataset_reader.DatasetReader). We have some additional examples in the package `wiser.data.dataset_readers`.

## Inspecting Data
Once the data is loaded, we use a WISER class called `Viewer` to inspect the sentences and tags.

In [2]:
from wiser.viewer import Viewer

Viewer(dev_data, height=120)

<IPython.core.display.Javascript object>

Viewer(html='<head>\n<style>\nspan.active {\n    background-color: skyblue;\n    box-shadow: 1px 1px 1px grey;…

You can use the left and right buttons to flip through the items in `dev_data`, each of which is an AllenNLP [`Instance`](https://allenai.github.io/allennlp-docs/api/allennlp.data.instance.html#allennlp.data.instance.Instance). The highlighted spans are the entities, and you can hover over each one with your cursor to see whether it is an award (AWD), or movie (MOV).

The drop-down menu selects which source of labels is displayed. Currently only the gold labels from the benchmark are available, but we will add more soon.

Advance to the instance at index 4 to see an example with multiple entities of different classes. You can access the underlying tokens and tags too.

Notice that WISER uses the [IOB1 tagging scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), meaning that entities are represented as consecutive tags beginning with `I`. Many data sets use subsequent characters for different classes, for example `-ACT` and `-MOV` here for actor/actress and movie, respectively. The `O` tag means that the token is not part of an entity. There is also a special set of tags beginning with `B` (like those beginning with `I`) that are used to start a new entity that immediately follows another of the same class without an `O` tag in between.

# Tagging Rules
Tagging rules are functions that map text instances to sequences of labels. We can define our own tagging rules by writing small functions that look at sequences of instance tokens, and vote on their correponding tags.

## Writing Tagging Rules
From inspecting the data, we know tokens proper nouns followed by a year/years between parentheses are likely tagged as movies. Therefore, we can write our first tagging rule to reflect this!

In [3]:
from wiser.lf import TaggingRule

In [4]:
class MovieYear(TaggingRule):
    
    def apply_instance(self, instance):

        tokens = [t.text for t in instance['tokens']]
        labels = ['ABS'] * len(tokens)
        
        for i in range(len(tokens)-2):
            
            # Proper nouns followed by a numerical year between parentheses
            if tokens[i].istitle() and tokens[i+1] == '(' and tokens[i+2].isdigit():
                labels[i] = 'I-MOV'
               
        return labels

lf = MovieYear()
lf.apply(data)

In [12]:
common_fp = {'network', 'netflix', 'hulu', 'bbc', 'fox', 'disney', 'hbo', 'cbs',
             'channel', 'american', 'british', 'television', 'showtime', 'productions'}

class CommonFP(TaggingRule):
    
    def apply_instance(self, instance):

        tokens = [t.text for t in instance['tokens']]
        labels = ['ABS'] * len(tokens)
        
        for i in range(len(tokens)):
            if tokens[i].lower() in common_fp:
                labels[i] = 'O'
               
        return labels

lf = CommonFP()
lf.apply(data)

You can also look at previous tagging rules to make more complex functions. However, be mindful of the order in which you run the tagging functions.

In [13]:
movie_keywords = {'trilogy', 'miniseries', 'saga', 'series', 'miniseries', 
            'show', 'sitcom', 'drama', 'musical', 'franchise'}
# Suggested: film, prequel, sequel, 'thriller', 'opera'

class MovieKeywords(TaggingRule):
    
    def apply_instance(self, instance):

        tokens = [t.text for t in instance['tokens']]
        false_positives = [t for t in instance['WISER_LABELS']['CommonFP']]
        labels = ['ABS'] * len(tokens)

        for i in range(len(tokens)):
            if tokens[i].lower() in movie_keywords:
                
                """
                    We will only tag a word as a movie if 
                    the false positive rule asbtained from voting it
                    (e.g., we want to avoid false positives such as 
                    "Hulu" in sentences like "The Hulu miniseries")

                    We also want to avoid some award names 
                    like " ... Musical Drama", etc.
                """ 
                
                # Keywords followed by movies (e.g., Kung-Fu Panda franchise)
                if i < len(tokens) and tokens[i+1].istitle() and false_positives[i+1] == 'ABS':
                    if tokens[i+1].lower() not in movie_keywords:
                        labels[i+1] = 'I-MOV'
                       
                # Movies followed by keywords (e.g., The TV series The Mandalorian)
                elif i > 0 and tokens[i-1].istitle() and false_positives[i-1] == 'ABS':
                    if tokens[i-1].lower() not in movie_keywords:
                        labels[i-1] = 'I-MOV'
        return labels

lf = MovieKeywords()
lf.apply(data)

You can also use existing tagging functions and helpers available at `wiser.lf`. DictionaryMatcher is a tagging function helper that allows us to quickly create a new rule that votes on tokens encountered in in a particular set using a predefined tag.

In [14]:
from wiser.lf import DictionaryMatcher

A good trick to developing efficient sequence taggers is to also generate some negative supervision in the form of *O* tags. We can therefore write a  function to tag punctuations signs as *O*.

In [15]:
# Feel free to add your own characters to the set!
non_entity_punctuation_chars = {'.', ';', '(', ')'}

lf = DictionaryMatcher("Non-Entity-Punctuation", terms=non_entity_punctuation_chars, i_label="O")
lf.apply(data)

## Evaluating Tagging Rules
We can evalualte labeling functions on the development set in either of two ways. First, we can inspect individual labeling functions using the ``score_labeling_functions`` method.

In [16]:
from wiser.eval import score_tagging_rules

score_tagging_rules(dev_data)

Unnamed: 0,TP,FP,FN,Token Acc.,Token Votes
CommonFP,0,0,1612,0.9048,315
MovieKeywords,45,148,1567,0.9124,194
MovieYear,163,662,1449,0.9661,825
Non-Entity-Punctuation,0,0,1612,0.9976,2970


We can also inspect at the precision, recall, and F1 scores of the combined labeling rules with ``score_labels_majority_vote``.

In [17]:
from wiser.eval import score_labels_majority_vote

score_labels_majority_vote(dev_data)

Unnamed: 0,TP,FP,FN,P,R,F1
Majority Vote,217,775,1395,0.2188,0.1346,0.1667


# Linking Rules
Linking rules are functions that vote on whether two or more adjacent tokens belong should belong to the same entity.

## Writing Linking Rules
Tagging rules do not always correctly vote on all the tokens in multi-span entities. For instance, a rule may only tag the *Barack* as a name in the string span *Barack Obama*. Therefore, we can write linking rules to indicate that *Barack* and *Obama* should share the same tag.

In [18]:
from wiser.lf import LinkingRule

In [19]:
class ConsecutiveCapitals(LinkingRule):
    
    def apply_instance(self, instance):
        tokens = [t.text for t in instance['tokens']]
        links = [0] * len(tokens)
        
        for i in range(1, len(tokens)):
            if tokens[i-1].istitle() and tokens[i].istitle():
                links[i] = 1
        return links

lf = ConsecutiveCapitals()
lf.apply(data)

In [20]:
linkers = {':', ';', '-'}

class SentenceLinkers(LinkingRule):

    def apply_instance(self, instance):
        tokens = [t.text for t in instance['tokens']]
        links = [0] * len(tokens)
        
        for i in range(1, len(tokens)-1):
            if tokens[i] in linkers:
                links[i] = 1
                links[i+1] = 1
        return links

lf = SentenceLinkers()
lf.apply(data)

In [21]:
contraction_suffixes = {'\'s', '\'nt', '\'ve'}

class Contractions(LinkingRule):

    def apply_instance(self, instance):
        tokens = [t.text for t in instance['tokens']]
        links = [0] * len(tokens)
        
        for i in range(1, len(tokens)):
            if tokens[i] in contraction_suffixes:
                links[i] = 1
        return links

lf = Contractions()
lf.apply(data)

In [22]:
common_prepositions = {'a', 'the', 'at', 'with', 'of'}
# Suggestions: in, by, for
class CommonPrepositions(LinkingRule):

    def apply_instance(self, instance):
        tokens = [t.text for t in instance['tokens']]
        links = [0] * len(tokens)
        
        for i in range(1, len(tokens)-1):
            if tokens[i] in common_prepositions:
                if tokens[i-1].istitle() and tokens[i+1].istitle():
                    links[i] = 1
                    links[i+1] = 1
        return links

lf = CommonPrepositions()
lf.apply(data)

## Evaluating Linking Rules

Similar to tagging rules, we can evaluate the accuracy of our linking rules using the ``score_linking_functions`` method.

In [23]:
from wiser.eval import score_linking_rules

score_linking_rules(dev_data)

Unnamed: 0,Entity Links,Non-Entity Links,Incorrect Links,Accuracy
CommonPrepositions,206,75,1,0.9965
ConsecutiveCapitals,1738,816,17,0.9934
Contractions,38,111,0,1.0
SentenceLinkers,182,404,12,0.9799


# Saving Progress
We can use pickle to store the data with the tagging and linking rules applied to it

In [24]:
import pickle

with open('output/tmp/train_data.p', 'wb') as f:
    pickle.dump(train_data, f)

with open('output/tmp/dev_data.p', 'wb') as f:
    pickle.dump(dev_data, f)

with open('output/tmp/test_data.p', 'wb') as f:
    pickle.dump(test_data, f)

You have completed part 1! Now you can move on to part 2.