# An Introduction to WISER, Part 1: Tagging and Linking Rules

Welcome to WISER (*Weak and Indirect Supervision for Entity Recognition*), a system for training sequence-to-sequence models, particularly neural networks for named entity recognition (NER) and related tasks. WISER uses *weak supervision* in the form of rules to train these models, as opposed to hand-labeled training data.

In this first part of the tutorial, we will be writing tagging and linking rules to identify award names, alongside movies, T.V. shows and plays from a text corpus extracted from Wikipedia.

## Loading Data
WISER is an add-on to [AllenNLP](http://allennlp.org), a great framework for natural language processing. That means we can use their tools for working with data.

Let's start by loading the Media dataset, a new dataset we created just for this tutorial. 

In [1]:
%%capture
from wiser.data.dataset_readers import MediaDatasetReader

dataset_reader = MediaDatasetReader()
train_data = dataset_reader.read('data/wikipedia/unlabeled_train.csv') # Reads only data for 100 actors
dev_data = dataset_reader.read('data/wikipedia/labeled_dev.csv')
test_data = dataset_reader.read('data/wikipedia/labeled_test.csv')

# We must merge the data to to simultaneously apply rules to it
data = train_data + dev_data + test_data

The easiest way to use WISER with other data sets is to implement a new subclass of AllenNLP's [DatasetReader](https://allenai.github.io/allennlp-docs/api/allennlp.data.dataset_readers.dataset_reader.html#allennlp.data.dataset_readers.dataset_reader.DatasetReader). We have some additional examples in the package `wiser.data.dataset_readers`.

## Inspecting Data
Once the data is loaded, we use a WISER class called `Viewer` to inspect the sentences and tags.

In [2]:
from wiser.viewer import Viewer
Viewer(dev_data, height=120)

<IPython.core.display.Javascript object>

Viewer(html='<head>\n<style>\nspan.active {\n    background-color: skyblue;\n    box-shadow: 1px 1px 1px grey;…

You can use the left and right buttons to flip through the items in `dev_data`, each of which is an AllenNLP [`Instance`](https://allenai.github.io/allennlp-docs/api/allennlp.data.instance.html#allennlp.data.instance.Instance). The highlighted spans are the entities, and you can hover over each one with your cursor to see whether it is an award (**AWD**), or one of a movie, T.V. show, or play (**MOV**).

The drop-down menu selects which source of labels is displayed. Currently only the gold labels from the benchmark are available, but we will add more soon.

Advance to the instance at index 4 to see an example with multiple entities of different classes. You can access the underlying tokens and tags too.

Notice that WISER uses the [IOB1 tagging scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), meaning that entities are represented as consecutive tags beginning with **I**. Many data sets use subsequent characters for different classes, for example **-AWD** and **-MOV** here for awards and movies/T.V. shows/plays, respectively. The **O**, or other tag, means that the token is not part of an entity. There is also a special set of tags beginning with **B** (like those beginning with **I**) that are used to start a new entity that immediately follows another of the same class without an **O** tag in between.

# Tagging Rules
Tagging rules are functions that map unlabeled text instances to sequences of labels. We can define our own tagging rules by writing small functions that look at sequences of instance tokens, and vote on their correponding tags. Let's first import the ``TaggingRule`` class from ``wiser.lf``

In [3]:
from wiser.lf import TaggingRule

## Writing Simple Tagging Rules
From inspecting the data, we know tokens proper nouns followed by a year between parentheses are likely tagged as movies. For instance, the token ``Friends`` in the span ``Friends (1994 - 2004)`` should be tagged **I-MOV**. Therefore, we can write our first tagging rule to reflect this!

In [4]:
class MovieYear(TaggingRule):
    
    def apply_instance(self, instance):

        # Creates a list of token strings to inspect
        tokens = [t.text for t in instance['tokens']]
        
        # Initializes a list of abstained label votes 
        # (All abstained votes should have the ABS tag)
        labels = ['ABS'] * len(tokens)
        
        # Iterates over the list of tokens
        for i in range(len(tokens)-2):    
            # Tags as movies all proper nouns followed by a number between parentheses
            if tokens[i].istitle() and tokens[i+1] == '(' and tokens[i+2].isdigit():
                labels[i] = 'I-MOV'
               
        # Returns the modified label vote list
        return labels

# Applies the tagging rule to all dataset entries 
lf = MovieYear()
lf.apply(data)

We can also write a tagging rule to identify award categories like ``for Oustanding Lead Actress`` in award spans such as ``BAFTA Award for Oustanding Lead Actress``. Categories are generally preceded by capitalized letters, and follow with the strings ``for Oustanding`` or ``for Best`` (skip to instance at index 27 to see an example of this). 

In [5]:
class AwardCategory(TaggingRule):
    
    def apply_instance(self, instance):

        tokens = [t.text for t in instance['tokens']]
        labels = ['ABS'] * len(tokens)
        
        for i in range(len(tokens)-2):
            if tokens[i].istitle() and tokens[i+1] == 'for' and tokens[i+2] in {'Best', 'Oustanding'}:
                # We tag the "for" and "Best"/"Outstanding" tokens as award names
                labels[i+1] = 'I-AWD'
                labels[i+2] = 'I-AWD'
               
        return labels

lf = AwardCategory()
lf.apply(data)

### Tagging Function Helpers

You can also use existing tagging functions and helpers available at `wiser.lf`. The ``DictionaryMatcher`` is a tagging function helper that allows us to quickly create a new rule that votes on any element found in a list or set of characters or words.

In [6]:
from wiser.lf import DictionaryMatcher

Now let's tag some award keywords! Any token spelling ``Award`` , ``Awards``, ``Prize`` or ``Cup`` should be tagged as an award. Be mindful of capitalization, since awards are generally proper nouns.

In [7]:
award_keywords = [['Award'], ['Awards'], ['Prize'], ['Cup']]
                  
lf = DictionaryMatcher("AwardKeywords", terms=award_keywords, i_label="I-AWD", uncased=False)
lf.apply(data)

A good trick to developing efficient sequence taggers is to also generate some negative supervision in the form of **O** tags. To do so, we can write a function to tag punctuations signs as **O** tags.

In [8]:
non_entity_punctuation_chars = {'.', ';', '(', ')'}

lf = DictionaryMatcher("Non-EntityPunctuation", terms=non_entity_punctuation_chars, i_label="O")
lf.apply(data)

We recommend going over the data and identifying a few false positive tokens. That is, tokens that are similar to entities but are not (e.g., capitalized tokens such as studio names, and recurrent proper nouns near movie titles). We will also write a `DictionaryMatcher` identify some common false positives and tag them as such:

In [9]:
common_false_positives = [['network'], ['netflix'], ['hulu'], ['bbc'], ['fox'], 
                          ['disney'], ['hbo'], ['CBS'], ['channel'], ['american'], 
                         ['showtime'], ['productions'], ['TV']]

lf = DictionaryMatcher("CommonFalsePositives", terms=common_false_positives, i_label="O", uncased=True)
lf.apply(data)

### Looking at Previous Tagging Rules

You can also develop more complex tagging rules by looking at previous tagging rule votes using the ``WISER_LABELS`` field. However, be mindful of the order in which you run the tagging functions.

In the following example, we will write a tagging rule to identify typical keywords preceding or **MOV** tags (e.g., ``The TV series The Mandalorian``) or succeeding them (e.g., ``Kung-Fu Panda franchise``). However, we also want to avoid tagging some common false positive tags (e.g., ``TV``) as movies, which is why we will reference the output votes of the ``CommonFalsePositives`` rule.

In [10]:
movie_keywords = {'trilogy', 'saga', 'series', 'miniseries', 
                'show', 'opera', 'drama', 'musical', 'sequel',
                'prequel', 'franchise', 'thriller', 'sitcom'}

class MovieKeywords(TaggingRule):
    
    def apply_instance(self, instance):

        tokens = [t.text for t in instance['tokens']]
        labels = ['ABS'] * len(tokens)

        # List of tag votes of CommonFalsePositives rule
        false_positives = [t for t in instance['WISER_LABELS']['CommonFalsePositives']]
        
        for i in range(len(tokens)):
            if tokens[i].lower() in movie_keywords:
                
                """
                    We will only tag a word as a movie if 
                    the CommonFalsePositives asbtained from voting it
                    (e.g., we want to avoid false positives such as 
                    "Hulu" in spans like "the Hulu miniseries")

                    We also want to avoid award names 
                    like "... Musical Drama", etc.
                """ 
                
                # Keywords followed by movies (e.g., Kung-Fu Panda franchise)
                if i < len(tokens) and tokens[i+1].istitle() and false_positives[i+1] == 'ABS':
                    if tokens[i+1].lower() not in movie_keywords:
                        labels[i+1] = 'I-MOV'
                       
                # Movies followed by keywords 
                elif i > 0 and tokens[i-1].istitle() and false_positives[i-1] == 'ABS':
                    if tokens[i-1].lower() not in movie_keywords:
                        labels[i-1] = 'I-MOV'
        return labels

lf = MovieKeywords()
lf.apply(data)

## Using Existing Models

We also recommend having one or two tagging rules that adds a lot of negative supervision in the form of O tags with high recall. These types of rules are generally weighted in the discriminative model to strike a balance between positive entity and non-entity votes.

For this, we will use nltk's [part-of-speech tagger]()

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
tagger = nlp.create_pipe("tagger")


class NonEntityWords(TaggingRule):
    
    def apply_instance(self, instance):

        tokens = [t.text for t in instance['tokens']]
        
        parts_of_speech = [token[0].pos_ for token in nlp.pipe(tokens)]        
        labels = ['ABS'] * len(tokens)

        for i, (token, pos) in enumerate(zip(tokens, parts_of_speech)):
            if pos in {'NOUN', 'VERB', 'ADJ', 'SPACE', 'NUM'} and not token.istitle():
                labels[i] = 'O'
        return labels

lf = NonEntityWords()
lf.apply(data)

## Evaluating Tagging Rules
We can inspect the performance of individual labeling functions on the development set using the ``score_labeling_functions`` method. 

* True positives (TP) represent the number of items correctly labeled items as belonging to a positive class (e.g. **I-MOV**).

* False positives (FP) are the number of items incorrectly labeled as belonging to a positive class.

* False Negatives (FN) are the items which were not labeled as belonging to the positive class but should have been.

* Token Accuracy (Token Acc.) represents the fraction of issued votes that correctly identified a positive class.

* Token Votes is the total number of times the tagging rules issued a vote belongint to a positive class.

In [None]:
from wiser.eval import score_tagging_rules
score_tagging_rules(dev_data)

In [None]:
c = 0
for instance in dev_data:
    for tok in instance['tokens']:
        c += 1
print(c)

A good rule of thumb is to write tagging rules whose accuracy is above 90%.

# Linking Rules
Linking rules are simple functions that vote on whether two or more adjacent tokens belong should belong to the same entity. To get started with linking rules, you can import the ``LinkingRule`` class from ``wiser.lf``

In [None]:
from wiser.lf import LinkingRule

## Writing Linking Rules
Tagging rules do not always correctly vote on *all* the tokens in multi-span entities. For instance, the **MovieYear** tagging rule only tag the last token in a movie span. For instance, it would only tag ``Gatsby`` as **I-MOV** in ``The Great Gatsby (2013)``.

Our job is to ensure that the entire class spans are tagged correctly. For instance, we could write a linking rule to indicate that consecutively capitalized words should share the same tag. Therefore, voting that ``The`` and ``Great`` share the same tag as ``Gatsby`` would tag the entire movie name as **I-MOV**, rather than the last token.

In [None]:
class ConsecutiveCapitals(LinkingRule):
    
    def apply_instance(self, instance):
        tokens = [t.text for t in instance['tokens']]
        links = [0] * len(tokens)
        
        for i in range(1, len(tokens)):
            if tokens[i-1].istitle() and tokens[i].istitle():
                links[i] = 1 # token at index "i" shares tag with token at index "i-1"
        return links

lf = ConsecutiveCapitals()
lf.apply(data)

In our data we have also observed several movie and award names that have hyphens, semicolons or colons (e.g. ``Avengers: Endgame``). We can write a linking rule to indicate that these linking punctuation characters, along with their preceding and succeeding token, should all be a part of the same entity.

In [None]:
linkers = {':', ';', '-'}

class PunctuationLinkers(LinkingRule):

    def apply_instance(self, instance):
        tokens = [t.text for t in instance['tokens']]
        links = [0] * len(tokens)
        
        for i in range(1, len(tokens)-1):
            if tokens[i] in linkers:
                
                # The linking punctuation character and it's succeeding character
                # share the same tag as the preceding one at index "i-1"
                links[i] = 1
                links[i+1] = 1
        return links

lf = PunctuationLinkers()
lf.apply(data)

Similarly, we can write a rule to indicate that contractions share the same tag with the token preceding them.

In [None]:
contraction_suffixes = {'\'s', '\'nt', '\'ve', '\'', '\'d'}

class Contractions(LinkingRule):

    def apply_instance(self, instance):
        tokens = [t.text for t in instance['tokens']]
        links = [0] * len(tokens)
        
        for i in range(1, len(tokens)):
            if tokens[i] in contraction_suffixes:
                links[i] = 1
        return links

lf = Contractions()
lf.apply(data)

We can also link noun phrases that using a list of common prepositions in award and movie names. These prepositions are part of award and movie names, and are usually lowercase and adjacent to or other prepositions or capitalized words. For example, ``Golden Globe for Best Actor`` or ``Guardians of the Galaxy``.

In [None]:
common_prepositions = {'a', 'the', 'at', 'with', 'of', 'by', '&', 'with'}

class CommonPrepositions(LinkingRule):

    def apply_instance(self, instance):
        tokens = [t.text for t in instance['tokens']]
        links = [0] * len(tokens)
        
        for i in range(1, len(tokens)-1):
            if tokens[i] in common_prepositions:
                if tokens[i-1].istitle() or tokens[i-1] in common_prepositions:
                    if tokens[i+1].istitle() or tokens[i+1] in common_prepositions:
                        links[i] = 1
                        links[i+1] = 1
        return links

lf = CommonPrepositions()
lf.apply(data)

### Linking Rule Helpers

Similar to tagging rules, we have linking rule helpers available at ``wiser.lf``. For the next linking rule, we will use the ``ElmoLinkingRule``, a rule that vectorizes tokens using [Elmo](https://allennlp.org/elmo) and links those with a cosine similaritiy larger than a given threshold.

In [None]:
from wiser.lf import ElmoLinkingRule

In [None]:
# We link tokens whose cosine similarity is larger than 0.8
# (this may take a while)
lf = ElmoLinkingRule(0.8)
lf.apply(data)

## Evaluating Linking Rules

Similar to tagging rules, we can evaluate the accuracy of our linking rules using the ``score_linking_functions`` method.

* Entity Links represents the number of correct links generated for positive classes.
* Non-Entity Links represents the number of correct links generated for negative classes (e.g., **O** tags).
* Incorrect links represent the total number of incorrectly generated links.
* Accuracy represents the fraction of issued links that identified correct links.

In [None]:
from wiser.eval import score_linking_rules
score_linking_rules(dev_data)

Once more, a good rule of thumb is to have all linking rules with an accuracy of above 90%.

# Saving Progress
We can use pickle to store the data with the tagging and linking rules applied to it

In [None]:
import pickle

with open('output/tmp/train_data.p', 'wb') as f:
    pickle.dump(train_data, f)

with open('output/tmp/dev_data.p', 'wb') as f:
    pickle.dump(dev_data, f)

with open('output/tmp/test_data.p', 'wb') as f:
    pickle.dump(test_data, f)

You have completed part 1! Now you can move on to part 2.