# An Introduction to WISER, Part 1: Tagging and Linking Rules

Welcome to WISER (*Weak and Indirect Supervision for Entity Recognition*), a system for training sequence-to-sequence models, particularly neural networks for named entity recognition (NER) and related tasks. WISER uses *weak supervision* in the form of rules to train these models, as opposed to hand-labeled training data.

In this first part of the tutorial, we will be writing tagging and linking rules to identify actor and acress names, awards, and movies from a text corpus of actor descriptions extracted from Wikipedia.

## Loading Data
WISER is an add-on to [Allen NLP](http://allennlp.org), a great framework for natural language processing. That means we can use their tools for working with data.

Let's start by loading the MovieAwards dataset, a new NEW dataset we created just for this tutorial. 

In [1]:
%%capture
from wiser.data.dataset_readers import MediaDatasetReader

dataset_reader = MediaDatasetReader()
train_data = dataset_reader.read('data/wikipedia/unlabeled_train.csv')
dev_data = dataset_reader.read('data/wikipedia/labeled_dev.csv')
test_data = dataset_reader.read('data/wikipedia/labeled_test.csv')

# To simultaneously apply rules to the data, we must merge it """ 
data = train_data + dev_data + test_data

The easiest way to use WISER with other data sets is to implement a new subclass of AllenNLP's [DatasetReader](https://allenai.github.io/allennlp-docs/api/allennlp.data.dataset_readers.dataset_reader.html#allennlp.data.dataset_readers.dataset_reader.DatasetReader). We have some additional examples in the package `wiser.data.dataset_readers`.

## Inspecting Data
Once the data is loaded, we use a WISER class called `Viewer` to inspect the sentences and tags.

In [3]:
from wiser.viewer import Viewer

In [4]:
Viewer(dev_data, height=120)

<IPython.core.display.Javascript object>

Viewer(html='<head>\n<style>\nspan.active {\n    background-color: skyblue;\n    box-shadow: 1px 1px 1px grey;…

You can use the left and right buttons to flip through the items in `dev_data`, each of which is an AllenNLP [`Instance`](https://allenai.github.io/allennlp-docs/api/allennlp.data.instance.html#allennlp.data.instance.Instance). The highlighted spans are the entities, and you can hover over each one with your cursor to see whether it is an award (**AWD**), or movie (**MOV**).

The drop-down menu selects which source of labels is displayed. Currently only the gold labels from the benchmark are available, but we will add more soon.

Advance to the instance at index 4 to see an example with multiple entities of different classes. You can access the underlying tokens and tags too.

Notice that WISER uses the [IOB1 tagging scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), meaning that entities are represented as consecutive tags beginning with `I`. Many data sets use subsequent characters for different classes, for example `-ACT` and `-MOV` here for actor/actress and movie, respectively. The `O` tag means that the token is not part of an entity. There is also a special set of tags beginning with `B` (like those beginning with `I`) that are used to start a new entity that immediately follows another of the same class without an `O` tag in between.

# Tagging Rules
Tagging rules are functions that map text instances to sequences of labels. We can define our own tagging rules by writing small functions that look at sequences of instance tokens, and vote on their correponding tags. Let's first import the ``TaggingRule`` class from ``wiser.lf``

In [15]:
from wiser.lf import TaggingRule

## Writing Tagging Rules
From inspecting the data, we know tokens proper nouns followed by a year between parentheses are likely tagged as movies. For instance, the token *Avatar* in the span "Avatar (2019)" should be classified as a movie. Therefore, we can write our first tagging rule to reflect this!

In [16]:
class MovieYear(TaggingRule):
    
    def apply_instance(self, instance):

        tokens = [t.text for t in instance['tokens']]
        labels = ['ABS'] * len(tokens)
        
        # Proper nouns followed by a numerical year between parentheses
        for i in range(len(tokens)-2):    
            if tokens[i].istitle() and tokens[i+1] == '(' and tokens[i+2].isdigit():
                labels[i] = 'I-MOV'
               
        return labels

lf = MovieYear()
lf.apply(data)

We can also write a tagging rule to identify award categories such as *for Oustanding Lead Actress* in *BAFTA Award for Oustanding Lead Actress*. Categories are generally preceded by capitalized letters, and follow with *for Oustanding* or *for Best*. 

In [17]:
class AwardCategory(TaggingRule):
    
    def apply_instance(self, instance):

        tokens = [t.text for t in instance['tokens']]
        labels = ['ABS'] * len(tokens)
        
        for i in range(len(tokens)-2):
            if tokens[i].istitle() and tokens[i+1] == 'for' and tokens[i+2] in {'Best', 'Oustanding'}:
                labels[i+1] = 'I-AWD'
                labels[i+2] = 'I-AWD'
               
        return labels

lf = AwardCategory()
lf.apply(data)

Similarly, any token spelling *Award* or *Prize* or *Cup* should be tagged as an award (note the capitalization, since awards are generally proper nouns).

In [18]:
keywords = {'Award', 'Awards', 'Prize', 'Cup'}
class AwardKeywords(TaggingRule):
    
    def apply_instance(self, instance):

        tokens = [t.text for t in instance['tokens']]
        labels = ['ABS'] * len(tokens)
        
        # Searches for Award or Prize (case sensitive)
        for i in range(len(tokens)):
            if tokens[i] in keywords:
                labels[i] = 'I-AWD'
               
        return labels

lf = AwardKeywords()
lf.apply(data)

You can also use existing tagging functions and helpers available at `wiser.lf`. The ``DictionaryMatcher`` is a tagging function helper that allows us to quickly create a new rule that votes on tokens encountered in in a particular set using a predefined tag. One of these helper functions is the `DictionaryMatcher`, which allows us to vote on any element found in a list or set of characters or words.

In [19]:
from wiser.lf import DictionaryMatcher

For this tutorial, we will use the ``DictionaryMatcher`` to tag tokens that appear in set of award keywords.

In [20]:
award_keywords = [['Award'], ['Awards'], ['Prize'], ['Cup']]
                  
lf = DictionaryMatcher("AwardKeywords", terms=award_keywords, i_label="I-AWD", uncased=False)
lf.apply(data)

A good trick to developing efficient sequence taggers is to also generate some negative supervision in the form of *O* tags. To do so, we must write a function to tag punctuations signs as *O* tags.

In [21]:
non_entity_punctuation_chars = {'.', ';', '(', ')'}

lf = DictionaryMatcher("Non-EntityPunctuation", terms=non_entity_punctuation_chars, i_label="O")
lf.apply(data)

We recommend going over the data and identifying a few false positive tokens. That is, tokens that are similar to entities but are not (e.g., capitalized tokens such as studio names, and recurrent tokens near movie titles). We will also write a `DictionaryMatcher` to reflect this heuristic.

In [22]:
common_false_positives = [['network'], ['netflix'], ['hulu'], ['bbc'], ['fox'], 
                          ['disney'], ['hbo'], ['CBS'], ['channel'], ['american'], 
                         ['showtime'], ['productions'], ['TV']]

lf = DictionaryMatcher("CommonFalsePositives", terms=common_false_positives, i_label="O", uncased=True)
lf.apply(data)

A trick to develop more complex tagging rules is to look at previous tagging rules using the ``WISER_LABELS`` field. However, be mindful of the order in which you run the tagging functions.

In [23]:
movie_keywords = {'trilogy', 'saga', 'series', 'miniseries', 
                'show', 'opera', 'drama', 'musical', 'sequel',
                'prequel', 'franchise', 'thriller', 'sitcom'}

class MovieKeywords(TaggingRule):
    
    def apply_instance(self, instance):

        tokens = [t.text for t in instance['tokens']]
        false_positives = [t for t in instance['WISER_LABELS']['CommonFalsePositives']]
        labels = ['ABS'] * len(tokens)

        for i in range(len(tokens)):
            if tokens[i].lower() in movie_keywords:
                
                """
                    We will only tag a word as a movie if 
                    the CommonFalsePositives asbtained from voting it
                    (e.g., we want to avoid false positives such as 
                    "Hulu" in sentences like "The Hulu miniseries")

                    We also want to avoid some award names 
                    like "... Musical Drama", etc.
                """ 
                
                # Keywords followed by movies (e.g., Kung-Fu Panda franchise)
                if i < len(tokens) and tokens[i+1].istitle() and false_positives[i+1] == 'ABS':
                    if tokens[i+1].lower() not in movie_keywords:
                        labels[i+1] = 'I-MOV'
                       
                # Movies followed by keywords (e.g., The TV series The Mandalorian)
                elif i > 0 and tokens[i-1].istitle() and false_positives[i-1] == 'ABS':
                    if tokens[i-1].lower() not in movie_keywords:
                        labels[i-1] = 'I-MOV'
        return labels

lf = MovieKeywords()
lf.apply(data)

## Evaluating Tagging Rules
We can inspect the performance of individual labeling functions on the development set using the ``score_labeling_functions`` method. 

* True positives (TP) represent the number of items correctly labeled items as belonging to a positive class (e.g. **I-MOV**).

* False positives (FP) are the number of items incorrectly labeled as belonging to a positive class.

* False Negatives (FN) are the items which were not labeled as belonging to the positive class but should have been.

* Token Accuracy (Token Acc.) represents the fraction of issued votes that correctly identified a positive class.

* Token Votes is the total number of times the tagging rules issued a vote belongint to a positive class.

In [24]:
from wiser.eval import score_tagging_rules

score_tagging_rules(dev_data)

Unnamed: 0,TP,FP,FN,Token Acc.,Token Votes
AwardCategory,0,58,1612,0.9655,116
AwardKeywords,0,207,1612,0.9855,207
CommonFalsePositives,0,0,1612,0.9317,205
MovieKeywords,49,165,1563,0.9163,215
MovieYear,163,662,1449,0.9661,825
Non-EntityPunctuation,0,0,1612,0.9976,2970


A good rule of thumb is to aim for tagging rules whose accuracy is above 90%.

# Linking Rules
Linking rules are simple functions that vote on whether two or more adjacent tokens belong should belong to the same entity.

## Writing Linking Rules
Tagging rules do not always correctly vote on all the tokens in multi-span entities. For instance, the **Award** tagging rule only tag the *Award* keywords as a positive class in the term *Emmy Award*. Therefore, since we want to tag the entire span as an award, we can write linking rules to indicate that consecutively capitalized words should share the same tag.

In [25]:
from wiser.lf import LinkingRule

In [26]:
class ConsecutiveCapitals(LinkingRule):
    
    def apply_instance(self, instance):
        tokens = [t.text for t in instance['tokens']]
        links = [0] * len(tokens)
        
        for i in range(1, len(tokens)):
            if tokens[i-1].istitle() and tokens[i].istitle():
                links[i] = 1
        return links

lf = ConsecutiveCapitals()
lf.apply(data)

In our data we have also observed several movie and award names that have hyphens, semicolons or colons. We can write a linking rule to indicate that these characters should be a part of the same span.

In [27]:
linkers = {':', ';', '-'}

class PunctuationLinkers(LinkingRule):

    def apply_instance(self, instance):
        tokens = [t.text for t in instance['tokens']]
        links = [0] * len(tokens)
        
        for i in range(1, len(tokens)-1):
            if tokens[i] in linkers:
                links[i] = 1
                links[i+1] = 1
        return links

lf = PunctuationLinkers()
lf.apply(data)

Similarly, we know that contractions share the same tag with the token preceding them

In [28]:
contraction_suffixes = {'\'s', '\'nt', '\'ve', '\'', '\'d'}

class Contractions(LinkingRule):

    def apply_instance(self, instance):
        tokens = [t.text for t in instance['tokens']]
        links = [0] * len(tokens)
        
        for i in range(1, len(tokens)):
            if tokens[i] in contraction_suffixes:
                links[i] = 1
        return links

lf = Contractions()
lf.apply(data)

In [31]:
common_prepositions = {'a', 'the', 'at', 'with', 'of', 'by', '&'}
class CommonPrepositions(LinkingRule):

    def apply_instance(self, instance):
        tokens = [t.text for t in instance['tokens']]
        links = [0] * len(tokens)
        
        for i in range(1, len(tokens)-1):
            if tokens[i] in common_prepositions:
                if tokens[i-1].istitle() and tokens[i+1].istitle():
                    links[i] = 1
                    links[i+1] = 1
        return links

lf = CommonPrepositions()
lf.apply(data)

Todo: movie enumerations separated by a comma

Similar to tagging rules, we have linking rule helpers available at ``wiser.lf``. For the next linking rule, we will use the ``ElmoLinkingRule``, a rule that vectorizes tokens using [Elmo](https://allennlp.org/elmo) and links tokens whose cosine similaritiy is larger than a given threshold.

In [36]:
from wiser.lf import ElmoLinkingRule

In [39]:
# We link tokens whose cosine similarity is larger than 0.8 (this may take a while)
lf = ElmoLinkingRule(0.8)
lf.apply(data)

KeyboardInterrupt: 

## Evaluating Linking Rules

Similar to tagging rules, we can evaluate the accuracy of our linking rules using the ``score_linking_functions`` method.

* Entity Links represents the number of correct links generated for positive classes.
* Non-Entity Links represents the number of correct links generated for negative classes (e.g., **O** tags).
* Incorrect links represent the total number of incorrectly generated links.
* Accuracy represents the fraction of issued links that identified correct links.

In [32]:
from wiser.eval import score_linking_rules

score_linking_rules(dev_data)

Unnamed: 0,Entity Links,Non-Entity Links,Incorrect Links,Accuracy
CommonPrepositions,208,78,2,0.9931
ConsecutiveCapitals,1738,816,17,0.9934
Contractions,40,116,0,1.0
PunctuationLinkers,182,404,12,0.9799


# Saving Progress
We can use pickle to store the data with the tagging and linking rules applied to it

In [33]:
import pickle

with open('output/tmp/train_data.p', 'wb') as f:
    pickle.dump(train_data, f)

with open('output/tmp/dev_data.p', 'wb') as f:
    pickle.dump(dev_data, f)

with open('output/tmp/test_data.p', 'wb') as f:
    pickle.dump(test_data, f)

You have completed part 1! Now you can move on to part 2.