# Using Snorkel to Extract Performances and their Directors

**Notes**: 
- You are supposed to write your code or modify our code in any cell with `# TODO`.
- Much content of this notebook was borrowed from Snorkel Introduction Tutorial

State-of-the-art extraction techniques require massive labeled training set but it is costly to obtain. To overcome this problem, Snorkel helps rapidly create training sets using the new data programming paradigm. To start, developers focus on writing a set of labeling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but Snorkel uses a generative model to learn how to use those labeling functions to label more data. The new labeled data now can be used to train high-quality end models.

**In summary, in this task, you will first manually label 50 documents and use these labeled data as a development set to create your own labeling functions. Then, you will train a generative model to label the rest 450 documents in training set. Finally, you will train a discriminative model (Bi-LSTM) to produce your final extraction model!**

## Task

**In this homework, you need to extract the list of `performances` and their `directors` from the set of IMDB biographies that you collect for Homework 2. For example, you need to extract three tuples: [(`Lost on Purpose`, `the Nelms Brothers`), (`Waffle Street`, `the Nelms Brothers`), (`Small Town Crime`, `the Nelms Brothers`)] from the following sentence.**

```
He would go on to act in three consecutive, but very different films written and directed by the Nelms Brothers: Lost on Purpose, Waffle Street and Small Town Crime.
```

**In cases where your collected biographies do not contain enough pairs of `performances` and `directors`, please feel free to use the example dataset as well**.

In [None]:
# TODO: COMBINE ALL OF YOUR BIOGRAPHIES IN ONE CSV FILE AND SUBMIT "Firstname_Lastname_hw05_all.csv"

## Prepare environment

Lets install the packages we will use. Through my testing, Snorkel v0.7 works the best with Python 3.6 

In [None]:
# If you are using Anaconda, you can create a new Python 3.6 environment.

# !conda create -n py36 python=3.6

In [None]:
!pip install -r requirements.txt

We will work with Snorkel version 0.7 (Beta), we can retrieve it by running the following commands:

In [None]:
!curl -L "https://github.com/snorkel-team/snorkel/archive/v0.7.0-beta.tar.gz" -o snorkel_v0_7_0.tar.gz

Now let's uncompress the package and install Snorkel

In [None]:
!tar -xvzf snorkel_v0_7_0.tar.gz

In [None]:
!pip install snorkel-0.7.0-beta/

## Creating a development set

We need to preprocess our documents using `Snorkel` utilities, parsing them into a simple hierarchy of component parts of our input data, which we refer as _contexts_. We'll also create _candidates_ out of these contexts, which are the objects we want to classify, in this case, possible mentions of schools and colleges that the cast have attended. Finally, we'll load some gold labels for evaluation.

All of this preprocessed input data is saved to a database. In Snorkel, if no database is specified, then a SQLite database at `./snorkel.db` is created by default -- so no setup is needed here!

In [None]:
# ** STUDENT CODE

import numpy as np, os
from pathlib import Path

from snorkel import SnorkelSession
from snorkel.parser import TSVDocPreprocessor, CorpusParser
from snorkel.parser.spacy_parser import Spacy
from snorkel.models import Document, Sentence, candidate_subclass
from snorkel.viewer import SentenceNgramViewer
from snorkel.annotations import LabelAnnotator, load_gold_labels

# TODO: SET LOCATION WHERE YOU STORE YOUR HW5 FILES
if 'HW_DIR' not in os.environ:
    HW_DIR = Path(".")
else:
    HW_DIR = Path(os.environ['HW_DIR'])
    assert HW_DIR.exists()

## Initializing a `SnorkelSession`

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

session = SnorkelSession()

## Loading the Corpus

Next, we load and pre-process the corpus of documents.

In [None]:
doc_preprocessor = TSVDocPreprocessor(HW_DIR / 'cast_bios.tsv')

## Running a `CorpusParser`

We'll use [Spacy](https://spacy.io/), an NLP preprocessing tool, to split our documents into sentences and tokens, and provide named entity annotations.

In [None]:
# Uncomment this to download spacy model
# !python -m spacy download [model_name] (e.g. en_core_web_lg)

corpus_parser = CorpusParser(parser=Spacy())
%time corpus_parser.apply(doc_preprocessor)

We can then use simple database queries (written in the syntax of [SQLAlchemy](http://www.sqlalchemy.org/), which Snorkel uses) to check how many documents and sentences were parsed:

In [None]:
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

## Generating Candidates

The next step is to extract _candidates_ from our corpus. A `Candidate` in Snorkel is an object for which we want to make a prediction. In this case, the candidates are pairs of performances and directors mentioned in sentences.

The [Spacy](https://spacy.io/) parser we used performs _named entity recognition_ for us. Next, we'll split up the documents into train and development splits; and collect the associated sentences.

### Writing a simple director name matcher

Our **simple** name matcher makes use of the fact that the names of the directors are mentions of person-type named entities in the documents. `Fonduer` provides a list of built-in matchers that can be used in many information extraction tasks. We will use `PersonMatcher` to extract director names. 

In [None]:
from snorkel.matchers import PersonMatcher, OrganizationMatcher

director_matcher = PersonMatcher(longest_match_only=True)

In [None]:
# ** STUDENT CODE

# TODO: WRITE YOUR PERFORMANCE MATCHER. YOU CAN REUSE EXTRACTORS IN HOMEWORK 2

### Writing a random performance matcher

We design our **random** award matcher to capture all capitalized `span`s of text that contain the letter `A`.

In [None]:
from snorkel.matchers import RegexMatchEach, LambdaFunctionMatcher

def mention_span_captilized_with_A(mention):
    performance_string = mention.get_span()
    for word in performance_string.split():
        if word[0].islower():
            return False
    if "A" in performance_string:
        return True
    else:
        return False

performance_matcher = LambdaFunctionMatcher(func=mention_span_captilized_with_A)

In [None]:
# ** STUDENT CODE

# TODO: WRITE YOUR DIRECTOR MATCHER. YOU CAN REUSE EXTRACTORS IN HOMEWORK 2

We know that normally each `director` name will contain at least two words (first name, last name). Considering
additional middle names, we expect a maximum of four words per name.

Similarly, we assume the `performance` name to be a `span` of one to seven words.

We use the default `Ngrams` class provided by `Fonduer` to define these properties:

In [None]:
from snorkel.candidates import Ngrams
# ** STUDENT CODE

# TODO: FEEL FREE TO CHANGE THE NGRAMS LENGTH IF YOU WANT
performance_ngrams = Ngrams(n_max=4)
director_ngrams = Ngrams(n_max=7)

We create a candidate that is composed of a `performance` and a `director` mention as we defined above. We name this candidate `performance_director`. And we will extract all 

In [None]:
from snorkel.candidates import Ngrams, CandidateExtractor

performance_with_director = candidate_subclass('performance_director', ['performance', 'director'])
ngrams = Ngrams(n_max=7)
cand_extractor = CandidateExtractor(performance_with_director, [performance_ngrams, director_ngrams], [director_matcher, performance_matcher])

## Create the development set

We create our development set by generating a `dev_ids.csv` file, which has one column `id` and contains 50 random biography URLs. You can choose any subset of 50 biographies that have `performance` and `director`.

In [None]:
docs = session.query(Document).order_by(Document.name).all()
import pandas as pd

docs = session.query(Document).order_by(Document.name).all()
ld = len(docs)

gold_data = pd.read_csv("dev_ids.csv")

dev_docs = gold_data["id"].values.tolist()

print(f"Number of dev documents: {len(dev_docs)}")

train_sents = set()
dev_sents   = set()

for doc in docs:
    sents = [s for s in doc.sentences]
    if doc.name in dev_docs:
        dev_sents.update(sents)
    else:
        train_sents.update(sents)
        
print("Number of dev sents:", len(dev_sents))
print("Number of train sents:", len(train_sents))

Finally, we'll apply the candidate extractor to the two sets of sentences. The results will be persisted in the database backend.

In [None]:
%%time
for i, sents in enumerate([train_sents, dev_sents]):
    cand_extractor.apply(sents, split=i)
    print("Number of candidates:", session.query(performance_with_director).filter(performance_with_director.split == i).count())

## Label 50 documents in development set

In this task, you will use `SentenceNgramViewer` to label each mention. You can click the green button to mark the candidate as correct, red button to mark as incorrect. Your labeling result is automatically stored in the database.

In [None]:
from snorkel.models import GoldLabel, GoldLabelKey

def get_gold_labels(session: SnorkelSession, annotator_name: str="gold"):
    # define relationship in case it is not defined
    ak = session.query(GoldLabelKey).filter(GoldLabelKey.name == annotator_name).first()
    return session.query(GoldLabel).filter(GoldLabel.key == ak).all()

gold_labels = get_gold_labels(session)
labeled_sents = {lbl.candidate.performance.sentence.id for lbl in gold_labels}
unlabeled = [
    x for x in session.query(performance_with_director).filter(performance_with_director.split == 1).all() 
    if x.performance.sentence.id not in labeled_sents
]
print("Number unlabeled:", len(unlabeled))

**Please remember to label all pairs of mentions, both correct and incorrect ones**

`SentenceNgramViewer` only show candidates that are matched by your matchers. Therefore, your annotation is under an assumption that your matchers work perfectly. 

In [None]:
# Uncomment and run this if you see "SentenceNgramViewer" text instead of a UI component. Then restart your notebook and refresh your browser.

#!jupyter nbextension enable --py --sys-prefix widgetsnbextension

In [None]:
SentenceNgramViewer(unlabeled, session, annotator_name="gold")

After you finish labeling, executing the cell below to **save your result** to CSV files. 

In [None]:
# ** STUDENT CODE

def extract_gold_labels(session: SnorkelSession, annotator_name: str="gold", split: int=None):
    ''' Extract pairwise gold labels and store in a file. '''
    gold_labels = get_gold_labels(session, annotator_name)

    results = []
    for gold_label in gold_labels:
        rel = gold_label.candidate
        if split is not None and rel.split != split:
            continue

        results.append({
            "id": rel.performance.sentence.document.name,
            "performance": rel.performance.get_span(),
            "director": rel.director.get_span(),
            "value": gold_label.value
        })

    return results

gold_labels = extract_gold_labels(session, split=1)
gold_labels

In [None]:
# TODO: CHANGE TO YOUR NAME AND SAVE THE GOLD LABELS (TASK 1)
pd.DataFrame(gold_labels).to_csv("Firstname_Lastname_hw05_gold.dev.csv", index=None)

## Define labeling functions (LFs)

In this task, you will define your own LFs, which Snorkel uses to create noise-aware training set. Usually, you will go through a couple of iterations (create LFs, test and refine it) to come up with a good set of LFs. We provide you at the end of this section a helper to quickly see what candidates did your model fail to classify. You can refer to [Snorkel tutorial](https://github.com/snorkel-team/snorkel-extraction/tree/master/tutorials) for more information.

You are free to use write any extra code to create a set of sophisticated LFs. More LF helper functions can be found [here](https://github.com/snorkel-team/snorkel-extraction/blob/master/snorkel/lf_helpers.py).

In [None]:
# ** STUDENT CODE 

# THESE ARE SOME HELPER FUNCTIONS THAT YOU CAN USE
from snorkel.lf_helpers import (
    get_left_tokens, get_right_tokens, get_between_tokens,
    get_text_between, get_tagged_text,
)

# TODO: DEFINE YOUR LFS HERE. BELOW ARE SOME RANDOM LFS

ABSTAIN = -1
FALSE = 0
TRUE = 1


def random_lf1(c):
    p1 = c.performance.get_word_start()
    p2 = c.director.get_word_start()
    if p1 < p2:
        return TRUE
    else:
        return FALSE
    
def random_lf2(c):
    p1 = c.performance.get_word_start()
    p2 = c.director.get_word_start()
    if p1 > p2:
        return TRUE
    else:
        return FALSE
    
def random_lf3(c):
    p1 = c.performance.get_word_start()
    p2 = c.director.get_word_start()
    if p1 == p2:
        return TRUE
    else:
        return FALSE

In [None]:
# ** STUDENT CODE

# TODO: PUT ALL YOUR LABELING FUNCTIONS HERE

performance_with_director_lfs = [
    random_lf1,
    random_lf2,
    random_lf3
]

## Train generative model

Now, we'll train a model of the LFs to estimate their accuracies. Once the model is trained, we can combine the outputs of the LFs into a single, noise-aware training label set for our extractor. Intuitively, we'll model the LFs by observing how they overlap and conflict with each other.

In [None]:
np.random.seed(1701)

labeler = LabelAnnotator(lfs=performance_with_director_lfs)
L_train = labeler.apply(split=0)

Get detailed statistics of LFs before training the model

In [None]:
L_train.lf_stats(session)

In [None]:
# TODO: MAKE SURE THE ABOVE CELL OUTPUT IS SHOWN IN YOUR PDF VERSION. THIS WILL BE YOUR ANSWER FOR TASK 2.3

In [None]:
from snorkel.learning import GenerativeModel

gen_model = GenerativeModel()
gen_model.train(L_train, epochs=100, decay=0.95, step_size=0.1 / L_train.shape[0], reg_param=1e-6)

print("LF weights:", gen_model.weights.lf_accuracy)

In [None]:
# TODO: MAKE SURE THE ABOVE CELL OUTPUT IS SHOWN IN YOUR PDF VERSION. THIS WILL BE YOUR ANSWER FOR TASK 2.2

Now that we have learned the generative model, we will measure its performances using the provided test set

In [None]:
L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1)

In [None]:
L_dev = labeler.apply_existing(split=1)
tp, fp, tn, fn = gen_model.error_analysis(session, L_dev, L_gold_dev)

Get detailed statistics of LFs learned by the model

In [None]:
L_dev.lf_stats(session, L_gold_dev, gen_model.learned_lf_stats()['Accuracy'])

In [None]:
# TODO: MAKE SURE THE ABOVE CELL OUTPUT IS SHOWN IN YOUR PDF VERSION. THIS WILL BE YOUR ANSWER FOR TASK 2.3

We now apply the generative model to the training candidates to get the noise-aware training label set. We'll refer to these as the training marginals:

In [None]:
train_marginals = gen_model.marginals(L_train)

We'll look at the distribution of the training marginals:

In [None]:
import matplotlib.pyplot as plt
plt.hist(train_marginals, bins=20)
plt.show()

In [None]:
# TODO: MAKE SURE THE ABOVE CELL OUTPUT IS SHOWN IN YOUR PDF VERSION. THIS WILL BE YOUR ANSWER FOR TASK 2.4

In [None]:
# TODO: CHANGE THIS CELL TO MARKDOWN CELL AND WRITE YOUR ANSWER TO TASK 2.5 HERE.

You might want to look at some examples in one of the error buckets to improve your LFs. For example, below is one of the false positives that we did not correctly label correctly

In [None]:
SentenceNgramViewer(fp, session)

## Adding Distant Supervision Labeling Function

Distant supervision generates training data automatically using an external, imperfectly aligned training resource, such as a Knowledge Base.

Define an additional distant-supervision-based labeling function which uses Wikidata or DBpedia. With the additional labeling function you added, please make sure to answer all questions for Task 3.3, 3.4, 3.5 mentioned in the homework.

In [None]:
# TODO: ADD YOUR DISTANT SUPERVISION LABELING FUNCTIONS AND ANSWER TASK 3 QUESTIONS

## Training an Discriminative Model

In this final task, we'll use the noisy training labels we generated to train our end extraction model. In particular, we will be training a Bi-LSTM.

In [None]:
train_cands = session.query(performance_with_director).filter(performance_with_director.split == 0).order_by(performance_with_director.id).all()
dev_cands   = session.query(performance_with_director).filter(performance_with_director.split == 1).order_by(performance_with_director.id).all()

In [None]:
from snorkel.annotations import load_gold_labels

L_gold_dev  = load_gold_labels(session, annotator_name='gold', split=1)

Try tuning the hyper-parameters below to get your best F1 score

In [None]:
# ** STUDENT CODE

# TODO: TUNE YOUR HYPERPARAMETERS TO OBTAIN BEST RESULTS. WE EXPECT A F1-SCORE THAT IS HIGHER THAN 0.7

from snorkel.learning.pytorch import LSTM

train_kwargs = {
    'lr':            0.01, # learning rate of the model
    'embedding_dim': 50,   # size of the feature vector
    'hidden_dim':    50,   # number of nodes in each layer in the model
    'n_epochs':      10,   # number of training epochs
    'dropout':       0.2,  # dropout rate (during learning)
    'batch_size':    64,   # training batch size
    'seed':          1701
}

lstm = LSTM(n_threads=None)
lstm.train(train_cands, train_marginals, X_dev=dev_cands, Y_dev=L_gold_dev, **train_kwargs)

## Report performance of your final extractor

In [None]:
p, r, f1 = lstm.score(dev_cands, L_gold_dev)
print("Prec: {0:.3f}, Recall: {1:.3f}, F1 Score: {2:.3f}".format(p, r, f1))

In [None]:
# TODO: MAKE SURE THE ABOVE CELL OUTPUT IS SHOWN IN YOUR PDF VERSION. THIS WILL BE YOUR ANSWER FOR TASK 4

In [None]:
tp, fp, tn, fn = lstm.error_analysis(session, dev_cands, L_gold_dev)

In [None]:
# TODO: MAKE SURE THE ABOVE CELL OUTPUT IS SHOWN IN YOUR PDF VERSION. THIS WILL BE YOUR ANSWER FOR TASK 4

Use your new model to extract relation in testing documents, and save it to JSON files.

In [None]:
# ** STUDENT CODE

# TODO: EXPORT YOUR PREDICTION OF THE DEV SET TO A CSV FILE