# Using Snorkel to Extract Education of Actresses and Actors

<sub>Content of this notebook was prepared by Basel Shbita (shbita@usc.edu) as part of the class <u>CSCI 563/INF 558: Building Knowledge Graphs</u> during Spring 2020 at University of Southern California (USC).</sub>

**Notes**: 
- You are supposed to write your code or modify our code in any cell starting with `# ** STUDENT CODE`.
- Much content of this notebook was borrowed from Snorkel Introduction Tutorial

State-of-the-art extraction techniques require massive labeled training set but it is costly to obtain. To overcome this problem, Snorkel helps rapidly create training sets using the new data programming paradigm. To start, developers focus on writing a set of labeling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but Snorkel uses a generative model to learn how to use those labeling functions to label more data. The new labeled data now can be used to train high-quality end models.

**In summary, in this task, you will first manually label 99 documents and use these labeled data as a development set to create your own labeling functions. Then, you will train a generative model to label 1025 documents in training set. Finally, you will train a discriminative model (Bi-LSTM) to produce your final extraction model!**

## Prepare environment

Lets install the packages we will use

In [None]:
!pip install -r requirements.txt

We will work with Snorkel version 0.7 (Beta), we can retrieve it by running the following commands:

In [None]:
!curl -L "https://github.com/snorkel-team/snorkel/archive/v0.7.0-beta.tar.gz" -o snorkel_v0_7_0.tar.gz

Now let's uncompress the package and install Snorkel

In [None]:
!tar -xvzf snorkel_v0_7_0.tar.gz

In [None]:
!pip install snorkel-0.7.0-beta/

## Creating a development set

Before you proceed with task 1.1, we need to preprocess our documents using `Snorkel` utilities, parsing them into a simple hierarchy of component parts of our input data, which we refer as _contexts_. We'll also create _candidates_ out of these contexts, which are the objects we want to classify, in this case, possible mentions of schools and colleges that the cast have attended. Finally, we'll load some gold labels for evaluation.

All of this preprocessed input data is saved to a database. In Snorkel, if no database is specified, then a SQLite database at `./snorkel.db` is created by default -- so no setup is needed here!

In [None]:
# ** STUDENT CODE

import numpy as np, os
from pathlib import Path

from snorkel import SnorkelSession
from snorkel.parser import TSVDocPreprocessor, CorpusParser
from snorkel.parser.spacy_parser import Spacy
from snorkel.models import Document, Sentence, candidate_subclass
from snorkel.viewer import SentenceNgramViewer
from snorkel.annotations import LabelAnnotator, load_gold_labels

# from utils import reload_external_labels, save_gold_labels, save_predicted_relations, \
    #  save_gold_relations, get_dev_doc_ids, get_test_doc_ids, get_gold_labels, number_of_people

# TODO: Set location where you store your homework 5 files
if 'HW_DIR' not in os.environ:
    # HW_DIR = Path("/.../Homework05")
    HW_DIR = Path(os.getcwd())
else:
    HW_DIR = Path(os.environ['HW_DIR'])
    assert HW_DIR.exists()

**Initializing a `SnorkelSession`**

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

session = SnorkelSession()

**Loading the Corpus**

Next, we load and pre-process the corpus of documents.

In [None]:
doc_preprocessor = TSVDocPreprocessor(HW_DIR / 'cast_bios.tsv')

**Running a `CorpusParser`**

We'll use [Spacy](https://spacy.io/), an NLP preprocessing tool, to split our documents into sentences and tokens, and provide named entity annotations.

In [None]:
corpus_parser = CorpusParser(parser=Spacy())
%time corpus_parser.apply(doc_preprocessor)

We can then use simple database queries (written in the syntax of [SQLAlchemy](http://www.sqlalchemy.org/), which Snorkel uses) to check how many documents and sentences were parsed:

In [None]:
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

**Generating Candidates**

The next step is to extract _candidates_ from our corpus. A `Candidate` in Snorkel is an object for which we want to make a prediction. In this case, the candidates are pairs of person and organization mentioned in sentences.

The [Spacy](https://spacy.io/) parser we used performs _named entity recognition_ for us. Next, we'll split up the documents into train, development, and test splits; and collect the associated sentences.

In [None]:
Education = candidate_subclass('Education', ['person', 'organization'])

In [None]:
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import PersonMatcher, OrganizationMatcher

ngrams         = Ngrams(n_max=7)
person_matcher = PersonMatcher(longest_match_only=True)
org_matcher    = OrganizationMatcher(longest_match_only=True)
cand_extractor = CandidateExtractor(Education, [ngrams, ngrams], [person_matcher, org_matcher])

In [None]:
docs = session.query(Document).order_by(Document.name).all()

dev_docs = get_dev_doc_ids(HW_DIR / "cast.dev.txt")
test_docs = get_test_doc_ids(HW_DIR / "cast.test.txt")

train_sents = set()
dev_sents   = set()
test_sents  = set()

for doc in docs:
    sents = (s for s in doc.sentences if number_of_people(s) <= 5)
    if doc.name in dev_docs:
        dev_sents.update(sents)
    elif doc.name in test_docs:
        test_sents.update(sents)
    else:
        train_sents.update(sents)
        
print("Number of dev sents:", len(dev_sents))
print("Number of train sents:", len(train_sents))
print("Number of test sents:", len(test_sents))

Finally, we'll apply the candidate extractor to the three sets of sentences. The results will be persisted in the database backend.

In [None]:
%%time
for i, sents in enumerate([train_sents, dev_sents, test_sents]):
    cand_extractor.apply(sents, split=i)
    print("Number of candidates:", session.query(Education).filter(Education.split == i).count())

## Task 1.1. Label 99 documents in development set

In this task, you will use `SentenceNgramViewer` to label each mention. You can click the green button to mark the candidate as correct, red button to mark as incorrect. Your labeling result is automatically stored in the database.

In [None]:
gold_labels = get_gold_labels(session)
labeled_sents = {lbl.candidate.person.sentence.id for lbl in gold_labels}
unlabeled = [
    x for x in session.query(Education).filter(Education.split == 1).all() 
    if x.person.sentence.id not in labeled_sents
]
print("Number unlabeled:", len(unlabeled))

In [None]:
SentenceNgramViewer(unlabeled, session, annotator_name="gold")

After you finish labeling, executing the cell below to **save your result** to JSON files. 

In [None]:
# ** STUDENT CODE

# TODO: change to your name
save_gold_labels(session, HW_DIR / "Firstname_Lastname_hw05_gold_labels.dev.json", split=1)
save_gold_relations(session, HW_DIR / "Firstname_Lastname_hw05_extracted_relation.dev.json", split=1)

## Tasks 1.2 & 1.3: Define labeling functions (LFs)

In this task, you will define your own LFs, which Snorkel uses to create noise-aware training set. Usually, you will go through a couple of iterations (create LFs, test and refine it) to come up with a good set of LFs. We provide you at the end of this section a helper to quickly see what candidates did your model fail to classify. You can refer to Snorkel tutorial or online documentation for more information.

You are free to use write any extra code to create a set of sophisticated LFs. For example, you build a list of universities and check if it matches with your candidate.

In [None]:
# ** STUDENT CODE 

# These are some example snorkel helpers you can use...
from snorkel.lf_helpers import (
    get_left_tokens, get_right_tokens, get_between_tokens,
    get_text_between, get_tagged_text,
)

# TODO: Define your LFs here, below is a very simple LF

def LF_sample(c):
    return -1

In [None]:
# ** STUDENT CODE

# TODO: store all of your labeling functions into LFs

LFs = [LF_sample]

**Train generative model**

In [None]:
np.random.seed(1701)

labeler = LabelAnnotator(lfs=LFs)
L_train = labeler.apply(split=0)

In [None]:
from snorkel.learning import GenerativeModel

gen_model = GenerativeModel()
gen_model.train(L_train, epochs=100, decay=0.95, step_size=0.1 / L_train.shape[0], reg_param=1e-6)

print("LF weights:", gen_model.weights.lf_accuracy)

We now apply the generative model to the training candidates to get the noise-aware training label set. We'll refer to these as the training marginals:

In [None]:
train_marginals = gen_model.marginals(L_train)

We'll look at the distribution of the training marginals:

In [None]:
import matplotlib.pyplot as plt
plt.hist(train_marginals, bins=20)
plt.show()

Now that we have learned the generative model, we will measure its performances using the provided test set

In [None]:
# Load test-set first
reload_external_labels(session, HW_DIR / "gold_labels.test.json")
L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1)

In [None]:
L_dev = labeler.apply_existing(split=1)
tp, fp, tn, fn = gen_model.error_analysis(session, L_dev, L_gold_dev)

Get detailed statistics of LFs learned by the model

In [None]:
L_dev.lf_stats(session, L_gold_dev, gen_model.learned_lf_stats()['Accuracy'])

You might want to look at some examples in one of the error buckets to improve your LFs. For example, below is one of the false negatives that we did not correctly label as true mentions

In [None]:
SentenceNgramViewer(fn, session)

## Task 1.4. Training an End Extraction Model

In this final task, we'll use the noisy training labels we generated to train our end extraction model. In particular, we will be training a Bi-LSTM.

In [None]:
train_cands = session.query(Education).filter(Education.split == 0).order_by(Education.id).all()
dev_cands   = session.query(Education).filter(Education.split == 1).order_by(Education.id).all()
test_cands  = session.query(Education).filter(Education.split == 2).order_by(Education.id).all()

In [None]:
from snorkel.annotations import load_gold_labels

L_gold_dev  = load_gold_labels(session, annotator_name='gold', split=1)
L_gold_test = load_gold_labels(session, annotator_name='gold', split=2)

Try tuning the hyper-parameters below to get your best F1 score

In [None]:
# ** STUDENT CODE

# TODO: tune your hyper-parameters for best results

from snorkel.learning.pytorch import LSTM

train_kwargs = {
    'lr':            0.01, # learning rate of the model
    'embedding_dim': 50,   # size of the feature vector
    'hidden_dim':    50,   # number of nodes in each layer in the model
    'n_epochs':      10,   # number of training epochs
    'dropout':       0.2,  # dropout rate (during learning)
    'batch_size':    64,   # training batch size
    'seed':          1701
}

lstm = LSTM(n_threads=None)
lstm.train(train_cands, train_marginals, X_dev=dev_cands, Y_dev=L_gold_dev, **train_kwargs)

**Report performance of your final extractor**

In [None]:
p, r, f1 = lstm.score(test_cands, L_gold_test)
print("Prec: {0:.3f}, Recall: {1:.3f}, F1 Score: {2:.3f}".format(p, r, f1))

In [None]:
tp, fp, tn, fn = lstm.error_analysis(session, test_cands, L_gold_test)

Use your new model to extract relation in testing documents, and save it to JSON files.

In [None]:
# ** STUDENT CODE

# TODO: change to your name
save_predicted_relations(HW_DIR / "Firstname_Lastname_hw05_extracted_relation.test.json", test_cands, lstm.predictions(test_cands))