# Part I: The Babble Labble Pipeline

The purpose of this notebook is to introduce the basic pipeline of a Babble Labble application. 

Our task is to classify candidate mentions of spouses from news articles. That is, given a sentence with two identified entities (people), we want to classify whether or not the two people were/are/will soon be married (according to the text). A classifier trained on this task could be used, for example, to populate a knowledge base.

This notebook consists of five steps:
1. Load candidates
2. Collect explanations
3. Parse and filter
4. Aggregate labels
5. Train classifier

Let's get started!

## Step 1: Load Candidates

In [1]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


First, we load the candidates, and target labels.

In [2]:
import pickle

DATA_FILE = 'data/tutorial_data.pkl'
with open(DATA_FILE, 'rb') as f:
    Cs, Ys = pickle.load(f)

Our data is now divided into three splits (80/10/10), which we'll refer to as the training, dev(evelopment), and test splits. In these tutorials, we will do the bulk of our analysis on the dev split to protect the integrity of the held-out test set.

The variables `Cs` and `Ys` are each lists of length 3, corresponding to the three splits; each `C` is a list of candidates, and each `Y` is a numpy arrays of gold labels. Our labels are categorical (1=True, 2=False).

In [3]:
print(f"Train Size: {len(Cs[0])}")
print(f"Dev Size:   {len(Cs[1])}")
print(f"Test Size:  {len(Cs[2])}")

Train Size: 8000
Dev Size:   1000
Test Size:  1000


Each candidate consists of two spans from the same sentence (which we refer to as X and Y in explanations). These spans correspond to tokens identified as people using a standard NER tagger. Our first candidate from the train split does appear to be an actual pair of spouses, so it should classified as True by our classifier.

In [4]:
candidate = Cs[0][0]
print(f"Sentence:\n{candidate.text}")
print(f"Candidate:\nX: {candidate[0]}\nY: {candidate[1]}")

Sentence:
His mother Joanna, 36, who lives with husband Ian in a detached house, declined to comment when approached yesterday.   
Candidate:
X: EntityMention(doc_id=14945: 'Joanna'(11:17)
Y: EntityMention(doc_id=14945: 'Ian'(46:49)


## Step 2: Collect Explanations

We now collect a small number of **natural language explanations** for why candidates should be labeled in a certain way. In Part II of this tutorial, you can look at examples from the dataset and write your own explanations. In this first notebook, we load 10 sample explanations as an example.

To improve the coverage of explanations, users may provide aliases, sets of words that can be referred to with a single term. For example, you may define "spouse" words as "husband, wife, spouse, bride, groom" and then refer to these terms collectively in an explanation like "There is at least one spouse word between person1 and person2." We store these user-provided aliases in a dictionary.

### Load existing explanations

In [5]:
from data.sample_explanations import explanations, aliases

Here are the first five explantions in our set:

In [6]:
for exp in explanations[:5]:
    print(exp)

Explanation(LF_and_married: 1, "the word 'and' is between X and Y and 'married' within five words of Y")
Explanation(LF_third_wheel: 2, "there is a person between X and Y")
Explanation(LF_married_two_people: 1, "the word 'married' is in the sentence and there are only two people in the sentence")
Explanation(LF_same_person: 2, "X and Y are identical")
Explanation(LF_husband_wife: 1, "there is at least one spouse word between X and Y")


## Step 3: Parse Explanations & Apply Filter Bank

The conversion from Explanations into Labeling Functions (LFs) is performed by an instance of the `Babbler` class. This class includes a semantic parser and filter bank chained together. The semantic parser creates (possibly multiple) candidate LFs for each Explanation, and the filter bank removes as many of these as it can. (See the paper for a description of the different filters).

In [7]:
from babble import Babbler
babbler = Babbler(Cs, Ys, aliases=aliases)

Grammar construction complete.


In this case, we see that our 10 explanations become 37 parses (labeling functions) that are then filtered back down to 10:

In [8]:
babbler.apply(explanations, split=0)     

Building list of target candidate ids...
Collected 10 unique target candidate ids from 10 explanations.
Gathering desired candidates...
Found 10/10 desired candidates
Linking explanations to candidates...
Linked 10/10 explanations
10 explanation(s) out of 10 were parseable.
32 parse(s) generated from 10 explanation(s).
17 parse(s) remain (15 parse(s) removed by DuplicateSemanticsFilter).
14 parse(s) remain (3 parse(s) removed by ConsistencyFilter).
Applying labeling functions to investigate labeling signature.

14 parse(s) remain (0 parse(s) removed by UniformSignatureFilter: (0 None, 0 All)).
11 parse(s) remain (3 parse(s) removed by DuplicateSignatureFilter).
10 parse(s) remain (1 parse(s) removed by LowestCoverageFilter).
Added 10 parse(s) from 10 explanations to set. (Total # parses = 10)

Applying labeling functions to split 1

Added 986 labels to split 1: L.nnz = 986, L.shape = (1000, 10).
Applying labeling functions to split 2

Added 980 labels to split 2: L.nnz = 980, L.shape =

### Apply LFs

Now that we have our final (filtered) set of LFs, we can label all three splits of our data to get our label matrices, which we'll store in a list called `Ls`, similar to our Cs and Ys lists.

In [9]:
Ls = []
for split in [0,1,2]:
    L = babbler.get_label_matrix(split)
    Ls.append(L)

Retrieved label matrix for split 0: L.nnz = 7838, L.shape = (8000, 10)
Retrieved label matrix for split 1: L.nnz = 986, L.shape = (1000, 10)
Retrieved label matrix for split 2: L.nnz = 980, L.shape = (1000, 10)


Each label matrix is an \[n x m\] sparse matrix where L\[i,j\] = the label given by labeling function j to candidate i. Most of the entries in L are 0 (representing an abstention), since most labeling functions apply to only a small portion of the candidates.

In [10]:
Ls[0]

<8000x10 sparse matrix of type '<class 'numpy.int64'>'
	with 7838 stored elements in Compressed Sparse Row format>

## Step 4: Aggregate Labels

We now aggregate the noisy labels in L into one label per example. We do this with the `LabelModel` class from [Snorkel MeTaL](https://github.com/HazyResearch/metal), which implements a new matrix approximation approach to data programming with significantly improved speed and scaling properties.

To run the label model with a single setting, we can do the following:

In [11]:
from metal import LabelModel

label_aggregator = LabelModel()
label_aggregator.train(Ls[0], n_epochs=50, lr=0.01)
label_aggregator.score(Ls[1], Ys[1])

Computing O...
Estimating \mu...
[E:0]	Train Loss: 4.957
[E:10]	Train Loss: 0.101
[E:20]	Train Loss: 0.267
[E:30]	Train Loss: 0.127
[E:40]	Train Loss: 0.023
[E:49]	Train Loss: 0.032
Finished Training
Accuracy: 0.287


0.287

Or we can perform a random search to identify the hyperparameters that result in the best F1 score on the dev set.

In [12]:
from metal.tuners import RandomSearchTuner

search_space = {
    'n_epochs': [50, 100, 500],
    'lr': {'range': [0.01, 0.001], 'scale': 'log'},
    'show_plots': False,
}

tuner = RandomSearchTuner(LabelModel, seed=123)

label_aggregator = tuner.search(
    search_space, 
    train_args=[Ls[0]], 
    X_dev=Ls[1], Y_dev=Ys[1], 
    max_search=20, verbose=False, metric='f1')

[SUMMARY]
Best model: [5]
Best config: {'n_epochs': 100, 'show_plots': False, 'lr': 0.0037849826648026384, 'seed': 127}
Best score: 0.6968838526912181


Notice that our labeling functions have limited coverage. In fact, over 40% of our candidates do not have a single label.

In [13]:
from metal.analysis import label_coverage

print(f"Fraction of dev data with at least one label: {label_coverage(Ls[1])}")

Fraction of dev data with at least one label: 0.58


Instead, we'll use our label aggregator to generate approximate labels for our training set, which will then be used to train a discriminative classifier. In a typical data programming pipeline, we would generate probabilistic labels here. In this tutorial, we want to take advantage of scikit-learn's blazing fast LogisticRegression classifier, so we'll just use normal hard labels.

In [14]:
Y_p = label_aggregator.predict(Ls[0])

## Step 5: Train Classifier

There are a variety of reasons why we might find it advantageous to train a discriminative model rather than use the label aggregator directly. Some of these include:

* **increased coverage**:  
As alluded to above, our labeling functions often do not provide labels for all examples in our training set. A trained discriminative model, however, can make informed predictions about any candidate that has features for which it has learned weights.
* **improved generalization**:  
One of the long-standing success stories in weak supervision is distant supervision (e.g., using a database of known spouses to vote positive on those candidates). However, the goal of distant supervision is to _generalize_ beyond the known examples, not memorize them. Similarly, passing supervision information from the user to the model in the form of a dataset--rather than hard rules--facilitates such generalization.
* **larger feature set**:  
The label model uses only those "features" described in labeling functions; by training a discriminative model, however, we open the door to using larger sets of features known to be helpful in our domain, or learning features appropriate for the problem via deep learning.
* **faster execution**:  
For the label model to make a prediction on a new example, it must execute all of its labeling functions, some of which may be expensive (e.g., requiring database lookups). A trained discriminative model, however, requires only a single forward pass through the network, often making it faster to execute.
* **servable features**:  
Sometimes, there are features that are convenient to supervise over, but hard to use in a servable model (e.g., statistics aggregated over time, features generated by heavy-weight third-party tools, etc.). Training a discriminative model on the label model's outputs allows us to transfer the supervision signal to a new serving environment.

For additional discussion of this topic in the larger context of a shift toward "Software 2.0" systems, see our [technical report](https://ajratner.github.io/assets/papers/software_2_mmt_vision.pdf).

In this tutorial, we use a very simple feature set with a simple logistic regression model for the sake of simplicity and fast runtimes. However, these can easily be swapped out for more advanced features and more sophisticated models.

### Generate Features

Our feature set is simply a bag of ngrams (size 1-3) for the text between the two entities in a relation, plus a small amount of additional context on either side. The text is preprocessed by lowercasing, removing stopwords, and replacing entities with generic markers.

In [15]:
from metal.contrib.featurizers.ngram_featurizer import RelationNgramFeaturizer

featurizer = RelationNgramFeaturizer(min_df=3)
featurizer.fit(Cs[0])
Xs = [featurizer.transform(C) for C in Cs]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bradenjh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The resulting `X` objects (one per split) are sparse one-hot matrices.

In [16]:
Xs[0]

<8000x30879 sparse matrix of type '<class 'numpy.int64'>'
	with 233249 stored elements in Compressed Sparse Row format>

### Train Model

Once again, we perform random search over hyperparameters to select the best model.

In [17]:
from babble.disc_model import LogisticRegressionWrapper
from metal.metrics import metric_score

search_space = {
    'C': {'range': [0.0001, 1000], 'scale': 'log'},
    'penalty': ['l1', 'l2'],
}

tuner = RandomSearchTuner(LogisticRegressionWrapper, seed=123)
disc_model = tuner.search(
    search_space, 
    train_args=[Xs[0], Y_p],
    X_dev=Xs[1], Y_dev=Ys[1], 
    max_search=20, verbose=False, metric='f1')

[SUMMARY]
Best model: [19]
Best config: {'penalty': 'l2', 'C': 25.184688168733086, 'seed': 141}
Best score: 0.6931818181818181


### Evaluation

In this case, even with a very simple model class and feature set, we see that the discriminative model performs on par with the label aggregator. In other words, the supervision signal provided via natural language explanations has been successfully transferred to a more transportable, generalizable discriminative model via an auto-generated labeled training set!

In [18]:
pr, re, f1 = label_aggregator.score(Ls[1], Ys[1], metric=['precision', 'recall', 'f1'])

Precision: 0.764
Recall: 0.641
F1: 0.697


In [19]:
pr, re, f1 = disc_model.score(Xs[1], Ys[1], metric=['precision', 'recall', 'f1'])

Precision: 0.762
Recall: 0.635
F1: 0.693


### Saving

Before we move on to our next notebook, we'll save the `Ls` and training set predictions `Y_p` in pickles so we can use them in the other notebooks without having to repeat the parsing and labeling process.

In [20]:
import pickle

with open("Ls.pkl", 'wb') as f:
    pickle.dump(Ls, f)
    
with open("Y_p.pkl", 'wb') as f:
    pickle.dump(Y_p, f)