# Part II: Writing Explanations

In this notebook, we'll walk through how to create your own explanations that can be fed into Babble Labble.

Creating explanations generally happens in five steps:
1. View candidates
2. Write explanations
3. Get feedback
4. Update explanations 
5. Apply label aggregator

Steps 3-5 are optional; explanations may be submitted without any feedback on their quality. However, in our experience, observing how well explanations are being parsed and what their accuracy/coverage on a dev set are (if available) can quickly lead to simple improvements that yield significantly more useful labeling functions. Once a few labeling functions have been collected, you can use the label aggregator to identify candidates that are being mislabeled and write additional explanations targeting those failure modes.

We'll walk through each of the steps individually with examples; at the end of the notebook is an area for you to iterate with your own explanations.

## Step 0: Setup

Once again, we need to first load the data (candidates and labels) from the pickle.

In [1]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
import pickle

DATA_FILE = 'data/tutorial_data.pkl'
with open(DATA_FILE, 'rb') as f:
    Cs, Ys = pickle.load(f)

## Step 1: View Candidates

We've combined most of the steps required for writing explanations into a single class for convenience: the `BabbleStream`. This will allow you to view candidates, submit explanations, analyze the resulting parses, save explanations that you're satisfied with, and generate label matrices from the parses you've saved so far. (The `Babbler` class seen in the Tutorial 1 is simply a subclass of `BabbleStream` that submits explanations as a batch and commits them immediately, for non-iterative workflows).

In [3]:
from babble import BabbleStream

babbler = BabbleStream(Cs, Ys, balanced=True, shuffled=True, seed=321)

Grammar construction complete.


Now that the `BabbleStream` has been initiated, we can run the cell below repeatedly to iterate through candidates for labeling. Some candidates will prove very difficult to give explanations for; **feel free to skip these**! The number of unlabeled candidates is often orders of magnitude larger than the number of explanations we need, so we can afford to skip the tricky ones.

Since many explanations end up referring to distances between words, each candidate will be displayed in two ways: as a list of tokens, and as a single string. In both cases, curly brackets have been placed around the entities; these are shown for your convenience only and are not actually a part of the raw text.

In [4]:
from babble.utils import display_candidate

candidate = babbler.next()
display_candidate(candidate)

Legendary Russian actor {Ivan Krasko} married his 24-year - old fiance {Natalia Shevel} , a former student of his , in a secret ceremony attended only by close friends and family yesterday in St Petersburg .

['Legendary', 'Russian', 'actor', '{Ivan', 'Krasko}', 'married', 'his', '24-year', '-', 'old', 'fiance', '{Natalia', 'Shevel}', ',', 'a', 'former', 'student', 'of', 'his', ',', 'in', 'a', 'secret', 'ceremony', 'attended', 'only', 'by', 'close', 'friends', 'and', 'family', 'yesterday', 'in', 'St', 'Petersburg', '.']


## Step 2: Write Explanations

Now, looking at candidates one by one, we can create `Explanation` objects. Each `Explanation` requires 3 things (with an optional 4th):
- A label: An integer (For this task, 1 if X and Y were/are/will soon be married, and 2 otherwise.
- A condition: See below for details.
- A candidate: This will be used by the filter bank inside to check for semantic consistency.
- A name: (Optional) Adding names can be helpful for bookkeeping if you have many explanations floating around. 

The condition should satisfy the following properties:
1. **Complete Sentences**: Form a complete sentence when preceded by "I labeled it \[label\] because..." (i.e., instead of simply the phrase "his wife", it should be a statement like "'his wife' is in the sentence").
2. **X and Y**: Refer to the person who occurs first in the sentence as **X** and the second person as **Y**. (These can be overwritten with custom strings, but for now we'll stick with X and Y).
3. **Valid Primitives**: Utilize primitives supported by the grammar. These include:  
true, false, strings, ints, floats, tuples, lists, sets, and, or, not, any, all, none, =, !=, <, ≤, >, ≥, lowercase, uppercase, capitalized, all caps, starts with, ends with, substring, basic NER tags (person, location, date, number, organization), count, contains, intersection, map, filter, distances in words or characters, relative positions (left/right/between/within).

The rule-based parser is naive, not comprehensive, and can certainly be improved to support more primitives. These are just some of the ones we found to be the most commonly used and easily supported. When tempted to refer to real-world concepts (e.g., the "last name" of X), see if you can capture something similar using the supported primitives (e.g., "the last word of X").

In [5]:
from babble import Explanation
explanation = Explanation(
    name='LF_fiance_between',
    label=1,
    condition='The word "fiance" is between X and Y',
    candidate=candidate,
)

When we call `babbler.apply()`, our explanation is parsed into (potentially multiple) parses, which are then passed through the filter bank, removing any that fail. It returns a list of passing parses, and filtered ones.

In [6]:
parses, filtered = babbler.apply(explanation)

Building list of target candidate ids...
All 1 explanations are already linked to candidates.
1 explanation(s) out of 1 were parseable.
3 parse(s) generated from 1 explanation(s).
2 parse(s) remain (1 parse(s) removed by DuplicateSemanticsFilter).
1 parse(s) remain (1 parse(s) removed by ConsistencyFilter).
Applying labeling functions to investigate labeling signature.

1 parse(s) remain (0 parse(s) removed by UniformSignatureFilter: (0 None, 0 All)).
1 parse(s) remain (0 parse(s) removed by DuplicateSignatureFilter).
1 parse(s) remain (0 parse(s) removed by LowestCoverageFilter).


You can view a pseudocode translation of your parse using the `view_parse()` method.

In [7]:
babbler.view_parse(parses[0])

Name: LF_fiance_between_1
Parse: return 1 if 'fiance'.in(text(between([X,Y]))) else 0


At this point, if you're confident in the value of your explanation, you can go ahead and it to the set of parses to keep by calling `babbler.commit()`. But if you'd like to investigate its quality first, continue on to Step 3. 

## Step 3: Get Feedback

If you have a labeled dev set, you can evaluate your resulting parse's performance on that set to get an estimate of what it's accuracy and coverage are. You may be surprised at how good/bad/broad/narrow your explanations actually are. 

**NOTE:** There is a risk to doing this evaluation, however. The dev set is generally small; be careful not to overfit to it with your explanations! This is especially important if you use the same dev set for explanation validation and hyperparameter tuning.

In [8]:
babbler.analyze(parses)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
LF_fiance_between_1,0,1.0,0.009,0.0,0.0,2,7,0.222222


In this case, we see that our explanation yielded a labeling function that has rather low accuracy (~22%), and low coverage (~1%).

You can view examples of candidates your parse labeled correctly or incorrectly for ideas. Once the viewer is instantiated, you can simply rerun the cell with `viewer.view()` to move on to the next candidate.

In [9]:
from babble.utils import CandidateViewer

correct, incorrect = babbler.error_buckets(parses[0])
viewer = CandidateViewer(incorrect)

In [10]:
viewer.view()

Touching : {Kath Rathband} ( pictured at the Colosseum in Rome ) , widow of tragic PC David Rathband , has found happiness with fiancee John McGee , who she will marry next year     Relaxed : The happy couple were pictured relaxing on the sofa of Gogglebox stars   {Steph} and Dom at their B&B    ‘ Everyone is delighted for them,’ a friend told the Sunday People .    ‘

['Touching', ':', '{Kath', 'Rathband}', '(', 'pictured', 'at', 'the', 'Colosseum', 'in', 'Rome', ')', ',', 'widow', 'of', 'tragic', 'PC', 'David', 'Rathband', ',', 'has', 'found', 'happiness', 'with', 'fiancee', 'John', 'McGee', ',', 'who', 'she', 'will', 'marry', 'next', 'year', '   ', 'Relaxed', ':', 'The', 'happy', 'couple', 'were', 'pictured', 'relaxing', 'on', 'the', 'sofa', 'of', 'Gogglebox', 'stars', '\xa0', '{Steph}', 'and', 'Dom', 'at', 'their', 'B&B', '  ', '‘', 'Everyone', 'is', 'delighted', 'for', 'them,’', 'a', 'friend', 'told', 'the', 'Sunday', 'People', '.', '  ', '‘']


RelationMention(doc_id=53957: entities=("Kath Rathband"(10:23), "Steph"(253:258))

If you want to see what parses were filtered and why, there's a helper method for that as well. Because of the simplicity of the parser, even some seemingly simple explanations can be parsed incorrectly or failed to yield any valid parses at all. But be warned: in general, we find that time spent analyzing the parser's performance is rarely as productive as time spent simply producing more labeling functions, possibly varying the way you phrase your explanations or the types of signals you refer to.

In [11]:
babbler.filtered_analysis(filtered)

SUMMARY
2 TOTAL:
0 Unparseable Explanation
1 Duplicate Semantics
1 Inconsistency with Example
0 Uniform Signature
0 Duplicate Signature
0 Lowest Coverage

[#1]: Duplicate Semantics

Parse: return 1 if 'fiance'.in(text(between([X,Y]))) else 0

Reason: This parse is identical to one produced by the following explanation:
	"The word "fiance" is between X and Y"

Semantics: ('.root', ('.label', ('.int', 1), ('.call', ('.in', ('.extract_text', ('.between', ('.list', ('.arg', ('.int', 1)), ('.arg', ('.int', 2)))))), ('.string', 'fiance'))))


[#2]: Inconsistency with Example

Parse: return 1 if 'fiance'.(.eq(z) for all z in [text(X),text(Y)]) else 0

Reason: This parse abstained on its own candidate (RelationMention(doc_id=7997: entities=("Ivan Krasko"(24:35), "Natalia Shevel"(67:81)))

Semantics: ('.root', ('.label', ('.int', 1), ('.call', ('.composite_and', ('.eq',), ('.list', ('.arg_to_string', ('.arg', ('.int', 1))), ('.arg_to_string', ('.arg', ('.int', 2))))), ('.string', 'fiance'))))



## Step 4: Update Explanations

If an explanation we propose has lower accuracy than we'd like, we can try tightening it up (reducing the number of false positives) by making it more specific. If it has lower coverage than we'd like, one simple way to boost it is to replace keywords with aliases.

As was mentioned in Tutorial 1, aliases are sets of words that can be referred to with a single term. To add aliases to the babbler, we call `babbler.add_aliases` with a dictionary containing key-value pairs corresponding to the name of the alias and the set it refers to.

In [12]:
babbler.add_aliases({'spouse': ['husband', 'wife', 'spouse', 'bride', 'groom', 'fiance']})

Grammar construction complete.


In [13]:
explanation = Explanation(
    name='LF_spouse_between',
    label=1,
    condition='A spouse word is between X and Y',
    candidate=candidate,
)
parses, filtered = babbler.apply(explanation)
babbler.analyze(parses)

Flushing all parses from previous explanation set.
Building list of target candidate ids...
All 1 explanations are already linked to candidates.
1 explanation(s) out of 1 were parseable.
6 parse(s) generated from 1 explanation(s).
4 parse(s) remain (2 parse(s) removed by DuplicateSemanticsFilter).
2 parse(s) remain (2 parse(s) removed by ConsistencyFilter).
Applying labeling functions to investigate labeling signature.

2 parse(s) remain (0 parse(s) removed by UniformSignatureFilter: (0 None, 0 All)).
1 parse(s) remain (1 parse(s) removed by DuplicateSignatureFilter).
1 parse(s) remain (0 parse(s) removed by LowestCoverageFilter).


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
LF_spouse_between_1,0,1.0,0.169,0.0,0.0,110,59,0.650888


We can see that broadening our explanation in this way improved our parse both in coverage and accuracy! We'll go ahead and commit this parse.

In [14]:
babbler.commit()

Added 1 parse(s) from 1 explanations to set. (Total # parses = 1)

Applying labeling functions to split 1

Added 169 labels to split 1: L.nnz = 169, L.shape = (1000, 1).
Applying labeling functions to split 2

Added 170 labels to split 2: L.nnz = 170, L.shape = (1000, 1).


In an ideal world, our parses would all have both high coverage and high accuracy. In practice, however, there is usually a tradeoff. When in doubt, we give a slight edge to accuracy over coverage, since the discriminative model can help with generalization, but it is unlikely to be much more precise than the model that generated its labels.

## Step 5: Apply Label Aggregator

At any point, we can extract our growing label matrices to view the summary statistics of all the parses we've commited so far.

In [15]:
from metal.analysis import lf_summary

Ls = [babbler.get_label_matrix(split) for split in [0,1,2]]
lf_names = [lf.__name__ for lf in babbler.get_lfs()]
lf_summary(Ls[1], Ys[1], lf_names=lf_names)

Retrieved label matrix for split 0: L.nnz = 1300, L.shape = (8000, 1)
Retrieved label matrix for split 1: L.nnz = 169, L.shape = (1000, 1)
Retrieved label matrix for split 2: L.nnz = 170, L.shape = (1000, 1)


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
LF_spouse_between_1,0,1,0.169,0.0,0.0,110,59,0.650888


Once we've committed parses (i.e., labeling functions) to our babbler, we can use them to train the label aggregator to see how we're doing overall.

In [16]:
from metal import LabelModel
from metal.tuners import RandomSearchTuner

search_space = {
    'n_epochs': [50, 100, 500],
    'lr': {'range': [0.01, 0.001], 'scale': 'log'},
    'show_plots': False,
}

tuner = RandomSearchTuner(LabelModel, seed=123)

label_aggregator = tuner.search(
    search_space, 
    train_args=[Ls[0]], 
    X_dev=Ls[1], Y_dev=Ys[1], 
    max_search=20, verbose=False, metric='f1')

[SUMMARY]
Best model: [1]
Best config: {'n_epochs': 500, 'show_plots': False, 'lr': 0.0012223249524949424, 'seed': 123}
Best score: 0.6094182825484763


It may be somewhat suprising to see how quickly quality improves with the first few labeling functions you submit. But remember: each labeling function you provide results in tens or hundreds of labels, so your effective training set size can actually be growing quite quickly. But as with traditional labels, there will come a point when adding more labeling functions will yield diminishing returns, so it's good to check in on the overall quality of your label aggregator every once in a while.

This process of iteratively tweaking 

# Your Turn!

Now that you've seen the process, you can use this space run your own iterative loop of explanation gathering.

If you need ideas for explanations, you can browse the 200 examples written by graduate students under `tutorial/spouse/data/gradturk_explanations`. Note, however, that these were collected in a non-iterative setting (i.e., the explanations were collected without any feedback on their parseability or performance on a dev set), so many of them have fairly low coverage/accuracy and some may not parse at all.

And remember--some candidates can be really tricky to come up with an explanation for, so feel free to skip!

In [None]:
from babble import BabbleStream

babbler = BabbleStream(Cs, Ys, balanced=True, shuffled=True, seed=456)

### Collection

In [None]:
from babble.utils import display_candidate

candidate = babbler.next()
display_candidate(candidate)

In [None]:
from babble import Explanation
explanation = Explanation(
    name='',
    label=?,
    condition='',
    candidate=candidate,
)

In [None]:
parses, filtered = babbler.apply(explanation)

### Analysis

In [None]:
babbler.analyze(parses)

In [None]:
babbler.filtered_analysis(filtered)

In [None]:
babbler.commit()

### Evaluation

In [None]:
from metal.analysis import lf_summary

Ls = [babbler.get_label_matrix(split) for split in [0,1,2]]
lf_names = [lf.__name__ for lf in babbler.get_lfs()]
lf_summary(Ls[1], Ys[1], lf_names=lf_names)

In [None]:
from metal import LabelModel
from metal.tuners import RandomSearchTuner

search_space = {
    'n_epochs': [50, 100, 500],
    'lr': {'range': [0.01, 0.001], 'scale': 'log'},
    'show_plots': False,
}

tuner = RandomSearchTuner(LabelModel, seed=123)

label_aggregator = tuner.search(
    search_space, 
    train_args=[Ls[0]], 
    X_dev=Ls[1], Y_dev=Ys[1], 
    max_search=20, verbose=False, metric='f1')

If you'd like to save the explanations you've generated, you can use the `ExplanationIO` object to write to or read them from file.

In [None]:
from babble.utils import ExplanationIO

FILE = "my_explanations.tsv"
exp_io = ExplanationIO()
exp_io.write(explanations, FILE)
explanations = exp_io.read(FILE)