# Part III: Tradeoffs

In this notebook, we'll explore the pros and cons of a few variations of the Babble Labble framework.

1. Data Programming or Majority Vote
2. Explanations or Traditional Labels
3. Including LFs as features

As with all machine learning tools, no one tool fits all situations; there are always tradeoffs. 
Also, note that the relative performance of each of these variants can vary widely across applications and different sets of labeling functions, so take the results of any single run with a grain of salt.

## 0. Setup

Once again, we need to first load the data (candidates and labels) from the pickle. This time, we'll also load our label matrices and training set predictions from Tutorial 1.

In [1]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
import pickle

DATA_FILE = 'data/tutorial_data.pkl'
with open(DATA_FILE, 'rb') as f:
    Cs, Ys = pickle.load(f)

with open("Ls.pkl", 'rb') as f:
    Ls = pickle.load(f)
    
with open("Y_p.pkl", 'rb') as f:
    Y_p = pickle.load(f)

## 1. Data Programming or Majority Vote

When it comes to label aggregation, there are a variety of ways to reweight and combine the outputs of the labeling functions. Perhaps the simplest approach is to use the majority vote on a per-candidate basis (effectively making the naive assumption that all labeling functions have equal accuracy). While simple, this is often an effective baseline.

As is described in the [VLDB paper](https://arxiv.org/abs/1711.10160) on Snorkel, in the regimes of very low label density and very high label density, the expected benefits of learning the accuracies of the functions with data programming decreases. When label density is low (i.e. few LFs and/or low coverage LFs), there are few conflicts to resolve, and minimal overlaps from which to learn. When label density is very high, it can be shown that under certain conditions, majority vote converges to an optimal solution, so long as the average labeling function accuracy is better than random.

Because many applications of interest occur in the middle regime, and because the data programming label model can effectively reduce to majority vote with sufficient regularization, we tend to use data programming for label aggregation.

In [3]:
# TEMP
from metal import MajorityLabelVoter

print("MajorityVoter")
mv = MajorityLabelVoter(seed=123, verbose=False)
_ = mv.score(Ls[1], Ys[1], metric='f1')

MajorityVoter
F1: 0.508


And indeed, for our sample set of labeling functions, we see that data programming does indeed outperform majority vote (69.7% vs. 50.8%).

## 2. Explanations or Traditional Labels

We can also consider when it's worthwhile to just use traditional (manually generated) training labels vs weakly supervised (programatically generated) ones. Since the weakly supervised training set will almost by definition not have perfect labels, if you have a large number of ground truth labels to train on, then use that! Where weak supervision makes more sense is situations where labeled data is sparse and/or hard to collect, or when you have the ability to create a much larger training set out of unlabeled data (e.g., 100 "perfect" training labels may not perform as well as 100k "good enough" labels that were automatically generated from labeling functions).

Other aspects to consider:
* static vs dynamic: If the data distribution shifts over time or the task requirements change even slightly, a hand-labeled training set can quickly depreciate in value, as it no longer accurately reflects what you want your model to learn. If your training data is automatically generated, however, you can modify or add a small number of labeling functions to "reshape" your dataset in the appropriate way; no tedious relabeling required.
* label provenance: While we often treat training data creation as a black box process, in actuality, there are often bugs even in manual label collection (e.g., crowdsource workers of varying quality, systematic biases, etc.). We've written about how to debug training data in these blog posts [1](https://dawn.cs.stanford.edu/2018/08/30/debugging2/), [2](https://dawn.cs.stanford.edu/2018/06/21/debugging/).

Finally, most training labels are of approximately equal value. That is, we'd expect a classifier trained with 500 randomly selected labels to achieve approximately the same performance. But it's worth asking:  

**What is the value of a labeling function?**  
It depends. A labeling function may:

* Label one example: `Label 1 if the candidate ID is 8675309`
* Label one distant supervision pair: `Label 1 if X is "Barack" and Y is "Michelle"`
* Label a whole database-worth: `Label 1 if the tuple of X and Y is in my known_spouses dictionary`
* Label based on a feature (1 or 1000s): `Label 2 if the last word of X is different than the last word of Y`

And it isn't just quantity (coverage) that matters; a labeling function that contributes to many labels may be "worth less" in our application than one with lower coverage but higher accuracy. And one that captures a new type of signal not reflected in our current set of LFs will also have relatively higher value than one re-using the same type of signal (e.g., the same keyword list) over and over. The upshot of this is that we can't simply say "An explanation/labeling function is worth this many labels."

Here we train our same discriminative model using 1000 traditional labels (100x the number of explanations in our sample set).

In [4]:
from metal.contrib.featurizers.ngram_featurizer import RelationNgramFeaturizer

NUM_LABELS = 1000

featurizer = RelationNgramFeaturizer()
Xs_ts = []
Xs_ts.append(featurizer.fit_transform(Cs[0][:NUM_LABELS]))
Xs_ts.append(featurizer.transform(Cs[1]))
Xs_ts.append(featurizer.transform(Cs[2]))

X_train = Xs_ts[0]
Y_train = Ys[0][:NUM_LABELS]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bradenjh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
from metal.tuners import RandomSearchTuner
from metal.metrics import metric_score

from babble.disc_model import LogisticRegressionWrapper

search_space = {
    'C': {'range': [0.0001, 1000], 'scale': 'log'},
    'penalty': ['l1', 'l2'],
}

tuner = RandomSearchTuner(LogisticRegressionWrapper, seed=123)

disc_model = tuner.search(
    search_space, 
    train_args=[X_train, Y_train],
    X_dev=Xs_ts[1], Y_dev=Ys[1], 
    max_search=20, verbose=False, metric='f1')

scores = disc_model.score(Xs_ts[1], Ys[1], metric=['precision', 'recall', 'f1'])

[SUMMARY]
Best model: [3]
Best config: {'penalty': 'l2', 'C': 531.4535315103664, 'seed': 125}
Best score: 0.6319018404907976
Precision: 0.769
Recall: 0.536
F1: 0.632


For this particular set of LFs and this particular set of manual labels, the 10 explanations resulted in a better classifier than 1000 labels (69.3% vs 63.2%). The exact multiplicative factor for any particular LF set will vary (and is not linear, as both collecting more manual labels and collecting more labeling functions experience diminishing returns after a point).

## Including LFs as Features

Once we have collected user explanations, there are a number of ways this extra information can be used. In our paper, we described using these explanations as functions for generating training data. Another option is to use them as essentially hand-crafted features, treating the label matrix as a feature matrix instead. Not surprisingly, these features tend to be highly relevant for their respective tasks. However, as we described in Tutorial 1, there may still be good reasons for not including them. For example:
* We may want to make sure our classifier generalizes beyond the signals described by the explanations.
* We may want to capitalize on representation learning, using the larger training set generated by using them as functions.
* We may be in a cross-modal setting, where the features we have at training time are different than the features that our classifier will have access to at deployment time.

Regardless, we find that even in situations where we do want to include the labeling function outputs as features, we can usually achieve additional quality by using them as labeling functions as well, thanks to the larger training set and the access to additional features relevant to the task at hand.

### LF as features only

First, we consider using the labeling function outputs as our only features.

In [6]:
import numpy as np
from data.sample_explanations import explanations

candidate_ids = [exp.candidate for exp in explanations]
indices = []
for c1 in candidate_ids:
    for i, c2 in enumerate(Cs[0]):
        if c1 == c2.mention_id:
            indices.append(i)
            break
            
X_train = Ls[0][indices, :]
Y_train = np.array([exp.label for exp in explanations])

In [7]:
from metal.tuners import RandomSearchTuner
from metal.metrics import metric_score

from babble.disc_model import LogisticRegressionWrapper

search_space = {
    'C': {'range': [0.0001, 1000], 'scale': 'log'},
    'penalty': ['l1', 'l2'],
}

tuner = RandomSearchTuner(LogisticRegressionWrapper, seed=123)

disc_model = tuner.search(
    search_space, 
    train_args=[X_train, Y_train],
    X_dev=Ls[1], Y_dev=Ys[1], 
    max_search=20, verbose=False, metric='f1')

scores = disc_model.score(Ls[1], Ys[1], metric=['precision', 'recall', 'f1'])

[SUMMARY]
Best model: [12]
Best config: {'penalty': 'l1', 'C': 0.2863047498381121, 'seed': 134}
Best score: 0.6217948717948719
Precision: 0.808
Recall: 0.505
F1: 0.622


As expected, these hand-engineered (or shall we say "natural-language-engineered"?) features get us pretty far. But in situations where we do want to give our discriminative model access to the labeling function outputs as features, this approach can nearly always be trumped by combining the two uses for labeling functions--using them to make the larger training set, and then also providing them as features.

### LFs as features and labelers

In [8]:
from metal.contrib.featurizers.ngram_featurizer import RelationNgramFeaturizer

featurizer = RelationNgramFeaturizer(min_df=3)
featurizer.fit(Cs[0])
Xs = [featurizer.transform(C) for C in Cs]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bradenjh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
from scipy.sparse import hstack, csr_matrix

Xs_new = []
for i in [0,1,2]:
    X_new = csr_matrix(hstack([Ls[i], Xs[i]]))
    Xs_new.append(X_new)

In [10]:
from metal.tuners import RandomSearchTuner
from metal.metrics import metric_score

from babble.disc_model import LogisticRegressionWrapper

search_space = {
    'C': {'range': [0.0001, 1000], 'scale': 'log'},
    'penalty': ['l1', 'l2'],
}

tuner = RandomSearchTuner(LogisticRegressionWrapper, seed=123)

disc_model = tuner.search(
    search_space, 
    train_args=[Xs_new[0], Y_p],
    X_dev=Xs_new[1], Y_dev=Ys[1], 
    max_search=20, verbose=False, metric='f1')

scores = disc_model.score(Xs_new[1], Ys[1], metric=['precision', 'recall', 'f1'])

[SUMMARY]
Best model: [8]
Best config: {'penalty': 'l1', 'C': 1.2667309424422641, 'seed': 130}
Best score: 0.7102272727272727
Precision: 0.781
Recall: 0.651
F1: 0.710


## Conclusions

This concludes the tutorial! 

If you'd like to stay up-to-date on the latest tools we're working on in the weak supervision Snorkel ecosystem, we post regular updates to the landing page at [snorkel.stanford.edu](https://hazyresearch.github.io/snorkel/)