Automatically labeling training data
Switch branches/tags
Nothing to show
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
baselines fix #7 Apr 11, 2018
data added datafile Apr 10, 2018
lstm removing pdb import Apr 11, 2018
program_synthesis added itertools Apr 11, 2018
.gitignore ignoring dot and png for trees Dec 6, 2017
LICENSE Create LICENSE Apr 13, 2018
README.md Update README.md Apr 11, 2018
[1] generate_reef_labels.ipynb added more explanations and histograms Apr 13, 2018
[2] train_lstm_model.ipynb renamed files, run through lstm training Apr 10, 2018
requirements.txt removing json Apr 11, 2018

README.md

Reef: Overcoming the Barrier to Labeling Training Data

This is the first version of the README. let us know if there are any errors!

Reef is an automated system for labeling training data based on a small labeled dataset. Reef utilizes ideas from program synthesis to automatically generate a set of interpretable heuristics that are then used to label unlabeled training data efficiently.

Installation

Reef uses Python 2. The Python package requirements are in the file requirements.txt. If you have Snorkel, can set a flag here as True but there is a simple version of learning heuristic accuracies in this repo as well.

Reef Workflow Overview

The inputs to Reef are the following:

  • A labeled dataset, which contains a numerical feature matrix and a vector of ground truth labels (currently only supports binary classification)
  • An unlabeled dataset, which contains a numerical feature matrix

The following is the overall workflow Reef follows to label training data automatically. The overall process is encoded in [1] generate_reef_labels.ipynb and the main file program_synthesis/heuristic_generator.py

  1. Using the labeled dataset, Reef generates heuristics like decision trees, or small logistic regression models. The synthesis code is in program_synthesis/synthesizer.py.
    1. A heuristic is generated for each possible combination of c features, where c is the cardinality. For example, with c=1 and 10 features, 10 heuristics will be generated.
    2. For each generated heuristic, a beta parameter is calculated. This represents the minimum confidence level at which the heuristics will assign a label. This is done by maximizing the F1 score on the labeled dataset.
  2. These heuristics are passed to a pruner that selects the best heuristic by maximizing a combination of the F1 score on the labeled dataset and diversity in terms of how many points it labels that previously selected heuristics don’t.
  3. The selected heuristic and previously chosen heuristics are then passed to the verifier which learns accuracies for the heuristics based on the labels the heuristics assign to the unlabeled dataset.
  4. Finally, Reef calculates the probabilistic labels the heuristics assign to the labeled dataset and pass datapoint with low confidence labels to the synthesizer. We repeat this procedure in an iterative manner.

Tutorial

The tutorial notebooks are based on a text-based plot classification dataset. We go through generating heuristics with Reef and then train a simple LSTM model to see how an end model trained with Reef labels compares to an end model trained with ground truth training labels.