Latest commit f1712dc Jul 12, 2018

README.md

Snorkel

Summer School in Snorkel, Weak Supervision & Software 2.0


Hazy Research

Programming Software 2.0 with Weak Supervision

In the last few years, deep learning models have simultaneously achieved high quality on conventionally challenging tasks and become easy-to-use commodity tools. These factors, combined with the ease of deployment compared to traditional software, have led to deep learning models replacing production software stacks in not only traditional machine learning-driven products including translation and search, but also in many previously heuristic-based applications. This new mode of software construction and deployment has been called Software 2.0. A key bottleneck in the construction of Software 2.0 applications is the need for large, high-quality training sets for each task.

As labeling training data increasingly becomes one of the most central ways in which developers interact with---and program---this new Software 2.0 stack, an emerging area of work focuses on weak supervision techniques for generating labeled training data more efficiently using higher-level, more agile interfaces. For concreteness, this tutorial focuses on Snorkel, a system that enables users to shape, create, and manage training data for Software 2.0 stacks. In Snorkel applications, instead of tediously hand-labeling individual data items, a user implicitly defines large training sets by writing programs, called labeling functions, that assign labels to subsets of data points, albeit noisily. This idea of using multiple, imperfect sources of labels builds on previous work in distant supervision, and extends it to handle a more diverse range of noisier, biased, and potentially correlated sources.

In this tutorial, we focus on a basic introduction to the Snorkel paradigm, its interface and workflow, and its motivating context and theory.

Lecture & Workshop Materials

  1. LECTURE: Software 2.0 Intro Slides

  2. LECTURE: Snorkel Overview Slides

  3. INTERACTIVE: Writing Labeling Functions

    1. Snorkel API: We introduct Candidate and Context objects (documents, sentences) and then show how to interact with candidates using the Snorkel helper function API.

    2. Writing Labeling Functions: We discuss how to write how to explore our training data, write labeling functions, and use labeling function factories to autogenerate LFs from simple dictionaries and regular expressions.

  4. LECTURE: Data Programming Theory Slides

  5. INTERACTIVE: Writing Labeling Functions

    1. Training the Generative Model: We discuss how to unify the supervision provided by lableing functions in the previous notebook. We show how using a generative model

    2. Training the Discriminative Model: Using the output of the generative model above, we train a noise-aware discriminative model (here a deep neural network) to make predictions over the candidates, and evaluating the discriminative model on the development candidate set.

    3. Working with Images: Snorkel isn't limited to just text-based classification problems. In this tutorial, we show how Snorkel can be used for computer vision tasks.

Advanced Tutorials

These are useful additional tutorials for advanced Snorkel features.

  1. Data Preprocessing: How to preprocess a corpus of documents and initialize a Snorkel database.

  2. Model Tuning: Model tuning through grid search.

  3. BRAT Annotator: How to construct a validation set of human annotated data using BRAT (Brat Rapid Annotation Tool).

Further Reading on Weak Supervision

Below are some links both on Snorkel and related projects, as well as the broader spectrum of weak supervision work in the community. For more links, see the Snorkel home page:

  1. Weak Supervision: The New Programming Language for Software 2.0
  2. Weak Supervision: A Survey Blog Post
  3. A Recent NIPS 2017 Workshop on Weak Supervision
  4. Exploiting Building Blocks of Data to Efficiently Create Training Sets
  5. HoloClean: Data Cleaning using Weak Supervision
  6. BabbleLabble: Using Natural Language to Label Training Data
  7. Structure Learning: Handling Correlated Sources Automatically, [Tutorial]