# Data programming: Training data without hand labeling

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"

## Contents

0. [Overview](#Overview)
0. [Set-up](#Set-up)
0. [Motivation](#Motivation)  
0. [The data programming model](#The-data-programming-model)
0. [Basic implementation](#Basic-implementation)
0. [Simple example: cheese vs. disease](#Simple-example:-cheese-vs.-disease)
  0. [Cheese/disease data](#Cheese/disease-data)
  0. [Cheese/disease labeling functions](#Cheese/disease-labeling-functions)
  0. [Applying the cheese/disease labelers](#Applying-the-cheese/disease-labelers)
  0. [Fitting the generative model to obtain cheese/disease labels](#Fitting-the-generative-model-to-obtain-cheese/disease-labels)
  0. [Training discriminative models for cheese/disease prediction](#Training-discriminative-models-for-cheese/disease-prediction)
0. [In-depth example: Stanford Sentiment Treebank](#In-depth-example:-Stanford-Sentiment-Treebank)
  0. [SST training set](#SST-training-set)
  0. [Lexicon-based labeling functions](#Lexicon-based-labeling-functions)
  0. [Other SST labeling function ideas](#Other-SST-labeling-function-ideas)
  0. [Applying the SST labeling functions](#Applying-the-SST-labeling-functions)
  0. [Fitting the SST generative model](#Fitting-the-SST-generative-model)
  0. [Direct assessment of the inferred labels against the gold ones](#Direct-assessment-of-the-inferred-labels-against-the-gold-ones)
  0. [Fitting a discriminative model on the noisy labels](#Fitting-a-discriminative-model-on-the-noisy-labels)
0. [Extra-credit bake-off](#Extra-credit-bake-off)

## Overview

This notebook provides an overview of the __data programming__ model pioneered by [Ratner et al. 2016](https://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly):

* This model synthesizes a bunch of noisy labeling functions into a set of (binary) supervised labels for examples. These labels are then used for __training__. 

* Thus, on this model, one need only have gold labels for assessment, thereby greatly reducing the burden of labeling examples.

* The researchers open-sourced their code as [Snorkel](https://github.com/HazyResearch/snorkel). For ease of use and exploration, we'll work with a simplified version derived from [this excellent blog post](https://hazyresearch.github.io/snorkel/blog/dp_with_tf_blog_post.html). This is implemented in our course repository as `tf_snorkel_lite.py`.

* Project teams that find this direction useful are encouraged to use the real Snorkel, as it will better handle the complex relationships that inevitably arise in a set of real labeling functions.

## Set-up

The set-up steps are [the same as those required for working with the Stanford Sentiment Treebank materials](sst_01_overview.ipynb#Set-up), since we'll be revisiting that dataset as an in-depth use-case. Make sure you've done a recent pull of the repository so that you have the latest code release.

In [2]:
from collections import Counter
import numpy as np
import os
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from tf_snorkel_lite import TfSnorkelGenerative, TfLogisticRegression
import sst

## Motivation

Have newer methods reduced the need for labels? Has crowdsourcing made it easy enough to get labels at scale?

### Types of learning

1. __Supervised learning__: Individual examples _from the domain you care about_ labeled in a way that you think/assume/hope is aligned with your actual real-world objective. The model objective is to minimize error between predicted and actual.

2. __Distantly supervised learning__: Exactly like supervised learning, but with individual examples _from a domain that is different from the one you care about_.

3. __Semi-supervised learning__: A fundamentally supervised method that can make use of unlabeled data. 

3. __Reinforcement learning__: The data are in some sense labeled, but not at the level of individual examples. The model objective is essentially as in supervised learning.

4. __Unsupervised learning__: No labels that you can make use of directly. The model objective is thus set independently of the data but is presumably tied to something intuitive.

In almost all domains right now, __effective learning is supervised learning__ – somewhere in 1–4. However, representations from unsupervised learning are very common as inputs to supervised deep learning models.

### The cost of labeling

[Ratner et al. 2016](https://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly):
> In many applications, we would like to use machine learning, but we face the following challenges: 
>
> (i) hand-labeled training data is not available, and is prohibitively expensive to obtain in sufficient quantities as it requires expensive domain expert labelers; 
>
> (ii) related external knowledge bases are either unavailable or insufficiently specific, precluding a traditional distant supervision or co-training approach; 
>
> (iii) application specifications are in flux, changing the model we ultimately wish to learn.

In addition, the annotator will register the same judgment repeatedly.

Point (iii) is subtle but very important: labels tend to be brittle, useful only for a narrow range of tasks, and thus they can quickly become irrelevant where one's scientific or business goals are evolving.

### Example: SNLI

SNLI ([Bowman et al. 2015](http://aclweb.org/anthology/D/D15/D15-1075.pdf)) represents one of the largest labeling efforts in NLP to date. It provides reasonable coverage for a __very__ narrow domain. The most frequent complaint is that it is too specialized.


### Example: I2B2

From [Uzuner 2009](https://academic.oup.com/jamia/article-abstract/16/4/561/766997):
 
> To define the Obesity Challenge task, two experts from the Massachusetts General Hospital Weight Center studied 50 (25 each) random pilot discharge summaries from the Partners HealthCare Research Patient Data Repository.
>
> [...]
>
> The data for the challenge were annotated by two obesity experts from the Massachusetts General Hospital Weight Center. The experts were given a _textual task_, which asked them to classify each disease (see list of diseases above) as Present, Absent, Questionable, or Unmentioned based on explicitly documented information in the discharge summaries [...]. The experts were also given an _intuitive task_, which asked them to classify each disease as Present, Absent, or Questionable by applying their intuition and judgment to information in the discharge summaries,

Extrapolate these costs, in money and time, to the +1M records we'd need for reasonable coverage of obesity patient experiences.

### Example: THYME

From [Styler et al. 2017](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5657277/):

> The THYME colon cancer corpus, which includes clinical notes and pathology reports for 35 patients diagnosed with colon cancer for a total of 107 documents. Each note was annotated by a pair of graduate or undergraduate students in Linguistics at the University of Colorado, then adjudicated by a domain expert.

Again, extrapolate these costs to a dataset that would provide reasonable coverage.

## The data programming model

1. Suppose we have some raw set of $m$ examples $T$. To help keep the concepts straight, assume that these are just raw examples, not representations for machine learning.
<br /><br />
2. We write a set of $n$ labeling functions $\Lambda$:
    * Each $\lambda \in \Lambda$ maps each $t \in T$ to a label in $\{-1, 0, 1\}$. 
    
    * These labeling functions need not be mutually consistent.
    
    * We expect each $\lambda$ to be high precision and low recall. We hope that $\Lambda$ in aggregate is high precision and high recall.
<br /><br /> 
3. Think of $\Lambda$ as mapping each $t$ to a vector of labels of dimension $n$ – e.g., $\Lambda(t) = [-1, 1, 1, 0]$. Let $\Lambda(T)$ be the $m \times n$ matrix of these representations.
<br /><br />
4. We fit a generative model to $\Lambda(T)$ that returns a binary vector $\widehat{y}$ of length $m$. These are the labels for the examples in $T$ we'll use for __training__.
<br /><br />
5. From here, it's just supervised learning as usual. A feature function will map $T$ to a matrix of representations $X$, and you can pick your favorite supervised model. It will learn from $(X, \widehat{y})$. In this way, you're doing supervised learning without any actual labeled data!

## Basic implementation

Our implementation is in `tf_snorkel_lite.py`, as `TfSnorkelGenerative`. It works well, but it is mainly for illustrative purposes. As noted above, its primary limitation is that it makes the "naive Bayes" assumption that the labeling functions are independent. Since real-world labeling functions you want to write will likely have many complex dependencies between them, this is strictly speaking an incorrect model. (In practice, and like Naive Bayes classifiers, the model might nonetheless work well!)

## Simple example: cheese vs. disease

Let's start with a toy example modeled on the cheese/disease problem that is distributed with the [Stanford MaxEnt classifier](https://nlp.stanford.edu/software/classifier.shtml).

### Cheese/disease data

The first three examples are diseases, and the rest are cheeses:

In [3]:
T = ["gastroenteritis", "gaucher disease", "blue sclera",
     "cure nantais", "charolais", "devon blue"]

In [4]:
y = [1, 1, 1, 0, 0, 0]

### Cheese/disease labeling functions

The first two positively label diseases:

In [5]:
def contains_biological_word(text):
    disease_words = {'disease', 'syndrome', 'cure'}
    return 1.0 if {w for w in disease_words if w in text} else 0.0

In [6]:
def ends_in_itis(text):
    """Positively label diseases"""
    return 1.0 if text.endswith('itis') else 0.0

These positively label cheeses:

In [7]:
def sounds_french(text):
    return -1.0 if text.endswith('ais') else 0.0

In [8]:
def contains_color_word(text):
    colors = {'red', 'blue', 'purple'}
    return -1.0 if {w for w in colors if w in text} else 0.0

### Applying the cheese/disease labelers

We apply all the labeling functions to form the $\Lambda(T)$ matrix [described in the model overview above](#The-data-programming-model):

In [9]:
def apply_labelers(T, labelers):
    return np.array([[l(t) for l in labelers] for t in T])

In [10]:
labelers = [contains_biological_word, ends_in_itis,
            sounds_french, contains_color_word]

In [11]:
L = apply_labelers(T, labelers)

Here's a look at $\Lambda(T)$:

In [12]:
pd.DataFrame(L, columns=[x.__name__ for x in labelers], index=T)

Unnamed: 0,contains_biological_word,ends_in_itis,sounds_french,contains_color_word
gastroenteritis,0.0,1.0,0.0,0.0
gaucher disease,1.0,0.0,0.0,0.0
blue sclera,0.0,0.0,0.0,-1.0
cure nantais,1.0,0.0,-1.0,0.0
charolais,0.0,0.0,-1.0,0.0
devon blue,0.0,0.0,0.0,-1.0


### Fitting the generative model to obtain cheese/disease labels

Now we get to the heart of it – using `TfSnorkelGenerative` to synthesize these label-function vectors into a single set of (probabilistic) labels:

In [13]:
snorkel = TfSnorkelGenerative(max_iter=100)

In [14]:
snorkel.fit(L)

Iteration 100: loss: 5.951983451843262

These are the predicted probabilistic labels, along with their non-probabilistic counterparts (derived from mapping scores above 0.5 to 1 and scores at or below 0.5 to 0):

In [15]:
pred_proba = snorkel.predict_proba(L)

In [16]:
pred = snorkel.predict(L)

In [17]:
df = pd.DataFrame({'texts':T, 'true': y, 'predict_proba': pred_proba})

df

Unnamed: 0,predict_proba,texts,true
0,0.916934,gastroenteritis,1
1,0.836139,gaucher disease,1
2,0.083066,blue sclera,1
3,0.5,cure nantais,0
4,0.163861,charolais,0
5,0.083066,devon blue,0


So we did pretty well. Only `blue sclera` tripped this model up. If we wanted to address that, we could write a labeling function to correct it. But let's retain this mistake to see what impact it has.

### Training discriminative models for cheese/disease prediction

At this point, it's just training classifiers as usual. The only difference is that we're using the potentially noisy labels created by the model.

To round it out, I define a feature function `character_ngram_phi`:

In [18]:
def character_ngram_phi(s, n=4):
    chars = list(s)
    chars = ["<w>"] + chars + ["</w>"]
    data = []
    for i in range(len(chars)-n+1):
        data.append("".join(chars[i: i+n]))
    return Counter(data)

Then we create a feature matrix in the usual way:

In [19]:
vec = DictVectorizer(sparse=False)

feats = [character_ngram_phi(s) for s in T]

X = vec.fit_transform(feats)

And then we fit a model. The real data programming way is to fit this model with the predicted probability values rather than the 1/0 versions of them. The `sklearn` class `LogisticRegression` doesn't support this, but this is an easy extension of [our core TensorFlow framework](tensorflow_models.ipynb):

In [20]:
mod = TfLogisticRegression(max_iter=5000, l2_penalty=0.1)

In [21]:
mod.fit(X, pred_proba)

Iteration 5000: loss: 0.47204923629760745

<tf_snorkel_lite.TfLogisticRegression at 0x11e1607b8>

In [22]:
cd_pred = mod.predict(X)

In [23]:
df['predicted'] = cd_pred

In [24]:
df

Unnamed: 0,predict_proba,texts,true,predicted
0,0.916934,gastroenteritis,1,1
1,0.836139,gaucher disease,1,1
2,0.083066,blue sclera,1,0
3,0.5,cure nantais,0,0
4,0.163861,charolais,0,0
5,0.083066,devon blue,0,0


That looks good, but the model's ability to generalize seem not so great:

In [25]:
tests = ['maconnais', 'dermatitis']

X_test = vec.transform([character_ngram_phi(s) for s in tests])

In [26]:
mod.predict(X_test)

[0, 0]

We can also use a standard `sklearn` `LogisticRegression` on the 1/0 labels. It works better for the test cases:

In [27]:
lr = LogisticRegression()

lr.fit(X, pred)

lr.predict(X_test)

array(['negative', 'positive'], dtype='<U8')

## In-depth example: Stanford Sentiment Treebank

The toy illustration shows how the model works and suggests it should work. Let's see how we do in practice by returning to the Stanford Sentiment Treebank (SST) – but this time without using any of the training labels!

### SST training set

Here we just load in the SST training data:

In [28]:
sst_train = list(sst.train_reader(class_func=sst.binary_class_func))

We'll keep the training labels as `sst_train_y` for a comparison, but they won't be used for training!

In [29]:
sst_train_texts, sst_train_y = zip(*sst_train)

### Lexicon-based labeling functions

The `vsmdata` distribution contains an excellent multidimensional sentiment lexicon, `Ratings_Warriner_et_al.csv`. The following function loads it into a DataFrame.

In [30]:
def load_warriner_lexicon(src_filename, df=None):
    """Read in 'Ratings_Warriner_et_al.csv' and optionally restrict its 
    vocabulary to items in `df`.
    
    Parameters
    ----------
    src_filename : str
        Full path to 'Ratings_Warriner_et_al.csv'
    df : pd.DataFrame or None
        If this is given, then its index is intersected with the 
        vocabulary from the lexicon, and we return a lexicon 
        containing only values in both vocabularies.
        
    Returns
    -------
    pd.DataFrame
    
    """
    lexicon = pd.read_csv(src_filename, index_col=0)
    lexicon = lexicon[['Word', 'V.Mean.Sum', 'A.Mean.Sum', 'D.Mean.Sum']]
    lexicon = lexicon.set_index('Word').rename(
        columns={'V.Mean.Sum': 'Valence', 
                 'A.Mean.Sum': 'Arousal', 
                 'D.Mean.Sum': 'Dominance'})
    if df is not None:
        shared_vocab = sorted(set(lexicon.index) & set(df.index))
        lexicon = lexicon.loc[shared_vocab]
    return lexicon

In [31]:
lex = load_warriner_lexicon(
    os.path.join('vsmdata', 'Ratings_Warriner_et_al.csv'))

The lexicon contains scores, rather than classes, so I create positive and negative sets from the words that are one standard deviation above and below the mean, respectively:

In [32]:
sd_high = lex['Valence'].mean() + lex['Valence'].std()

In [33]:
sd_low = lex['Valence'].mean() - lex['Valence'].std()

In [34]:
pos_words = set(lex[lex['Valence'] > sd_high].index)

In [35]:
neg_words = set(lex[lex['Valence'] < sd_low].index)

In [36]:
def lex_pos_labeler(tree):
    return 1 if set(tree.leaves()) & pos_words else 0    

In [37]:
def lex_neg_labeler(tree):
    return -1 if set(tree.leaves()) & neg_words else 0    

### Other SST labeling function ideas

* More lexicon-based features: http://sentiment.christopherpotts.net/lexicons.html

* Position-sensitive lexicon features. For example, perhaps core lexicon features should be reversed if there is a preceding negation or a following _but_.

* Features for near-neighbors of lexicon words, in a VSM derived from, say, `imdb5` or `imdb20` from our VSM unit.

* Feature identifying specific actors and directors, building in assumptions that their moves are good or bad.

* Negations like _not_, _never_, _no one_, and _nothing_ as signals of negativity in the evaluative sense ([Potts 2010](https://journals.linguisticsociety.org/proceedings/index.php/SALT/article/view/2565)); universal quantifiers like _always_, _all_, and _every_ as signals of positivity.

### Applying the SST labeling functions

In [38]:
sst_train_labels = apply_labelers(
    sst_train_texts, 
    [lex_neg_labeler, lex_pos_labeler])

### Fitting the SST generative model

In [39]:
nb = TfSnorkelGenerative(max_iter=1000)

nb.fit(sst_train_labels)

sst_train_predicted_y = nb.predict(sst_train_labels)

Iteration 1000: loss: 2.0862767696380615

### Direct assessment of the inferred labels against the gold ones

Since we have the labels, we can see how we did in reconstructing them:

In [40]:
print(classification_report(sst_train_y, sst_train_predicted_y))

             precision    recall  f1-score   support

   negative       0.60      0.62      0.61      3310
   positive       0.64      0.62      0.63      3610

avg / total       0.62      0.62      0.62      6920



Pretty good! With more labeling functions we could do better. It's tempting to hill-climb on this directly, but that's not especially realistic. However, it does suggest that, when doing data programming, one does well to have labels that are used strictly to improve the labeling functions (which can be run on a much larger dataset to create the training data).

### Fitting a discriminative model on the noisy labels

And now we slip back into the usual SST classifier workflow. As a reminder, `unigrams_phi` gets 0.77 average F1 on the `dev` set when we train on the actual gold labels. Can we approach that performance by writing excellent labeling functions?

In [41]:
def unigrams_phi(tree):
    return Counter(tree.leaves())    

In [42]:
train = sst.build_dataset(
    sst.train_reader, 
    phi=unigrams_phi,
    class_func=sst.binary_class_func)

Now we swap the true labels for the predicted ones:

In [43]:
train['y'] = sst_train_predicted_y

We assess against the `dev` set, which is unchanged – that is, for assessment, we use the gold labels:

In [44]:
dev = sst.build_dataset(
    sst.dev_reader,
    phi=unigrams_phi,
    class_func=sst.binary_class_func,
    vectorizer=train['vectorizer'])

In the cheese/disease example, `LogisticRegression` worked best, so we'll continue to use it:

In [45]:
mod = LogisticRegression()

In [46]:
mod.fit(train['X'], sst_train_predicted_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [47]:
snorkel_dev_preds = mod.predict(dev['X'])

In [48]:
print(classification_report(dev['y'], snorkel_dev_preds))

             precision    recall  f1-score   support

   negative       0.60      0.59      0.60       428
   positive       0.61      0.62      0.61       444

avg / total       0.61      0.61      0.61       872



At this point, we might [return to writing more labeling functions](#Other-SST-labeling-function-ideas), in the hope of improving our dev-set results. We got this far with only two simple lexicon-based feature functions, so there is reason to be optimistic that we can train effective models without showing our models any gold labels!

## Extra-credit bake-off

This is a fast, optional bake-off intended to be done in class on May 2:
    
__Question__: How good an F1 score can you get with the function call in [Direct assessment of the inferred labels against the gold ones](#Direct-assessment-of-the-inferred-labels-against-the-gold-ones) above? This just compares the actual gold labels in the train set against the ones you're creating with data programming.

__To submit__:

1. Your average F1 score from this assessment.
1. A description of the labeling functions you wrote to get this score.

To get full credit, you just need to write at least one new labeling function and try it out.

Submission URL: https://goo.gl/forms/MtyQHoWDHmU5oEyt1

The close-time for this is May 2, 11:59 pm.

In [49]:
import utils
vsmdata_home = 'vsmdata'

glove_home = os.path.join(vsmdata_home, 'glove.6B')
glove_lookup = utils.glove2dict(os.path.join(glove_home, 'glove.6B.300d.txt'))

In [50]:
import scipy

In [51]:
lex

Unnamed: 0_level_0,Valence,Arousal,Dominance
Word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
aardvark,6.26,2.41,4.27
abalone,5.30,2.65,4.95
abandon,2.84,3.73,3.32
abandonment,2.63,4.95,2.64
abbey,5.85,2.20,5.00
abdomen,5.43,3.68,5.15
abdominal,4.48,3.50,5.32
abduct,2.42,5.90,2.75
abduction,2.05,5.33,3.02
abide,5.52,3.26,5.33


In [52]:
['aardvark']

['aardvark']

In [53]:
lex2glove = {}
for w in lex.index:
    w = str(w).lower()
    if w in glove_lookup:
        lex2glove[w] = glove_lookup[w]

In [55]:
np.array().shape

()

In [92]:
'yuletide' in lex['Valence']

True

In [113]:
def negation_labeler(tree):
    return -1 if set(tree.leaves()) & negation_set else 0

def universal_quantifier_labeler(tree):
    return 1 if set(tree.leaves()) & set(['always', 'all', 'every']) else 0

negation_set = set(['not', 'never', 'no one', 'nothing', 'no', 'neither', 'nobody', 'none', 'nowhere', 'nor'])
v_high_small = lex['Valence'].mean() + 0.8 * lex['Valence'].std()
v_low_small = lex['Valence'].mean() - 0.8 * lex['Valence'].std()

def nn_word_label(tree):
    label = 0
    words = set([x.lower() for x in tree.leaves()])
    for w in words:
        if w in lex2glove:
            min_dist = np.inf
            nn_w = ''
            for k in lex2glove:
                if scipy.spatial.distance.cosine(lex2glove[w], lex2glove[k]) < min_dist:
                    nn_w = k
                    min_dist = scipy.spatial.distance.cosine(lex2glove[w], lex2glove[k])
            if nn_w in lex['Valence']:
                if lex['Valence'][nn_w] > sd_high:
                    label += 1
                if lex['Valence'][nn_w] < sd_low:
                    label -= 1
    if label > 0:
        return 1
    elif label < 0:
        return -1
    else:
        return 0
    
def lex_mean_pos(tree):
    words = [x.lower() for x in tree.leaves() if x in lex.index]
    m = lex['Valence'][words].max()
    return 1 if m > v_high_small else 0

def lex_mean_neg(tree):
    words = [x.lower() for x in tree.leaves() if x in lex.index]
    m = lex['Valence'][words].min()
    return -1 if m < v_low_small else 0
    
def lex_pos_labeler_with_neg(tree):
    global negation_set
    words = set([x.lower() for x in tree.leaves()])
    
    label = 1 if words & pos_words else 0
    label *= -1 if len(words & negation_set)%2==1 else 1
    return label

def lex_neg_labeler_with_neg(tree):
    global negation_set
    words = set([x.lower() for x in tree.leaves()])
    label = -1 if words & neg_words else 0 
    label *= -1 if len(words & negation_set)%2==1 else 1
    return label

In [None]:
%%time
sst_train_labels = apply_labelers(
    sst_train_texts, 
    [lex_mean_pos, lex_mean_neg])

nb = TfSnorkelGenerative(max_iter=1000)
nb.fit(sst_train_labels)
sst_train_predicted_y = nb.predict(sst_train_labels)

print(classification_report(sst_train_y, sst_train_predicted_y))

Iteration 863: loss: 2.1044075489044197

In [108]:
%%time
sst_train_labels = apply_labelers(
    sst_train_texts, 
    [lex_pos_labeler_with_neg, lex_neg_labeler_with_neg, lex_mean_pos, lex_mean_neg])

nb = TfSnorkelGenerative(max_iter=1000)
nb.fit(sst_train_labels)
sst_train_predicted_y = nb.predict(sst_train_labels)

print(classification_report(sst_train_y, sst_train_predicted_y))

Iteration 1000: loss: 8.439771652221687

             precision    recall  f1-score   support

   negative       0.61      0.63      0.62      3310
   positive       0.65      0.63      0.64      3610

avg / total       0.63      0.63      0.63      6920

CPU times: user 38.6 s, sys: 1.12 s, total: 39.7 s
Wall time: 40.6 s


In [84]:
%%time
sst_train_labels = apply_labelers(
    sst_train_texts, 
    [lex_pos_labeler_with_neg, lex_neg_labeler_with_neg])

nb = TfSnorkelGenerative(max_iter=1000)
nb.fit(sst_train_labels)
sst_train_predicted_y = nb.predict(sst_train_labels)

print(classification_report(sst_train_y, sst_train_predicted_y))

Iteration 1000: loss: 2.0862767696380615

             precision    recall  f1-score   support

   negative       0.59      0.69      0.64      3310
   positive       0.66      0.57      0.61      3610

avg / total       0.63      0.62      0.62      6920

CPU times: user 30.1 s, sys: 825 ms, total: 30.9 s
Wall time: 33.3 s
