# Empirical Distribution Predictors

We may be losing a lot of information in the annotations by condensing them into a single number. Instead, we can train a model to predict the empirical distribution formed by the annotations over the answer choices. We do this by minimizing the cross-entropy between the predicted distributions and the empirical distributions. This is essentially softmax classification, but off-the-shelf implementations don't let you pass a distribution as a training label, so we have to roll out own in TensorFlow.

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import tensorflow as tf
from ngram import *
from baselines import *

In [3]:
data_filename  = '../../data/v4_annotated/annotated_onion_layer_5_rows_0_to_5000_raters_20.csv'
d = load_cf_labels(data_filename)
d = tidy_labels(d)
d = d.dropna(subset=['attack'])
d = d.iloc[np.random.permutation(np.arange(d.shape[0]))]

In [4]:
ngram_feature_pipeline = Pipeline([
    ('vect', CountVectorizer(ngram_range = (1,6), max_features = 5000)),
    ('tfidf', TfidfTransformer(sublinear_tf=True,norm='l2')),
])

# Fit Plurality

Fit a softmax regression to the most common annotation. 

In [5]:
labels = plurality(d['attack'])

data = get_labeled_comments(d, labels)
train, test = split(data, test_size = 0.2,)

y_train =train.ix[:, train.columns != 'x'].values[:,0]
y_test =test.ix[:, train.columns != 'x'].values[:,0]
y_train = np.array([y_train, 1- y_train]).T
y_test = np.array([y_test, 1- y_test]).T

ngram_feature_extractor = ngram_feature_pipeline.fit(train['x'])
X_train = ngram_feature_extractor.transform(train['x'])
X_test = ngram_feature_extractor.transform(test['x'])

ED_CLF(X_train,
        y_train,
        X_test,
        y_test,
        training_epochs = 300,
        batch_size = 200,
        display_step = 15)

Epoch: 0001 cost= 0.826558864
Accuracy: 0.55903
Accuracy: 0.541542
Epoch: 0016 cost= 0.645986644
Accuracy: 0.666333
Accuracy: 0.647648
Epoch: 0031 cost= 0.538006697
Accuracy: 0.730865
Accuracy: 0.693694
Epoch: 0046 cost= 0.461819182
Accuracy: 0.776138
Accuracy: 0.733734
Epoch: 0061 cost= 0.405223148
Accuracy: 0.823162
Accuracy: 0.747748
Epoch: 0076 cost= 0.361157354
Accuracy: 0.851926
Accuracy: 0.762763
Epoch: 0091 cost= 0.325631864
Accuracy: 0.875688
Accuracy: 0.76977
Epoch: 0106 cost= 0.296221656
Accuracy: 0.892196
Accuracy: 0.775776
Epoch: 0121 cost= 0.271312231
Accuracy: 0.907704
Accuracy: 0.77978
Epoch: 0136 cost= 0.249868573
Accuracy: 0.917459
Accuracy: 0.777778
Epoch: 0151 cost= 0.231153783
Accuracy: 0.929715
Accuracy: 0.785786
Epoch: 0166 cost= 0.214671709
Accuracy: 0.938469
Accuracy: 0.788789
Epoch: 0181 cost= 0.199999733
Accuracy: 0.945473
Accuracy: 0.790791
Epoch: 0196 cost= 0.186850882
Accuracy: 0.953227
Accuracy: 0.791792
Epoch: 0211 cost= 0.174963344
Accuracy: 0.957229
Ac

# Fit Empirical Distribution
Fit a softmax regression to the empirical distribtion of annotions over answer choices.

In [6]:
labels = empirical_dist(d['attack'], w = 0.5)
data = get_labeled_comments(d, labels)
train, test = split(data, test_size = 0.2,)

y_train =train.ix[:, train.columns != 'x'].values
y_test =test.ix[:, train.columns != 'x'].values

ngram_feature_extractor = ngram_feature_pipeline.fit(train['x'])
X_train = ngram_feature_extractor.transform(train['x'])
X_test = ngram_feature_extractor.transform(test['x'])

ED_CLF(X_train,
        y_train,
        X_test,
        y_test,
        training_epochs = 300,
        batch_size = 200,
        display_step = 15)

Epoch: 0001 cost= 0.860558868
Accuracy: 0.646573
Accuracy: 0.668669
Epoch: 0016 cost= 0.715179900
Accuracy: 0.674087
Accuracy: 0.671672
Epoch: 0031 cost= 0.643569165
Accuracy: 0.731616
Accuracy: 0.698699
Epoch: 0046 cost= 0.596841331
Accuracy: 0.775388
Accuracy: 0.71972
Epoch: 0061 cost= 0.564251946
Accuracy: 0.809155
Accuracy: 0.733734
Epoch: 0076 cost= 0.540559257
Accuracy: 0.833667
Accuracy: 0.742743
Epoch: 0091 cost= 0.522752478
Accuracy: 0.855678
Accuracy: 0.760761
Epoch: 0106 cost= 0.509014375
Accuracy: 0.872436
Accuracy: 0.772773
Epoch: 0121 cost= 0.498218831
Accuracy: 0.884942
Accuracy: 0.771772
Epoch: 0136 cost= 0.489562381
Accuracy: 0.896198
Accuracy: 0.777778
Epoch: 0151 cost= 0.482511312
Accuracy: 0.905203
Accuracy: 0.77978
Epoch: 0166 cost= 0.476710386
Accuracy: 0.913957
Accuracy: 0.780781
Epoch: 0181 cost= 0.471877582
Accuracy: 0.922961
Accuracy: 0.786787
Epoch: 0196 cost= 0.467798205
Accuracy: 0.928964
Accuracy: 0.785786
Epoch: 0211 cost= 0.464327448
Accuracy: 0.933967
A