# Bake-off: Word-level entailment with neural networks

In [2]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018"

## Contents

0. [Overview](#Overview)
0. [Set-up](#Set-up)
0. [Data](#Data)
  0. [Edge disjoint](#Edge-disjoint)
  0. [Word disjoint](#Word-disjoint)
  0. [Word disjoint and balanced](#Word-disjoint-and-balanced)
0. [Baseline](#Baseline)
  0. [Representing words: vector_func](#Representing-words:-vector_func)
  0. [Combining words into inputs: vector_combo_func](#Combining-words-into-inputs:-vector_combo_func)
  0. [Classifier model](#Classifier-model)
  0. [Baseline results](#Baseline-results)
0. [Bake-off submission](#Bake-off-submission)

## Overview

__Problem__: Word-level natural language inference.

Training examples are pairs of words $(w_{L}, w_{R}), y$ with $y$ a relation in

* __synonym__: very roughly identical meanings; symmetric
* __hyponym__: e.g., _puppy_ is a hyponym of _dog_
* __hypernym__:  e.g., _dog_ is a hypernym of _puppy_
* __antonym__: semantically opposed within a domain; symmetric

The dataset is due to [Bowman et al. 2015](https://arxiv.org/abs/1406.1827). See [below](#Data) for details on how it was processed for this bake-off.

## Set-up

0. Make sure your environment includes all the requirements for [the cs224u repository](https://github.com/cgpotts/cs224u/).

0. Make sure you have the [the Wikipedia 2014 + Gigaword 5 distribution](http://nlp.stanford.edu/data/glove.6B.zip) of  pretrained GloVe vectors downloaded and unzipped, and that `glove_home` below is pointing to it.

0. Make sure `wordentail_filename` below is pointing to the full path for `nli_wordentail_bakeoff_data.json`, which is included in [the nlidata.zip archive](http://web.stanford.edu/class/cs224u/data/nlidata.zip).

In [1]:
from collections import defaultdict
import json
import numpy as np
import os
import pandas as pd
import tensorflow as tf
from tf_shallow_neural_classifier import TfShallowNeuralClassifier
import nli
import utils

In [2]:
nlidata_home = 'nlidata'

wordentail_filename = os.path.join(
    nlidata_home, 'nli_wordentail_bakeoff_data.json')

glove_home = os.path.join("vsmdata", "glove.6B")

## Data

As noted above, the dataset was originally released by [Bowman et al. 2015](https://arxiv.org/abs/1406.1827), who derived it from [WordNet](https://wordnet.princeton.edu) using some heuristics (and thus it might contain some errors or unintuitive pairings).

I've processed the data into three different train/test splits, in an effort to put some pressure on our models to actually learn these semantic relations, as opposed to exploiting regularities in the sample.

* `edge_disjoint`: The `train` and `dev` __edge__ sets are disjoint, but many __words__ appear in both `train` and `dev`.
* `word_disjoint`: The `train` and `dev` __vocabularies are disjoint__, and thus the edges are disjoint as well.
* `word_disjoint_balanced`: Like `word_disjoint`, but with each word appearing at most one time as the left word and at most one time on the right for a given relation type.

These are progressively harder problems

* For `word_disjoint`, there is real pressure on the model to learn abstract relationships, as opposed to memorizing properties of individual words.

* For `word_disjoint_balanced`, the model can't even learn that some terms tend to appear more on the left or the right. This might be a step too far. For example, appearing more on the right for `hypernym` corresponds in a deep way with being a more general term, which is a non-trivial lexical property that we want our models to learn.

In [3]:
with open(wordentail_filename) as f:
    wordentail_data = json.load(f)

The outer keys are the three splits plus a list giving the vocabulary for the entire dataset:

In [4]:
wordentail_data.keys()

dict_keys(['edge_disjoint', 'vocab', 'word_disjoint', 'word_disjoint_balanced'])

### Edge disjoint

In [4]:
wordentail_data['edge_disjoint'].keys()

dict_keys(['dev', 'train'])

This is what the split looks like; all three have this same format:

In [5]:
wordentail_data['edge_disjoint']['dev'][: 5]

[[['archived', 'records'], 'synonym'],
 [['stage', 'station'], 'synonym'],
 [['engineers', 'design'], 'hypernym'],
 [['save', 'book'], 'hypernym'],
 [['match', 'supply'], 'hypernym']]

Let's test to make sure no edges are shared between `train` and `dev`:

In [6]:
nli.get_edge_overlap_size(wordentail_data, 'edge_disjoint')

0

As we expect, a *lot* of vocabulary items are shared between `train` and `dev`:

In [7]:
nli.get_vocab_overlap_size(wordentail_data, 'edge_disjoint')

4769

This is a large percentage of the entire vocab:

In [8]:
len(wordentail_data['vocab'])

6560

Here's the distribution of labels in the `train` set. It's highly imbalanced, which will pose a challenge. (I'll go ahead and reveal that the `dev` set is similarly distributed.)

In [96]:
def label_distribution(split):
    return pd.DataFrame(wordentail_data[split]['train'])[1].value_counts()

In [11]:
label_distribution('edge_disjoint')

synonym     8865
hypernym    6475
hyponym     1044
antonym      629
Name: 1, dtype: int64

### Word disjoint

In [12]:
wordentail_data['word_disjoint'].keys()

dict_keys(['dev', 'train'])

In the `word_disjoint` split, no __words__ are shared between `train` and `dev`:

In [13]:
nli.get_vocab_overlap_size(wordentail_data, 'word_disjoint')

0

Because no words are shared between `train` and `dev`, no edges are either:

In [14]:
nli.get_edge_overlap_size(wordentail_data, 'word_disjoint')

0

The label distribution is similar to that of `edge_disjoint`, though the overall number of examples is a bit smaller:

In [15]:
label_distribution('word_disjoint')

synonym     5610
hypernym    3993
hyponym      627
antonym      386
Name: 1, dtype: int64

There is still an important bias in the data: some words appear much more often than others, and in specific positions. For example, the very general term `part` appears on the right in a large number of cases, many of them `hypernym`.

In [16]:
[[ex, y] for ex, y in wordentail_data['word_disjoint']['train'] 
 if ex[1] == 'part']

[[['frames', 'part'], 'hypernym'],
 [['heaven', 'part'], 'hypernym'],
 [['pan', 'part'], 'synonym'],
 [['middle', 'part'], 'hypernym'],
 [['shared', 'part'], 'synonym'],
 [['shares', 'part'], 'synonym'],
 [['ended', 'part'], 'hypernym'],
 [['twin', 'part'], 'synonym'],
 [['meal', 'part'], 'synonym'],
 [['bit', 'part'], 'hypernym'],
 [['sections', 'part'], 'synonym'],
 [['capacity', 'part'], 'hypernym'],
 [['beginning', 'part'], 'hypernym'],
 [['divorce', 'part'], 'hypernym'],
 [['paradise', 'part'], 'hypernym'],
 [['ends', 'part'], 'hypernym'],
 [['reduced', 'part'], 'hypernym'],
 [['units', 'part'], 'hypernym'],
 [['corner', 'part'], 'hypernym'],
 [['air', 'part'], 'hypernym'],
 [['section', 'part'], 'synonym'],
 [['something', 'part'], 'synonym'],
 [['reduce', 'part'], 'hypernym'],
 [['some', 'part'], 'synonym'],
 [['heavy', 'part'], 'hypernym'],
 [['segment', 'part'], 'hypernym'],
 [['share', 'part'], 'synonym'],
 [['hat', 'part'], 'hypernym'],
 [['maria', 'part'], 'hypernym'],
 [['

These tabulations suggest that a classifier could do well just by learning where words tend to appear:

In [17]:
def count_label_position_instances(split, pos=0):
    examples = wordentail_data[split]['train']    
    return pd.Series([(ex[pos], label) for ex, label in examples]).value_counts()

In [18]:
count_label_position_instances('word_disjoint', pos=0).head()

(forms, hypernym)       9
(items, synonym)        8
(questions, synonym)    8
(lots, synonym)         8
(question, synonym)     8
dtype: int64

In [19]:
count_label_position_instances('word_disjoint', pos=1).head()

(be, hypernym)        51
(take, hypernym)      39
(alter, hypernym)     38
(person, hypernym)    33
(modify, hypernym)    32
dtype: int64

### Word disjoint and balanced

To see how much our models are leveraging the uneven distribution of words across the left and right positions, we also have a split in which each word $w$ appears in at most one item $((w, w_{R}), y)$ and at most one item $((w_{L}, w), y)$.

The following tests establish that the dataset has the desired properties:

In [20]:
wordentail_data['word_disjoint_balanced'].keys()

dict_keys(['dev', 'train'])

In [21]:
nli.get_edge_overlap_size(wordentail_data, 'word_disjoint_balanced')

0

In [22]:
nli.get_vocab_overlap_size(wordentail_data, 'word_disjoint_balanced')

0

In [23]:
[[ex, y] for ex, y in wordentail_data['word_disjoint_balanced']['train'] 
 if ex[1] == 'part']

[[['frames', 'part'], 'hypernym'], [['pan', 'part'], 'synonym']]

In [24]:
count_label_position_instances('word_disjoint_balanced', pos=0).head()

(import, antonym)          1
(strip, hypernym)          1
(state, synonym)           1
(subsequently, synonym)    1
(life, synonym)            1
dtype: int64

In [25]:
count_label_position_instances('word_disjoint_balanced', pos=1).head()

(distance, synonym)    1
(chief, synonym)       1
(sign, hypernym)       1
(visual, synonym)      1
(please, synonym)      1
dtype: int64

## Baseline

Even in deep learning, __feature representation is the most important thing and requires care!__
For our task, feature representation has two parts: representing the individual words and combining those representations into a single network input.

### Representing words: vector_func

Let's consider two baseline word representations methods:

1. Random vectors (as returned by `utils.randvec`).
1. 50-dimensional GloVe representations.

In [5]:
def randvec(w, n=50, lower=-1.0, upper=1.0):
    """Returns a random vector of length `n`. `w` is ignored."""
    return utils.randvec(n=n, lower=lower, upper=upper)

In [6]:
# Any of the files in glove.6B will work here:
glove50_src = os.path.join(glove_home, 'glove.6B.50d.txt')

# Creates a dict mapping strings (words) to GloVe vectors:
GLOVE50 = utils.glove2dict(glove50_src)

def glove50vec(w):    
    """Return `w`'s GloVe representation if available, else return 
    a random vector."""
    return GLOVE50.get(w, randvec(w, n=50))

### Combining words into inputs: vector_combo_func

Here we decide how to combine the two word vectors into a single representation. In more detail, where `u` is a vector representation of the left word and `v` is a vector representation of the right word, we need a function `vector_combo_func` such that `vector_combo_func(u, v)` returns a new input vector `z` of dimension `m`. A simple example is concatenation:

In [7]:
def vec_concatenate(u, v):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return np.concatenate((u, v))

`vector_combo_func` could instead be vector average, vector difference, etc. (even combinations of those) – there's lots of space for experimentation here.

### Classifier model

For a baseline model, I chose `TfShallowNeuralClassifier` with a pretty large hidden layer and a correspondingly high number of iterations. 

In [8]:
net = TfShallowNeuralClassifier(hidden_dim=200, max_iter=500)

### Baseline results

The following puts the above pieces together, using `vector_func=glove50vec`, since `vector_func=randvec` seems so hopelessly misguided for `word_disjoint` and `word_disjoint_balanced`!

First, we build the dataset:

In [9]:
X = nli.build_bakeoff_dataset(
    wordentail_data, 
    vector_func=glove50vec,
    vector_combo_func=vec_concatenate)

And then we run the experiment with `nli.bakeoff_experiment`. This trains and tests on all three splits, and additionally trains on `word_disjoint`'s `train` portion and tests on `word_disjoint_balanced`'s `dev` portion, to see what distribution of examples is more effective for this balanced evaluation.

Since the bake-off focus is `word_disjoint`, you might want to run just that evaluation. To to that, use:

In [11]:
nli.bakeoff_experiment(X, net, conditions=['word_disjoint'])

Iteration 500: loss: 9.8291873931884778

word_disjoint
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       150
   hypernym       0.55      0.41      0.47      1594
    hyponym       0.20      0.01      0.02       275
    synonym       0.58      0.79      0.67      2229

avg / total       0.52      0.57      0.53      4248



This will run the complete evaluation:

In [33]:
nli.bakeoff_experiment(X, net)

  'precision', 'predicted', average, warn_for)


edge_disjoint
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       392
   hypernym       0.57      0.45      0.50      4310
    hyponym       0.45      0.04      0.07       710
    synonym       0.59      0.79      0.68      5930

avg / total       0.55      0.59      0.55     11342



Iteration 6: loss: 3.706857800483703674

word_disjoint
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       150
   hypernym       0.54      0.42      0.47      1594
    hyponym       0.14      0.01      0.02       275
    synonym       0.58      0.78      0.67      2229

avg / total       0.52      0.57      0.53      4248



Iteration 1: loss: 13.39137732982635585

word_disjoint_balanced
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       115
   hypernym       0.48      0.27      0.35       511
    hyponym       0.29      0.02      0.03       118
    synonym       0.55      0.84      0.66       831

avg / total       0.47      0.53      0.47      1575



Iteration 500: loss: 9.7942168712615977

word_disjoint_balanced, training on word_disjoint
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       115
   hypernym       0.47      0.41      0.43       511
    hyponym       0.18      0.02      0.03       118
    synonym       0.58      0.78      0.66       831

avg / total       0.47      0.54      0.49      1575



## Bake-off submission

__The goal__: achieve the highest average F1 score on __word_disjoint__.

__Submit:__

* Your score on the `word_disjoint` split.
* A description of the method you used: 
   * Your approach to representing words.
   * Your approach to combining them into inputs.
   * The model you used for predictions.
   
__Submission URL__: https://goo.gl/forms/CizXwS3kfPjsThxA3   

__Notes:__

* For the methods, the only requirement is that they differ in some way from the baseline above. They don't have to be completely different, though. For example, you might want to stick with the model but represent examples differently, or the reverse.

* You must train only on the `train` split. No outside training instances can be brought in. You can, though, bring in outside information via your input vectors, as long as this information is not from `dev` or `edge_disjoint`. 

* You can also augment your training data. For example, if `((A, B), synonym)` is a training instance, then so should be `((B, A), synonym)`. Similarly, `((A, B), hyponym)`  and `((B, C), hyponym)` are training cases, then so should be `((A, C), hyponym)`.

* Since the evaluation is for `word_disjoint`, you're not going to get very far with random input vectors! A GloVe featurizer is defined above. Feel free to look around for new word vectors on the Web, or even train your own using our VSM notebooks.

* You're not required to stick to `TfShallowNeuralNetwork`. For instance, you could create deeper feed-forward networks, change how they optimize, etc. As long as you have `fit` and `predict` methods with the same input and output types as our networks, you should be able to use `bakeoff_experiment`. For notes on how to extend the TensorFlow models included in this repository, see [tensorflow_models.ipynb](tensorflow_models.ipynb).

In [10]:
from collections import Counter
import numpy as np
from tf_rnn_classifier import TfRNNClassifier

In [23]:
def vec_double_concatenate(u, v):
    return np.concatenate((u, v, u))
def vec_add(u, v):
    return np.array(u) + np.array(v)

In [39]:
def augment_data(wordentail_data):
    # augment dataa
    aug_wordentail_data = {}
    aug_wordentail_data['vocab'] = wordentail_data['vocab']
    for disjoint_type in ['edge_disjoint', 'word_disjoint', 'word_disjoint_balanced']:
        aug_wordentail_data[disjoint_type] = {}
        for split in wordentail_data[disjoint_type]:
            print(disjoint_type, split)
            if split != 'train':
                aug_wordentail_data[disjoint_type][split] = wordentail_data[disjoint_type][split]
            else:
                aug_wordentail_data[disjoint_type][split] = set()
                hyponyms = {}
                hyper_hypo_count = 0
                for pair in wordentail_data[disjoint_type][split]:
                    w1, w2 = pair[0]
                    aug_wordentail_data[disjoint_type][split].add((w1, w2, pair[1]))
                    if pair[1] == 'synonym':
                        aug_wordentail_data[disjoint_type][split].add((w2, w1, 'synonym'))
                    elif pair[1] == 'hyponym':
                        if w1 not in hyponyms:
                            hyponyms[w1] = []
                        hyponyms[w1].append(w2)
                    elif pair[1] == 'hypernym':
                        if w2 not in hyponyms:
                            hyponyms[w2] = []
                        hyponyms[w2].append(w1)
                    elif pair[1] == 'antonym':
                        aug_wordentail_data[disjoint_type][split].add((w2, w1, 'antonym'))
                print(len(hyponyms.keys()), np.sum([len(x) for x in hyponyms.values()]))
                # move one level down in hyponym each step
                last_data_size = 0
                curr_level_hyponyms = hyponyms.copy()
                count = 0
                while (last_data_size != len(aug_wordentail_data[disjoint_type][split])):
                    print(last_data_size, len(aug_wordentail_data[disjoint_type][split]))
                    last_data_size = len(aug_wordentail_data[disjoint_type][split])
                    next_level_hyponyms = {}
                    for k in curr_level_hyponyms:
                        next_level_hyponyms[k] = []
                        for v in curr_level_hyponyms[k]:
                            aug_wordentail_data[disjoint_type][split].add((k, v, 'hyponym'))
                            aug_wordentail_data[disjoint_type][split].add((v, k, 'hypernym'))
                            if v in hyponyms:
                                next_level_hyponyms[k] += hyponyms[v]
                    count += 1
                    if count == 2:
                        break
                    curr_level_hyponyms = next_level_hyponyms
                aug_wordentail_data[disjoint_type][split] = [((w1, w2), rel) for w1, w2, rel in aug_wordentail_data[disjoint_type][split]]
    return aug_wordentail_data

In [59]:
def resample_data(wordentail_data, aug_wordentail_data, total_sample_size, evenly_sample=False):
    new_data = aug_wordentail_data.copy()
    # sample proportionally
    for disjoint_type in ['edge_disjoint', 'word_disjoint', 'word_disjoint_balanced']:
        type_portion = pd.DataFrame(wordentail_data[disjoint_type]['train'])[1].value_counts(normalize=True).to_dict()
        if evenly_sample:
            for k in type_portion:
                type_portion[k] = 1./len(type_portion)
        aug_type_count = pd.DataFrame(aug_wordentail_data[disjoint_type]['train'])[1].value_counts().to_dict()
        sample_pool = dict([(t, []) for t in type_portion])
        for pair in aug_wordentail_data[disjoint_type]['train']:
            sample_pool[pair[1]].append(pair)
        samples = np.array([])
        samples = samples.reshape(len(samples), 2)
        for k in type_portion:
            sample_num = int(type_portion[k] * total_sample_size)
            if sample_num <= aug_type_count[k]:
                sample_idx = np.random.choice(len(sample_pool[k]), size=sample_num, replace=False)
                samples = np.concatenate((samples, np.array(sample_pool[k])[sample_idx]))
            else:
                sample_idx = np.random.choice(len(sample_pool[k]), size=int(sample_num-aug_type_count[k]), replace=True)
                samples = np.concatenate((samples, sample_pool[k], np.array(sample_pool[k])[sample_idx]))
        new_data[disjoint_type]['train'] = samples
    return new_data

In [60]:
aug_wordentail_data = augment_data(wordentail_data)

edge_disjoint dev
edge_disjoint train
2030 7519
0 25590
25590 32921
word_disjoint dev
word_disjoint train
1262 4620
0 15657
15657 20071
word_disjoint_balanced dev
word_disjoint_balanced train
1080 1196
0 4572
4572 5742


In [61]:
resample_aug_wordentail_data = resample_data(wordentail_data, aug_wordentail_data, 
                                             total_sample_size = 40000, evenly_sample=False)

In [50]:
pd.DataFrame(wordentail_data['word_disjoint']['train'])[1].value_counts()

synonym     5610
hypernym    3993
hyponym      627
antonym      386
Name: 1, dtype: int64

In [62]:
pd.DataFrame(aug_wordentail_data['word_disjoint']['train'])[1].value_counts()

synonym     21137
hypernym    15045
hyponym      2362
antonym      1454
Name: 1, dtype: int64

In [63]:
pd.DataFrame(resample_aug_wordentail_data['word_disjoint']['train'])[1].value_counts()

synonym     21137
hypernym    15045
hyponym      2362
antonym      1454
Name: 1, dtype: int64

In [15]:
# wordnet synonym smoothing for glove
lexicon_filename = os.path.join(nlidata_home, 'wordnet-synonyms+.txt')

import re
isNumber = re.compile(r'\d+.*')

def norm_word(word):
    if isNumber.search(word.lower()):
        return '---num---'
    elif re.sub(r'\W+', '', word) == '':
        return '---punc---'
    else:
        return word.lower()

def read_lexicon(filename):
    lexicon = {}
    for line in open(filename, 'r'):
        words = line.lower().strip().split()
        lexicon[norm_word(words[0])] = [norm_word(word) for word in words[1:]]
        #lexicon_set = set(lexicon(norm_word(words[0])))
    return lexicon

lexicon = read_lexicon(lexicon_filename)

def glove50vec_synonym(w):
    original = glove50vec(w)

    embedding_lookup = GLOVE50
    weight = 0.2
    vec_syn=[]
  
    if w in lexicon.keys():
        num_syn =len(lexicon[w])
        vec_syn_all = [embedding_lookup[ww] for ww in lexicon[w] if ww in embedding_lookup]
        if len(vec_syn_all) == 0:
            return original
        vec_syn_all = np.sum(vec_syn_all, axis=0)/len(vec_syn_all)
        return original + weight * vec_syn_all
    else:
        return randvec(w, n=50)

In [16]:
def get_vocab(X, n_words=None):
    wc = Counter([w for pair in X for w in pair[0]])
    wc = wc.most_common(n_words) if n_words else wc.items()
    vocab = {w for w, c in wc}
    vocab.add("$UNK")
    return sorted(vocab)
vocab = get_vocab(resample_aug_wordentail_data['word_disjoint']['train'])
embedding = np.array([glove50vec(w) for w in vocab])

In [32]:
# rnn_mod = TfRNNClassifier(
#             vocab, 
#             eta=0.05,
#             batch_size=10,
#             embed_dim=10,
#             hidden_dim=10,
#             max_length=10, 
#             max_iter=2,
#             embedding=embedding,
#             cell_class=tf.nn.rnn_cell.LSTMCell,
#             hidden_activation=tf.nn.relu,
#             train_embedding=False)

rnn_mod = TfRNNClassifier(vocab, 
                        hidden_dim=50,
                        embed_dim=50,
                        eta=0.05,
                        cell_class=tf.nn.rnn_cell.LSTMCell,
                        hidden_activation=tf.nn.relu,
                        max_iter=10)
    
nli.bakeoff_experiment(aug_X, rnn_mod, conditions=['word_disjoint'])

Iteration 10: loss: 19.852337062358856

word_disjoint
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       150
   hypernym       0.00      0.00      0.00      1594
    hyponym       0.00      0.00      0.00       275
    synonym       0.52      1.00      0.69      2229

avg / total       0.28      0.52      0.36      4248



In [72]:
aug_reweight_X = nli.build_bakeoff_dataset(
        resample_aug_wordentail_data, 
        vector_func=glove50vec,
        vector_combo_func=vec_concatenate)
model = TfShallowNeuralClassifier(hidden_dim=300, max_iter=3000)
nli.bakeoff_experiment(aug_reweight_X, model, conditions=['word_disjoint'])

Iteration 3000: loss: 10.319462835788727

word_disjoint
             precision    recall  f1-score   support

    antonym       0.09      0.03      0.05       150
   hypernym       0.48      0.48      0.48      1594
    hyponym       0.14      0.07      0.10       275
    synonym       0.58      0.64      0.61      2229

avg / total       0.50      0.52      0.51      4248



In [71]:
aug_reweight_X = nli.build_bakeoff_dataset(
        resample_aug_wordentail_data, 
        vector_func=glove50vec,
        vector_combo_func=vec_concatenate)
model = TfShallowNeuralClassifier(hidden_dim=100, max_iter=1000)
nli.bakeoff_experiment(aug_reweight_X, model, conditions=['word_disjoint'])

Iteration 1000: loss: 25.755892276763916

word_disjoint
             precision    recall  f1-score   support

    antonym       0.33      0.01      0.01       150
   hypernym       0.51      0.46      0.48      1594
    hyponym       0.41      0.03      0.05       275
    synonym       0.59      0.73      0.65      2229

avg / total       0.54      0.56      0.53      4248



In [70]:
aug_reweight_X = nli.build_bakeoff_dataset(
        resample_aug_wordentail_data, 
        vector_func=glove50vec,
        vector_combo_func=vec_concatenate)
model = TfShallowNeuralClassifier(hidden_dim=100, max_iter=500)
nli.bakeoff_experiment(aug_reweight_X, model, conditions=['word_disjoint'])

Iteration 500: loss: 30.179321885108948

word_disjoint
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       150
   hypernym       0.51      0.42      0.46      1594
    hyponym       0.25      0.00      0.01       275
    synonym       0.58      0.76      0.66      2229

avg / total       0.51      0.56      0.52      4248



In [69]:
aug_reweight_X = nli.build_bakeoff_dataset(
        resample_aug_wordentail_data, 
        vector_func=glove50vec,
        vector_combo_func=vec_concatenate)
model = TfShallowNeuralClassifier(hidden_dim=100, max_iter=300)
nli.bakeoff_experiment(aug_reweight_X, model, conditions=['word_disjoint'])

Iteration 300: loss: 32.748709917068486

word_disjoint
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       150
   hypernym       0.53      0.41      0.46      1594
    hyponym       0.00      0.00      0.00       275
    synonym       0.58      0.78      0.67      2229

avg / total       0.50      0.56      0.52      4248



In [68]:
aug_reweight_X = nli.build_bakeoff_dataset(
        resample_aug_wordentail_data, 
        vector_func=glove50vec,
        vector_combo_func=vec_concatenate)
model = TfShallowNeuralClassifier(hidden_dim=50, max_iter=1000)
nli.bakeoff_experiment(aug_reweight_X, model, conditions=['word_disjoint'])

Iteration 1000: loss: 27.987963557243347

word_disjoint
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       150
   hypernym       0.50      0.45      0.48      1594
    hyponym       0.53      0.03      0.06       275
    synonym       0.59      0.73      0.65      2229

avg / total       0.53      0.56      0.52      4248



In [67]:
aug_reweight_X = nli.build_bakeoff_dataset(
        resample_aug_wordentail_data, 
        vector_func=glove50vec,
        vector_combo_func=vec_concatenate)
model = TfShallowNeuralClassifier(hidden_dim=50, max_iter=500)
nli.bakeoff_experiment(aug_reweight_X, model, conditions=['word_disjoint'])

Iteration 500: loss: 31.674897015094757

word_disjoint
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       150
   hypernym       0.52      0.46      0.48      1594
    hyponym       0.67      0.01      0.01       275
    synonym       0.59      0.75      0.66      2229

avg / total       0.54      0.56      0.53      4248



In [66]:
aug_reweight_X = nli.build_bakeoff_dataset(
        resample_aug_wordentail_data, 
        vector_func=glove50vec,
        vector_combo_func=vec_concatenate)
model = TfShallowNeuralClassifier(hidden_dim=50, max_iter=300)
nli.bakeoff_experiment(aug_reweight_X, model, conditions=['word_disjoint'])

Iteration 300: loss: 33.572845458984375

word_disjoint
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       150
   hypernym       0.52      0.41      0.46      1594
    hyponym       0.00      0.00      0.00       275
    synonym       0.58      0.78      0.66      2229

avg / total       0.50      0.56      0.52      4248



In [65]:
aug_reweight_X = nli.build_bakeoff_dataset(
        resample_aug_wordentail_data, 
        vector_func=glove50vec,
        vector_combo_func=vec_concatenate)
model = TfShallowNeuralClassifierWithDropout(hidden_dim=300, max_iter=3000)
nli.bakeoff_experiment(aug_reweight_X, model, conditions=['word_disjoint'])

Iteration 3000: loss: 26.672155380249023

word_disjoint
             precision    recall  f1-score   support

    antonym       0.12      0.01      0.01       150
   hypernym       0.51      0.41      0.46      1594
    hyponym       0.27      0.01      0.03       275
    synonym       0.58      0.76      0.66      2229

avg / total       0.52      0.56      0.52      4248



In [64]:
aug_reweight_X = nli.build_bakeoff_dataset(
        resample_aug_wordentail_data, 
        vector_func=glove50vec_synonym,
        vector_combo_func=vec_concatenate)
model = TfShallowNeuralClassifierWithDropout(hidden_dim=300, max_iter=3000)
nli.bakeoff_experiment(aug_reweight_X, model, conditions=['word_disjoint'])

Iteration 3000: loss: 27.134022831916816

word_disjoint
             precision    recall  f1-score   support

    antonym       0.20      0.01      0.01       150
   hypernym       0.48      0.41      0.44      1594
    hyponym       0.00      0.00      0.00       275
    synonym       0.57      0.74      0.64      2229

avg / total       0.49      0.54      0.50      4248



In [173]:
nli.bakeoff_experiment(aug_X, net, conditions=['word_disjoint'])

Iteration 500: loss: 17.443665981292725

word_disjoint
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       150
   hypernym       0.55      0.46      0.50      1594
    hyponym       0.17      0.00      0.01       275
    synonym       0.59      0.77      0.67      2229

avg / total       0.53      0.58      0.54      4248



  'precision', 'predicted', average, warn_for)


In [34]:
class TfShallowNeuralClassifierWithDropout(TfShallowNeuralClassifier):
    def __init__(self, hidden_dim=50, keep_prob=0.8, **kwargs):
        self.hidden_dim = hidden_dim
        self.keep_prob = keep_prob
        super(TfShallowNeuralClassifierWithDropout, self).__init__(**kwargs)        
                    
    def build_graph(self):
        # All the parameters of `TfShallowNeuralClassifier`:
        self.define_parameters()
        
        # Same hidden layer:
        self.hidden = tf.nn.relu(
            tf.matmul(self.inputs, self.W_xh) + self.b_h)
        
        # Drop-out on the hidden layer:
        self.tf_keep_prob = tf.placeholder(tf.float32)
        dropout_layer = tf.nn.dropout(self.hidden, self.tf_keep_prob)
        
        # `dropout_layer` instead of `hidden` to define full model:
        self.model = tf.matmul(dropout_layer, self.W_hy) + self.b_y            
                
    def train_dict(self, X, y):
        return {self.inputs: X, self.outputs: y, 
                self.tf_keep_prob: self.keep_prob}
    
    def test_dict(self, X):
        # No dropout at test-time, hence `self.tf_keep_prob: 1.0
        return {self.inputs: X, self.tf_keep_prob: 1.0}

In [216]:
nli.bakeoff_experiment(aug_reweight_X, model, conditions=['word_disjoint'])

Iteration 300: loss: 25.792278885841377

word_disjoint
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       150
   hypernym       0.50      0.37      0.42      1594
    hyponym       0.00      0.00      0.00       275
    synonym       0.57      0.79      0.66      2229

avg / total       0.49      0.55      0.51      4248



  'precision', 'predicted', average, warn_for)


In [242]:
nli.bakeoff_experiment(aug_X, net, conditions=['word_disjoint'])

Iteration 500: loss: 34.357860445976266

word_disjoint
             precision    recall  f1-score   support

    antonym       0.11      0.41      0.17       150
   hypernym       0.51      0.49      0.50      1594
    hyponym       0.12      0.40      0.18       275
    synonym       0.60      0.33      0.43      2229

avg / total       0.52      0.40      0.43      4248



In [243]:
model = TfShallowNeuralClassifierWithDropout(hidden_dim=200, max_iter=500)
nli.bakeoff_experiment(aug_X, model, conditions=['word_disjoint'])

Iteration 500: loss: 38.666366517543794

word_disjoint
             precision    recall  f1-score   support

    antonym       0.09      0.36      0.14       150
   hypernym       0.51      0.50      0.51      1594
    hyponym       0.12      0.41      0.18       275
    synonym       0.61      0.30      0.40      2229

avg / total       0.52      0.39      0.42      4248



In [70]:
nli.bakeoff_experiment(X, model, conditions=['word_disjoint'])

Iteration 1000: loss: 9.622217893600464

word_disjoint
             precision    recall  f1-score   support

    antonym       0.00      0.00      0.00       150
   hypernym       0.54      0.38      0.45      1594
    hyponym       0.00      0.00      0.00       275
    synonym       0.57      0.80      0.67      2229

avg / total       0.50      0.57      0.52      4248

