# Homework 4: Word-level entailment with neural networks

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2019"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [Data](#Data)
  1. [Edge disjoint](#Edge-disjoint)
  1. [Word disjoint](#Word-disjoint)
1. [Baseline](#Baseline)
  1. [Representing words: vector_func](#Representing-words:-vector_func)
  1. [Combining words into inputs: vector_combo_func](#Combining-words-into-inputs:-vector_combo_func)
  1. [Classifier model](#Classifier-model)
  1. [Baseline results](#Baseline-results)
1. [Homework questions](#Homework-questions)
  1. [Hypothesis-only baseline [2 points]](#Hypothesis-only-baseline-[2-points])
  1. [Alternatives to concatenation [1 point]](#Alternatives-to-concatenation-[1-point])
  1. [A deeper network [2 points]](#A-deeper-network-[2-points])
  1. [Your original system [4 points]](#Your-original-system-[4-points])
1. [Bake-off [1 point]](#Bake-off-[1-point])

## Overview

The general problem is word-level natural language inference.

Training examples are pairs of words $(w_{L}, w_{R}), y$ with $y = 1$ if $w_{L}$ entails $w_{R}$, otherwise $0$.

The homework questions below ask you to define baseline models for this and develop your own system for entry in the bake-off, which will take place on a held-out test-set distributed at the start of the bake-off. (Thus, all the data you have available for development is available for training your final system before the bake-off begins.)

<img src="fig/wordentail-diagram.png" width=600 alt="wordentail-diagram.png" />

## Set-up

See [the first notebook in this unit](nli_01_task_and_data.ipynb) for set-up instructions.

In [2]:
from collections import defaultdict
import json
import numpy as np
import os
import pandas as pd
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
import nli
import utils

In [3]:
DATA_HOME = 'data'

NLIDATA_HOME = os.path.join(DATA_HOME, 'nlidata')

wordentail_filename = os.path.join(
    NLIDATA_HOME, 'nli_wordentail_bakeoff_data.json')

GLOVE_HOME = os.path.join(DATA_HOME, 'glove.6B')

## Data

I've processed the data into two different train/test splits, in an effort to put some pressure on our models to actually learn these semantic relations, as opposed to exploiting regularities in the sample.

* `edge_disjoint`: The `train` and `dev` __edge__ sets are disjoint, but many __words__ appear in both `train` and `dev`.
* `word_disjoint`: The `train` and `dev` __vocabularies are disjoint__, and thus the edges are disjoint as well.

These are very different problems. For `word_disjoint`, there is real pressure on the model to learn abstract relationships, as opposed to memorizing properties of individual words.

In [4]:
with open(wordentail_filename) as f:
    wordentail_data = json.load(f)

The outer keys are the  splits plus a list giving the vocabulary for the entire dataset:

In [5]:
wordentail_data.keys()

dict_keys(['edge_disjoint', 'vocab', 'word_disjoint'])

### Edge disjoint

In [6]:
wordentail_data['edge_disjoint'].keys()

dict_keys(['dev', 'train'])

This is what the split looks like; all three have this same format:

In [7]:
wordentail_data['edge_disjoint']['dev'][: 5]

[[['sweater', 'stroke'], 0],
 [['constipation', 'hypovolemia'], 0],
 [['disease', 'inflammation'], 0],
 [['herring', 'animal'], 1],
 [['cauliflower', 'outlook'], 0]]

Let's test to make sure no edges are shared between `train` and `dev`:

In [8]:
nli.get_edge_overlap_size(wordentail_data, 'edge_disjoint')

0

As we expect, a *lot* of vocabulary items are shared between `train` and `dev`:

In [9]:
nli.get_vocab_overlap_size(wordentail_data, 'edge_disjoint')

2916

This is a large percentage of the entire vocab:

In [10]:
len(wordentail_data['vocab'])

8470

Here's the distribution of labels in the `train` set. It's highly imbalanced, which will pose a challenge for learning. (I'll go ahead and reveal that the `dev` set is similarly distributed.)

In [11]:
def label_distribution(split):
    return pd.DataFrame(wordentail_data[split]['train'])[1].value_counts()

In [12]:
label_distribution('edge_disjoint')

0    14650
1     2745
Name: 1, dtype: int64

### Word disjoint

In [13]:
wordentail_data['word_disjoint'].keys()

dict_keys(['dev', 'train'])

In the `word_disjoint` split, no __words__ are shared between `train` and `dev`:

In [14]:
nli.get_vocab_overlap_size(wordentail_data, 'word_disjoint')

0

Because no words are shared between `train` and `dev`, no edges are either:

In [15]:
nli.get_edge_overlap_size(wordentail_data, 'word_disjoint')

0

The label distribution is similar to that of `edge_disjoint`, though the overall number of examples is a bit smaller:

In [16]:
label_distribution('word_disjoint')

0    7199
1    1349
Name: 1, dtype: int64

## Baseline

Even in deep learning, __feature representation is vital and requires care!__ For our task, feature representation has two parts: representing the individual words and combining those representations into a single network input.

### Representing words: vector_func

Let's consider two baseline word representations methods:

1. Random vectors (as returned by `utils.randvec`).
1. 50-dimensional GloVe representations.

In [17]:
def randvec(w, n=50, lower=-1.0, upper=1.0):
    """Returns a random vector of length `n`. `w` is ignored."""
    return utils.randvec(n=n, lower=lower, upper=upper)

In [18]:
# Any of the files in glove.6B will work here:

glove_dim = 50

glove_src = os.path.join(GLOVE_HOME, 'glove.6B.{}d.txt'.format(glove_dim))

# Creates a dict mapping strings (words) to GloVe vectors:
GLOVE = utils.glove2dict(glove_src)

def glove_vec(w):    
    """Return `w`'s GloVe representation if available, else return 
    a random vector."""
    return GLOVE.get(w, randvec(w, n=glove_dim))

### Combining words into inputs: vector_combo_func

Here we decide how to combine the two word vectors into a single representation. In more detail, where `u` is a vector representation of the left word and `v` is a vector representation of the right word, we need a function `vector_combo_func` such that `vector_combo_func(u, v)` returns a new input vector `z` of dimension `m`. A simple example is concatenation:

In [19]:
def vec_concatenate(u, v):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return np.concatenate((u, v))

`vector_combo_func` could instead be vector average, vector difference, etc. (even combinations of those) – there's lots of space for experimentation here; [homework question 2](#Alternatives-to-concatenation-[1-point]) below pushes you to do some exploration.

### Classifier model

For a baseline model, I chose `TorchShallowNeuralClassifier`:

In [20]:
net = TorchShallowNeuralClassifier(hidden_dim=50, max_iter=100)

### Baseline results

The following puts the above pieces together, using `vector_func=glove_vec`, since `vector_func=randvec` seems so hopelessly misguided for `word_disjoint`!

In [21]:
word_disjoint_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=vec_concatenate)

Finished epoch 100 of 100; error is 0.022924087708815932

              precision    recall  f1-score   support

           0       0.92      0.94      0.93      1910
           1       0.41      0.36      0.38       239

   micro avg       0.87      0.87      0.87      2149
   macro avg       0.66      0.65      0.65      2149
weighted avg       0.86      0.87      0.87      2149



## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### Hypothesis-only baseline [2 points]

During our discussion of SNLI and MultiNLI, we noted that a number of research teams have shown that hypothesis-only baselines for inference tasks can be remarkably robust. This question asks you to explore briefly how this baseline effects the 'edge_disjoint' and 'word_disjoint' versions of our task.

For this problem, submit code the following:

1. A `vector_combo_func` function called `hypothesis_only` that simply throws away the premise, using the unmodified hypothesis (second) vector as its representation of the example.

1. Code for looping over the two conditions 'word_disjoint' and 'edge_disjoint' and the two `vector_combo_func` values `vec_concatenate` and `hypothesis_only`, calling `nli.wordentail_experiment` to train on the conditions 'train' portion and assess on its 'dev' portion, with `glove50vec` as the `vector_func`. So that the results are consistent, use an `sklearn.linear_model.LogisticRegression` with default parameters as the model.

1. Print out the percentage-wise increase in macro-F1 over the `hypothesis_only` delivers over `vec_concatenate` for each of the two conditions. For example, if `hypothesis_only` returns 0.5 for condition `C` and  `vec_concatenate` delivers 0.75 for `C`, then you'd report a 50% increase for `C`. The values you need are stored in the dictionary returned by `nli.wordentail_experiment`, with key 'macro-F1'. Please use two digits of precision for the increases.

In [22]:
from sklearn.linear_model import LogisticRegression
def hypothesis_only(u, v):
    return v

conditions = ['word_disjoint', 'edge_disjoint']

In [24]:
for c in conditions:
    concat_experiment = nli.wordentail_experiment(
        train_data=wordentail_data[c]['train'],
        assess_data=wordentail_data[c]['dev'], 
        model=LogisticRegression(), 
        vector_func=glove_vec,
        vector_combo_func=vec_concatenate)
    hypo_experiment = nli.wordentail_experiment(
        train_data=wordentail_data[c]['train'],
        assess_data=wordentail_data[c]['dev'], 
        model=LogisticRegression(), 
        vector_func=glove_vec,
        vector_combo_func=hypothesis_only)
    print(c + " increase{0:.2f}".format((concat_experiment['macro-F1'] - hypo_experiment['macro-F1']) / hypo_experiment['macro-F1']))



              precision    recall  f1-score   support

           0       0.90      0.98      0.94      1910
           1       0.47      0.14      0.22       239

   micro avg       0.89      0.89      0.89      2149
   macro avg       0.68      0.56      0.58      2149
weighted avg       0.85      0.89      0.86      2149





              precision    recall  f1-score   support

           0       0.89      0.99      0.94      1910
           1       0.36      0.05      0.09       239

   micro avg       0.88      0.88      0.88      2149
   macro avg       0.63      0.52      0.51      2149
weighted avg       0.83      0.88      0.84      2149

word_disjoint increase0.13




              precision    recall  f1-score   support

           0       0.88      0.97      0.92      7376
           1       0.58      0.23      0.33      1321

   micro avg       0.86      0.86      0.86      8697
   macro avg       0.73      0.60      0.62      8697
weighted avg       0.83      0.86      0.83      8697





              precision    recall  f1-score   support

           0       0.87      0.98      0.92      7376
           1       0.59      0.19      0.29      1321

   micro avg       0.86      0.86      0.86      8697
   macro avg       0.73      0.59      0.61      8697
weighted avg       0.83      0.86      0.83      8697

edge_disjoint increase0.03


### Alternatives to concatenation [1 point]

We've so far just used vector concatenation to represent the premise and hypothesis words. This question asks you to explore a simple alternative. 

For this problem, submit code the following:

1. A new potential value for `vector_combo_func` that does something different from concatenation. Options include, but are not limited to, element-wise addition, difference, and multiplication. These can be combined with concatenation if you like.
1. Include a use of `nli.wordentail_experiment` in the same configuration as the one in [Baseline results](#Baseline-results) above, but with your new value of `vector_combo_func`.

In [23]:
def vec_add(u, v):
    #con = np.concatenate((u, v))
    #return np.concatenate((con, np.multiply(u, v)))
    return [x + y for x, y in zip(u, v)]
    

In [38]:
word_disjoint_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=vec_add)

Finished epoch 100 of 100; error is 0.9082470536231995

              precision    recall  f1-score   support

           0       0.91      0.90      0.91      1910
           1       0.29      0.31      0.30       239

   micro avg       0.84      0.84      0.84      2149
   macro avg       0.60      0.61      0.60      2149
weighted avg       0.84      0.84      0.84      2149



### A deeper network [2 points]

It is very easy to subclass `TorchShallowNeuralClassifier` if all you want to do is change the network graph: all you have to do is write a new `define_graph`. If your graph has new arguments that the user might want to set, then you should also redefine `__init__` so that these values are accepted and set as attributes.

For this question, please subclass `TorchShallowNeuralClassifier` so that it defines the following graph:

$$\begin{align}
h_{1} &= xW_{1} + b_{1} \\
r_{1} &= \textbf{Bernoulli}(1 - \textbf{dropout_prob}, n) \\
d_{1} &= r_1 * h_{1} \\
h_{2} &= f(d_{1}) \\
h_{3} &= h_{2}W_{2} + b_{2}
\end{align}$$

Here, $r_{1}$ and $d_{1}$ define a dropout layer: $r_{1}$ is a random binary vector of dimension $n$, where the probability of a value being $1$ is given by $1 - \textbf{dropout_prob}$. $r_{1}$ is multiplied element-wise by our first hidden representation, thereby zeroing out some of the values. The result is fed to the user's activation function $f$, and the result of that is fed through another linear layer to produce $h_{3}$. (Inside `TorchShallowNeuralClassifier`, $h_{3}$ is the basis for a softmax classifier, so no activation function is applied to it.)

For comparison, using this notation, `TorchShallowNeuralClassifier` defines the following graph:

$$\begin{align}
h_{1} &= xW_{1} + b_{1} \\
h_{2} &= f(h_{1}) \\
h_{3} &= h_{2}W_{2} + b_{2}
\end{align}$$

The following code starts this sub-class for you, so that you can concentrate on `define_graph`. Be sure to make use of `self.dropout_prob`

For this problem, submit just your completed  `TorchDeepNeuralClassifier`. You needn't evaluate it, though we assume you will be keen to do that!

In [24]:
import torch.nn as nn

class TorchDeepNeuralClassifier(TorchShallowNeuralClassifier):
    def __init__(self, dropout_prob=0.7, **kwargs):
        self.dropout_prob = dropout_prob
        super().__init__(**kwargs)
    
    def define_graph(self):
        """Complete this method!
        
        Returns
        -------
        an `nn.Module` instance, which can be a free-standing class you 
        write yourself, as in `torch_rnn_classifier`, or the outpiut of 
        `nn.Sequential`, as in `torch_shallow_neural_classifier`.
        
        """
        return nn.Sequential(
            nn.Linear(self.input_dim, self.hidden_dim),
            nn.Dropout(self.dropout_prob), 
            self.hidden_activation,
            nn.Linear(self.hidden_dim, self.n_classes_))
    


In [40]:
_ = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=TorchDeepNeuralClassifier(), 
    vector_func=glove_vec,
    vector_combo_func=vec_concatenate)

Finished epoch 100 of 100; error is 2.451948955655098

              precision    recall  f1-score   support

           0       0.90      0.99      0.94      1910
           1       0.61      0.15      0.25       239

   micro avg       0.89      0.89      0.89      2149
   macro avg       0.75      0.57      0.60      2149
weighted avg       0.87      0.89      0.87      2149



### Your original system [4 points]

This is a simple dataset, but our focus on the 'word_disjoint' condition ensures that it's a challenging one, and there are lots of modeling strategies one might adopt. 

You are free to do whatever you like. We require only that your system differ in some way from those defined in the preceding questions. They don't have to be completely different, though. For example, you might want to stick with the model but represent examples differently, or the reverse.

Keep in mind that, for the bake-off evaluation, the 'edge_disjoint' portions of the data are off limits. You can, though, train on the combination of the 'word_disjoint' 'train' and 'dev' portions. You are free to use different pretrained word vectors and the like. Please do not introduce additional entailment datasets into your training data, though.

Please embed your code in this notebook so that we can rerun it.

### Notes
> please install pytorch-pretrained-bert and imblearn before proceeding
> you'd need to download our pretrained model at https://drive.google.com/open?id=1THvgq3HdqOZet14IBgbd0wgg515gYS3s

In [25]:
_train_dataset = wordentail_data['word_disjoint']['train']
_dev_dataset = wordentail_data['word_disjoint']['dev']

## Additional imports
import csv, os, json, random, logging, sys, pickle, time
from tqdm import trange
from tqdm._tqdm_notebook import tqdm_notebook as tqdm
import pandas as pd
import torch
from sklearn.metrics import *
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler

from pytorch_pretrained_bert.modeling import BertForSequenceClassification, BertConfig, WEIGHTS_NAME, CONFIG_NAME

from pytorch_pretrained_bert.tokenization import BertTokenizer
from pytorch_pretrained_bert.modeling import BertForSequenceClassification
from pytorch_pretrained_bert.optimization import BertAdam
from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE
from collections import *
from imblearn.over_sampling import RandomOverSampler
from sklearn.metrics import classification_report

## Helpers
logging.basicConfig(format = '%(asctime)s - %(levelname)s -   %(message)s', 
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.INFO)
logger = logging.getLogger(__name__)

class dotdict(dict):
    def __getattr__(self, name):
        return self[name]
    
class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None):
        """Constructs a InputExample.
        Args:
            guid: Unique id for the example.
            text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label


class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, label_id):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id


class DataProcessor(object):
    """Base class for data converters for sequence classification data sets."""

    def get_train_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the train set."""
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the dev set."""
        raise NotImplementedError()

    def get_test_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the test set."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()



class CondProcessor(DataProcessor):
    """Processor for the MRPC data set (GLUE version)."""
    def __init__(self):
        pass
        
    def get_train_examples(self):
        """See base class."""
        return self._create_examples(_train_dataset, "train")

    def get_dev_examples(self):
        """See base class."""
        return self._create_examples(_dev_dataset, "dev")
    
    def get_test_examples(self):
        """See base class."""
        return self._create_examples(_test_dataset, "test")
    
    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, data, set_type):
        """Creates examples for the training and dev sets."""

        sents = np.array([d[0] for d in data])
        labels = np.array([d[1] for d in data])
        examples = []
        if set_type == "train":
            logging.info("getting oversampled train data")
            ros = RandomOverSampler(random_state=42, sampling_strategy=0.4)
            X_res, y_res = ros.fit_resample(np.arange(len(sents)).reshape(-1, 1), labels)
            X_res = sents[X_res.reshape(-1)]

            logging.info(Counter(y_res))
        else:
            logging.info("getting dev data")
            X_res = sents
            y_res = labels
        i = 0
        for e, l in zip(X_res, y_res):
            text_a = e[0]
            text_b = e[1]
            label = l
            guid = "%s-%s" % (set_type, i)
            i += 1
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
      
        return examples



def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()

def convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, show_exp=False):

    label_map = {label : i for i, label in enumerate(label_list)}

    features = []
    for (ex_index, example) in enumerate(examples):
        tokens_a = tokenizer.tokenize(example.text_a)

        tokens_b = None
        if example.text_b:
            tokens_b = tokenizer.tokenize(example.text_b)

        if tokens_b:
            # Modifies `tokens_a` and `tokens_b` in place so that the total
            # length is less than the specified length.
            # Account for [CLS], [SEP], [SEP] with "- 3"
            _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
        else:
            # Account for [CLS] and [SEP] with "- 2"
            if len(tokens_a) > max_seq_length - 2:
                tokens_a = tokens_a[0:(max_seq_length - 2)]

        # The convention in BERT is:
        # (a) For sequence pairs:
        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
        #  type_ids: 0   0  0    0    0     0       0 0    1  1  1  1   1 1
        # (b) For single sequences:
        #  tokens:   [CLS] the dog is hairy . [SEP]
        #  type_ids: 0   0   0   0  0     0 0
        #
        # Where "type_ids" are used to indicate whether this is the first
        # sequence or the second sequence. The embedding vectors for `type=0` and
        # `type=1` were learned during pre-training and are added to the wordpiece
        # embedding vector (and position vector). This is not *strictly* necessary
        # since the [SEP] token unambigiously separates the sequences, but it makes
        # it easier for the model to learn the concept of sequences.
        #
        # For classification tasks, the first vector (corresponding to [CLS]) is
        # used as as the "sentence vector". Note that this only makes sense because
        # the entire model is fine-tuned.
        tokens = []
        segment_ids = []
        tokens.append("[CLS]")
        segment_ids.append(0)
        for token in tokens_a:
            tokens.append(token)
            segment_ids.append(0)
        tokens.append("[SEP]")
        segment_ids.append(0)

        if tokens_b:
            for token in tokens_b:
                tokens.append(token)
                segment_ids.append(1)
            tokens.append("[SEP]")
            segment_ids.append(1)

        input_ids = tokenizer.convert_tokens_to_ids(tokens)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1] * len(input_ids)

        # Zero-pad up to the sequence length.
        while len(input_ids) < max_seq_length:
            input_ids.append(0)
            input_mask.append(0)
            segment_ids.append(0)

        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length

        label_id = label_map[str(example.label)]
        if ex_index < 5 and show_exp:
            logger.info("*** Example ***")
            logger.info("guid: %s" % (example.guid))
            logger.info("tokens: %s" % " ".join(
                    [str(x) for x in tokens]))
            logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
            logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
            logger.info(
                    "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
            logger.info("label: %s (id = %d)" % (example.label, label_id))

        features.append(
                InputFeatures(input_ids=input_ids,
                              input_mask=input_mask,
                              segment_ids=segment_ids,
                              label_id=label_id))
    return features


Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


In [45]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
n_gpu = torch.cuda.device_count()
processor = CondProcessor()
label_list = processor.get_labels()

tokenizer = BertTokenizer.from_pretrained('bert-large-cased', do_lower_case=False)

num_labels = 2

# Prepare model
cache_dir = os.path.join(str(PYTORCH_PRETRAINED_BERT_CACHE), 'distributed_-1')
model = BertForSequenceClassification.from_pretrained('./models/', # change this to the dir of pretrained model 
          cache_dir=cache_dir,
          num_labels = num_labels)
model.to(device)
eval_examples = processor.get_dev_examples()
eval_features = convert_examples_to_features(
    eval_examples, label_list, 6, tokenizer)
all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
# Run prediction for full data
eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=32)

model.eval()
eval_loss = 0
nb_eval_steps, nb_eval_examples = 0, 0
preds = []
labels = []
for input_ids, input_mask, segment_ids, label_ids in eval_dataloader:
    input_ids = input_ids.to(device)
    input_mask = input_mask.to(device)
    segment_ids = segment_ids.to(device)
    label_ids = label_ids.to(device)

    with torch.no_grad():
        tmp_eval_loss = model(input_ids, segment_ids, input_mask, label_ids)
        logits = model(input_ids, segment_ids, input_mask)

    logits = logits.detach().cpu().numpy()
    label_ids = label_ids.to('cpu').numpy()
    pred = np.argmax(logits, axis=1)
    labels.append(label_ids)
    preds.append(pred)
    eval_loss += tmp_eval_loss.mean().item()

    nb_eval_examples += input_ids.size(0)
    nb_eval_steps += 1
    del input_ids, input_mask, segment_ids, label_ids, tmp_eval_loss

f1 = f1_score(np.concatenate(labels), np.concatenate(preds), average="macro")
logger.info("Dev F1 %s" % f1)

05/05/2019 17:00:56 - INFO -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt from cache at /Users/baidi/.pytorch_pretrained_bert/cee054f6aafe5e2cf816d2228704e326446785f940f5451a5b26033516a4ac3d.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
05/05/2019 17:00:56 - INFO -   loading archive file ./models/
05/05/2019 17:00:56 - INFO -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "max_position_embeddings": 512,
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "type_vocab_size": 2,
  "vocab_size": 28996
}

05/05/2019 17:01:22 - INFO -   getting dev data
05/05/2019 17:09:36 - IN

In [46]:
print("macro f1-score: {}".format(f1))

macro f1-score: 0.7580615447598253


In [None]:
# ## Train
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# n_gpu = torch.cuda.device_count()
# best_f1 = 0
# for learning_rate in [5E-6, 8E-6, 1E-5, 2E-5]:
#         args = dotdict({"data_dir": './data/bert/', 
#                 "bert_model": "bert-large-cased",
#                 "output_dir": "./models/",
#                 "model_save_pth": "./models/bert_classification.pth",
#                 "seed": 28,
#                 "train_batch_size": 32,
#                 "num_train_epochs": 8,
#                 "eval_batch_size": 128,
#                 "do_lower_case": False,
#                 "do_train": True,
#                 "do_eval": True,
#                 "max_seq_length": 6,
#                 "gradient_accumulation_steps": 1,
#                 "task_name": "test",
#                 "local_rank": -1,
#                 "warmup_proportion": 0.1,
#                 "fp16": False,
#                 "cache_dir": "./tmp/",
#                 "learning_rate": learning_rate})



#         if torch.cuda.is_available():
#             torch.cuda.empty_cache()
#         logging.info(args)




#         processor = CondProcessor()
#         label_list = processor.get_labels()

#         tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)

#         num_labels = 2
#         train_examples = None
#         num_train_optimization_steps = None
#         if args.do_train:
#             train_examples = processor.get_train_examples()
#             num_train_optimization_steps = int(
#                 len(train_examples) / args.train_batch_size / args.gradient_accumulation_steps) * args.num_train_epochs
#             if args.local_rank != -1:
#                 num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()

#         # Prepare model
#         cache_dir = args.cache_dir if args.cache_dir else os.path.join(str(PYTORCH_PRETRAINED_BERT_CACHE), 'distributed_{}'.format(args.local_rank))
#         model = BertForSequenceClassification.from_pretrained(args.bert_model,
#                   cache_dir=cache_dir,
#                   num_labels = num_labels)
# #         for i in model.bert.embeddings.parameters():
# #             i.requires_grad = False
        
#         if n_gpu > 1:
#             model = torch.nn.DataParallel(model)
#         model.to(device)


#         param_optimizer = list(model.named_parameters())
#         no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
#         optimizer_grouped_parameters = [
#             {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
#             {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
#             ]


#         optimizer = BertAdam(optimizer_grouped_parameters,
#                              lr=args.learning_rate,
#                              warmup=args.warmup_proportion,
#                              t_total=num_train_optimization_steps)
        
#         train_f1s = []
#         eval_f1s = []
#         global_step = 0
#         nb_tr_steps = 0
#         tr_loss = 0
#         patience = 0
#         if args.do_train:


#             for _ in trange(int(args.num_train_epochs)):

#                 logger.info("do train")

#                 train_features = convert_examples_to_features(
#                 train_examples, label_list, args.max_seq_length, tokenizer)
#                 all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
#                 all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
#                 all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)
#                 all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
#                 train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
#                 if args.local_rank == -1:
#                     train_sampler = RandomSampler(train_data)
#                 else:
#                     train_sampler = DistributedSampler(train_data)
#                 train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size, drop_last = True, pin_memory=True)


#                 model.train()
#                 tr_loss = 0
#                 nb_tr_examples, nb_tr_steps = 0, 0

#                 for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):

#                     batch = tuple(t.to(device, non_blocking=True) for t in batch)

#                     input_ids, input_mask, segment_ids, label_ids = batch

#                     loss = model(input_ids, segment_ids, input_mask, label_ids) #???
#                     if n_gpu > 1:
#                         loss = loss.mean() # mean() to average on multi-gpu.
#                     loss.backward()

#                     tr_loss += loss.item()
#                     nb_tr_examples += input_ids.size(0)
#                     nb_tr_steps += 1
#                     optimizer.step()
#                     optimizer.zero_grad()
#                     global_step += 1
#                     del input_ids, input_mask, segment_ids, label_ids, batch, loss

#                 logger.info("tr loss: %s" % tr_loss)

          


#                 if args.do_eval:
#                     logger.info("do eval")

#                     eval_examples = processor.get_dev_examples()
#                     eval_features = convert_examples_to_features(
#                         eval_examples, label_list, args.max_seq_length, tokenizer)
#                     all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
#                     all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
#                     all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
#                     all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
#                     eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
#                     # Run prediction for full data
#                     eval_sampler = SequentialSampler(eval_data)
#                     eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)

#                     model.eval()
#                     eval_loss = 0
#                     nb_eval_steps, nb_eval_examples = 0, 0
#                     preds = []
#                     labels = []
#                     for input_ids, input_mask, segment_ids, label_ids in eval_dataloader:
#                         input_ids = input_ids.to(device)
#                         input_mask = input_mask.to(device)
#                         segment_ids = segment_ids.to(device)
#                         label_ids = label_ids.to(device)

#                         with torch.no_grad():
#                             tmp_eval_loss = model(input_ids, segment_ids, input_mask, label_ids)
#                             logits = model(input_ids, segment_ids, input_mask)

#                         logits = logits.detach().cpu().numpy()
#                         label_ids = label_ids.to('cpu').numpy()
#                         pred = np.argmax(logits, axis=1)
#                         labels.append(label_ids)
#                         preds.append(pred)
#                         eval_loss += tmp_eval_loss.mean().item()

#                         nb_eval_examples += input_ids.size(0)
#                         nb_eval_steps += 1
#                         del input_ids, input_mask, segment_ids, label_ids, tmp_eval_loss

#                     f1 = f1_score(np.concatenate(labels), np.concatenate(preds), average="macro")
#                     logger.info("F1 %s" % f1)
#                     eval_f1s.append(f1)
#                     if f1 > best_f1:

#                         best_f1 = f1
#                         assert type(eval_loss) == float
#                         logger.info("Best F1 %s" % best_f1)
#                         result = {'eval_loss': eval_loss,
#                                   'eval_f1': f1,
#                                   'global_step': global_step}


#                         model_to_save = model.module if hasattr(model, 'module') else model 
#                         output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME)
#                         torch.save(model_to_save.state_dict(), output_model_file)
#                         output_config_file = os.path.join(args.output_dir, CONFIG_NAME)
#                         with open(output_config_file, 'w') as f:
#                             f.write(model_to_save.config.to_json_string())

                            

## Bake-off [1 point]

The goal of the bake-off is to achieve the highest macro-average F1 score on __word_disjoint__, on a test set that we will make available at the start of the bake-off on May 6. The announcement will go out on Piazza. To enter, you'll be asked to run `nli.bake_off_evaluation` on the output of your chosen `nli.wordentail_experiment` run. 

To enter the bake-off, upload this notebook on Canvas:

https://canvas.stanford.edu/courses/99711/assignments/187250

The cells below this one constitute your bake-off entry.

The rules described in the [Your original system](#Your-original-system-[4-points]) homework question are also in effect for the bake-off.

Systems that enter will receive the additional homework point, and systems that achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

The bake-off will close at 4:30 pm on May 8. Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

In [33]:
# Enter your bake-off assessment code into this cell. 
# Please do not remove this comment.
test_data_filename = os.path.join(
            'data', 'nlidata', 'nli_wordentail_bakeoff_data-test.json')
with open(test_data_filename) as f:
    wordentail_data = json.load(f)

In [34]:
# please install pytorch-pretrained-bert and imblearn before proceeding
# you'd need to download our pretrained model at https://drive.google.com/open?id=1THvgq3HdqOZet14IBgbd0wgg515gYS3s

_test_dataset = wordentail_data['word_disjoint']['test']

## Additional imports
import csv, os, json, random, logging, sys, pickle, time
from tqdm import trange
from tqdm._tqdm_notebook import tqdm_notebook as tqdm
import pandas as pd
import torch
from sklearn.metrics import *
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler

from pytorch_pretrained_bert.modeling import BertForSequenceClassification, BertConfig, WEIGHTS_NAME, CONFIG_NAME

from pytorch_pretrained_bert.tokenization import BertTokenizer
from pytorch_pretrained_bert.modeling import BertForSequenceClassification
from pytorch_pretrained_bert.optimization import BertAdam
from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE
from collections import *
from imblearn.over_sampling import RandomOverSampler
from sklearn.metrics import classification_report

## Helpers
logging.basicConfig(format = '%(asctime)s - %(levelname)s -   %(message)s', 
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.INFO)
logger = logging.getLogger(__name__)

class dotdict(dict):
    def __getattr__(self, name):
        return self[name]
    
class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None):
        """Constructs a InputExample.
        Args:
            guid: Unique id for the example.
            text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label


class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, label_id):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id


class DataProcessor(object):
    """Base class for data converters for sequence classification data sets."""

    def get_train_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the train set."""
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the dev set."""
        raise NotImplementedError()

    def get_test_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the test set."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()



class CondProcessor(DataProcessor):
    """Processor for the MRPC data set (GLUE version)."""
    def __init__(self):
        pass
        
    def get_train_examples(self):
        """See base class."""
        return self._create_examples(_train_dataset, "train")

    def get_dev_examples(self):
        """See base class."""
        return self._create_examples(_dev_dataset, "dev")
    
    def get_test_examples(self):
        """See base class."""
        return self._create_examples(_test_dataset, "test")
    
    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, data, set_type):
        """Creates examples for the training and dev sets."""

        sents = np.array([d[0] for d in data])
        labels = np.array([d[1] for d in data])
        examples = []
        if set_type == "train":
            logging.info("getting oversampled train data")
            ros = RandomOverSampler(random_state=42, sampling_strategy=0.4)
            X_res, y_res = ros.fit_resample(np.arange(len(sents)).reshape(-1, 1), labels)
            X_res = sents[X_res.reshape(-1)]

            logging.info(Counter(y_res))
        else:
            X_res = sents
            y_res = labels
        i = 0
        for e, l in zip(X_res, y_res):
            text_a = e[0]
            text_b = e[1]
            label = l
            guid = "%s-%s" % (set_type, i)
            i += 1
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
      
        return examples



def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()

def convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, show_exp=False):

    label_map = {label : i for i, label in enumerate(label_list)}

    features = []
    for (ex_index, example) in enumerate(examples):
        tokens_a = tokenizer.tokenize(example.text_a)

        tokens_b = None
        if example.text_b:
            tokens_b = tokenizer.tokenize(example.text_b)

        if tokens_b:
            # Modifies `tokens_a` and `tokens_b` in place so that the total
            # length is less than the specified length.
            # Account for [CLS], [SEP], [SEP] with "- 3"
            _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
        else:
            # Account for [CLS] and [SEP] with "- 2"
            if len(tokens_a) > max_seq_length - 2:
                tokens_a = tokens_a[0:(max_seq_length - 2)]

        # The convention in BERT is:
        # (a) For sequence pairs:
        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
        #  type_ids: 0   0  0    0    0     0       0 0    1  1  1  1   1 1
        # (b) For single sequences:
        #  tokens:   [CLS] the dog is hairy . [SEP]
        #  type_ids: 0   0   0   0  0     0 0
        #
        # Where "type_ids" are used to indicate whether this is the first
        # sequence or the second sequence. The embedding vectors for `type=0` and
        # `type=1` were learned during pre-training and are added to the wordpiece
        # embedding vector (and position vector). This is not *strictly* necessary
        # since the [SEP] token unambigiously separates the sequences, but it makes
        # it easier for the model to learn the concept of sequences.
        #
        # For classification tasks, the first vector (corresponding to [CLS]) is
        # used as as the "sentence vector". Note that this only makes sense because
        # the entire model is fine-tuned.
        tokens = []
        segment_ids = []
        tokens.append("[CLS]")
        segment_ids.append(0)
        for token in tokens_a:
            tokens.append(token)
            segment_ids.append(0)
        tokens.append("[SEP]")
        segment_ids.append(0)

        if tokens_b:
            for token in tokens_b:
                tokens.append(token)
                segment_ids.append(1)
            tokens.append("[SEP]")
            segment_ids.append(1)

        input_ids = tokenizer.convert_tokens_to_ids(tokens)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1] * len(input_ids)

        # Zero-pad up to the sequence length.
        while len(input_ids) < max_seq_length:
            input_ids.append(0)
            input_mask.append(0)
            segment_ids.append(0)

        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length

        label_id = label_map[str(example.label)]
        if ex_index < 5 and show_exp:
            logger.info("*** Example ***")
            logger.info("guid: %s" % (example.guid))
            logger.info("tokens: %s" % " ".join(
                    [str(x) for x in tokens]))
            logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
            logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
            logger.info(
                    "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
            logger.info("label: %s (id = %d)" % (example.label, label_id))

        features.append(
                InputFeatures(input_ids=input_ids,
                              input_mask=input_mask,
                              segment_ids=segment_ids,
                              label_id=label_id))
    return features



In [35]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
n_gpu = torch.cuda.device_count()
processor = CondProcessor()
label_list = processor.get_labels()

tokenizer = BertTokenizer.from_pretrained('bert-large-cased', do_lower_case=False)

num_labels = 2

# Prepare model
cache_dir = os.path.join(str(PYTORCH_PRETRAINED_BERT_CACHE), 'distributed_-1')
model = BertForSequenceClassification.from_pretrained('./models/', # change this to the dir of pretrained model 
          cache_dir=cache_dir,
          num_labels = num_labels)
model.to(device)
eval_examples = processor.get_test_examples()
eval_features = convert_examples_to_features(
    eval_examples, label_list, 6, tokenizer)
all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
# Run prediction for full data
eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=32)

model.eval()
eval_loss = 0
nb_eval_steps, nb_eval_examples = 0, 0
preds = []
labels = []
for input_ids, input_mask, segment_ids, label_ids in tqdm(eval_dataloader):
    input_ids = input_ids.to(device)
    input_mask = input_mask.to(device)
    segment_ids = segment_ids.to(device)
    label_ids = label_ids.to(device)

    with torch.no_grad():
        tmp_eval_loss = model(input_ids, segment_ids, input_mask, label_ids)
        logits = model(input_ids, segment_ids, input_mask)

    logits = logits.detach().cpu().numpy()
    label_ids = label_ids.to('cpu').numpy()
    pred = np.argmax(logits, axis=1)
    labels.append(label_ids)
    preds.append(pred)
    eval_loss += tmp_eval_loss.mean().item()

    nb_eval_examples += input_ids.size(0)
    nb_eval_steps += 1
    del input_ids, input_mask, segment_ids, label_ids, tmp_eval_loss

f1 = f1_score(np.concatenate(labels), np.concatenate(preds), average="macro")
logger.info("test: macro-F1 %s" % f1)

05/06/2019 15:38:45 - INFO -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt from cache at /Users/baidi/.pytorch_pretrained_bert/cee054f6aafe5e2cf816d2228704e326446785f940f5451a5b26033516a4ac3d.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
05/06/2019 15:38:46 - INFO -   loading archive file ./models/
05/06/2019 15:38:46 - INFO -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "max_position_embeddings": 512,
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "type_vocab_size": 2,
  "vocab_size": 28996
}



HBox(children=(IntProgress(value=0, max=70), HTML(value='')))

05/06/2019 15:42:28 - INFO -   test: macro-F1 0.7825426994039539





In [None]:
# On an otherwise blank line in this cell, please enter
# your macro-avg f1 value as reported by the code above. 
# Please enter only a number between 0 and 1 inclusive.
# Please do not remove this comment.
0.7825426994039539