<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

In this notebook, we are going to run `TapasForSequenceClassification`, a PyTorch/Transformers implementation of the [Tapas algorithm](https://arxiv.org/abs/2004.02349) by Google AI on the test set of [TabFact](https://github.com/wenhuchen/Table-Fact-Checking), a large dataset for table entailment (which is included in the HuggingFace [datasets library](https://github.com/huggingface/datasets)).

* Paper (which is a follow-up on the original TAPAS paper): https://arxiv.org/abs/2010.00571
* Tabfact paper: https://arxiv.org/abs/1909.02164

## Setting up environment

Make sure to set runtime to GPU.
We install from the `tapas_v4` branch + the soft dependency:

In [None]:
! rm -r transformers
! git clone -b tapas_v4 https://github.com/NielsRogge/transformers.git
! cd transformers
! pip install ./transformers

rm: cannot remove 'transformers': No such file or directory
Cloning into 'transformers'...
remote: Enumerating objects: 53835, done.[K
remote: Total 53835 (delta 0), reused 0 (delta 0), pack-reused 53835[K
Receiving objects: 100% (53835/53835), 40.44 MiB | 29.81 MiB/s, done.
Resolving deltas: 100% (37444/37444), done.
Processing ./transformers
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 16.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |███████████████████████

In [None]:
! pip install torch-scatter==latest+cu101 -f https://pytorch-geometric.com/whl/torch-1.7.0.html

Looking in links: https://pytorch-geometric.com/whl/torch-1.7.0.html
Collecting torch-scatter==latest+cu101
[?25l  Downloading https://s3.eu-central-1.amazonaws.com/pytorch-geometric.com/whl/torch-1.7.0/torch_scatter-latest%2Bcu101-cp36-cp36m-linux_x86_64.whl (11.9MB)
[K     |████████████████████████████████| 11.9MB 274kB/s 
[?25hInstalling collected packages: torch-scatter
Successfully installed torch-scatter-2.0.5


We install the datasets library from source:

In [None]:
! rm -r datasets
! git clone https://github.com/huggingface/datasets.git
! cd datasets
! pip install ./datasets

rm: cannot remove 'datasets': No such file or directory
Cloning into 'datasets'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 24374 (delta 2), reused 2 (delta 0), pack-reused 24360
Receiving objects: 100% (24374/24374), 38.86 MiB | 35.28 MiB/s, done.
Resolving deltas: 100% (9181/9181), done.
Processing ./datasets
Collecting pyarrow>=0.17.1
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e1/27958a70848f8f7089bff8d6ebe42519daf01f976d28b481e1bfd52c8097/pyarrow-2.0.0-cp36-cp36m-manylinux2014_x86_64.whl (17.7MB)
[K     |████████████████████████████████| 17.7MB 219kB/s 
Collecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/f7/73/826b19f3594756cb1c6c23d2fbd8ca6a77a9cd3b650c9dec5acc85004c38/xxhash-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (242kB)
[K     |████████████████████████████████| 245kB 44.3MB/s 
Building wheels for collected packages

## Loading the model




In [None]:
from transformers import TapasForSequenceClassification
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = TapasForSequenceClassification.from_pretrained("google/tapas-base-finetuned-tabfact")
model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1699.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442779435.0, style=ProgressStyle(descri…




TapasForSequenceClassification(
  (tapas): TapasModel(
    (embeddings): TapasEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(1024, 768)
      (token_type_embeddings_0): Embedding(3, 768)
      (token_type_embeddings_1): Embedding(256, 768)
      (token_type_embeddings_2): Embedding(256, 768)
      (token_type_embeddings_3): Embedding(2, 768)
      (token_type_embeddings_4): Embedding(256, 768)
      (token_type_embeddings_5): Embedding(256, 768)
      (token_type_embeddings_6): Embedding(10, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): TapasEncoder(
      (layer): ModuleList(
        (0): TapasLayer(
          (attention): TapasAttention(
            (self): TapasSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=Tr

## Preparing the data

Here we read in the test set of the TabFact dataset. 

In [None]:
from datasets import load_dataset

dataset = load_dataset('tab_fact', 'tab_fact', split='test')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2221.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1340.0, style=ProgressStyle(description…


Downloading and preparing dataset tab_fact/tab_fact (download: 187.41 MiB, generated: 121.30 MiB, post-processed: Unknown size, total: 308.71 MiB) to /root/.cache/huggingface/datasets/tab_fact/tab_fact/1.0.0/bd64c4ee1b4127f8377f1817669219ec36aaf65cb8c78d7c995902e25ef362b6...


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Downloading', max=1.0, style=ProgressSt…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset tab_fact downloaded and prepared to /root/.cache/huggingface/datasets/tab_fact/tab_fact/1.0.0/bd64c4ee1b4127f8377f1817669219ec36aaf65cb8c78d7c995902e25ef362b6. Subsequent calls will reuse this data.


In [None]:
dataset.column_names

['id', 'label', 'statement', 'table_caption', 'table_id', 'table_text']

Each example in the TabFact dataset is a statement about a table, and the label indicates whether the statement is supported (1) or refuted (0) by the contents of the table. So it's a binary classification problem. 

Let's visualize an example:

In [None]:
import pandas as pd

# let's take a random example
example = dataset[0]
id2label = {0: "REFUTES", 1: "SUPPORTS"}

data = example['table_text']

# convert table_text into a Pandas dataframe
table = pd.DataFrame([x.split('#') for x in data.split('\n')[1:-1]], columns=[x for x in data.split('\n')[0].split('#')])
display(table)
print("")
print("Statement:", example['statement'])
print("Label:", id2label[example['label']])

Unnamed: 0,tournament,wins,top - 5,top - 10,top - 25,events,cuts made
0,masters tournament,0,1,2,4,4,4
1,us open,0,2,3,4,6,5
2,the open championship,1,2,2,2,3,3
3,pga championship,0,0,1,2,5,4
4,totals,1,5,8,12,18,16



Statement: tony lema be in the top 5 for the master tournament , the us open , and the open championship
Label: SUPPORTS


We write the logic to turn the `table_text` column into a Pandas dataframe into a function:

In [None]:
def read_text_as_pandas_table(table_text: str):
    table = pd.DataFrame([x.split('#') for x in table_text.split('\n')[1:-1]], columns=[x for x in table_text.split('\n')[0].split('#')]).fillna('')
    table = table.astype(str)
    return table

Let's check if TapasTokenizer can prepare the data correctly:

In [None]:
from transformers import TapasTokenizer

tokenizer = TapasTokenizer.from_pretrained("google/tapas-base-finetuned-tabfact")

# test on a random example
example = dataset[0]
inputs = tokenizer(table=read_text_as_pandas_table(example['table_text']),
                   queries=example['statement'],
                   padding='max_length')
inputs

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231506.0, style=ProgressStyle(descripti…




{'input_ids': [101, 4116, 3393, 2863, 2022, 1999, 1996, 2327, 1019, 2005, 1996, 3040, 2977, 1010, 1996, 2149, 2330, 1010, 1998, 1996, 2330, 2528, 102, 2977, 5222, 2327, 1011, 1019, 2327, 1011, 2184, 2327, 1011, 2423, 2824, 7659, 2081, 5972, 2977, 1014, 1015, 1016, 1018, 1018, 1018, 2149, 2330, 1014, 1016, 1017, 1018, 1020, 1019, 1996, 2330, 2528, 1015, 1016, 1016, 1016, 1017, 1017, 14198, 2528, 1014, 1014, 1015, 1016, 1019, 1018, 21948, 1015, 1019, 1022, 2260, 2324, 2385, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Now let's use the `.map()` functionality of `datasets` to tokenize and prepare for the model the entire test split of the dataset. Note that we tokenize each table-question pair independently (we don't set `batched=True`): 

In [None]:
from datasets import Features, Sequence, ClassLabel, Value, Array2D

# we need to define the features ourselves as the token_type_ids of TAPAS are different from those of BERT 
features = Features({
    'attention_mask': Sequence(Value(dtype='int64')),
    'id': Value(dtype='int32'),
    'input_ids': Sequence(feature=Value(dtype='int64')),
    'label': ClassLabel(names=['refuted', 'entailed']),
    'statement': Value(dtype='string'),
    'table_caption': Value(dtype='string'),
    'table_id': Value(dtype='string'),
    'table_text': Value(dtype='string'),
    'token_type_ids': Array2D(dtype="int64", shape=(512, 7))
})
test = dataset.map(
    lambda e: tokenizer(table=read_text_as_pandas_table(e['table_text']), queries=e['statement'], 
                                       truncation=True,
                                       padding='max_length'),
    features=features
)

HBox(children=(FloatProgress(value=0.0, max=12779.0), HTML(value='')))




Let's create a PyTorch dataloader based on this:

In [None]:
# map to PyTorch tensors and only keep columns we need
test.set_format(type='torch', columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
# create PyTorch dataloader
test_dataloader = torch.utils.data.DataLoader(test, batch_size=4)

We can verify whether everything is created correctly, for example by verifying their shapes and decoding the `input_ids` of the first example of the first batch:

In [None]:
# let's check the first batch
batch = next(iter(test_dataloader))
assert batch["input_ids"].shape == (4, 512)
assert batch["attention_mask"].shape == (4, 512)
assert batch["token_type_ids"].shape == (4, 512, 7)
#tokenizer.decode(batch["input_ids"][0])

  return torch.tensor(x, **format_kwargs)


## Run evaluation

Now we can very easily compute the accuracy of TAPAS on the test set of TabFact! Incredible!

In [None]:
from datasets import load_metric

accuracy = load_metric("accuracy")

for batch in test_dataloader:
    # get the inputs
    input_ids = batch["input_ids"].to(device)
    attention_mask = batch["attention_mask"].to(device)
    token_type_ids = batch["token_type_ids"].to(device)
    labels = batch["label"].to(device)

    # forward pass
    outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=labels)
    model_predictions = outputs.logits.argmax(-1)

    # add metric
    accuracy.add_batch(predictions=model_predictions, references=labels)

final_score = accuracy.compute()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1249.0, style=ProgressStyle(description…




In [None]:
print(final_score)

{'accuracy': 0.7711871038422412}
