Why does TAPAS perform worse than reported? #2

FeiWang96 · 2021-03-01T19:40:32Z

Hi, nice tutorials!

Thank you for adding TAPAS to huggingface/transformers. It is really helpful.

However, according to your Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb, the performance of tapas-base-finetuned-tabfact on test set is 77.1 while it is reported as 78.5 in the paper. What attributes to the performance drop?

Thank you!

The text was updated successfully, but these errors were encountered:

NielsRogge · 2021-03-01T20:27:51Z

Hi,

Thank you for adding TAPAS to huggingface/transformers. It is really helpful.

Thank you :)

Actually, good question.. I first thought I forgot to put my model in evaluation mode, but models are set to .eval() by default in Transformers. So not sure..

JasperGuo · 2021-04-03T01:07:20Z

Thanks for your great contributions about TAPAS.

I have pinpointed the reason for lower performance (77.1). It was caused by the lemmatized statements in the TabFact datasets of huggingface. TAPAS was not trained with lemmatized statements. By replacing the lemmatized statements with their original ones, I got 79.1 accuracy on test set.

The original statements are available in the following link.
https://github.com/wenhuchen/Table-Fact-Checking/tree/master/collected_data

NielsRogge · 2021-04-06T15:15:55Z

Great, thank you @JasperGuo!

Do you have any code that you can share?

JasperGuo · 2021-04-07T01:10:57Z

Sure. Here is the raw train, validation, and test data.

Download the data and run the following script, it is expected to get 79.1% accuracy.

import os
from typing import List
import torch
import pandas as pd
from transformers import TapasTokenizer, TapasForSequenceClassification
from datasets import load_dataset, load_metric, Features, Sequence, ClassLabel, Value, Array2D

def prepare_official_data_loader():
    tokenizer = TapasTokenizer.from_pretrained('google/tapas-base-finetuned-tabfact')
    features = Features({
        'attention_mask': Sequence(Value(dtype='int64')),
        'input_ids': Sequence(feature=Value(dtype='int64')),
        'label': ClassLabel(names=['refuted', 'entailed']),
        'statement': Value(dtype='string'),
        'table_caption': Value(dtype='string'),
        'table_id': Value(dtype='string'),
        'token_type_ids': Array2D(dtype="int64", shape=(512, 7))
    })
    test_set = load_dataset('json', data_files={'test': 'test.jsonl'}, split='test')

    def _format_pd_table(table_text: List) -> pd.DataFrame:
        df = pd.DataFrame(columns=table_text[0], data=table_text[1:])
        df = df.astype(str)
        return df

    test = test_set.map(
        lambda e: tokenizer(table=_format_pd_table(e['table_text']), queries=e['statement'],
                            truncation=True,
                            padding='max_length'),
        features=features,
        remove_columns=['table_text'],
    )
    # map to PyTorch tensors and only keep columns we need
    test.set_format(type='torch', columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
    # create PyTorch dataloader
    test_dataloader = torch.utils.data.DataLoader(test, batch_size=4)

    return test_dataloader

def evaluate():
    accuracy = load_metric("accuracy")
    test_dataloader = prepare_official_data_loader()
    batch = next(iter(test_dataloader))
    assert batch["input_ids"].shape == (4, 512)
    assert batch["attention_mask"].shape == (4, 512)
    assert batch["token_type_ids"].shape == (4, 512, 7)

    # Evaluate
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model = TapasForSequenceClassification.from_pretrained('google/tapas-base-finetuned-tabfact')
    model.to(device)

    number_processed = 0
    total = len(test_dataloader) * batch["input_ids"].shape[0]  # number of batches * batch_size
    for batch in test_dataloader:
        # get the inputs
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        token_type_ids = batch["token_type_ids"].to(device)
        labels = batch["label"].to(device)

        # forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
                        labels=labels)
        model_predictions = outputs.logits.argmax(-1)

        # add metric
        accuracy.add_batch(predictions=model_predictions, references=labels)

        number_processed += batch["input_ids"].shape[0]
        print(f"Processed {number_processed} / {total} examples")

    final_score = accuracy.compute()
    print(final_score)

if __name__ == '__main__':
    evaluate()

FeiWang96 · 2021-04-08T03:56:13Z

Thank you! It is really helpful. @JasperGuo

qshi95 · 2021-05-06T02:58:04Z

@JasperGuo Can the data split further divided into simple_test, complex_test, small_test? Thank you!

FeiWang96 · 2021-08-28T18:15:16Z

@JasperGuo Can the data split further divided into simple_test, complex_test, small_test? Thank you!

You can filter them by table id.

NielsRogge mentioned this issue Apr 5, 2021

The performance of TAPAS in huggingface model hub is not consist with that of the original ones google-research/tapas#108

Closed

FeiWang96 closed this as completed Apr 8, 2021

qshi95 mentioned this issue May 7, 2021

Can the data split further divided into simple_test, complex_test, small_test #8

Closed

tiennvcs mentioned this issue Sep 20, 2021

Fine-tuning LayoutLMv2 on multi-gpus #30

Open

jswapnil10 mentioned this issue Jun 15, 2022

RuntimeError: Caught RuntimeError in replica 1 on device 1. #125

Open

jswapnil10 mentioned this issue Jun 23, 2022

RuntimeError: Caught RuntimeError in replica 1 on device 1 #133

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does TAPAS perform worse than reported? #2

Why does TAPAS perform worse than reported? #2

FeiWang96 commented Mar 1, 2021 •

edited

NielsRogge commented Mar 1, 2021 •

edited

JasperGuo commented Apr 3, 2021 •

edited

NielsRogge commented Apr 6, 2021

JasperGuo commented Apr 7, 2021 •

edited

FeiWang96 commented Apr 8, 2021 •

edited

qshi95 commented May 6, 2021

FeiWang96 commented Aug 28, 2021

Why does TAPAS perform worse than reported? #2

Why does TAPAS perform worse than reported? #2

Comments

FeiWang96 commented Mar 1, 2021 • edited

NielsRogge commented Mar 1, 2021 • edited

JasperGuo commented Apr 3, 2021 • edited

NielsRogge commented Apr 6, 2021

JasperGuo commented Apr 7, 2021 • edited

FeiWang96 commented Apr 8, 2021 • edited

qshi95 commented May 6, 2021

FeiWang96 commented Aug 28, 2021

FeiWang96 commented Mar 1, 2021 •

edited

NielsRogge commented Mar 1, 2021 •

edited

JasperGuo commented Apr 3, 2021 •

edited

JasperGuo commented Apr 7, 2021 •

edited

FeiWang96 commented Apr 8, 2021 •

edited