Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does TAPAS perform worse than reported? #2

Closed
FeiWang96 opened this issue Mar 1, 2021 · 7 comments
Closed

Why does TAPAS perform worse than reported? #2

FeiWang96 opened this issue Mar 1, 2021 · 7 comments

Comments

@FeiWang96
Copy link

FeiWang96 commented Mar 1, 2021

Hi, nice tutorials!

Thank you for adding TAPAS to huggingface/transformers. It is really helpful.

However, according to your Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb, the performance of tapas-base-finetuned-tabfact on test set is 77.1 while it is reported as 78.5 in the paper. What attributes to the performance drop?

Thank you!

@NielsRogge
Copy link
Owner

NielsRogge commented Mar 1, 2021

Hi,

Thank you for adding TAPAS to huggingface/transformers. It is really helpful.

Thank you :)

Actually, good question.. I first thought I forgot to put my model in evaluation mode, but models are set to .eval() by default in Transformers. So not sure..

@JasperGuo
Copy link

JasperGuo commented Apr 3, 2021

Thanks for your great contributions about TAPAS.

I have pinpointed the reason for lower performance (77.1). It was caused by the lemmatized statements in the TabFact datasets of huggingface. TAPAS was not trained with lemmatized statements. By replacing the lemmatized statements with their original ones, I got 79.1 accuracy on test set.

The original statements are available in the following link.
https://github.com/wenhuchen/Table-Fact-Checking/tree/master/collected_data

@NielsRogge
Copy link
Owner

Great, thank you @JasperGuo!

Do you have any code that you can share?

@JasperGuo
Copy link

JasperGuo commented Apr 7, 2021

Sure. Here is the raw train, validation, and test data.

Download the data and run the following script, it is expected to get 79.1% accuracy.

import os
from typing import List
import torch
import pandas as pd
from transformers import TapasTokenizer, TapasForSequenceClassification
from datasets import load_dataset, load_metric, Features, Sequence, ClassLabel, Value, Array2D

def prepare_official_data_loader():
    tokenizer = TapasTokenizer.from_pretrained('google/tapas-base-finetuned-tabfact')
    features = Features({
        'attention_mask': Sequence(Value(dtype='int64')),
        'input_ids': Sequence(feature=Value(dtype='int64')),
        'label': ClassLabel(names=['refuted', 'entailed']),
        'statement': Value(dtype='string'),
        'table_caption': Value(dtype='string'),
        'table_id': Value(dtype='string'),
        'token_type_ids': Array2D(dtype="int64", shape=(512, 7))
    })
    test_set = load_dataset('json', data_files={'test': 'test.jsonl'}, split='test')

    def _format_pd_table(table_text: List) -> pd.DataFrame:
        df = pd.DataFrame(columns=table_text[0], data=table_text[1:])
        df = df.astype(str)
        return df

    test = test_set.map(
        lambda e: tokenizer(table=_format_pd_table(e['table_text']), queries=e['statement'],
                            truncation=True,
                            padding='max_length'),
        features=features,
        remove_columns=['table_text'],
    )
    # map to PyTorch tensors and only keep columns we need
    test.set_format(type='torch', columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
    # create PyTorch dataloader
    test_dataloader = torch.utils.data.DataLoader(test, batch_size=4)

    return test_dataloader

def evaluate():
    accuracy = load_metric("accuracy")
    test_dataloader = prepare_official_data_loader()
    batch = next(iter(test_dataloader))
    assert batch["input_ids"].shape == (4, 512)
    assert batch["attention_mask"].shape == (4, 512)
    assert batch["token_type_ids"].shape == (4, 512, 7)

    # Evaluate
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model = TapasForSequenceClassification.from_pretrained('google/tapas-base-finetuned-tabfact')
    model.to(device)

    number_processed = 0
    total = len(test_dataloader) * batch["input_ids"].shape[0]  # number of batches * batch_size
    for batch in test_dataloader:
        # get the inputs
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        token_type_ids = batch["token_type_ids"].to(device)
        labels = batch["label"].to(device)

        # forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
                        labels=labels)
        model_predictions = outputs.logits.argmax(-1)

        # add metric
        accuracy.add_batch(predictions=model_predictions, references=labels)

        number_processed += batch["input_ids"].shape[0]
        print(f"Processed {number_processed} / {total} examples")

    final_score = accuracy.compute()
    print(final_score)

if __name__ == '__main__':
    evaluate()

@FeiWang96
Copy link
Author

FeiWang96 commented Apr 8, 2021

Thank you! It is really helpful. @JasperGuo

@qshi95
Copy link

qshi95 commented May 6, 2021

@JasperGuo Can the data split further divided into simple_test, complex_test, small_test? Thank you!

@FeiWang96
Copy link
Author

@JasperGuo Can the data split further divided into simple_test, complex_test, small_test? Thank you!

You can filter them by table id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants