Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Unsupported dataset schema #449 #529

Closed
marwanomar1 opened this issue Sep 20, 2021 · 12 comments
Closed

ValueError: Unsupported dataset schema #449 #529

marwanomar1 opened this issue Sep 20, 2021 · 12 comments
Labels
bug Something isn't working

Comments

@marwanomar1
Copy link

I am running adversarial training on NLP models and I am getting an error " ValueError: Unsupported dataset schema ". When I run the following code:
import textattack
import transformers
from textattack.datasets import HuggingFaceDataset

model = transformers.AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
model_wrapper = textattack.models.wrappers.HuggingFaceModelWrapper(model, tokenizer)

We only use DeepWordBugGao2018 to demonstration purposes.
attack = textattack.attack_recipes.DeepWordBugGao2018.build(model_wrapper)
train_dataset = HuggingFaceDataset('squad', split='train')
eval_dataset = HuggingFaceDataset('squad', split='validation')

Train for 3 epochs with 1 initial clean epochs, 1000 adversarial examples per epoch, learning rate of 5e-5, and effective batch size of 32 (8x4).
training_args = textattack.TrainingArgs(
num_epochs=3,
num_clean_epochs=1,
num_train_adv_examples=1000,
learning_rate=5e-5,
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
log_to_tb=True,
)

trainer = textattack.Trainer(
model_wrapper,
"classification",
attack,

eval_dataset,
training_args
)
trainer.train()
@jxmorris12

@jxmorris12
Copy link
Collaborator

I suggested a fix that you haven't tried yet:

A quick diagnosis tells me you should be using our HuggingFaceDataset class to wrap the dataset instead of just importing it directly from huggingface datasets. so in the code you posted, your dataset initializations might look something like:

from textattack.datasets import HuggingFaceDataset

train_dataset = HuggingFaceDataset('squad', split='train')
eval_dataset = HuggingFaceDataset('squad', split='validation')

@jxmorris12 jxmorris12 added the bug Something isn't working label Sep 22, 2021
@marwanomar1
Copy link
Author

Thank you, Jack. Things are working now. In the same code above when I try the yelp dataset, it shows that it will take several days to complete because the size of examples is about 560.000.00.

Is it possible to reduce the number of examples to about 10k so that it would go faster?

@jxmorris12
Copy link
Collaborator

yes! I would try using the rotten_tomatoes dataset instead. It's much smaller.

@marwanomar1
Copy link
Author

Great. Many thanks. I really appreciate it.

@marwanomar1
Copy link
Author

I am running the following code to test IMDB on WordCNN model-

It gives me error: NameError: name 'model_wrapper' is not defined

!pip install textattack
!pip install -U tensorflow-text
import textattack
import json
import os
import torch
from torch import nn as nn
from torch.nn import functional as F
import textattack
from textattack.model_args import TEXTATTACK_MODELS
from textattack.models.helpers import GloveEmbeddingLayer
from textattack.models.helpers.utils import load_cached_state_dict
from textattack.shared import utils
import textattack

We only use DeepWordBugGao2018 to demonstration purposes.

attack = textattack.attack_recipes.DeepWordBugGao2018.build(model_wrapper)
train_dataset = textattack.datasets.HuggingFaceDataset("imdb", split="train")
eval_dataset = textattack.datasets.HuggingFaceDataset("imdb", split="test")

Train for 3 epochs with 1 initial clean epochs, 1000 adversarial examples per epoch, learning rate of 5e-5, and effective batch size of 32 (8x4).

training_args = textattack.TrainingArgs(
num_epochs=3,
num_clean_epochs=1,
num_train_adv_examples=1000,
learning_rate=5e-5,
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
log_to_tb=True,
)
trainer = textattack.Trainer(
model_wrapper,
"classification",
attack,
train_dataset,
eval_dataset,
training_args
)
trainer.train()

@jxmorris12

@jxmorris12
Copy link
Collaborator

uhh, yeah, you still need this piece of the code:

model = transformers.AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
model_wrapper = textattack.models.wrappers.HuggingFaceModelWrapper(model, tokenizer)

@marwanomar1
Copy link
Author

That worked. Many thanks!

@marwanomar1
Copy link
Author

I ran the training on LSTM using command: textattack train --model-name-or-path lstm --dataset yelp_polarity --epochs 50 --learning-rate 1e-5
so now I want to know which command to use to attack this same model which I just trained. I want to attack it with textfooler

@jxmorris12
Copy link
Collaborator

Pretty sure you have to create a model wrapper file and use the --model-from-file argument to textattack attack. Or you could just write a script that loads the model and runs attacks in the script.

@marwanomar1
Copy link
Author

When I try to run an attack using my saved model. I use this command: !textattack attack --recipe textfooler --num-examples 100 --model ./outputs/2021-09-15-06-37-33-327512/best_model --dataset-from-huggingface imdb --dataset-split test
white_check_mark
eyes
raised_hands

but it gives me this error: ValueError: Error: unsupported TextAttack model ./outputs/2021-09-15-06-37-33-327512/best_model

Do you know what could be going wrong?

@jxmorris12

@jxmorris12
Copy link
Collaborator

You're using --model, not --model-from-file, I think that's the problem!

@marwanomar1
Copy link
Author

I am trying to run an attack on a pretrained, fine-tuned model as follows:
!textattack attack --model cardiffnlp/twitter-roberta-base-offensive --recipe deepwordbug --num-examples 10

but its giving me the following error:
ValueError: Must supply pretrained model or dataset

I am not sure why it would not take the pretrained model above- is there anything I am doing wrong here?

@jxmorris12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants