Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train VQA on my custom data? #73

Closed
xiaoqiang-lu opened this issue Apr 18, 2022 · 11 comments
Closed

How to train VQA on my custom data? #73

xiaoqiang-lu opened this issue Apr 18, 2022 · 11 comments
Assignees

Comments

@xiaoqiang-lu
Copy link

Hello! I am trying to finetune OFA-large on VQA using custom dataset, using the finetuning instruction in the repo. I have checked my .tsv and .pkl file several times and they are correct as your provided sample. But after command "bash train_vqa_distributed.sh", the terminal just prints:

total_num_updates 40000
warmup_updates 1000
lr 5e-5
patch_image_size 480

The GPU usage will rise to a certain value and then suddenly return to zero, and then the program will end. I train on single server with 2 GPU. Looking forward to reply, thanks for your sharing work!

@yangapku yangapku self-assigned this Apr 19, 2022
@yangapku
Copy link
Member

Hi, could you please provide the exact script you run on your machine and the information of your GPU-cards type? I will have a check on my environment.

@yangapku
Copy link
Member

Moreover, for fine-tuning on customed VQA-formated data, please also refer to this recent issue for more information #76.

@xiaoqiang-lu
Copy link
Author

Thanks for your reply! At first I was using two cards 3080ti, now I replaced them with 4 cards v100, however the same problem occurs. The script on my machine:

GPUS_PER_NODE=4
WORKER_CNT=1
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=8214
export RNAK=0

The rest are unchanged. I also make my own ans2label.pkl file.
Here is a part of my .tsv file without imgbase64.
image
Here is a part of my .pkl file.
image

@yangapku
Copy link
Member

yangapku commented Apr 22, 2022

Hi, have you checked the path of $log_file defined in your training script? The running log is saved in this file rather than printed on stdout. The program may be ended for other reasons, which may be recorded in the log. Please share more information if you find this log file.

@xiaoqiang-lu
Copy link
Author

Thanks! It seems to be a problem with my image that is causing this, I am using the code you replied to in issue #56 for imgbase64.
image

@xiaoqiang-lu
Copy link
Author

I have solved the above problem, but another problem occurs.
image

@yangapku
Copy link
Member

yangapku commented Apr 22, 2022

Hi, please check whether the fields of the input data line which caused this error correspond with the specified selected_cols. By default, the selected_cols is specified as 0,5,2,3,4 in the script, which sequentially fetches the 0th (uniq_id), 5th (image), 2nd (question), 3rd (answer info), 4th (predict_objects) field from each input TSV line. If any of the field mismatches, errors may occur.

@xiaoqiang-lu
Copy link
Author

I have check the input data line, and it is same as exsample. I print the column_l and the length of it, column_l is correct [img_id, imgbase64, question, answer, objects].
image

@yangapku
Copy link
Member

yangapku commented Apr 22, 2022

Hi, I think there is a misunderstanding of how each data line is organized. As mentioned in the readme, in each line in TSV file, the fields follow the exact order of question-id, image-id, question, answer (with confidence), predicted object labels and image base64 string, thus there are 6 fields in total in the TSV file (also the image-id field is not used). By specifying the selected_cols=0,5,2,3,4, the program sequentially fetches the 0th (question-id), 5th (image), 2nd (question), 3rd (answer info), 4th (predict_objects) field from each input TSV line, resulting in a sample to be further processed in __getitem__ method of VqaGenDataset.

@yangapku
Copy link
Member

yangapku commented Apr 22, 2022

By the way, for preparing the dataset TSV file, I would also recommend to prepare an original training sample with more than one golden answers into multiple samples each of which contains only one of the answers. This will take full advantage of the supervision of ground-truth answers of training samples. Otherwise, only the golden answer with the highest confidence score will be used as supervision.

@hieptran1812
Copy link

Thanks! It seems to be a problem with my image that is causing this, I am using the code you replied to in issue #56 for imgbase64. image

how you resolve this problem? I''m having same problem. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants