Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom dataset training failed due to IndexError: list index out of range #91

Closed
AI-EnabledSoftwareEngineering-AISE opened this issue May 2, 2022 · 7 comments

Comments

@AI-EnabledSoftwareEngineering-AISE
Copy link

AI-EnabledSoftwareEngineering-AISE commented May 2, 2022

I organized my dataset as you described in a tsv file.
I used this code to convert images to b64 encode:

from PIL import Image
from io import BytesIO
import base64

img = Image.open(fn)
img_buffer = BytesIO()
img.save(img_buffer, format=img.format)
byte_data = img_buffer.getvalue()
base64_str = base64.b64encode(byte_data) 

Then organized data in a TSV file with these columns uniq-id, image-id, caption, predicted object labels (empty string), image base64 string. The size of my data frame is:5 X 47899 . But while I am loading data it says: caption_stage1_train.tsv slice_id 1 row count 24100 total row count 48200 slice_id 1 seek offset 24100.

2022-05-02 00:17:19 - trainer.py[line:124] - INFO: detected shared parameter: encoder.embed_images.conv1.bias <- decoder.output_projection.bias
2022-05-02 00:17:19 - utils.py[line:759] - INFO: ***********************CUDA enviroments for all 2 workers***********************
2022-05-02 00:17:19 - utils.py[line:765] - INFO: rank   0: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB                    
2022-05-02 00:17:19 - utils.py[line:765] - INFO: rank   1: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB                    
2022-05-02 00:17:19 - utils.py[line:767] - INFO: ***********************CUDA enviroments for all 2 workers***********************
2022-05-02 00:17:19 - train.py[line:145] - INFO: training on 2 devices (GPUs/TPUs)
2022-05-02 00:17:19 - train.py[line:151] - INFO: max tokens per device = None and max sentences per device = 8
2022-05-02 00:17:19 - trainer.py[line:458] - INFO: Preparing to load checkpoint ../../checkpoints/ofa_base.pt
2022-05-02 00:17:24 - trainer.py[line:309] - INFO: NOTE: your device does NOT support faster training with --fp16 or --amp, please switch to FP32 which is likely to be faster
2022-05-02 00:17:24 - trainer.py[line:619] - INFO: Loaded checkpoint ../../checkpoints/ofa_base.pt (epoch 48 @ 0 updates)
2022-05-02 00:17:24 - trainer.py[line:639] - INFO: loading train data for epoch 1
local datafile /raid/AISSEL/Hamed/datasets/caption_data/caption_stage1_train.tsv slice_id 0 begin to initialize row_count and line_idx-to-offset mapping
local datafile /raid/AISSEL/Hamed/datasets/caption_data/caption_stage1_train.tsv slice_id 1 begin to initialize row_count and line_idx-to-offset mapping
local datafile /raid/AISSEL/Hamed/datasets/caption_data/caption_stage1_train.tsv slice_id 1 finished initializing row_count and line_idx-to-offset mapping
file /raid/AISSEL/Hamed/datasets/caption_data/caption_stage1_train.tsv slice_id 1 row count 24100 total row count 48200
local datafile /raid/AISSEL/Hamed/datasets/caption_data/caption_stage1_train.tsv slice_id 0 finished initializing row_count and line_idx-to-offset mapping
file /raid/AISSEL/Hamed/datasets/caption_data/caption_stage1_train.tsv slice_id 0 row count 24100 total row count 48200
slice_id 0 seek offset 0
Total steps 3770, warmup steps 226, warmup_factor 0.004424778761061947
2022-05-02 00:18:04 - trainer.py[line:703] - INFO: begin training epoch 1
2022-05-02 00:18:04 - train.py[line:296] - INFO: Start iterating over samples
slice_id 1 seek offset 24100
Total steps 3770, warmup steps 226, warmup_factor 0.004424778761061947
Traceback (most recent call last):
  File "../../train.py", line 528, in <module>
    cli_main()
  File "../../train.py", line 521, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/fairseq/distributed/utils.py", line 374, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/fairseq/distributed/utils.py", line 348, in distributed_main
    main(cfg, **kwargs)
  File "../../train.py", line 190, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "../../train.py", line 297, in train
    for i, samples in enumerate(progress):
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/fairseq/logging/progress_bar.py", line 261, in __iter__
    for i, obj in enumerate(self.iterable, start=self.n):
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/fairseq/data/iterators.py", line 56, in __next__
    x = next(self._itr)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/fairseq/data/iterators.py", line 509, in _chunk_iterator
    for x in itr:
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/fairseq/data/iterators.py", line 56, in __next__
    x = next(self._itr)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/fairseq/data/iterators.py", line 637, in __next__
    raise item
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/fairseq/data/iterators.py", line 567, in run
    for item in self._source:
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 570, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/XXXX/ofa/OFA/data/mm_data/caption_dataset.py", line 117, in __getitem__
    uniq_id, image, caption = self.dataset[index]
  File "/home/XXXX/ofa/OFA/data/file_dataset.py", line 106, in __getitem__
    column_l = [dtype(column_l[col_id]) for col_id, dtype in zip(self.selected_col_ids, self.dtypes)]
  File "/home/XXXX/ofa/OFA/data/file_dataset.py", line 106, in <listcomp>
    column_l = [dtype(column_l[col_id]) for col_id, dtype in zip(self.selected_col_ids, self.dtypes)]
IndexError: list index out of range
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13615 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13614) of binary: /home/XXXX/.conda/envs/ofa/bin/python
Traceback (most recent call last):
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../../train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-05-02_00:18:09
  host      : hartley
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 13614)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@AI-EnabledSoftwareEngineering-AISE
Copy link
Author

I tried with your dataset and it did not give an index out of range error. Also, I tried by a 1000 sample of my dataset it again give me out of index range error. To create the tsv file I use this command:

df_train.to_csv(f'{raid_path}/caption_stage1_train.tsv', sep="\t", index=False, header=False)

Also, my dataframe is like this:
Screen Shot 2022-05-02 at 1 06 30 AM

It is strange to me what is happening to data?
Screen Shot 2022-05-02 at 1 09 20 AM

@AI-EnabledSoftwareEngineering-AISE
Copy link
Author

What do you think about a set of characters in the caption column that may cause a problem? Did your dataset have any issues like this? If that's the case, could you please send me the data cleaning code?

@logicwong
Copy link
Member

@AI-EnabledSoftwareEngineering-AISE Hi, I would recommend processing the data as follows:

  1. Set the predicted object labels to ' ' or any other character, do not let it be NaN;
  2. Delete the '\t' in the caption, like caption = caption.replace('\t', ' ') .

@AI-EnabledSoftwareEngineering-AISE
Copy link
Author

Thank you, I solved it by removing all special characters in captions:

def remove_special(input_string):
    final_string = ""
    for character in input_string:
        if  character == " ":
            final_string = final_string + character
        else:
            if(character.isalnum()):
                final_string = final_string + character
    return final_string

@zzhanghub
Copy link

Hi, I encountered the same problem list index out of range in the line column_l = [dtype(column_l[col_id]) for col_id, dtype in zip(self.selected_col_ids, self.dtypes)]. There are no special characters in my caption. I see that several other issues mentioned this issue. Do you have any method fot it 😭?
Thank you!

#131
#94

@shengjie1980
Copy link

you

Hi, I encountered the same problem list index out of range in the line column_l = [dtype(column_l[col_id]) for col_id, dtype in zip(self.selected_col_ids, self.dtypes)]. There are no special characters in my caption. I see that several other issues mentioned this issue. Do you have any method fot it 😭? Thank you!

#131 #94

You can modify line 53 of data/file_dataset.py from 'fp = open(self.file_path,"r")' to 'fp = open(self.file_path,"rb")', and modify line 62 from 'offset += len(line.encode('utf-8))' to 'offset += len(line)’'.

@zzhanghub
Copy link

@shengjie1980
Great !!! it works! Thank you!
Could you explain the reasons behind this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants