Preprocess with debug flag fails. #1544

amitagh · 2024-04-19T09:21:56Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Preprocess with debug flag should work.
python -m axolotl.cli.preprocess /content/test_axolotl.yaml --debug

Current behaviour

Gives error. Have json file with each example in the json file is with {"text": <text_str>}.
I am doing Pretraining with Lora for Non-Eng lang.

[2024-04-19 09:05:02,918] [DEBUG] [axolotl.log:61] [PID:2346] [RANK:0] max_input_len: 600
Dropping Long Sequences (num_proc=2): 100% 17/17 [00:00<00:00, 99.19 examples/s]
Add position_id column (Sample Packing) (num_proc=2): 100% 17/17 [00:00<00:00, 70.88 examples/s]
[2024-04-19 09:05:03,502] [INFO] [axolotl.load_tokenized_prepared_datasets:423] [PID:2346] [RANK:0] Saving merged prepared dataset to disk... /content/d538aae6e42c7df428d20d3ff2685ad0
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/content/src/axolotl/src/axolotl/cli/preprocess.py", line 70, in
fire.Fire(do_cli)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/content/src/axolotl/src/axolotl/cli/preprocess.py", line 60, in do_cli
load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
File "/content/src/axolotl/src/axolotl/cli/init.py", line 397, in load_datasets
train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
File "/content/src/axolotl/src/axolotl/utils/data/sft.py", line 66, in prepare_dataset
train_dataset, eval_dataset, prompters = load_prepare_datasets(
File "/content/src/axolotl/src/axolotl/utils/data/sft.py", line 460, in load_prepare_datasets
dataset, prompters = load_tokenized_prepared_datasets(
File "/content/src/axolotl/src/axolotl/utils/data/sft.py", line 424, in load_tokenized_prepared_datasets
dataset.save_to_disk(prepared_ds_path)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 1515, in save_to_disk
fs, _ = url_to_fs(dataset_path, **(storage_options or {}))
File "/usr/local/lib/python3.10/dist-packages/fsspec/core.py", line 363, in url_to_fs
chain = _un_chain(url, kwargs)
File "/usr/local/lib/python3.10/dist-packages/fsspec/core.py", line 316, in _un_chain
if "::" in path
TypeError: argument of type 'PosixPath' is not iterable

Steps to reproduce

Use json with with each example in the json file is with {"text": <text_str>}.

Preprocess with debug flag.
python -m axolotl.cli.preprocess /content/test_axolotl.yaml --debug
But i get the error.

Config yaml

base_model: google/gemma-7b
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: ./test_txt_data.json
    type: completion
    field: text
dataset_prepared_path: data/last_run_prepared
dataset_processes: 16
val_set_size: 0
output_dir: ./lora-out

adapter: lora
lora_model_dir:

gpu_memory_limit: 76

sequence_len: 1100
sample_packing: true
pad_to_sequence_len: true

lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_modules: 
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - down_proj
  - up_proj
lora_modules_to_save:
  - embed_tokens
  - lm_head
lora_target_linear: true
lora_fan_in_fan_out:

save_safetensors: True

gradient_accumulation_steps: 2
micro_batch_size: 10
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002
warmup_steps: 20
save_steps: 5000

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 500
xformers_attention:
flash_attention: True

evals_per_epoch: 1
eval_table_size:
eval_max_new_tokens: 128
eval_sample_packing: False
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.10

axolotl branch-commit

Latest

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

Napuh · 2024-04-19T10:37:44Z

+1

Napuh · 2024-04-19T10:50:10Z

Update:

downgrading datasets to 2.15.0 seems to work for me.

jorge-tromero · 2024-04-19T11:38:22Z

+1

FrankRuis · 2024-04-19T11:43:34Z

See my change in #1548, you can wrap prepared_ds_path in dataset.save_to_disk(prepared_ds_path) with str() in the src/axolotl/utils/data/sft.py file, then you don't need to downgrade any packages.

monk1337 · 2024-04-20T21:58:06Z

2.15.0

It worked for me as well!

@kwaa

* Downgrade datasets to 2.15.0 to address axolotl prepare issue OpenAccess-AI-Collective/axolotl#1544 Tks to @kwaa for providing the solution in #10821 (comment)

qiyuangong · 2024-04-23T01:59:19Z

Update:

downgrading datasets to 2.15.0 seems to work for me.

Work for me. :)

amitagh added the bug Something isn't working label Apr 19, 2024

Napuh mentioned this issue Apr 19, 2024

fix(packages): lock datasets version #1545

Merged

FrankRuis mentioned this issue Apr 19, 2024

wrap prepared_ds_path in str() to avoid TypeError in fsspec package #1548

Merged

winglian closed this as completed in #1548 Apr 21, 2024

kwaa mentioned this issue Apr 22, 2024

[Feature Request] IPEX-LLM + Axolotl Docker Image intel-analytics/ipex-llm#10821

Open

qiyuangong added a commit to qiyuangong/ipex-llm that referenced this issue Apr 23, 2024

Downgrade datasets to avoid OpenAccess-AI-Collective/axolotl#1544

c8cc38b

qiyuangong mentioned this issue Apr 23, 2024

Downgrade datasets to avoid axolotl prepare issue intel-analytics/ipex-llm#10849

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocess with debug flag fails. #1544

Preprocess with debug flag fails. #1544

amitagh commented Apr 19, 2024

Napuh commented Apr 19, 2024

Napuh commented Apr 19, 2024

jorge-tromero commented Apr 19, 2024

FrankRuis commented Apr 19, 2024

monk1337 commented Apr 20, 2024

qiyuangong commented Apr 23, 2024

Preprocess with debug flag fails. #1544

Preprocess with debug flag fails. #1544

Comments

amitagh commented Apr 19, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

Napuh commented Apr 19, 2024

Napuh commented Apr 19, 2024

jorge-tromero commented Apr 19, 2024

FrankRuis commented Apr 19, 2024

monk1337 commented Apr 20, 2024

qiyuangong commented Apr 23, 2024