Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocess with debug flag fails. #1544

Closed
6 of 8 tasks
amitagh opened this issue Apr 19, 2024 · 6 comments · Fixed by #1548
Closed
6 of 8 tasks

Preprocess with debug flag fails. #1544

amitagh opened this issue Apr 19, 2024 · 6 comments · Fixed by #1548
Labels
bug Something isn't working

Comments

@amitagh
Copy link

amitagh commented Apr 19, 2024

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Preprocess with debug flag should work.
python -m axolotl.cli.preprocess /content/test_axolotl.yaml --debug

Current behaviour

Gives error. Have json file with each example in the json file is with {"text": <text_str>}.
I am doing Pretraining with Lora for Non-Eng lang.

[2024-04-19 09:05:02,918] [DEBUG] [axolotl.log:61] [PID:2346] [RANK:0] max_input_len: 600
Dropping Long Sequences (num_proc=2): 100% 17/17 [00:00<00:00, 99.19 examples/s]
Add position_id column (Sample Packing) (num_proc=2): 100% 17/17 [00:00<00:00, 70.88 examples/s]
[2024-04-19 09:05:03,502] [INFO] [axolotl.load_tokenized_prepared_datasets:423] [PID:2346] [RANK:0] Saving merged prepared dataset to disk... /content/d538aae6e42c7df428d20d3ff2685ad0
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/content/src/axolotl/src/axolotl/cli/preprocess.py", line 70, in
fire.Fire(do_cli)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/content/src/axolotl/src/axolotl/cli/preprocess.py", line 60, in do_cli
load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
File "/content/src/axolotl/src/axolotl/cli/init.py", line 397, in load_datasets
train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
File "/content/src/axolotl/src/axolotl/utils/data/sft.py", line 66, in prepare_dataset
train_dataset, eval_dataset, prompters = load_prepare_datasets(
File "/content/src/axolotl/src/axolotl/utils/data/sft.py", line 460, in load_prepare_datasets
dataset, prompters = load_tokenized_prepared_datasets(
File "/content/src/axolotl/src/axolotl/utils/data/sft.py", line 424, in load_tokenized_prepared_datasets
dataset.save_to_disk(prepared_ds_path)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 1515, in save_to_disk
fs, _ = url_to_fs(dataset_path, **(storage_options or {}))
File "/usr/local/lib/python3.10/dist-packages/fsspec/core.py", line 363, in url_to_fs
chain = _un_chain(url, kwargs)
File "/usr/local/lib/python3.10/dist-packages/fsspec/core.py", line 316, in _un_chain
if "::" in path
TypeError: argument of type 'PosixPath' is not iterable

Steps to reproduce

Use json with with each example in the json file is with {"text": <text_str>}.

Preprocess with debug flag.
python -m axolotl.cli.preprocess /content/test_axolotl.yaml --debug
But i get the error.

Config yaml

base_model: google/gemma-7b
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: ./test_txt_data.json
    type: completion
    field: text
dataset_prepared_path: data/last_run_prepared
dataset_processes: 16
val_set_size: 0
output_dir: ./lora-out

adapter: lora
lora_model_dir:

gpu_memory_limit: 76

sequence_len: 1100
sample_packing: true
pad_to_sequence_len: true

lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_modules: 
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - down_proj
  - up_proj
lora_modules_to_save:
  - embed_tokens
  - lm_head
lora_target_linear: true
lora_fan_in_fan_out:

save_safetensors: True

gradient_accumulation_steps: 2
micro_batch_size: 10
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002
warmup_steps: 20
save_steps: 5000

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 500
xformers_attention:
flash_attention: True

evals_per_epoch: 1
eval_table_size:
eval_max_new_tokens: 128
eval_sample_packing: False
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

Latest

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@amitagh amitagh added the bug Something isn't working label Apr 19, 2024
@Napuh
Copy link
Contributor

Napuh commented Apr 19, 2024

+1

@Napuh
Copy link
Contributor

Napuh commented Apr 19, 2024

Update:

downgrading datasets to 2.15.0 seems to work for me.

@jorge-tromero
Copy link

+1

@FrankRuis
Copy link
Contributor

See my change in #1548, you can wrap prepared_ds_path in dataset.save_to_disk(prepared_ds_path) with str() in the src/axolotl/utils/data/sft.py file, then you don't need to downgrade any packages.

@monk1337
Copy link
Contributor

2.15.0

It worked for me as well!

qiyuangong added a commit to qiyuangong/ipex-llm that referenced this issue Apr 23, 2024
qiyuangong added a commit to intel-analytics/ipex-llm that referenced this issue Apr 23, 2024
* Downgrade datasets to 2.15.0 to address axolotl prepare issue OpenAccess-AI-Collective/axolotl#1544

Tks to @kwaa for providing the solution in #10821 (comment)
@qiyuangong
Copy link

Update:

downgrading datasets to 2.15.0 seems to work for me.

Work for me. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants