Expected is_sm80 to be true, but got false #101

awaelchli · 2023-04-05T18:09:48Z

I tried running the finetuning scripts on a 3090 GPU and got this error:

/home/adrian/repositories/lightning-llama/lit_llama/model.py:43: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at ../aten/src/ATen/EmptyTensor.cpp:31.)
  ).to(complex_dtype)
Traceback (most recent call last):
  File "/home/adrian/repositories/lightning-llama/finetune_adapter.py", line 201, in <module>
    main()
  File "/home/adrian/repositories/lightning-llama/finetune_adapter.py", line 67, in main
    train(fabric, model, optimizer, train_data, val_data)
  File "/home/adrian/repositories/lightning-llama/finetune_adapter.py", line 97, in train
    fabric.backward(loss / gradient_accumulation_steps)
  File "/home/adrian/anaconda3/envs/lit-llama/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 365, in backward
    self._precision.backward(tensor, module, *args, **kwargs)
  File "/home/adrian/anaconda3/envs/lit-llama/lib/python3.10/site-packages/lightning/fabric/plugins/precision/amp.py", line 70, in backward
    super().backward(tensor, model, *args, **kwargs)
  File "/home/adrian/anaconda3/envs/lit-llama/lib/python3.10/site-packages/lightning/fabric/plugins/precision/precision.py", line 81, in backward
    tensor.backward(*args, **kwargs)
  File "/home/adrian/anaconda3/envs/lit-llama/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/adrian/anaconda3/envs/lit-llama/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected is_sm80 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)

This was on the branch of #100 where I added the EmptyInitOnDevice() context manager. It looks like the conversion to complex_dtype caused problems in the backward.

Both

python finetune_lora.py

and

python finetune_adapter.py

fail with this error.

The text was updated successfully, but these errors were encountered:

lantiga · 2023-04-05T18:19:47Z

@t-vi does this ring any bells?

t-vi · 2023-04-05T18:29:47Z

unfortunately not, but I'll be sure to dig into it.

lantiga · 2023-04-05T18:31:02Z

If we could get rid of that complex op in the RoPE implementation and still match up the results it would unblock a ton (see test_rope.py)

t-vi · 2023-04-05T18:41:01Z

There is a known and fixed upstream bug about this check, maybe try a nightly?

t-vi · 2023-04-05T18:41:40Z

but I can expand the rope to use reals if that helps. gets rid of the stupid warning, too.

awaelchli · 2023-04-05T18:47:55Z

Thanks @t-vi
Indeed nightly worked! So that seems unrelated to the complex issue then? It might just show up in that line the first time?

lantiga · 2023-04-06T09:26:44Z

Closing as nightly has solved it and we reference the workaround in the README.

AurelienSaussay · 2023-04-14T07:36:40Z

Hi all, I was still encountering with PyTorch nightly (as of 2023-04-13) on an A10 while running LoRA finetuning.

As a temp fix, I have found that disabling flash attention backend in the scaled dot-product attention calculation around the loss function resolved the issue. In finetuning_lora.py, simply replace lines 93-96 with:

        with torch.cuda.amp.autocast(enabled=True, dtype=torch.bfloat16) as autocast, torch.backends.cuda.sdp_kernel(enable_flash=False) as disable:
            input_ids, targets = get_batch(fabric, train_data)
            logits = model(input_ids)
            loss = loss_fn(logits, targets)
            fabric.backward(loss)

lantiga · 2023-04-14T08:22:46Z

Oh interesting, thanks for bringing this up @AurelienSaussay

lantiga · 2023-04-14T08:23:20Z

I'm imagining the same issue comes up with LLaMA-Adapter on A10, can you confirm?

lantiga · 2023-04-14T08:27:51Z

Also, the autocast part should be already taken care of by model, optimizer = fabric.setup(model, optimizer), while disabling flash attention can be done globally too, which avoids modifying the code

torch.backends.cuda.enable_flash_sdp(False)

Do you confirm @awaelchli ?

awaelchli · 2023-04-14T08:39:04Z

Yes I downgraded to torch 2.0 and was able to prevent the issue with torch.backends.cuda.enable_flash_sdp(False) as well.

lantiga · 2023-04-14T08:40:40Z

So let's add this line (commented) to the scripts and mention in the README to uncomment that line if that error comes up.

Lingeswaran-S · 2023-07-17T17:33:57Z

Can anyone help me with this.

{'eval_interval': 600, 'save_interval': 1000, 'eval_iters': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.003, 'batch_size': 128.0, 'micro_batch_size': 2, 'gradient_accumulation_iters': 64.0, 'epoch_size': 50000, 'num_epochs': 5, 'max_iters': 125000, 'weight_decay': 0.02, 'warmup_steps': 781.0}
Global seed set to 1337
Loading model 'checkpoints/stabilityai/stablelm-tuned-alpha-3b/lit_model.pth' with {'org': 'stabilityai', 'name': 'stablelm-tuned-alpha-3b', 'block_size': 4096, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 50688, 'n_layer': 16, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 0.25, 'parallel_residual': True, 'bias': True, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'LayerNorm', 'norm_eps': 1e-05, '_mlp_class': 'GptNeoxMLP', 'intermediate_size': 16384, 'condense_ratio': 1, 'adapter_prompt_length': 10, 'adapter_start_layer': 2}
Number of trainable parameters: 2125248
Number of non trainable parameters: 3637051392
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/lingeswaran/0/AI/privateChat/testFalcon/lit-gpt/finetune/adapter_v2.py:305 in │
│ │
│ 302 │ │
│ 303 │ from jsonargparse.cli import CLI │
│ 304 │ │
│ ❱ 305 │ CLI(setup) │
│ 306 │
│ │
│ /home/lingeswaran/0/AI/privateChat/testFalcon/env/lib/python3.10/site-packages/jsonargparse/_cli │
│ .py:85 in CLI │
│ │
│ 82 │ │ │ return parser │
│ 83 │ │ cfg = parser.parse_args(args) │
│ 84 │ │ cfg_init = parser.instantiate_classes(cfg) │
│ ❱ 85 │ │ return _run_component(component, cfg_init) │
│ 86 │ │
│ 87 │ subcommands = parser.add_subcommands(required=True) │
│ 88 │ comp_dict = {c.name: c for c in components} │
│ │
│ /home/lingeswaran/0/AI/privateChat/testFalcon/env/lib/python3.10/site-packages/jsonargparse/_cli │
│ .py:147 in _run_component │
│ │
│ 144 def run_component(component, cfg): │
│ 145 │ cfg.pop("config", None) │
│ 146 │ if not inspect.isclass(component): │
│ ❱ 147 │ │ return component(**cfg) │
│ 148 │ subcommand = cfg.pop("subcommand") │
│ 149 │ if not subcommand: │
│ 150 │ │ return component(**cfg) │
│ │
│ /home/lingeswaran/0/AI/privateChat/testFalcon/lit-gpt/finetune/adapter_v2.py:82 in setup │
│ │
│ 79 │ logger = step_csv_logger(out_dir.parent, out_dir.name, flush_logs_every_n_steps=log │
│ 80 │ fabric = L.Fabric(devices=fabric_devices, strategy=strategy, precision=precision, lo │
│ 81 │ fabric.print(hparams) │
│ ❱ 82 │ fabric.launch(main, data_dir, checkpoint_dir, out_dir) │
│ 83 │
│ 84 │
│ 85 def main(fabric: L.Fabric, data_dir: Path, checkpoint_dir: Path, out_dir: Path): │
│ │
│ /home/lingeswaran/0/AI/privateChat/testFalcon/env/lib/python3.10/site-packages/lightning/fabric/ │
│ fabric.py:789 in launch │
│ │
│ 786 │ │ │ │ f"To use the {type(self.strategy).__name__} strategy, .launch() need │
│ 787 │ │ │ │ " that contains the code to launch in processes." │
│ 788 │ │ │ ) │
│ ❱ 789 │ │ return self._wrap_and_launch(function, self, *args, **kwargs) │
│ 790 │ │
│ 791 │ def call(self, hook_name: str, *args: Any, **kwargs: Any) -> None: │
│ 792 │ │ """Trigger the callback methods with the given name and arguments. │
│ │
│ /home/lingeswaran/0/AI/privateChat/testFalcon/env/lib/python3.10/site-packages/lightning/fabric/ │
│ fabric.py:871 in _wrap_and_launch │
│ │
│ 868 │ │ to_run = partial(self._wrap_with_setup, to_run) │
│ 869 │ │ if (launcher := self._strategy.launcher) is not None: │
│ 870 │ │ │ return launcher.launch(to_run, *args, **kwargs) │
│ ❱ 871 │ │ return to_run(*args, **kwargs) │
│ 872 │ │
│ 873 │ def _wrap_with_setup(self, to_run: Callable, *args: Any, **kwargs: Any) -> Any: │
│ 874 │ │ self._strategy.setup_environment() │
│ │
│ /home/lingeswaran/0/AI/privateChat/testFalcon/env/lib/python3.10/site-packages/lightning/fabric/ │
│ fabric.py:876 in _wrap_with_setup │
│ │
│ 873 │ def _wrap_with_setup(self, to_run: Callable, *args: Any, **kwargs: Any) -> Any: │
│ 874 │ │ self._strategy.setup_environment() │
│ 875 │ │ with _replace_dunder_methods(DataLoader, "dataset"), _replace_dunder_methods(Bat │
│ ❱ 876 │ │ │ return to_run(*args, **kwargs) │
│ 877 │ │
│ 878 │ def _move_model_to_device(self, model: nn.Module, optimizers: List[Optimizer]) -> nn │
│ 879 │ │ initial_device = next(model.parameters(), torch.tensor(0)).device │
│ │
│ /home/lingeswaran/0/AI/privateChat/testFalcon/lit-gpt/finetune/adapter_v2.py:121 in main │
│ │
│ 118 │ model, optimizer = fabric.setup(model, optimizer) │
│ 119 │ │
│ 120 │ train_time = time.time() │
│ ❱ 121 │ train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir, speed │
│ 122 │ fabric.print(f"Training time: {(time.time()-train_time):.2f}s") │
│ 123 │ │
│ 124 │ # Save the final checkpoint at the end of training │
│ │
│ /home/lingeswaran/0/AI/privateChat/testFalcon/lit-gpt/finetune/adapter_v2.py:140 in train │
│ │
│ 137 │ speed_monitor: SpeedMonitor, │
│ 138 ) -> None: │
│ 139 │ tokenizer = Tokenizer(checkpoint_dir) │
│ ❱ 140 │ max_seq_length, longest_seq_length, longest_seq_ix = get_max_seq_length(train_data) │
│ 141 │ │
│ 142 │ validate(fabric, model, val_data, tokenizer, longest_seq_length) # sanity check │
│ 143 │
│ │
│ /home/lingeswaran/0/AI/privateChat/testFalcon/lit-gpt/finetune/adapter_v2.py:283 in │
│ get_max_seq_length │
│ │
│ 280 def get_max_seq_length(data: List[Dict]) -> Tuple[int, int, int]: │
│ 281 │ # find out the minimum max_seq_length required during fine-tuning (saves memory!) │
│ 282 │ lengths = [len(d["input_ids"]) for d in data] │
│ ❱ 283 │ max_seq_length = max(lengths) │
│ 284 │ longest_seq_ix = lengths.index(max_seq_length) │
│ 285 │ # support easy override at the top of the file │
│ 286 │ return ( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: max() arg is an empty sequence

awaelchli self-assigned this Apr 5, 2023

awaelchli mentioned this issue Apr 5, 2023

Finetuning with weights in bfloat16 #100

Merged

lantiga closed this as completed Apr 6, 2023

dstsmallbird mentioned this issue Apr 8, 2023

Expected is_sm80 to be true, but got false. lm-sys/FastChat#294

Closed

awaelchli mentioned this issue Apr 14, 2023

Update instructions for handling flash attention error #138

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expected is_sm80 to be true, but got false #101

Expected is_sm80 to be true, but got false #101

awaelchli commented Apr 5, 2023 •

edited

Loading

lantiga commented Apr 5, 2023

t-vi commented Apr 5, 2023

lantiga commented Apr 5, 2023 •

edited

Loading

t-vi commented Apr 5, 2023

t-vi commented Apr 5, 2023

awaelchli commented Apr 5, 2023

lantiga commented Apr 6, 2023

AurelienSaussay commented Apr 14, 2023 •

edited

Loading

lantiga commented Apr 14, 2023

lantiga commented Apr 14, 2023

lantiga commented Apr 14, 2023

awaelchli commented Apr 14, 2023

lantiga commented Apr 14, 2023

Lingeswaran-S commented Jul 17, 2023 •

edited

Loading

Expected is_sm80 to be true, but got false #101

Expected is_sm80 to be true, but got false #101

Comments

awaelchli commented Apr 5, 2023 • edited Loading

lantiga commented Apr 5, 2023

t-vi commented Apr 5, 2023

lantiga commented Apr 5, 2023 • edited Loading

t-vi commented Apr 5, 2023

t-vi commented Apr 5, 2023

awaelchli commented Apr 5, 2023

lantiga commented Apr 6, 2023

AurelienSaussay commented Apr 14, 2023 • edited Loading

lantiga commented Apr 14, 2023

lantiga commented Apr 14, 2023

lantiga commented Apr 14, 2023

awaelchli commented Apr 14, 2023

lantiga commented Apr 14, 2023

Lingeswaran-S commented Jul 17, 2023 • edited Loading

awaelchli commented Apr 5, 2023 •

edited

Loading

lantiga commented Apr 5, 2023 •

edited

Loading

AurelienSaussay commented Apr 14, 2023 •

edited

Loading

Lingeswaran-S commented Jul 17, 2023 •

edited

Loading