New CLI #1437

rasbt · 2024-05-24T16:00:11Z

To solve some of the most common user pain points, we want to rethink the LitGPT CLI to make it more intuitive. After the change, the CLI would work as follows

# ligpt [action] [model]
litgpt  download  meta-llama/Meta-Llama-3-8B-Instruct
litgpt  chat      meta-llama/Meta-Llama-3-8B-Instruct
litgpt  finetune  meta-llama/Meta-Llama-3-8B-Instruct
litgpt  pretrain  meta-llama/Meta-Llama-3-8B-Instruct
litgpt  serve     meta-llama/Meta-Llama-3-8B-Instruct

and for advanced users, there will be additional finetuning options:

# ligpt [action]             [model]

litgpt  finetune             meta-llama/Meta-Llama-3-8B-Instruct
litgpt  finetune_full        meta-llama/Meta-Llama-3-8B-Instruct
litgpt  finetune_lora        meta-llama/Meta-Llama-3-8B-Instruct
litgpt  finetune_adapter     meta-llama/Meta-Llama-3-8B-Instruct
litgpt  finetune_adapter_v2  meta-llama/Meta-Llama-3-8B-Instruct

Todos

Implement changes
Update all docs
Deprecate "old way"

Approach

This will be tackled in 2 steps

introduce positional args but keep existing usage
remove checkpoint_dir argument and introduce root_dir argument with root_dir="checkpoints" as the default value

E.g., if someone uses

litgpt  finetune  meta-llama/Meta-Llama-3-8B-Instruct --root_dir="checkpoints"

It will open the checkpoints from checkpoints/meta-llama/Meta-Llama-3-8B-Instruct

To use a custom path from a totally different location, e.g., tmp/my_model, one can do

litgpt  finetune  tmp/my_model --root_dir="."

rasbt · 2024-05-24T20:55:12Z

Sorry to both you with this @awaelchli but I've been banging my head against this for 2 hours and can't figure it out. I just want to get litgpt finetune_lora to work similar to litgpt finetune (and litgpt finetune lora) before. It does work fine, but at the the end, it appends the Unrecognized arguments error:

litgpt finetune_lora --checkpoint_dir checkpoints/EleutherAI/pythia-14m --train.max_steps 1 --train.max_seq_len 512  --optimizer SGD
{'checkpoint_dir': PosixPath('checkpoints/EleutherAI/pythia-14m'),
 'data': None,
 'devices': 1,
 'eval': EvalArgs(interval=100,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': False,
 'lora_key': False,
 'lora_mlp': False,
 'lora_projection': False,
 'lora_query': True,
 'lora_r': 8,
 'lora_value': True,
 'optimizer': {'class_path': 'torch.optim.SGD',
               'init_args': {'dampening': 0.0,
                             'differentiable': False,
                             'foreach': None,
                             'lr': 0.001,
                             'maximize': False,
                             'momentum': 0.0,
                             'nesterov': False,
                             'weight_decay': 0.01}},
 'out_dir': PosixPath('out/finetune/lora'),
 'precision': None,
 'quantize': None,
 'seed': 1337,
 'train': TrainArgs(save_interval=1000,
                    log_interval=1,
                    global_batch_size=16,
                    micro_batch_size=1,
                    lr_warmup_steps=100,
                    lr_warmup_fraction=None,
                    epochs=5,
                    max_tokens=None,
                    max_steps=1,
                    max_seq_length=512,
                    tie_embeddings=None,
                    max_norm=None,
                    min_lr=6e-05)}
Using bfloat16 Automatic Mixed Precision (AMP)
Seed set to 1337
Number of trainable parameters: 24,576
Number of non-trainable parameters: 14,067,712
The longest sequence length in the train data is 512, the model's maximum sequence length is 512 and context length is 512
Validating ...
Epoch 1 | iter 1 step 0 | loss train: 6.456, val: n/a | iter time: 218.23 ms
Epoch 1 | iter 2 step 0 | loss train: 6.077, val: n/a | iter time: 42.61 ms
Epoch 1 | iter 3 step 0 | loss train: 6.065, val: n/a | iter time: 23.38 ms
Epoch 1 | iter 4 step 0 | loss train: 5.940, val: n/a | iter time: 52.27 ms
Epoch 1 | iter 5 step 0 | loss train: 5.975, val: n/a | iter time: 21.69 ms
Epoch 1 | iter 6 step 0 | loss train: 6.002, val: n/a | iter time: 21.96 ms
Epoch 1 | iter 7 step 0 | loss train: 5.824, val: n/a | iter time: 16.52 ms
Epoch 1 | iter 8 step 0 | loss train: 5.840, val: n/a | iter time: 22.35 ms
Epoch 1 | iter 9 step 0 | loss train: 5.814, val: n/a | iter time: 22.57 ms
Epoch 1 | iter 10 step 0 | loss train: 5.789, val: n/a | iter time: 22.63 ms
Epoch 1 | iter 11 step 0 | loss train: 5.842, val: n/a | iter time: 22.41 ms
Epoch 1 | iter 12 step 0 | loss train: 5.873, val: n/a | iter time: 22.48 ms
Epoch 1 | iter 13 step 0 | loss train: 5.893, val: n/a | iter time: 22.48 ms
Epoch 1 | iter 14 step 0 | loss train: 5.868, val: n/a | iter time: 23.18 ms
Epoch 1 | iter 15 step 0 | loss train: 5.879, val: n/a | iter time: 22.29 ms
Epoch 1 | iter 16 step 1 | loss train: 5.929, val: n/a | iter time: 31.64 ms (step)
Training time: 20.27s
Memory used: 0.25 GB
Validating ...
Final evaluation | val loss: 5.967 | val ppl: 390.511
Saving LoRA weights to '/teamspace/studios/this_studio/out/finetune/lora/final/lit_model.pth.lora'
usage: litgpt [-h] [--config CONFIG] [--print_config[=flags]] [--checkpoint_dir CHECKPOINT_DIR] [--out_dir OUT_DIR] [--precision PRECISION] [--quantize QUANTIZE] [--devices DEVICES]
              [--lora_r LORA_R] [--lora_alpha LORA_ALPHA] [--lora_dropout LORA_DROPOUT] [--lora_query {true,false}] [--lora_key {true,false}] [--lora_value {true,false}]
              [--lora_projection {true,false}] [--lora_mlp {true,false}] [--lora_head {true,false}] [--data.help CLASS_PATH_OR_NAME] [--data DATA] [--train CONFIG]
              [--train.save_interval SAVE_INTERVAL] [--train.log_interval LOG_INTERVAL] [--train.global_batch_size GLOBAL_BATCH_SIZE] [--train.micro_batch_size MICRO_BATCH_SIZE]
              [--train.lr_warmup_steps LR_WARMUP_STEPS] [--train.lr_warmup_fraction LR_WARMUP_FRACTION] [--train.epochs EPOCHS] [--train.max_tokens MAX_TOKENS]
              [--train.max_steps MAX_STEPS] [--train.max_seq_length MAX_SEQ_LENGTH] [--train.tie_embeddings {true,false,null}] [--train.max_norm MAX_NORM] [--train.min_lr MIN_LR]
              [--eval CONFIG] [--eval.interval INTERVAL] [--eval.max_new_tokens MAX_NEW_TOKENS] [--eval.max_iters MAX_ITERS] [--eval.initial_validation {true,false}]
              [--optimizer OPTIMIZER] [--logger_name {wandb,tensorboard,csv}] [--seed SEED]
error: Unrecognized arguments: finetune_lora

I am not sure where it comes from. It's bizarre to me. I checked the kwargs and there doesn't seem to be such an argument. I must be doing something wrong with jsonargparse, but I can't figure out where the issue lies as litgpt finetune uses exactly the same finetune_lora_fn function but doesn't have this issue.

awaelchli · 2024-05-28T11:13:41Z

At the end of training, we save the hyperparameters to the checkpoint dir for reproducibility. This is done in this function:

litgpt/litgpt/utils.py

Lines 436 to 458 in 221b7ef

    
           def save_hyperparameters(function: callable, checkpoint_dir: Path) -> None: 
        
               """Captures the CLI parameters passed to `function` without running `function` and saves them to the checkpoint.""" 
        
               from jsonargparse import capture_parser 
        
               # TODO: Make this more robust 
        
               # This hack strips away the subcommands from the top-level CLI 
        
               # to parse the file as if it was called as a script 
        
               known_commands = [ 
        
                   ("finetune", "full"), 
        
                   ("finetune", "lora"), 
        
                   ("finetune", "adapter"), 
        
                   ("finetune", "adapter_v2"), 
        
                   ("finetune",), 
        
                   ("pretrain",), 
        
               ] 
        
               for known_command in known_commands: 
        
                   unwanted = slice(1, 1 + len(known_command)) 
        
                   if tuple(sys.argv[unwanted]) == known_command: 
        
                       sys.argv[unwanted] = [] 
        
               parser = capture_parser(lambda: CLI(function)) 
        
               config = parser.parse_args() 
        
               parser.save(config, checkpoint_dir / "hyperparameters.yaml", overwrite=True)

If the name of the commands change, the code there needs to be adapted a bit. I can help with this.

rasbt · 2024-05-28T13:50:42Z

Oh thanks, this might be exactly related to the issue I was encountering! Thanks a lot, I think I should be able to fix it!

rasbt · 2024-05-28T17:53:47Z

Thanks, @awaelchli ! It actually didn't occur to me to check this file! I think I got this know, thanks!

litgpt/scripts/convert_hf_checkpoint.py

litgpt/__main__.py

#1451)

litgpt/config.py

tests/Untitled-1.ipynb

README.md

tutorials/evaluation.md

tests/test_pretrain.py

tests/test_utils.py

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>

proposal

6e2ae76

rasbt requested review from williamFalcon and lantiga as code owners May 24, 2024 16:00

rasbt added enhancement New feature or request breaking change labels May 24, 2024

rasbt marked this pull request as draft May 24, 2024 16:01

rasbt added 6 commits May 24, 2024 16:47

modify jsonargparse

685186e

simplify level 1 argument parsing

fd1d276

update cli tests

1010355

readme tests

28ca06a

update adapter tests

d2e997c

updates

238c4ae

rasbt added 4 commits May 28, 2024 09:50

Merge branch 'main' into new-cli

3929088

updates

8d028bc

fix cli naming

d590620

fix tests

2551f5f

rasbt and others added 4 commits May 28, 2024 21:45

update CLI usage

471002f

update docstrings

1710658

convert to positional args

3426609

Pass checkpoint dir as required argument in tests

f0293b7

awaelchli reviewed May 29, 2024

View reviewed changes

litgpt/scripts/convert_hf_checkpoint.py Outdated Show resolved Hide resolved

litgpt/__main__.py Show resolved Hide resolved

rasbt added 5 commits May 29, 2024 22:51

resolve redundancies

077db2d

Transition checkpoint_dir to root_dir (part of the New CLI transition) (

5d76018

#1451)

address adrian's suggestions

4d55ab6

fix adapter tests

25811c5

fix various tests

c03ff3f

rasbt and others added 10 commits May 30, 2024 20:59

update readme tests

ded9199

test_save_hyperparameters

74e46c0

model_name and model_config are no longer mutually exclusive

a3ad8eb

update docs

8c5cebf

minor cleanup

99df354

fix generate tests

db8f7df

readd missing cli call

243f077

add missing positional argument in pretrain test

3f1fdfc

A more centered look (#1449)

f95f07b

simplify from_name

1b37873

rasbt marked this pull request as ready for review May 31, 2024 13:53

Merge branch 'main' into new-cli

9b624a4

awaelchli approved these changes May 31, 2024

View reviewed changes

rasbt and others added 6 commits May 31, 2024 09:10

Update tests/test_utils.py

705ad6f

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>

Update tests/test_pretrain.py

9a7f271

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>

Update tutorials/evaluation.md

e4e2b19

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>

Update README.md

c58c00e

Co-authored-by: awaelchli <aedu.waelchli@gmail.com>

simplify from_name number 2

97aa6c5

fix pretrain test

2d11cab

rasbt merged commit 3fa17fb into main May 31, 2024
9 checks passed

rasbt deleted the new-cli branch May 31, 2024 16:28

awaelchli mentioned this pull request Jun 1, 2024

Detect tensor cores #1456

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New CLI #1437

New CLI #1437

rasbt commented May 24, 2024 •

edited

Loading

rasbt commented May 24, 2024

awaelchli commented May 28, 2024

rasbt commented May 28, 2024

rasbt commented May 28, 2024

New CLI #1437

New CLI #1437

Conversation

rasbt commented May 24, 2024 • edited Loading

Todos

Approach

rasbt commented May 24, 2024

awaelchli commented May 28, 2024

rasbt commented May 28, 2024

rasbt commented May 28, 2024

rasbt commented May 24, 2024 •

edited

Loading