Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alit/griffin #9021

Merged
merged 70 commits into from
May 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
b0e3c89
add init griffin
JRD971000 Apr 2, 2024
55a2bd2
Merge branch 'main' into alit/griffin
JRD971000 Apr 2, 2024
4cffa72
remove unnecessary imports
JRD971000 Apr 2, 2024
67a8a4c
add sft
JRD971000 Apr 2, 2024
1d7e22d
add sft model init
JRD971000 Apr 4, 2024
e91fc85
add text gen starategy for Griffin no cache
JRD971000 Apr 4, 2024
23b76d7
test SFT
JRD971000 Apr 5, 2024
8e47c88
minor fix to config
Apr 5, 2024
624cd9a
fix logprob output issue
JRD971000 Apr 6, 2024
7713181
Merge branch 'alit/griffin' of https://github.com/NVIDIA/NeMo into al…
JRD971000 Apr 6, 2024
8fc52e4
sft WS fixed
JRD971000 Apr 9, 2024
81df3e2
replace trainer in conversion script
JRD971000 Apr 9, 2024
479f5a8
Merge branch 'main' into alit/griffin
JRD971000 Apr 9, 2024
390af24
Revert "Fix PTL2.2 saving multiple `*-last.ckpt` checkpoints in resum…
JRD971000 Apr 10, 2024
5f4cde3
Revert "FSDP update to PTL 2.2 (#8658)"
JRD971000 Apr 10, 2024
dbd9670
Merge branch 'main' into alit/griffin
JRD971000 Apr 11, 2024
c9aa338
init dist opt
JRD971000 Apr 11, 2024
fdae74e
add peft
JRD971000 Apr 16, 2024
380b285
fix generate script
Apr 17, 2024
ce86fba
convert to HF format
JRD971000 Apr 19, 2024
c9d7556
Merge branch 'alit/griffin' of https://github.com/NVIDIA/NeMo into al…
JRD971000 Apr 19, 2024
88ab6c2
further cleanups
JRD971000 Apr 22, 2024
93a8367
minor fix
JRD971000 Apr 22, 2024
6b4ccb4
minor fix
JRD971000 Apr 22, 2024
f8585b3
more refactoring
JRD971000 Apr 23, 2024
1780766
rebase with main
JRD971000 Apr 23, 2024
a17f32f
remove local path from config
JRD971000 Apr 23, 2024
4121f8a
undo unnecessary changes
JRD971000 Apr 23, 2024
cff3f6b
remove pretraining
JRD971000 Apr 23, 2024
75df1ae
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 23, 2024
0fb9770
fix val param sync
Apr 25, 2024
f7ef6a3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 25, 2024
bd50e8d
minor fix
Apr 25, 2024
cc42199
fix self
Apr 25, 2024
0d6a319
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 25, 2024
7bf55f1
Addresing MR comments
JRD971000 Apr 25, 2024
82508ee
resolve conflict and address review
JRD971000 Apr 25, 2024
4af7319
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 25, 2024
949bdf7
Merge branch 'main' into alit/griffin
JRD971000 Apr 25, 2024
c87bba8
code ql fixed
JRD971000 Apr 25, 2024
c6c23f1
Merge branch 'main' into alit/griffin
JRD971000 Apr 25, 2024
fbbc779
more code ql
JRD971000 Apr 25, 2024
3a5d554
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 25, 2024
ae02650
address comments
JRD971000 Apr 25, 2024
685c09e
Merge branch 'alit/griffin' of https://github.com/NVIDIA/NeMo into al…
JRD971000 Apr 25, 2024
9f24940
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 25, 2024
1611cc3
add jenkins
JRD971000 Apr 25, 2024
e3c0ad1
Merge branch 'alit/griffin' of https://github.com/NVIDIA/NeMo into al…
JRD971000 Apr 25, 2024
6fe52fd
remove jenkins for momentarily
JRD971000 Apr 25, 2024
f4474d5
add reqs for griffin
JRD971000 Apr 26, 2024
9395461
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 26, 2024
d138509
add req test
JRD971000 Apr 26, 2024
e806714
Merge branch 'alit/griffin' of https://github.com/NVIDIA/NeMo into al…
JRD971000 Apr 26, 2024
67cf0dd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 26, 2024
02b11c1
add reqs to nlp
JRD971000 Apr 26, 2024
833bb72
Merge branch 'alit/griffin' of https://github.com/NVIDIA/NeMo into al…
JRD971000 Apr 26, 2024
638ddb2
add reqs to nlp
JRD971000 Apr 26, 2024
221a9f2
Merge branch 'main' into alit/griffin
ericharper Apr 26, 2024
ace539d
replace torch scan
Apr 29, 2024
38f9c0a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 29, 2024
3e896b2
Merge branch 'main' into alit/griffin
ericharper May 1, 2024
93721c1
jit fusion for embedding decoder
JRD971000 May 1, 2024
f2056c2
Merge branch 'alit/griffin' of https://github.com/NVIDIA/NeMo into al…
JRD971000 May 1, 2024
f2cc289
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 1, 2024
a55a771
jit fusion for embedding decoder
JRD971000 May 1, 2024
111602c
Merge branch 'main' into alit/griffin
ericharper May 1, 2024
183e556
add fix to rglru
JRD971000 May 2, 2024
97895f1
Merge branch 'alit/griffin' of https://github.com/NVIDIA/NeMo into al…
JRD971000 May 2, 2024
7f22c1c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 2, 2024
2aeb30d
Merge branch 'main' into alit/griffin
ericharper May 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
168 changes: 168 additions & 0 deletions examples/nlp/language_modeling/conf/megatron_griffin_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
name: megatron_griffin
restore_from_path: null # used when starting from a .nemo file

trainer:
devices: 1
num_nodes: 1
accelerator: gpu
precision: 16
logger: False # logger provided by exp_manager
enable_checkpointing: False
use_distributed_sampler: False
max_epochs: -1 # PTL default. In practice we don't usually train for more than 1 epoch.
max_steps: 100000 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
log_every_n_steps: 10
val_check_interval: 100
limit_val_batches: 50
limit_test_batches: 500
accumulate_grad_batches: 1
gradient_clip_val: 1.0
benchmark: False

exp_manager:
explicit_log_dir: null
exp_dir: null
name: megatron_griffin
create_wandb_logger: False
wandb_logger_kwargs:
project: null
name: null
resume_if_exists: True
resume_ignore_no_checkpoint: True
create_checkpoint_callback: True
checkpoint_callback_params:
monitor: val_loss
save_top_k: 10
mode: min
always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
filename: 'megatron_griffin--{val_loss:.2f}-{step}-{consumed_samples}'
model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}


model:
restore_from_path: null
# model parallelism
micro_batch_size: 2
global_batch_size: 2
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
virtual_pipeline_model_parallel_size: null
vocab_size: 256000
# model architecture
encoder_seq_length: 512
max_position_embeddings: ${.encoder_seq_length}
position_embedding_type: 'rope' # Position embedding type. Options ['learned_absolute', 'rope', 'alibi', 'kerple' , 'xpos', 'sandwich'] xpos and sandwich are experimental.
logits_soft_cap: 30.0
num_layers: 26
gated_linear_unit: True
window_size: [1024, 0]
num_query_groups: 1
attention_dropout: 0.0
hidden_dropout: 0.0
hidden_size: 2560
bias_activation_fusion: True
ffn_hidden_size: 7680 # Transformer FFN hidden size. Usually 4 * hidden_size.
num_attention_heads: 10
transformer_block_type: pre_ln
init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.')
kv_channels: null # Projection weights dimension in multi-head attention. Set to hidden_size // num_attention_heads if null
apply_query_key_layer_scaling: True # scale Q * K^T by 1 / layer-number.
normalization: RMSNorm
layernorm_epsilon: 1e-6
rotary_interleaved: False
layernorm_zero_centered_gamma: True
make_vocab_size_divisible_by: 128 # Pad the vocab size to be divisible by this value for computation efficiency.
pre_process: True # add embedding
post_process: True # add pooler
megatron_legacy: False

tokenizer:
library: 'huggingface'
type: 'google/recurrentgemma-2b'
model: null
vocab_file: null
merge_file: null
sentencepiece_legacy: False

# precision
native_amp_init_scale: 4294967296 # 2 ** 32
native_amp_growth_interval: 1000
fp32_residual_connection: False # Move residual connections to fp32
fp16_lm_cross_entropy: False # Move the cross entropy unreduced loss calculation for lm head to fp16

# Megatron O2-style half-precision
megatron_amp_O2: False # Enable O2-level automatic mixed precision using main parameters
grad_allreduce_chunk_size_mb: 125
grad_div_ar_fusion: False

# miscellaneous
seed: 1234
use_cpu_initialization: False # Init weights on the CPU (slow for large models)
onnx_safe: False # Use work-arounds for known problems with Torch ONNX exporter.
gradient_as_bucket_view: True # PyTorch DDP argument. Allocate gradients in a contiguous bucket to save memory (less fragmentation and buffer memory)

## Activation Checkpointing
# NeMo Megatron supports 'selective' activation checkpointing where only the memory intensive part of attention is checkpointed.
# These memory intensive activations are also less compute intensive which makes activation checkpointing more efficient for LLMs (20B+).
# See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
# 'full' will checkpoint the entire transformer layer.
activations_checkpoint_granularity: null # 'selective' or 'full'
activations_checkpoint_method: null # 'uniform', 'block'
# 'uniform' divides the total number of transformer layers and checkpoints the input activation
# of each chunk at the specified granularity. When used with 'selective', 'uniform' checkpoints all attention blocks in the model.
# 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
activations_checkpoint_num_layers: null
# when using 'uniform' this creates groups of transformer layers to checkpoint. Usually set to 1. Increase to save more memory.
# when using 'block' this this will checkpoint the first activations_checkpoint_num_layers per pipeline stage.
num_micro_batches_with_partial_activation_checkpoints: null
# This feature is valid only when used with pipeline-model-parallelism.
# When an integer value is provided, it sets the number of micro-batches where only a partial number of Transformer layers get checkpointed
# and recomputed within a window of micro-batches. The rest of micro-batches in the window checkpoint all Transformer layers. The size of window is
# set by the maximum outstanding micro-batch backpropagations, which varies at different pipeline stages. The number of partial layers to checkpoint
# per micro-batch is set by 'activations_checkpoint_num_layers' with 'activations_checkpoint_method' of 'block'.
# This feature enables using activation checkpoint at a fraction of micro-batches up to the point of full GPU memory usage.
activations_checkpoint_layers_per_pipeline: null
# This feature is valid only when used with pipeline-model-parallelism.
# When an integer value (rounded down when float is given) is provided, it sets the number of Transformer layers to skip checkpointing at later
# pipeline stages. For example, 'activations_checkpoint_layers_per_pipeline' of 3 makes pipeline stage 1 to checkpoint 3 layers less than
# stage 0 and stage 2 to checkpoint 6 layers less stage 0, and so on. This is possible because later pipeline stage
# uses less GPU memory with fewer outstanding micro-batch backpropagations. Used with 'num_micro_batches_with_partial_activation_checkpoints',
# this feature removes most of activation checkpoints at the last pipeline stage, which is the critical execution path.
sequence_parallel: False

data:
# Path to data must be specified by the user.
# can override from the CLI: "model.data.data_prefix=[.5,/raid/data/pile/my-gpt3_00_text_document,.5,/raid/data/pile/my-gpt3_01_text_document]",
# Or see example below:
# data_prefix:
# - .5
# - /raid/data/pile/my-gpt3_00_text_document
# - .5
# - /raid/data/pile/my-gpt3_01_text_document
data_prefix: [1.0, /path/to/data]
index_mapping_dir: null # path to save index mapping .npy files, by default will save in the same location as data_prefix
data_impl: mmap
splits_string: 900,50,50
seq_length: ${model.encoder_seq_length}
skip_warmup: True
num_workers: 0
dataloader_type: single # cyclic, LDDL
reset_position_ids: False # Reset position ids after end-of-document token
reset_attention_mask: False # Reset attention mask after end-of-document token
eod_mask_loss: False # Mask loss for the end of document tokens
masked_lm_prob: 0.15 # Probability of replacing a token with mask.
short_seq_prob: 0.1 # Probability of producing a short sequence.
ceil_to_power_2: True

optim:
name: fused_adam
lr: 2e-4
weight_decay: 0.01
betas:
- 0.9
- 0.98
sched:
name: CosineAnnealing
warmup_steps: 500
constant_steps: 50000
min_lr: 2e-5
Loading
Loading