Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi machine pre-training hung #111

Open
BUPTAnderson opened this issue Jun 16, 2023 · 1 comment
Open

Multi machine pre-training hung #111

BUPTAnderson opened this issue Jun 16, 2023 · 1 comment

Comments

@BUPTAnderson
Copy link

deepspeed --hostfile=hostfile pretrain.py --deepspeed --deepspeed_config models/deepspeed_zero3_config.json --enable_zero3 \
                      --pretrained_model_path models/llama-7b.bin \
                      --dataset_path $OUTPUT_DATASET_PATH --spm_model_path $LLaMA_PATH/tokenizer.model \
                      --config_path models/llama/7b_config.json \
                      --output_model_path models/llama_zh_7b \
                      --world_size 8 --data_processor lm  --deepspeed_checkpoint_activations \
                      --total_steps 300000 --save_checkpoint_steps 5000 --batch_size 24

hostfie has 4 v100 machine:

1.1.1.1 slots=8
2.2.2.2 slots=8
3.3.3.3 slots=8
4.4.4.4 slots=8

the master machine 1.1.1.1 print follow log and the pre-training hung:

1.1.1.1: [2023-06-16 15:32:01,673] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: True
1.1.1.1: [2023-06-16 15:32:01,674] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
1.1.1.1: [2023-06-16 15:32:01,674] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
1.1.1.1: [2023-06-16 15:32:01,692] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
1.1.1.1: [2023-06-16 15:32:01,692] [INFO] [utils.py:51:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
1.1.1.1: [2023-06-16 15:32:01,692] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
1.1.1.1: [2023-06-16 15:32:01,692] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
1.1.1.1: [2023-06-16 15:32:01,791] [INFO] [utils.py:785:see_memory_usage] Stage 3 initialize beginning
1.1.1.1: [2023-06-16 15:32:01,791] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.73 GB         CA 0.73 GB         Max_CA 1 GB
1.1.1.1: [2023-06-16 15:32:01,792] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 37.82 GB, percent = 15.1%
1.1.1.1: [2023-06-16 15:32:01,794] [INFO] [stage3.py:113:__init__] Reduce bucket size 500,000,000
1.1.1.1: [2023-06-16 15:32:01,794] [INFO] [stage3.py:114:__init__] Prefetch bucket size 50,000,000
1.1.1.1: [2023-06-16 15:32:01,856] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
1.1.1.1: [2023-06-16 15:32:01,857] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.73 GB         Max_CA 1 GB
1.1.1.1: [2023-06-16 15:32:01,857] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 37.82 GB, percent = 15.1%
1.1.1.1: Parameter Offload: Total persistent parameters: 266240 in 65 params
1.1.1.1: [2023-06-16 15:32:01,941] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
1.1.1.1: [2023-06-16 15:32:01,941] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.73 GB         Max_CA 1 GB
1.1.1.1: [2023-06-16 15:32:01,941] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 37.82 GB, percent = 15.1%
1.1.1.1: [2023-06-16 15:32:02,007] [INFO] [utils.py:785:see_memory_usage] Before creating fp16 partitions
1.1.1.1: [2023-06-16 15:32:02,008] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.73 GB         Max_CA 1 GB
1.1.1.1: [2023-06-16 15:32:02,008] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 37.82 GB, percent = 15.1%
1.1.1.1: [2023-06-16 15:32:10,831] [INFO] [utils.py:785:see_memory_usage] After creating fp16 partitions: 1
1.1.1.1: [2023-06-16 15:32:10,832] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.73 GB         Max_CA 1 GB
1.1.1.1: [2023-06-16 15:32:10,832] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 47.35 GB, percent = 18.9%
1.1.1.1: [2023-06-16 15:32:10,894] [INFO] [utils.py:785:see_memory_usage] Before creating fp32 partitions
1.1.1.1: [2023-06-16 15:32:10,894] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.73 GB         Max_CA 1 GB
1.1.1.1: [2023-06-16 15:32:10,895] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 47.35 GB, percent = 18.9%
1.1.1.1: [2023-06-16 15:32:11,059] [INFO] [utils.py:785:see_memory_usage] After creating fp32 partitions
1.1.1.1: [2023-06-16 15:32:11,059] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.73 GB         Max_CA 1 GB
1.1.1.1: [2023-06-16 15:32:11,059] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 48.14 GB, percent = 19.2%
1.1.1.1: [2023-06-16 15:32:11,247] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states
1.1.1.1: [2023-06-16 15:32:11,248] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.73 GB         Max_CA 1 GB
1.1.1.1: [2023-06-16 15:32:11,248] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 58.49 GB, percent = 23.3%
1.1.1.1: [2023-06-16 15:32:13,026] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | init_optimizer_state: 1731.81
1.1.1.1: [2023-06-16 15:32:13,148] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states
1.1.1.1: [2023-06-16 15:32:13,149] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.73 GB         Max_CA 1 GB
1.1.1.1: [2023-06-16 15:32:13,149] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 68.91 GB, percent = 27.5%
1.1.1.1: [2023-06-16 15:32:13,149] [INFO] [stage3.py:392:_setup_for_real_optimizer] optimizer state initialized
1.1.1.1: [2023-06-16 15:32:15,376] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer
1.1.1.1: [2023-06-16 15:32:15,377] [INFO] [utils.py:786:see_memory_usage] MA 0.93 GB         Max_MA 1.42 GB         CA 1.67 GB         Max_CA 2 GB
1.1.1.1: [2023-06-16 15:32:15,377] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 73.03 GB, percent = 29.1%
1.1.1.1: [2023-06-16 15:32:15,377] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam
1.1.1.1: [2023-06-16 15:32:15,377] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
1.1.1.1: [2023-06-16 15:32:15,377] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f347c617df0>
1.1.1.1: [2023-06-16 15:32:15,378] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.999)]
1.1.1.1: [2023-06-16 15:32:15,379] [INFO] [config.py:955:print] DeepSpeedEngine configuration:
1.1.1.1: [2023-06-16 15:32:15,379] [INFO] [config.py:959:print]   activation_checkpointing_config  {
1.1.1.1:     "partition_activations": false,
1.1.1.1:     "contiguous_memory_optimization": false,
1.1.1.1:     "cpu_checkpointing": false,
1.1.1.1:     "number_checkpoints": null,
1.1.1.1:     "synchronize_checkpoint_boundary": false,
1.1.1.1:     "profile": false
1.1.1.1: }
1.1.1.1: [2023-06-16 15:32:15,379] [INFO] [config.py:959:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
1.1.1.1: [2023-06-16 15:32:15,379] [INFO] [config.py:959:print]   amp_enabled .................. False
1.1.1.1: [2023-06-16 15:32:15,379] [INFO] [config.py:959:print]   amp_params ................... False
1.1.1.1: [2023-06-16 15:32:15,379] [INFO] [config.py:959:print]   autotuning_config ............ {
1.1.1.1:     "enabled": false,
1.1.1.1:     "start_step": null,
1.1.1.1:     "end_step": null,
1.1.1.1:     "metric_path": null,
1.1.1.1:     "arg_mappings": null,
1.1.1.1:     "metric": "throughput",
1.1.1.1:     "model_info": null,
1.1.1.1:     "results_dir": "autotuning_results",
1.1.1.1:     "exps_dir": "autotuning_exps",
1.1.1.1:     "overwrite": true,
1.1.1.1:     "fast": true,
1.1.1.1:     "start_profile_step": 3,
1.1.1.1:     "end_profile_step": 5,
1.1.1.1:     "tuner_type": "gridsearch",
1.1.1.1:     "tuner_early_stopping": 5,
1.1.1.1:     "tuner_num_trials": 50,
1.1.1.1:     "model_info_path": null,
1.1.1.1:     "mp_size": 1,
1.1.1.1:     "max_train_batch_size": null,
1.1.1.1:     "min_train_batch_size": 1,
1.1.1.1:     "max_train_micro_batch_size_per_gpu": 1.024000e+03,
1.1.1.1:     "min_train_micro_batch_size_per_gpu": 1,
1.1.1.1:     "num_tuning_micro_batch_sizes": 3
1.1.1.1: }
1.1.1.1: [2023-06-16 15:32:15,379] [INFO] [config.py:959:print]   bfloat16_enabled ............. False
1.1.1.1: [2023-06-16 15:32:15,379] [INFO] [config.py:959:print]   checkpoint_parallel_write_pipeline  False
1.1.1.1: [2023-06-16 15:32:15,379] [INFO] [config.py:959:print]   checkpoint_tag_validation_enabled  True
1.1.1.1: [2023-06-16 15:32:15,379] [INFO] [config.py:959:print]   checkpoint_tag_validation_fail  False
1.1.1.1: [2023-06-16 15:32:15,379] [INFO] [config.py:959:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f347c6175e0>
1.1.1.1: [2023-06-16 15:32:15,379] [INFO] [config.py:959:print]   communication_data_type ...... None
1.1.1.1: [2023-06-16 15:32:15,379] [INFO] [config.py:959:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
1.1.1.1: [2023-06-16 15:32:15,379] [INFO] [config.py:959:print]   curriculum_enabled_legacy .... False
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   curriculum_params_legacy ..... False
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   data_efficiency_enabled ...... False
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   dataloader_drop_last ......... False
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   disable_allgather ............ False
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   dump_state ................... False
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   eigenvalue_enabled ........... False
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   eigenvalue_gas_boundary_resolution  1
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   eigenvalue_layer_name ........ bert.encoder.layer
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   eigenvalue_layer_num ......... 0
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   eigenvalue_max_iter .......... 100
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   eigenvalue_stability ......... 1e-06
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   eigenvalue_tol ............... 0.01
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   eigenvalue_verbose ........... False
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   elasticity_enabled ........... False
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   flops_profiler_config ........ {
1.1.1.1:     "enabled": true,
1.1.1.1:     "profile_step": 1,
1.1.1.1:     "module_depth": -1,
1.1.1.1:     "top_modules": 3,
1.1.1.1:     "detailed": true,
1.1.1.1:     "output_file": null
1.1.1.1: }
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   fp16_auto_cast ............... False
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   fp16_enabled ................. True
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   fp16_master_weights_and_gradients  False
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   global_rank .................. 0
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   grad_accum_dtype ............. None
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   gradient_accumulation_steps .. 1
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   gradient_clipping ............ 0.0
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   gradient_predivide_factor .... 1.0
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   initial_dynamic_scale ........ 65536
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   load_universal_checkpoint .... False
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   loss_scale ................... 0
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   memory_breakdown ............. False
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   mics_hierarchial_params_gather  False
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   mics_shard_size .............. -1
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
1.1.1.1: [2023-06-16 15:32:15,380] [INFO] [config.py:959:print]   nebula_config ................ {
1.1.1.1:     "enabled": false,
1.1.1.1:     "persistent_storage_path": null,
1.1.1.1:     "persistent_time_interval": 100,
1.1.1.1:     "num_of_version_in_retention": 2,
1.1.1.1:     "enable_nebula_load": true,
1.1.1.1:     "load_path": null
1.1.1.1: }
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   optimizer_legacy_fusion ...... False
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   optimizer_name ............... adam
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   optimizer_params ............. {'lr': 1e-05, 'weight_decay': 0.01}
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   pld_enabled .................. False
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   pld_params ................... False
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   prescale_gradients ........... False
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   scheduler_name ............... None
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   scheduler_params ............. None
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   sparse_attention ............. None
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   sparse_gradients_enabled ..... False
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   steps_per_print .............. 100
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   train_batch_size ............. 32
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   train_micro_batch_size_per_gpu  1
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   use_node_local_storage ....... False
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   wall_clock_breakdown ......... True
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   world_size ................... 32
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   zero_allow_untested_optimizer  True
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   zero_enabled ................. True
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   zero_force_ds_cpu_optimizer .. True
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:959:print]   zero_optimization_stage ...... 3
1.1.1.1: [2023-06-16 15:32:15,381] [INFO] [config.py:945:print_user_config]   json = {
1.1.1.1:     "gradient_accumulation_steps": 1,
1.1.1.1:     "train_micro_batch_size_per_gpu": 1,
1.1.1.1:     "steps_per_print": 100,
1.1.1.1:     "optimizer": {
1.1.1.1:         "type": "Adam",
1.1.1.1:         "params": {
1.1.1.1:             "lr": 1e-05,
1.1.1.1:             "weight_decay": 0.01
1.1.1.1:         }
1.1.1.1:     },
1.1.1.1:     "flops_profiler": {
1.1.1.1:         "enabled": true,
1.1.1.1:         "profile_step": 1,
1.1.1.1:         "module_depth": -1,
1.1.1.1:         "top_modules": 3,
1.1.1.1:         "detailed": true
1.1.1.1:     },
1.1.1.1:     "fp16": {
1.1.1.1:         "enabled": true,
1.1.1.1:         "loss_scale": 0,
1.1.1.1:         "loss_scale_window": 1000,
1.1.1.1:         "hysteresis": 2,
1.1.1.1:         "min_loss_scale": 1
1.1.1.1:     },
1.1.1.1:     "zero_optimization": {
1.1.1.1:         "stage": 3,
1.1.1.1:         "offload_param": {
1.1.1.1:             "device": "cpu",
1.1.1.1:             "pin_memory": true
1.1.1.1:         },
1.1.1.1:         "offload_optimizer": {
1.1.1.1:             "device": "cpu",
1.1.1.1:             "pin_memory": true
1.1.1.1:         }
1.1.1.1:     },
1.1.1.1:     "activation_checkpointing": {
1.1.1.1:         "partition_activations": false,
1.1.1.1:         "contiguous_memory_optimization": false,
1.1.1.1:         "cpu_checkpointing": false
1.1.1.1:     },
1.1.1.1:     "wall_clock_breakdown": false,
1.1.1.1:     "zero_allow_untested_optimizer": true
1.1.1.1: }
@BUPTAnderson
Copy link
Author

Every v100 machine single pre-training for TencentPretrain is ok, and pre-training for FastChat use the same 4 nodes is ok too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant