# Step 3: Training a LoRA Adapter

This notebook performs the preparatory tasks needed for obtaining the base model that we will use for fine-tuning.

This notebook showcases performing LoRA fine-tuning on the dataset that we curated in step 1.

## Setup and Requirements
Before proceeding, please make ensure you have completed the notebooks for steps 1 and 2. You will need to install one dependency to follow along. Execute the following cell before getting started.

In [1]:
! pip install ipywidgets

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


Let's also specify the base model name that we will use for fine-tuning. This should be the same model you downloaded/converted in step 2.

In [2]:
model_to_use = "google/gemma-2-2b"

---
# Sanity Checking

Let's do a quick sanity check to ensure we have all the pieces needed before moving forward.

In [3]:
import os

model_name = model_to_use.split('/')[-1].lower()

# The path to the model checkpoint, and also the data directory containing the training, validation, and test data.
nemo_model_fp = os.path.abspath(f"models/{model_name}.nemo")
data_dir = "data/split"

# The directory where the results will be stored.
result_dir = os.path.abspath("results")
os.makedirs(result_dir, exist_ok=True)

# Sanity checks
assert os.path.exists(nemo_model_fp), f"The model checkpoint at '{nemo_model_fp}' does not exist. Please ensure the model was downloaded successfully."
assert os.path.exists(data_dir), f"The data directory '{data_dir}' does not exist. Please ensure the data was prepared successfully."

train_fp = os.path.abspath(f"{data_dir}/train.jsonl")
val_fp = os.path.abspath(f"{data_dir}/val.jsonl")

# Sanity checks
assert os.path.exists(train_fp), f"The training data at '{train_fp}' does not exist. Please ensure the data was prepared successfully."
assert os.path.exists(val_fp), f"The validation data at '{val_fp}' does not exist. Please ensure the data was prepared successfully."

#
# Set the environment variables (needed for executing the next cell)
#
%env BASE_MODEL=$nemo_model_fp
%env DATA_DIR=$data_dir
%env TRAIN_DS=$train_fp
%env VAL_DS=$val_fp
%env RESULT_DIR=$result_dir

print(f"\n{'#'*80}")
print("All checks passed. You are ready to go!")
print(f"    Base model file: {nemo_model_fp}")
print(f"    Data directory: {data_dir}")
print(f"    Results: {result_dir}")

env: BASE_MODEL=/root/ODSC-Hackathon-Repository/models/gemma-2-2b.nemo
env: DATA_DIR=data/split
env: TRAIN_DS=/root/ODSC-Hackathon-Repository/data/split/train.jsonl
env: VAL_DS=/root/ODSC-Hackathon-Repository/data/split/val.jsonl
env: RESULT_DIR=/root/ODSC-Hackathon-Repository/results

################################################################################
All checks passed. You are ready to go!
    Base model file: /root/ODSC-Hackathon-Repository/models/gemma-2-2b.nemo
    Data directory: data/split
    Results: /root/ODSC-Hackathon-Repository/results


---
# Model Training

With all the sanity checks passing, it is time to start model training.

> NOTE: Running the following cell will remove any previously trained model!

In [4]:
%%bash

SCHEME="lora"
TP_SIZE=1
PP_SIZE=1

# Clear up cached mem-map file
rm $DATA_DIR/*idx*
# Clean up prior results
rm -r $RESULT_DIR

torchrun --nproc_per_node=1 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    exp_manager.exp_dir=${RESULT_DIR} \
    exp_manager.explicit_log_dir=${RESULT_DIR} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.precision=bf16 \
    trainer.val_check_interval=200 \
    trainer.max_steps=1000 \
    trainer.gradient_clip_val=0.3 \
    model.megatron_amp_O2=True \
    ++model.mcore_gpt=True \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.micro_batch_size=1 \
    model.global_batch_size=10 \
    model.restore_from_path=${BASE_MODEL} \
    model.data.train_ds.num_workers=0 \
    model.data.train_ds.add_bos=True \
    model.data.validation_ds.num_workers=0 \
    model.data.train_ds.file_names=[${TRAIN_DS}] \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=[${VAL_DS}] \
    model.peft.peft_scheme=${SCHEME}

rm: cannot remove 'data/split/*idx*': No such file or directory
      cm = get_cmap("Set1")
    


[NeMo I 2024-10-28 07:07:10 megatron_gpt_finetuning:56] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-10-28 07:07:10 megatron_gpt_finetuning:57] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: bf16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 1000
      log_every_n_steps: 10
      val_check_interval: 200
      gradient_clip_val: 0.3
    exp_manager:
      explicit_log_dir: /root/ODSC-Hackathon-Repository/results
      exp_dir: /root/ODSC-Hackathon-Repository/results
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.validat

[NeMo W 2024-10-28 07:07:10 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


[NeMo I 2024-10-28 07:07:10 exp_manager:450] ExpManager schema
[NeMo I 2024-10-28 07:07:10 exp_manager:451] {'explicit_log_dir': None, 'exp_dir': None, 'name': None, 'version': None, 'use_datetime_version': True, 'resume_if_exists': False, 'resume_past_end': False, 'resume_ignore_no_checkpoint': False, 'resume_from_checkpoint': None, 'create_tensorboard_logger': True, 'summary_writer_kwargs': None, 'create_wandb_logger': False, 'wandb_logger_kwargs': None, 'create_mlflow_logger': False, 'mlflow_logger_kwargs': {'experiment_name': None, 'tracking_uri': None, 'tags': None, 'save_dir': './mlruns', 'prefix': '', 'artifact_location': None, 'run_id': None, 'log_model': False}, 'create_dllogger_logger': False, 'dllogger_logger_kwargs': {'verbose': False, 'stdout': False, 'json_file': './dllogger.json'}, 'create_clearml_logger': False, 'clearml_logger_kwargs': {'project': None, 'task': None, 'connect_pytorch': False, 'model_name': None, 'tags': None, 'log_model': False, 'log_cfg': False, 'log_

[NeMo E 2024-10-28 07:07:10 exp_manager:910] exp_manager received explicit_log_dir: /root/ODSC-Hackathon-Repository/results and at least one of exp_dir: /root/ODSC-Hackathon-Repository/results, or version: None. Please note that exp_dir, name, and version will be ignored.
[NeMo W 2024-10-28 07:07:10 exp_manager:837] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/root/ODSC-Hackathon-Repository/results/checkpoints. Training from scratch.


[NeMo I 2024-10-28 07:07:10 exp_manager:509] Experiments will be logged at /root/ODSC-Hackathon-Repository/results
[NeMo I 2024-10-28 07:07:10 exp_manager:1063] TensorboardLogger has been set up


[NeMo W 2024-10-28 07:07:10 exp_manager:1201] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 1000. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.


[NeMo I 2024-10-28 07:07:10 exp_manager:646] TFLOPs per sec per GPU will be calculated, conditioned on supported models. Defaults to -1 upon failure.


[NeMo W 2024-10-28 07:07:16 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:07:16 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:07:16 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:07:16 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:07:16 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-10-28 07:07:16 megatron_init:314] Rank 0 has data parallel group : [0]
[NeMo I 2024-10-28 07:07:16 megatron_init:320] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-10-28 07:07:16 megatron_init:325] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-10-28 07:07:16 megatron_init:328] Ranks 0 has data parallel rank: 0
[NeMo I 2024-10-28 07:07:16 megatron_init:336] Rank 0 has context parallel group: [0]
[NeMo I 2024-10-28 07:07:16 megatron_init:339] All context parallel group ranks: [[0]]
[NeMo I 2024-10-28 07:07:16 megatron_init:340] Ranks 0 has context parallel rank: 0
[NeMo I 2024-10-28 07:07:16 megatron_init:347] Rank 0 has model parallel group: [0]
[NeMo I 2024-10-28 07:07:16 megatron_init:348] All model parallel group ranks: [[0]]
[NeMo I 2024-10-28 07:07:16 megatron_init:357] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-10-28 07:07:16 megatron_init:361] All tensor model parallel group ranks: 

[NeMo W 2024-10-28 07:07:16 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:07:16 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:07:16 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:07:16 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:07:16 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-10-28 07:07:16 megatron_base_model:604] Padded vocab_size: 256000, original vocab_size: 256000, dummy tokens: 0.


[NeMo W 2024-10-28 07:07:16 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:07:16 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:07:16 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:07:16 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:07:16 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-10-28 07:07:28 nlp_overrides:1374] Model MegatronGPTSFTModel was successfully restored from /root/ODSC-Hackathon-Repository/models/gemma-2-2b.nemo.
[NeMo I 2024-10-28 07:07:28 megatron_gpt_finetuning:72] Adding adapter weights to the model for PEFT
[NeMo I 2024-10-28 07:07:28 nlp_adapter_mixins:245] Before adding PEFT params:
      | Name  | Type          | Params | Mode 
    ------------------------------------------------
    0 | model | Float16Module | 2.6 B  | train
    ------------------------------------------------
    0         Trainable params
    2.6 B     Non-trainable params
    2.6 B     Total params
    10,457.368Total estimated model params size (MB)
    452       Modules in train mode
    0         Modules in eval mode
[NeMo I 2024-10-28 07:07:31 nlp_adapter_mixins:250] After adding PEFT params:
      | Name  | Type          | Params | Mode 
    ------------------------------------------------
    0 | model | Float16Module | 2.6 B  | train
    -------------

[NeMo W 2024-10-28 07:07:31 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-10-28 07:07:31 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-10-28 07:07:32 megatron_gpt_sft_model:836] Building GPT SFT validation datasets.
[NeMo I 2024-10-28 07:07:32 text_memmap_dataset:116] Building data files
[NeMo I 2024-10-28 07:07:32 text_memmap_dataset:528] Processing 1 data files using 2 workers
[NeMo I 2024-10-28 07:07:32 text_memmap_dataset:494] Building indexing for fn = /root/ODSC-Hackathon-Repository/data/split/val.jsonl
[NeMo I 2024-10-28 07:07:32 text_memmap_dataset:506] Saving idx file = /root/ODSC-Hackathon-Repository/data/split/val.jsonl.idx.npy
[NeMo I 2024-10-28 07:07:32 text_memmap_dataset:508] Saving metadata file = /root/ODSC-Hackathon-Repository/data/split/val.jsonl.idx.info
[NeMo I 2024-10-28 07:07:32 text_memmap_dataset:543] Time building 1 / 1 mem-mapped files: 0:00:00.101332
[NeMo I 2024-10-28 07:07:32 text_memmap_dataset:528] Processing 1 data files using 2 workers
[NeMo I 2024-10-28 07:07:32 text_memmap_dataset:543] Time building 0 / 1 mem-mapped files: 0:00:00.089192
[NeMo I 2024-10-28 07:07:32 text

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo W 2024-10-28 07:07:32 megatron_base_model:1230] Ignoring `trainer.max_epochs` when computing `max_steps` because `trainer.max_steps` is already set to 1000.


[NeMo I 2024-10-28 07:07:32 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 07:07:32 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 07:07:32 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 07:07:32 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 07:07:32 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 07:07:32 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 07:07:32 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 07:07:32 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 07:07:32 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 07:07:32 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 07:07:32 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-10-28 07:07:32 adapter_mixins:495] Unfrozen adapter : lora_kqv_


  | Name  | Type          | Params | Mode 
------------------------------------------------
0 | model | Float16Module | 2.6 B  | train
------------------------------------------------
5.3 M     Trainable params
2.6 B     Non-trainable params
2.6 B     Total params
10,478.667Total estimated model params size (MB)
582       Modules in train mode
0         Modules in eval mode
[NeMo W 2024-10-28 07:07:33 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
    


Sanity Checking: |          | 0/? [00:00<?, ?it/s][NeMo I 2024-10-28 07:07:33 num_microbatches_calculator:228] setting number of microbatches to constant 10
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:02<00:00,  0.76it/s][NeMo I 2024-10-28 07:07:35 num_microbatches_calculator:228] setting number of microbatches to constant 10


[NeMo W 2024-10-28 07:07:35 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-10-28 07:07:35 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('validation_loss_dataloader0', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-10-28 07:07:35 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('validation_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 202

Epoch 0: :  20%|██        | 200/1000 [05:56<23:45, reduced_train_loss=1.150, global_step=199.0, consumed_samples=2e+3, train_step_timing in s=1.800]  
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 07:13:32 num_microbatches_calculator:228] setting number of microbatches to constant 10

Validation:   0%|          | 0/96 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/96 [00:00<?, ?it/s][A
Validation DataLoader 0:   1%|          | 1/96 [00:00<01:27,  1.08it/s][A
Validation DataLoader 0:   2%|▏         | 2/96 [00:01<01:27,  1.08it/s][A
Validation DataLoader 0:   3%|▎         | 3/96 [00:02<01:29,  1.04it/s][A
Validation DataLoader 0:   4%|▍         | 4/96 [00:03<01:27,  1.05it/s][A
Validation DataLoader 0:   5%|▌         | 5/96 [00:04<01:26,  1.05it/s][A
Validation DataLoader 0:   6%|▋         | 6/96 [00:05<01:26,  1.04it/s][A
Validation DataLoader 0:   7%|▋         | 7/96 [00:06<01:25,  1.04it/s][A
Validation DataLoader 0:   8%|▊         | 8/96 

Metric val_loss improved. New best score: 1.100
Epoch 0, global step 200: 'validation_loss' reached 1.10000 (best 1.10000), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.100-step=200-consumed_samples=2000.0.ckpt' as top 1
[NeMo W 2024-10-28 07:15:03 nlp_overrides:625] DistributedCheckpointIO configured but should not be used. Reverting back to TorchCheckpointIO


Epoch 0: :  40%|████      | 400/1000 [13:23<20:05, reduced_train_loss=1.160, global_step=399.0, consumed_samples=4e+3, train_step_timing in s=1.740, val_loss=1.100]  
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 07:20:59 num_microbatches_calculator:228] setting number of microbatches to constant 10

Validation:   0%|          | 0/96 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/96 [00:00<?, ?it/s][A
Validation DataLoader 0:   1%|          | 1/96 [00:00<01:27,  1.09it/s][A
Validation DataLoader 0:   2%|▏         | 2/96 [00:01<01:26,  1.09it/s][A
Validation DataLoader 0:   3%|▎         | 3/96 [00:02<01:25,  1.08it/s][A
Validation DataLoader 0:   4%|▍         | 4/96 [00:03<01:25,  1.07it/s][A
Validation DataLoader 0:   5%|▌         | 5/96 [00:04<01:24,  1.07it/s][A
Validation DataLoader 0:   6%|▋         | 6/96 [00:05<01:24,  1.06it/s][A
Validation DataLoader 0:   7%|▋         | 7/96 [00:06<01:23,  1.06it/s][A
Validation DataLoader 0:   8%|▊

Metric val_loss improved by 0.071 >= min_delta = 0.001. New best score: 1.029
Epoch 0, global step 400: 'validation_loss' reached 1.02901 (best 1.02901), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.029-step=400-consumed_samples=4000.0.ckpt' as top 1


Epoch 0: :  40%|████      | 400/1000 [14:54<22:21, reduced_train_loss=1.160, global_step=399.0, consumed_samples=4e+3, train_step_timing in s=1.740, val_loss=1.030][NeMo I 2024-10-28 07:22:30 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.100-step=200-consumed_samples=2000.0.ckpt
[NeMo I 2024-10-28 07:22:31 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.100-step=200-consumed_samples=2000.0-last.ckpt
Epoch 0: :  60%|██████    | 600/1000 [20:51<13:54, reduced_train_loss=0.686, global_step=599.0, consumed_samples=6e+3, train_step_timing in s=1.770, val_loss=1.030]  
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 07:28:26 num_microbatches_calculator:228] setting number of microbatches to constant 10

Validation:   0%|          | 0/96 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|  

Metric val_loss improved by 0.048 >= min_delta = 0.001. New best score: 0.981
Epoch 0, global step 600: 'validation_loss' reached 0.98120 (best 0.98120), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.981-step=600-consumed_samples=6000.0.ckpt' as top 1


Epoch 0: :  60%|██████    | 600/1000 [22:22<14:54, reduced_train_loss=0.686, global_step=599.0, consumed_samples=6e+3, train_step_timing in s=1.770, val_loss=0.981][NeMo I 2024-10-28 07:29:58 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.029-step=400-consumed_samples=4000.0.ckpt
[NeMo I 2024-10-28 07:29:58 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.029-step=400-consumed_samples=4000.0-last.ckpt
Epoch 0: :  80%|████████  | 800/1000 [28:18<07:04, reduced_train_loss=0.961, global_step=799.0, consumed_samples=8e+3, train_step_timing in s=1.770, val_loss=0.981]  
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-10-28 07:35:53 num_microbatches_calculator:228] setting number of microbatches to constant 10

Validation:   0%|          | 0/96 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|  

Metric val_loss improved by 0.027 >= min_delta = 0.001. New best score: 0.954
Epoch 0, global step 800: 'validation_loss' reached 0.95373 (best 0.95373), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.954-step=800-consumed_samples=8000.0.ckpt' as top 1


Epoch 0: :  80%|████████  | 800/1000 [29:48<07:27, reduced_train_loss=0.961, global_step=799.0, consumed_samples=8e+3, train_step_timing in s=1.770, val_loss=0.954][NeMo I 2024-10-28 07:37:24 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.981-step=600-consumed_samples=6000.0.ckpt
[NeMo I 2024-10-28 07:37:24 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.981-step=600-consumed_samples=6000.0-last.ckpt
Validation DataLoader 0:  86%|████████▋ | 83/96 [01:17<00:12,  1.06it/s][A, global_step=976.0, consumed_samples=9770.0, train_step_timing in s=1.790, val_loss=0.954]
Validation DataLoader 0:  88%|████████▊ | 84/96 [01:18<00:11,  1.06it/s][A
Validation DataLoader 0:  89%|████████▊ | 85/96 [01:19<00:10,  1.06it/s][A
Validation DataLoader 0:  90%|████████▉ | 86/96 [01:20<00:09,  1.06it/s][A
Validation Dat

Metric val_loss improved by 0.005 >= min_delta = 0.001. New best score: 0.949
Epoch 0, global step 1000: 'validation_loss' reached 0.94917 (best 0.94917), saving model to '/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.949-step=1000-consumed_samples=10000.0.ckpt' as top 1


Epoch 0: : 100%|██████████| 1000/1000 [37:14<00:00, reduced_train_loss=0.876, global_step=999.0, consumed_samples=1e+4, train_step_timing in s=1.840, val_loss=0.949][NeMo I 2024-10-28 07:44:50 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.954-step=800-consumed_samples=8000.0.ckpt
[NeMo I 2024-10-28 07:44:51 nlp_overrides:609] Removing checkpoint: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.954-step=800-consumed_samples=8000.0-last.ckpt


`Trainer.fit` stopped: `max_steps=1000` reached.


Epoch 0: : 100%|██████████| 1000/1000 [37:15<00:00, reduced_train_loss=0.876, global_step=999.0, consumed_samples=1e+4, train_step_timing in s=1.840, val_loss=0.949]
[NeMo I 2024-10-28 07:44:51 perf_metrics:87] TFLOPs per sec per GPU=-1.00


[NeMo E 2024-10-28 07:44:51 perf_metrics:85] Failed to calculate TFLOPs per sec per GPU.
    FLOPs measurement not supported for finetuning jobs
Restoring states from the checkpoint path at /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.949-step=1000-consumed_samples=10000.0.ckpt
Restored all states from the checkpoint at /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.949-step=1000-consumed_samples=10000.0.ckpt


---
# Inference and Submission


To make a submission, run inference with your model on the test dataset at `data/split/submission.jsonl`.

> NOTE: This dataset was generated as part of Step 1. Please ensure it exists before proceeding.

In order to do this, set the variable pointing to your submission data file in the set below, then excute the final cell.

The inference results will be written under `results/inference` folder.

In [5]:
test_fp = os.path.abspath(f"{data_dir}/submission.jsonl")
assert os.path.exists(test_fp), f"The submission data at '{test_fp}' does not exist. Please ensure the data was prepared successfully."

test_fp = os.path.abspath(test_fp)
adapter_fp = f"{result_dir}/checkpoints/megatron_gpt_peft_lora_tuning.nemo"
os.makedirs(f"{result_dir}/inference", exist_ok=True)

print(f"Inference set: {test_fp}")
print(f"Trained adapter: {adapter_fp}")
test_filename = os.path.basename(test_fp)


%env TEST_DS=$test_fp
%env TEST_FP=$test_filename
%env TRAINED_ADAPTER=$adapter_fp

Inference set: /root/ODSC-Hackathon-Repository/data/split/submission.jsonl
Trained adapter: /root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning.nemo
env: TEST_DS=/root/ODSC-Hackathon-Repository/data/split/submission.jsonl
env: TEST_FP=submission.jsonl
env: TRAINED_ADAPTER=/root/ODSC-Hackathon-Repository/results/checkpoints/megatron_gpt_peft_lora_tuning.nemo


In [6]:
%%bash

# This is where the inference results will be stored.
OUTPUT_DIR="results/inference/infer-$TEST_FP"

SCHEME="lora"
TP_SIZE=1
PP_SIZE=1

# Clear up cached mem-map file
rm $DATA_DIR/*idx*

python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
    model.restore_from_path=${BASE_MODEL} \
    model.peft.restore_from_path=${TRAINED_ADAPTER} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    inference.greedy=True \
    model.data.test_ds.file_names=[${TEST_DS}] \
    model.data.test_ds.names=["infer"] \
    model.data.test_ds.global_batch_size=16 \
    model.data.test_ds.micro_batch_size=1 \
    model.data.test_ds.tokens_to_generate=32 \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.data.test_ds.output_file_path_prefix=$OUTPUT_DIR \
    model.data.test_ds.write_predictions_to_file=True

      cm = get_cmap("Set1")
    


[NeMo I 2024-10-28 07:53:16 megatron_gpt_generate:125] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-10-28 07:53:16 megatron_gpt_generate:126] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: 16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 20000
      log_every_n_steps: 10
      val_check_interval: 200
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: null
      exp_dir: null
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.test_ds.metric.name}
        save_top_k: 1
        mode: max
        save_nemo_o

[NeMo W 2024-10-28 07:53:16 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-10-28 07:53:22 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:53:22 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:53:22 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it 

[NeMo I 2024-10-28 07:53:22 megatron_init:314] Rank 0 has data parallel group : [0]
[NeMo I 2024-10-28 07:53:22 megatron_init:320] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-10-28 07:53:22 megatron_init:325] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-10-28 07:53:22 megatron_init:328] Ranks 0 has data parallel rank: 0
[NeMo I 2024-10-28 07:53:22 megatron_init:336] Rank 0 has context parallel group: [0]
[NeMo I 2024-10-28 07:53:22 megatron_init:339] All context parallel group ranks: [[0]]
[NeMo I 2024-10-28 07:53:22 megatron_init:340] Ranks 0 has context parallel rank: 0
[NeMo I 2024-10-28 07:53:22 megatron_init:347] Rank 0 has model parallel group: [0]
[NeMo I 2024-10-28 07:53:22 megatron_init:348] All model parallel group ranks: [[0]]
[NeMo I 2024-10-28 07:53:22 megatron_init:357] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-10-28 07:53:22 megatron_init:361] All tensor model parallel group ranks: 

[NeMo W 2024-10-28 07:53:22 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:53:22 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:53:22 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:53:22 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:53:22 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-10-28 07:53:22 tokenizer_utils:197] Getting SentencePiece with model: /tmp/tmp6r3lmtsl/7785645eb8594f67b5e7f32b1fee7d65_tokenizer.model
[NeMo I 2024-10-28 07:53:22 megatron_base_model:604] Padded vocab_size: 256000, original vocab_size: 256000, dummy tokens: 0.


[NeMo W 2024-10-28 07:53:22 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:53:22 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:53:22 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:53:22 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-10-28 07:53:22 megatron_base_model:1189] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-10-28 07:53:38 nlp_overrides:1374] Model MegatronGPTSFTModel was successfully restored from /root/ODSC-Hackathon-Repository/models/gemma-2-2b.nemo.
[NeMo I 2024-10-28 07:53:38 nlp_adapter_mixins:245] Before adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 2.6 B  | train
    -------------------------------------------
    0         Trainable params
    2.6 B     Non-trainable params
    2.6 B     Total params
    10,457.368Total estimated model params size (MB)
    451       Modules in train mode
    0         Modules in eval mode
[NeMo I 2024-10-28 07:53:41 nlp_adapter_mixins:250] After adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 2.6 B  | train
    -------------------------------------------
    5.3 M     Trainable params
    2.6 B     Non-trainable params
    2.6 B     Total params
    10,478.6

[NeMo W 2024-10-28 07:53:41 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-10-28 07:53:41 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-10-28 07:53:41 megatron_gpt_sft_model:828] Building GPT SFT test datasets.
[NeMo I 2024-10-28 07:53:41 text_memmap_dataset:116] Building data files
[NeMo I 2024-10-28 07:53:41 text_memmap_dataset:528] Processing 1 data files using 6 workers
[NeMo I 2024-10-28 07:53:41 text_memmap_dataset:494] Building indexing for fn = /root/ODSC-Hackathon-Repository/data/split/submission.jsonl
[NeMo I 2024-10-28 07:53:41 text_memmap_dataset:506] Saving idx file = /root/ODSC-Hackathon-Repository/data/split/submission.jsonl.idx.npy
[NeMo I 2024-10-28 07:53:41 text_memmap_dataset:508] Saving metadata file = /root/ODSC-Hackathon-Repository/data/split/submission.jsonl.idx.info
[NeMo I 2024-10-28 07:53:41 text_memmap_dataset:543] Time building 1 / 1 mem-mapped files: 0:00:00.230623
[NeMo I 2024-10-28 07:53:41 text_memmap_dataset:528] Processing 1 data files using 6 workers
[NeMo I 2024-10-28 07:53:42 text_memmap_dataset:543] Time building 0 / 1 mem-mapped files: 0:00:00.191846
[NeMo I 2024-10-2

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo W 2024-10-28 07:53:42 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
    


Testing: |          | 0/? [00:00<?, ?it/s]setting number of microbatches to constant 16
Testing DataLoader 0:   0%|          | 0/313 [00:00<?, ?it/s]setting number of microbatches to constant 1
setting number of microbatches to constant 16
Testing DataLoader 0:   0%|          | 1/313 [00:13<1:07:46,  0.08it/s]setting number of microbatches to constant 1
setting number of microbatches to constant 16
Testing DataLoader 0:   1%|          | 2/313 [00:27<1:10:19,  0.07it/s]setting number of microbatches to constant 1
setting number of microbatches to constant 16
Testing DataLoader 0:   1%|          | 3/313 [00:37<1:05:17,  0.08it/s]setting number of microbatches to constant 1
setting number of microbatches to constant 16
Testing DataLoader 0:   1%|▏         | 4/313 [00:49<1:03:43,  0.08it/s]setting number of microbatches to constant 1
setting number of microbatches to constant 16
Testing DataLoader 0:   2%|▏         | 5/313 [00:59<1:01:26,  0.08it/s]setting number of microbatches to constan

[NeMo W 2024-10-28 08:54:08 megatron_gpt_sft_model:677] No training data found, reconfiguring microbatches based on validation batch sizes.


setting number of microbatches to constant 16


[NeMo W 2024-10-28 08:54:08 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-10-28 08:54:08 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('test_loss_infer', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-10-28 08:54:08 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('test_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    


Testing DataLoader 0: 100%|██████████| 313/313 [1:00:25<00:00,  0.09it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1m       Test metric       [0m[1m [0m┃[1m [0m[1m      DataLoader 0       [0m[1m [0m┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│[36m [0m[36m        test_loss        [0m[36m [0m│[35m [0m[35m    10.21175479888916    [0m[35m [0m│
│[36m [0m[36m     test_loss_infer     [0m[36m [0m│[35m [0m[35m    10.21175479888916    [0m[35m [0m│
│[36m [0m[36m        val_loss         [0m[36m [0m│[35m [0m[35m    10.21175479888916    [0m[35m [0m│
└───────────────────────────┴───────────────────────────┘


The results will be written under `results/inference`. Please send us this file for your final submission.

Let's inspect a couple of lines from that file for sanity checking:

In [7]:
! cat results/inference/infer-submission.jsonl_test_infer_inputs_preds_labels.jsonl | head -n 2

{"input": "Read the following title and question about a legal issue and assign the most appropriate tag to it. All tags must be in lowercase, ordered lexicographically and separated by commas.\n\nTITLE:\nFairness in Punishment for Reckless Behavior\n\nQUESTION:\nIs it justifiable to have significantly different penalties for individuals who engage in reckless behavior, depending on the outcome of their actions, or should the focus be on the level of recklessness itself, regardless of the consequences?", "pred": " criminal-law,punishment", "label": " ", "filename": "submission.jsonl"}
{"input": "Read the following title and question about a legal issue and assign the most appropriate tag to it. All tags must be in lowercase, ordered lexicographically and separated by commas.\n\nTITLE:\nCan a Promise Be Binding Without a Tangible Exchange?\n\nQUESTION:\nIn what situations can a commitment or promise be deemed legally enforceable, even if no direct benefit or tangible item is exchanged b

---
# Freeing Memory and Other Resources

As always, it is a good idea to free up all allocated resources when you are done. Please execute the following cell to do so.

Alternatively, please restart the kernel by navigating to `Kernel > Restart Kernel` (if using Jypyter notebook), or clicking the `Restart` button in VS Code.

In [8]:
exit(0)