
### Step 2: Fine-tune the teacher on the dataset

NeMo Framework includes a standard Python script, [megatron_gpt_pretraining.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_pretraining.py), for training a model. Once you have your model downloaded and the dataset ready, fine-tuning the teacher model with NeMo is essentially just running this script!

We fine-tune the unpruned model on our dataset to correct the distribution shift from the original dataset the model was trained on. According to the [blog](https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/) and [tech report](https://arxiv.org/pdf/2408.11796), experiments showed that without correcting for this distribution shift, the teacher provides suboptimal guidance on the dataset during distillation.

For this demonstration, this training run is capped by `STEPS`, and validation is carried out every `VAL_INTERVAL` steps.

> `NOTE:` In the block of code below, pass the paths to your pre-processed train, test, and validation data files, as well as the path to the teacher .nemo model.

In [None]:
%%bash 

export CUDA_DEVICE_MAX_CONNECTIONS=1

# Set path(s) if different:

MODEL="/workspace/llama-3_1-8b-nemo_v1.0/llama3_1_8b.nemo"

# Can change these to accommodate resources:

TENSOR_PARALLEL_SIZE=8
NODES=1
MICRO_BATCH_SIZE=4

# Don't change the following:

EXPERIMENT_DIR="distill_trainings"
EXPERIMENT_NAME="megatron_llama_ft"

DATA_TRAIN='wikitext_tokenized_train_text_document'
DATA_VAL='wikitext_tokenized_test_text_document'
DATA_TEST='wikitext_tokenized_val_text_document'

STEPS=30
GLOBAL_BATCH_SIZE=128

LOG_INTERVAL=1
VAL_INTERVAL=10
NUM_VAL_BATCHES=5

LR=1e-4
MIN_LR=1e-5
WARMUP_STEPS=2

cmd="torchrun --nproc-per-node=${TENSOR_PARALLEL_SIZE}"

${cmd} /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
    --config-path /opt/NeMo/examples/nlp/language_modeling/conf/ \
    --config-name megatron_llama_distill.yaml \
    \
    name=${EXPERIMENT_NAME} \
    \
    exp_manager.exp_dir=${EXPERIMENT_DIR} \
    exp_manager.checkpoint_callback_params.save_top_k=1 \
    exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
    \
    trainer.max_steps=${STEPS} \
    trainer.log_every_n_steps=${LOG_INTERVAL} \
    trainer.val_check_interval=${VAL_INTERVAL} \
    trainer.limit_val_batches=${NUM_VAL_BATCHES} \
    +trainer.num_sanity_val_steps=0 \
    \
    trainer.precision=bf16 \
    trainer.devices=${TENSOR_PARALLEL_SIZE} \
    trainer.num_nodes=${NODES} \
    \
    "model.data.data_prefix={train:[1.0,$DATA_TRAIN],validation:[$DATA_VAL],test:[$DATA_TEST]}" \
    \
    model.restore_from_path=${MODEL} \
    +model.dist_ckpt_load_strictness=log_all \
    \
    ~model.tokenizer \
    +model.tokenizer='{library: huggingface, type: meta-llama/Meta-Llama-3.1-8B, use_fast: True}' \
    \
    model.tensor_model_parallel_size=${TENSOR_PARALLEL_SIZE} \
    model.sequence_parallel=True \
    model.micro_batch_size=${MICRO_BATCH_SIZE} \
    model.global_batch_size=${GLOBAL_BATCH_SIZE} \
    \
    model.encoder_seq_length=8192 \
    model.num_layers=32 \
    model.hidden_size=4096 \
    model.ffn_hidden_size=14336 \
    model.num_attention_heads=32 \
    model.hidden_dropout=0.0 \
    model.attention_dropout=0.0 \
    model.apply_query_key_layer_scaling=True \
    model.normalization='rmsnorm' \
    model.bias=False \
    model.activation='fast-swiglu' \
    model.position_embedding_type='rope' \
    model.share_embeddings_and_output_weights=False \
    model.num_query_groups=8 \
    ++model.scale_positional_embedding=True \
    ++model.rotary_base=500000.0 \
    \
    model.optim.name=distributed_fused_adam \
    model.optim.lr=${LR} \
    model.optim.sched.min_lr=${MIN_LR} \
    model.optim.sched.warmup_steps=${WARMUP_STEPS}

This will create a fine-tuned teacher model named `megatron_llama_ft.nemo` in `./distill_trainings/megatron_llama_ft/checkpoints/`. We'll use this later.
> `NOTE:`This script takes at least 20 minutes to run (depending on GPU) and will generate the fine-tuned teacher model.