# Training with InstructLab

<ul>
<li>Contributors: InstructLab team and IBM Research Technology Education team
<li>Contact for questions and technical support: IBM.Research.JupyterLab@ibm.com
<li>Provenance: IBM Research
<li>Version: 1.0.8
<li>Release date: 2024-11-14
<li>Compute requirements: GPU: estimated 30 minutes (19 min for cell 12 and 10 min for cell 15 )
<li>Memory requirements: 16 GB
<li>Notebook set: InstructLab
</ul>

# Select Viewing Option 

This notebook was optimized for viewing the output in a separate panel:
- If you would like to see the separate panel output, set `dual_screen` to *True* in the first line of the next cell and follow the steps below. 
- If you want the output inline with the notebook cells, set `dual_screen` to *False*. If you are running with the output inline with the notebook, please run the notebook cell by cell so that options can be selected.

If you set `dual_screen` to *True*, perform the following:
1. Right click on the same cell and select **Create New View for Output**.
1. Drag the new **Output View** panel to the right side of the JupyterLab.
1. Hide the File Browser by toggling the File Browser icon on the top left of the JupyterLab.
1. To run the notebook, click on a notebook code cell, then from the top menu select *Kernel->Restart Kernel and Run All Cells*.
1. Select options and the *Continue* button to progress with the notebook.

In [None]:
dual_screen = False

from IPython.display import Image, display
import ipywidgets as widgets
from ipynb_pause import flow

H1 = "<p style='font-family:IBM Plex Sans;font-size:28px'>"
H2 = "<p style='font-family:IBM Plex Sans;font-size:24px'>"
Norm = "<p style='font-family:IBM Plex Sans;font-size:20px'>"
Small = "<p style='font-family:IBM Plex Sans;font-size:17px'>"
Ex = "<p style='font-family:IBM Plex Sans;font-size:20px;font-style:italic'>"

out = widgets.Output(layout={'border': '1px solid black'})
run=flow.display_mode(mode=dual_screen, output=out, color='darkblue')
if dual_screen:
    display(out)

# Summary

This notebook demonstrates InstructLab, a model-agnostic open source AI project that facilitates contributions to Large Language Models (LLMs).

This notebook is part of a sequential notebook set. Before using this notebook, please ensure that you have run the first notebook in this  set: [Configuring InstructLab](./00_configuring_InstructLab.ipynb).

In this notebook, we will demonstrate the following:
- Querying the LLM before 
- Creating a question and answer data file
- Generating synthetic data for training
- Training the LLM with the generated data

**Notes:** This notebook must be run with a GPU. If you are not running with a GPU, please select File->Hub Control Panel->Stop My Server, then Start My Server and the select GPU Session 

# Table of Contents

* [Step 0. Introduction](#I2_intro)
* [Step 1. Import Libraries and Check Configuration](#I2_import)
* [Step 2. Specify the Data for this Run](#I2_data)
* [Step 3. Create the Taxonomy Data Repository](#I2_taxonomy)
* [Step 4. Generate Synthetic Data](#I2_generate)
* [Step 5. Train Model](#I2_train)
* [Conclusion](#I2_conclusion)
* [Learn more](#I2_learn)

<a id="#I2_intro"></a>
# Step 0. Introduction

This notebook provides a template for running InstructLab in Python.

In [None]:
if dual_screen:
    with out:
        out.clear_output()
        display(widgets.HTML(H1+"Step 0. Introduction"))
        display(Image(filename='data/images/Flow.png',width=1000))
else: 
    display(Image(filename='data/images/Flow.png',width=1200))
run.pause()

<a id="I2_import"></a>
# Step 1. Import Libraries and Check Configuration

## 1.1. Imports and configuration

This code cell also checks for GPU availability. This notebook requires a GPU to run in a reasonable time.

Check that InstructLab version 0.23.1 is installed properly and is configured for using a GPU.

The first line from 'InstructLab' section should read
```
instructlab.version: 0.23.1
```
and the last line should read
```
llama_cpp_python.supports_gpu_offload: True
```

In [None]:
# standard imports
import os
import subprocess
import torch
import json

os.environ['NUMEXPR_MAX_THREADS'] = '64'
il_data_path= '/home/jovyan/.local/share/instructlab/datasets/'
with open('config.json', 'r') as f:
    jsonData = json.load(f)
with open('instructlab.json', 'r') as f:
    jsonState = json.load(f)    

# torch and cuda version check
TORCH_VERSION = ".".join(torch.__version__.split(".")[:2])
CUDA_VERSION = torch.__version__.split("+")[-1]

if dual_screen:
    with out:
        out.clear_output()
        display(widgets.HTML(H1+"Step 1. Import Libraries and Check Configuration"))
        display(widgets.HTML(H1+"1.1 Imports and configuration"))
        display(widgets.HTML(Norm+"Perform imports and see if a GPU is available"))
        print("torch: ", TORCH_VERSION, "; cuda: ", CUDA_VERSION)
        if torch.cuda.is_available() is False:
            display(widgets.HTML(Norm+"ERROR: GPU not in configuration, Please restart with a GPU"))
            run.resume() 
        else:
            os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
            print("GPU is Available\n")
            display(widgets.HTML(Norm+"Check if configuration supports GPU offload"))
            !ilab system info
        
else: 
    print(f"Imports completed")
    print("torch: ", TORCH_VERSION, "; cuda: ", CUDA_VERSION)
    if torch.cuda.is_available() is False:
        print("ERROR: GPU not in configuration, Please restart with a GPU")
    else:
        os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
        print("GPU is Available\n")
        !ilab system info

## 1.2. Optionally test the base model before adding data

At this point you may wish to run an InstructLab server and run queries against the base model. 

This may be useful if you are working with new content and want to query the base model to ascertain the responses before InstructLab training.

After training, the [Inferencing with InstructLab](./02_inferencing_with_InstructLab.ipynb) notebook allows you to ask questions to both the base and InstructLab trained models and compare answers.

In [None]:
if dual_screen:
    with out:
        display(widgets.HTML(H1+"1.2 Optionally test the base model before adding data"))
    
#!ilab model serve --model-path models/merlinite-7b-lab-Q4_K_M.gguf    
    
run.pause()    

<a id="I2_data"></a>
# Step 2. Specify the Data for this Run

We've provided question-and-answer files for these datasets: "2024 Oscar Awards Ceremony" and "Quantum Roadmap and Patterns" and "Artificial Intelligence Agents". Feel free to choose one of these datasets, or select your own custom dataset in the cell below.

## 2.1 Optionally, Create your own data set for InstructLab

Follow these steps to add your own dataset:
1. Create your own **qna.yaml** file based on the example qna.yaml files provided in the /data/oscars, /data/quantum and /data/agentic_ai directories. Additional guidance on creating a properly formatted QNA.yaml file is found on the [InstructLab taxonomy readme](https://github.com/instructlab/taxonomy).
1. Add your **qna.yaml** and sample **questions.txt** files to the **/data/your_content_1** folder or the **/data/your_content_2** folder.
1. Right click on the **config.json** file and select Open With->Editort. Specify the **qna_location** where your data resides within the Dewy Decimal classification system. Close and save the **config.json** file.


In [None]:
data_set = widgets.ToggleButtons(
    options=['2024 Oscars', 'Quantum', 'Agentic AI', 'Your Content 1', 'Your Content 2'],
    tooltips=['2024 Oscar Awards Ceremony', 'Quantum Roadmap and Patterns', 'Artificial Intelligence Agents', 'Your own uploaded content dataset 1', 'Your own uploaded content dataset 2'],
    description='Dataset:', disabled=False, button_style='', style={"button_width": "auto"}
)
print("\nSelect the QNA dataset to add:")
display(data_set)
if dual_screen:
    with out:
        out.clear_output()
        display(widgets.HTML(H1+"Step 2. Choose the Dataset for this Run"))
        display(widgets.HTML(Norm+"<br>Select content for InstructLab processing from the following:"))
        display(data_set)
        display(widgets.HTML(Small+"Note: Choose <b>Demo Content</b> or <b>Your Own Content</b> if you are providing your own created QNA file"))
else:
    print("After choosing your dataset for this run, please select and run the following cell")        
run.pause()

In [None]:
if dual_screen:
    with out:
        display(widgets.HTML(H1+"Step 2. Choose the Dataset for this Run"))
        
if data_set.value=='2024 Oscars':
    use_case="oscars"
elif data_set.value=='Quantum':
    use_case="quantum"
elif data_set.value=='Agentic AI':
    use_case="agentic_ai"
elif data_set.value=='Your Content 1':
    use_case="your_content_1"
elif data_set.value=='Your Content 2':
    use_case="your_content_2"
else:
    use_case="undefined"
    
if use_case=="undefined":   
    print("ERROR: Undefined data set: " + data_set.value + " data")
    if dual_screen:
        with out:
            display(widgets.HTML(Norm+"ERROR: Undefined data set: " + data_set.value + " data"))
    run.pause()
else:
    jsonState["last_use_case"]=data_set.value
    with open('instructlab.json', 'w') as f:
            json.dump(jsonState, f, indent=4)
    qna_file="data/" + use_case + "/qna.yaml"
    qna_location=jsonData["use_cases"][use_case]["qna_location"]            

    print("Using " + data_set.value + " data")
    if dual_screen:
        with out:
            display(widgets.HTML(Norm+"Using " + data_set.value + " data"))     

<a id="I2_taxonomy"></a>
# Step 3. Create the Taxonomy Data Repository
## 3.1 Delete the prior repository and clone the empty taxonomy repository
We start with an empty repository. 

In [None]:
shell_command1 = f"rm -rf taxonomy"
taxonomy_repo=jsonData["taxonomy_repo"]
shell_command2 = f"git clone {taxonomy_repo}"
if dual_screen:
    with out:
        out.clear_output()
        display(widgets.HTML(H1+"Step 3. Create the Taxonomy Data Repository"))
        display(widgets.HTML(H2+"3.1 Delete the prior repository and clone the empty taxonomy repository"))
        !{shell_command1}
        !{shell_command2}
else:
    print("Step 3. Create the Taxonomy data Repository")
    print("3.1 Delete the prior repository and clone the empty taxonomy repository")
    !{shell_command1}
    !{shell_command2}

## 3.2 View the beginning of the QNA file

In [None]:
def print_file_top():
    print_lines=40
    with open(qna_file, 'r') as input_file:
        for line_number, line in enumerate(input_file):
            if line_number > print_lines:  # line_number starts at 0.
                break
            print(line, end="")           

if dual_screen:
    with out:
        display(widgets.HTML(H2+"3.2 View the beginning of the QNA file"))
        print_file_top()
else:
    print("3.2 Show QNA file")
    print_file_top()

## 3.3 Place the QNA file in the proper taxonomy directory

In [None]:
#Should produce !mkdir -p ./taxonomy/knowledge/textbooks/culture/movies/awards/oscars
shell_command1 = f"mkdir -p ./taxonomy/{qna_location}"
shell_command2 = f"cp ./{qna_file} ./taxonomy/{qna_location}/qna.yaml"
if dual_screen:
    with out:
        display(widgets.HTML(H2+"3.3 Place the QNA file in the proper taxonomy directory"))
        display(widgets.HTML(Norm+"Place QNA file in taxonomy as: /taxonomy/"+qna_location+"/qna.yaml"))
        !{shell_command1}
        !{shell_command2}
else:
    print("3.3 Place the QNA file in the proper taxonomy directory")
    print("Place QNA file in taxononmy as: /taxonomy/"+qna_location+"/qna.yaml")
    !{shell_command1}
    !{shell_command2}

## 3.4 Verify the taxonomy
We run the ilab taxonomy diff command to verify 

In [None]:
if dual_screen:
    with out:
        display(widgets.HTML(H2+"3.4 Verify the taxonomy"))
        !ilab taxonomy diff
else:
    print("Verify the taxonomy")
    !ilab taxonomy diff
run.pause()

<a id="I2_generate"></a>
# Step 4. Generate Synthetic Data

This step will produce synthetic training data from the provided repository in the form of question and answer pairs. To generate synthetic data, InstructLab uses s large teacher model, such as Mixtral 8x7B, to create synthetic training data about the manually created data to train a small student model, such as the Merlinite 7B or Granite 7B models.

You can skip this step and use the previously generated 500 synthetic samples found in the *generated* folder.

## 4.1 Set data generation parameters

### Select pipeline

InstructLab has three primary pipelines that can be used: simple, full and acellerated:
- The **simple pipeline** runs fast and can be used for initial model and data testing. 
- The **full pipeline** runs all of the InstrctLab steps and takes more time but produces a better tuned model. 
- The **accelerated pipeline** runs the full pipeline processing using a GPU, so it produces a similarly tuned model in a shorter time.

**Note:** If you are running with a new or modifed dataset, you may want to use the **Simple pipeline** for the first run to verify the configuration

### Sepect number of samples to generate

Data generation takes 19 minutes for generating 15 synthetic data samples. You may wish to generate a small number on your first run to verify the QNA dataset format.

To produce **sufficient synthetic data** to focus training on the new material, **about 30 synthetic questions and answer pairs need to be generated** for each question and answer pair provided. This will require a proportionally longer time to generate, but will provide better training.

Before following these instructions, ensure the existing model you are adding skills or knowledge to is still running. Alternatively, ilab data generate can start a server for you if you provide a fully qualified model path via --model.

To generate a synthetic dataset based on your newly added knowledge or skill set in taxonomy repository, run the following command:

    ilab data generate

### **Simple Pipeline**

The Simple Pipeline works solely with Merlinite 7b Lab as the teacher model. The Simple Pipeline is called without GPU acceleration as follows:

    ilab data generate --pipeline simple

### **Full Pipeline**

The Full Pipeline runs the full processing with a GPU. Currently, the Full Pipeline only supports the Mixtral and Mistral Instruct Family models as the teacher model.  This is due to only supporting specific model prompt templates.

Using a non-default model such as Mixtral-8x7B-Instruct-v0.1) to generate data with the Full Pipeline:

    ilab data generate --model ~/.cache/instructlab/models/mistralai/mixtral-8x7b-instruct-v0.1 --pipeline full --gpus 4

**Note** Synthetic Data Generation can take from 15 minutes to 1+ hours to complete, depending on your computing resources.

In [None]:
pipe2 = widgets.ToggleButtons(
    options=['Simple', 'Full with GPU', 'Demo (Use prior data)'],
    tooltips=['Ilab Simple Pipeline', 'Full Pipe running on a GPU', 'Demo Run with previously created data'],
    description='Processing:', disabled=False, button_style='', style={"button_width": "auto"}
)
instr=widgets.ToggleButtons(
    options=['15', '50', '200', '450 (default)', '1000'],
    description='# of QNAs:',
    disabled=False,
    button_style='',
    style={"button_width": "auto"}
)

print("Select Pipeline to use")
display(pipe2)
display(instr)
if dual_screen:
    with out:
        out.clear_output()
        display(widgets.HTML(H1+"Step 4. Generate Synthetic Data"))
        display(pipe2)
        display(instr)
else:
    print("After making your selections for data generation, please select and run the following cell")        
run.pause()

## 4.2 Run data generation
Data generation takes 19 minutes for generating 15 synthetic data samples and takes longer to generate more samples.

In [None]:
directory = "data/"+ use_case+"/ilab_generated/"
if pipe2.value!='Demo (Use prior data)':
    if instr.value == '15':
        sdg_factor="--sdg-scale-factor 1"
    elif instr.value == '50':
        sdg_factor="--sdg-scale-factor 3"
    elif instr.value == '200':
        sdg_factor="--sdg-scale-factor 13"
    elif instr.value == '450 (default)':
        sdg_factor=""
    else:
        sdg_factor="--sdg-scale-factor 67"
    # 'Fast (Simple)', 'Full with CPU'
    if pipe2.value == 'Simple':
        pipeline = 'simple'
        model = 'models/merlinite-7b-lab-Q4_K_M.gguf'
        gpus = '--gpus 1'
    elif pipe2.value == 'Full with GPU':
        pipeline = 'full'
        model = 'models/mistral-7b-instruct-v0.2.Q4_K_M.gguf'
        gpus = '--gpus 1'
    else:
        if dual_screen:
            with out:
                display(widgets.HTML(Norm+"ERROR: Undefined pipeline"))
    
    #Remove old data so there is only one test_merlinite and train_merlinite after generation
    !rm -rf /home/jovyan/.local/share/instructlab/datasets/*
    #shell_command = f"ilab --verbose data generate --model {model} --num-cpus 10 {gpus} {sdg_factor} --taxonomy-path taxonomy --pipeline {pipeline} --max-num-tokens 512"
    shell_command = f"ilab data generate --model {model} --num-cpus 10 {gpus} {sdg_factor} --taxonomy-path taxonomy --pipeline {pipeline} --max-num-tokens 512"
    if dual_screen:
        with out:
            display(widgets.HTML(H2+"4.2 Run data generation (over 19 min)"))
            display(widgets.HTML(Norm+"Running:<br> !"+shell_command))
            !{shell_command}
    else:
        print("Generating data")
        print("Running: !"+shell_command)
        !{shell_command}

    #Rename results to  test_gen.jsonl and train_gen.jsonl and move to local data directory
    if not os.path.exists(directory):
        print("Create directory: " + directory)
        !mkdir {directory}
    file_cnt=0
    for dirname in os.listdir(il_data_path):
        date_path=il_data_path+'/'+ dirname + '/'
        for filename in os.listdir(date_path):
            if filename[:6]=='train_':
                train_name= 'train_gen.jsonl'
                print('Renaming '+ filename+ ' to ' + train_name)
                !mv {date_path+filename} {directory+train_name}
                file_cnt+=1
            elif filename[:5]=='test_':
                test_name= 'test_gen.jsonl'
                print('Renaming '+ filename+ ' to ' + test_name)
                !mv {date_path+filename} {directory+test_name}
                file_cnt+=1
    if file_cnt < 2:
        with out:
            display(widgets.HTML(Norm+"ERROR: train_gen.jsonl and/or test.jsonl not created"))
        print("ERROR: train_gen.jsonl and/or test.jsonl not created") 
    elif os.path.getsize(directory+train_name) == 0:
        with out:
            display(widgets.HTML(Norm+"ERROR: train_gen.jsonl file is empty"))
        print("ERROR: train_gen.jsonl file is empty")
    elif os.path.getsize(directory+test_name) == 0:
        with out:
            display(widgets.HTML(Norm+"ERROR: test_gen.jsonl file is empty"))
        print("ERROR: test_gen.jsonl file is empty")
    else:
        with out:
            display(widgets.HTML(Norm+"Training and test files successfully created in: " + directory))
        print("Training and test files successfully created in: " + directory)
        
else:    
    if dual_screen:
        with out:
            display(widgets.HTML(Norm+"Using previously generated data"))
    print("Using previously generated data")
run.pause()   

## 4.3 Show examples of generated data

In [None]:
if dual_screen:
    with out:
        display(widgets.HTML(H2+"4.3 Show examples of generated data"))
else:
    print("4.3 Show examples of generated data")

for filename in os.listdir(directory):
    if filename[:9]=='train_gen':
        with open(directory+filename, 'r') as syn_file:
            cnt=0
            for line_number, line in enumerate(syn_file):
                if cnt >= 8:
                    break
                jsonLine= json.loads(line)
                syn_user=jsonLine["user"]
                syn_assist=jsonLine["assistant"]
                #Remove "Answer:" and "Response:" from anawers for displaying
                if syn_assist[:8]=="Answer: ":
                    syn_assist=syn_assist[8:]
                cnt+=1
                if dual_screen:
                    with out:
                        display(widgets.HTML(Norm+"Question: "+syn_user + "<br>Answer: " + syn_assist))
                else:
                    print("\nQuestion: "+syn_user+"\nAnswer: "+syn_assist)
                                    
run.pause()

<a id="I2_train"></a>
# Step 5. Train the Model

## 5.1 Select the model training pipeline

InstructLab has three primary model training pipelines: simple, full (default), and accelerated. For all of the models, the training time can be limited by adjusting the num_epoch paramater. The maximum number of epochs for running the InstructLab end-to-end workflow is 10.

### **Simple pipeline**

The simple pipeline uses an SFT Trainer on Linux and MLX on MacOS. This type of training takes roughly an hour and produces the lowest fidelity model but should indicate if your data is being picked up by the training process. The simple pipeline only works with Merlinite 7b Lab as the teacher model. For this Linux system, the trained model is saved in the models directory as ggml-model-f16.gguf.

The command form is:

    ilab model train --pipeline simple

**Note:** This process will take a little while to complete (time can vary based on hardware and output of ilab data generate but on the order of 5 to 15 minutes)

### **Full pipeline**

The full pipeline uses a custom training loop and data processing functions for the granite family of models. This loop is optimized for CPU and MPS functionality. Please use **--pipeline=full** in combination with **--device=cpu** for this Linus system. For a MacOS system you can use --device=mps (MacOS) or --device=cpu, however, MPS is optimized for better performance on MacOS systems. The full pipeline only works with Mixtral and Mistral Instruct Family models as the teacher model. For the full pipeline, the models are saved in the ~/.local/share/instructlab/checkpoints directory. The instructlab command "ilab model evaluate" can be used to choose the best one.

The command form is:

    ilab model train


**Note:** This process will take a while to complete. If you run for ~8 epochs it will take several hours.

### **Accelerated pipeline**

The accelerated uses the instructlab-training library which supports GPU accelerated and distributed training. The full loop and data processing functions are either pulled directly from or based off of the work in this library. For the accelerated pipeline, the models are saved in the ~/.local/share/instructlab/checkpoints directory. The instructlab command "ilab model evaluate" can be used to choose the best one. Training is support for GPU acceleration with Nvidia CUDA or AMD ROCm. Please see the GPU acceleration documentation for more details. At present, hardware acceleration requires a data center GPU or high-end consumer GPU with at least 18 GB free memory.

The command form is:

    ilab model train --pipeline accelerated --device cuda --data-path <path-to-sdg-data>

### **Multiphase**

When running multi phase training evaluation is run on each phase, we will tell you which checkpoint in this folder performs the best. Train the model locally with multi-phase training and GPU acceleration. This results in the following workflow:
1. We train the model on knowledge
1. Evaluate the trained model to find the best checkpoint
1. We train the model on skills
1. We evaluate the model to find the best overall checkpoint

Phase 1 models saved in ~/.local/share/instructlab/phased/phase1/checkpoints (Knowledge training). Phase 2 models saved in ~/.local/share/instructlab/phased/phase2/checkpoints (Skills training). Evaluation is run for phase 2 to identify the best checkpoint.

A multiphase training is of the form:

    ilab model train --strategy lab-multiphase --phased-phase1-data <knowledge train messages jsonl> --phased-phase2-data <skills train messages jsonl> -y

**Note:** This command may take 3 or more hours depending on the size of the data and number of training epochs you run.

### **Skills only**

Phase 2 models saved in ~/.local/share/instructlab/phased/phase2/checkpoints (Skills training). Evaluation is run for phase 2 to identify the best checkpoint.



In [None]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
torch.cuda.empty_cache()

pipe3 = widgets.ToggleButtons(
    options=['Simple', 'Full with CPU', 'Accelerated GPU', 'Demo (Use prior data)'],
    tooltips=['Ilab Simple Pipeline', 'Full Pipe running only on a CPU', 'Accelerated pipeline run on a GPU', 'Demo Run with previously created data'],
    description='Processing', disabled=False, button_style='', style={"button_width": "auto"}
)
epoch=widgets.ToggleButtons(
    options=['1', '2', '3', '4', '5', '10', '15'],
    description='Epochs:',
    disabled=False, button_style='', style={"button_width": "auto"}
)
it=widgets.ToggleButtons(
    options=['1', '3', '5','10','20','50','100','200'],
    description='Iterations:',
    disabled=False, button_style='', style={"button_width": "auto"}
)

print("Select to Continue or to Train the model")
#pipe3.value=pipe.value
display(pipe3)
display(epoch)
display(it)
if dual_screen:
    with out:
        out.clear_output()
        display(widgets.HTML(H1+"Step 5. Train the Model"))
        display(widgets.HTML(H2+"5.1 Select the model training pipeline"))
        display(widgets.HTML(Norm+"<br>Select the processsing pipeline for this run. Use Fast for the first run on new data to verify configuration."))
        display(pipe3)
        display(epoch)
        display(it)
        display(widgets.HTML(Norm+"<br>"))
else:
    print("After choosing your training options, please select and run the following cell")
run.pause()

## 5.2 Run the model training

Model training takes 10 minutes for 1 epoch and 1 iteration. This minimal training could be used for testing the generation and training for a new set of data.

To produce a higher quality model, more epochs and iterations are needed for refining the model. This will require a proportionally longer time to train the model.

In [None]:
data_path="data/"+ use_case+"/ilab_generated/"
train_data=data_path+"train_gen.jsonl"
model_path="models/instructlab/granite-7b-lab"
#model_path='/home/jovyan/.cache/instructlab/models/instructlab/granite-7b-lab'
trained_model_path="data/"+ use_case+"/new_model/"

#'Simple (Fast)', 'Full with CPU', 'Accelerated GPU','DDemo (Use prior data)'
if pipe3.value=='Demo (Use prior data)':
    if dual_screen:
        with out:
            display(widgets.HTML(H1+"Step 5. Train the Model"))
            display(widgets.HTML(H2+"5.2 Run the model training"))
            display(widgets.HTML(Norm+"Using previously trained data"))
    print("Using previously trained data")
else:
    file_cnt=0
    for filename in os.listdir(data_path):
        if filename[:15]=='train_gen.jsonl': file_cnt+=1
        elif filename[:14]=='test_gen.jsonl': file_cnt+=1
    if file_cnt < 2 or os.path.getsize(directory+train_name) < 5 or os.path.getsize(directory+test_name) < 5:
        with out:
            display(widgets.HTML(Norm+"ERROR: train_gen.jsonl and/or test.jsonl are not present or too small"))
        print("ERROR: train_gen.jsonl and/or test.jsonl are not present or too small") 
    
    if not os.path.exists(trained_model_path):
        print("Create directory: " + trained_model_path)
        !mkdir {trained_model_path}
    ep=int(epoch.value)
    its=int(it.value)
    if pipe3.value=='Simple':
        if dual_screen:
            with out:
                display(widgets.HTML(H1+"Step 5. Train the Model"))
                display(widgets.HTML(H2+"5.2 Run model training"))
                display(widgets.HTML(Norm+"Train with a GPU with simple pipeline"))
        print("Train with simple pipeline with a GPU")
        shell_command = f"ilab model train --pipeline simple --model-path {model_path} --data-path {data_path} --device cpu --num-epochs {ep} --iters {its}"
    if pipe3.value=='Full with CPU':
        if dual_screen:
            with out:
                display(widgets.HTML(H1+"Step 5. Train the Model"))
                display(widgets.HTML(H2+"5.2 Run model training"))
                display(widgets.HTML(Norm+"Train with a CPU"))
        print("Train with a CPU")
        shell_command = f"ilab model train --pipeline full --model-path {model_path} --data-path {train_data} --device cpu"
    elif pipe3.value=='Accelerated GPU':
        if dual_screen:
            with out:
                display(widgets.HTML(H1+"Step 5. Train the Model"))
                display(widgets.HTML(H2+"5.2 Run model training"))
                display(widgets.HTML(Norm+"Train with a GPU"))
        print("Train with a GPU")
        shell_command = f"ilab model train --pipeline accelerated --device cuda --model-path {model_path} --data-path {train_data} --num-epochs {ep} --iters {its}"
    if dual_screen:
        with out:
            display(widgets.HTML(Norm+"Running: !"+shell_command))
            !{shell_command}
            if pipe3.value=='Accelerated GPU' or pipe3.value=='Accelerated GPU with 4b Quantization':
                !ilab model evaluate --benchmark mmlu
    if not dual_screen:
        print("Running: !"+shell_command)
        !{shell_command}
        if pipe3.value=='Accelerated GPU' or pipe3.value=='Accelerated GPU with 4b Quantization':
            !ilab model evaluate --benchmark mmlu
    #Move the model to the use_case/new_model directory
    if dual_screen:
        with out:
            display(widgets.HTML(Norm+"Move the trained model to the directory: "+trained_model_path))
    print("Moving the trained model to the directory: "+trained_model_path)
    !mv /home/jovyan/.local/share/instructlab/checkpoints/ggml-model-f16.gguf {trained_model_path}
run.pause()    

<a id="I2_train"></a>
# Step 6 Test the Model
## 6.1 Run test on the model to see how it performs

In [None]:
#!ilab model test

## 6.2 Evaluate the model

In [None]:
#ILAB_MODELS_DIR=$HOME/.local/share/instructlab/models
#shell_command=f"ilab model evaluate --benchmark mmlu --model {ILAB_MODELS_DIR}/instructlab/granite-7b-test"
#!{shell_command}

# Run the Inferencing with InstructLab notebook to assess improvements
You have now completed InstructLab training. You can now select to run the [Inferencing with InstructLab](./02_inferencing_with_InstructLab.ipynb) notebook to ask questions to both the base and InstructLab trained models and to compare answers.

In [None]:
if dual_screen:
    with out:
        display(widgets.HTML(H1+"You have created a trained model for the " + data_set.value + " data set"))
        display(widgets.HTML(H2+"To assess improvements, run the Inferencing with InstructLab notebook"))
else: 
    print("You have created a trained model for the " + data_set.value + " data set")
    print("To assess improvements, run the Inferencing with InstructLab notebook")
run.resume()

<a id="I1_conclusion"></a>
# Conclusion

This notebook demonstrated utilizing InstructLab for introducing datasets, data generation, model training, and model creation. This notebook produced an InstructLab trained model that can be used for inferencing. The following notebook [Inferencing with InstructLab](./02_inferencing_with_InstructLab.ipynb) can be used to run queries on your model.

<a id="I1_learn"></a>
# Learn More

Proceed to run the [Inferencing with InstructLab](./02_inferencing_with_InstructLab.ipynb) notebook run inferencing on the InstructLab trained model. This will allow you to interact with your model to see how well it performs on queries, both before it was trained and after InstrutLab training

This notebook is based on the InstructLab CLI repository available [here](https://github.com/instructlab/instructlab).

InstructLab uses a novel synthetic data-based alignment tuning method for Large Language Models introduced in this [paper](https://arxiv.org/abs/2403.01081).

Contact us by email to ask questions, discuss potential use cases, or schedule a technical deep dive. The contact email is IBM.Research.JupyterLab@ibm.com.

© 2025 IBM Corporation