# Creating InstructLab Taxonomies

<ul>
<li>Contributors: InstructLab team and IBM Research Technology Education team:
<li>Questions and support: kochel@us.ibm.com, IBM.Research.JupyterLab@ibm.com
<li>Release date: 2025-03-20
</ul>

# Summary
This Jupyter notebook facilitates compiling taxonomies for InstructLab, an open source AI project that facilitates knowledge and skills contributions to Large Language Models (LLMs). InstructLab uses a novel synthetic data-based alignment tuning method for Large Language Models introduced in this [paper](https://arxiv.org/abs/2403.01081). The open source InstructLab repository is available [here](https://github.com/instructlab/instructlab) and provides additional documentation on using InstructLab.

The InstructLab method consists of three major components:
* **Taxonomy-driven data curation:**  The taxonomy is a set of training data curated by humans as examples of new knowledge and skills for the model.
* **Large-scale synthetic data generation:** A teacher model is used to generate new examples based on the seed training data. Since synthetic data can vary in quality, InstructLab adds an automated step to refine the example answers, ensuring they are grounded and safe.
* **Iterative model alignment tuning:** The model is retrained based on the synthetic data. The InstructLab method includes two tuning phases: knowledge tuning, followed by skill tuning.

<img src="https://github.com/KenOcheltree/ilab-colab/blob/main/data/images/Flow.png?raw=1" width="800">

InstructLab can take the form of an open source installation or a Red Hat AI InstructLab installation. In this notebook, we will demonstrate the open source version of InstructLab running on Colab with a GPU, broken into the following major sequential sections:
* Configuring InstructLab
* Generating Syntehtic Data
* Training with InstructLab
* Inferencing with InstructLab

# Running this Notebook

**IMPORTANT:** This notebook must be run within a Colab GPU runtime. You can check you are running with a GPU by selecting Runtime-> Change Runtime Type and confirming that a GPU Runtime is selected. While this notebook can be started on a free Colab account, the GPUs availabe with a free access do not have sufficient memory to run InstructLab training.

You can run this notebook either:
- Running All Cells by selecting Runtime->Run all
- Cell by cell by selecting the arrow on each code cell and running them sequentially.

Once the Configuring Instructlab section has been run, the other sections of this notebook can be repeatedly run on other data sets.

# Section 1. Configure InstructLab

## Step 1.1 Environment Configuration
Replicate the ilab data repository containing the pip requirements and data files and run the pip installs that require a reset.

**IMPORTANT:** Run the next cell, allow it to complete running, then Restart the session , run the following cell to specify parameters and then you can run the remainder of the notebook. Ignore any spurious pip install warnings or errors.

After selecting parameters, the remainder of this notebook can be run either:
- Running All Cells by selecting Runtime->Run cell and below
- Cell by cell by selecting the arrow on each code cell and running them sequentially.


In [None]:
# Run this cell, then perform the requested Reset
import os
if not os.path.exists("ilab"):
    !git clone https://github.com/KenOcheltree/ilab-test.git ilab

## Step 1.2 Optionally, provide your own InstructLab QNA data set

You can optionally provide your own InstructLab QNA file for processing in this step.

**Note:** You may want to run this notebook with an existing dataset before creating your own to understand the InstructLab flow.

Follow these steps to add your own dataset:
1. Create your own qna.yaml file following the directions on the InstructLab taxonomy [readme](https://github.com/instructlab/taxonomy).
1. Create a questions.txt file with related sample questions to use on inferencing.
1. Add your qna.yaml and sample questions.txt files to the /content/ilab/data/your_content_1 folder or the /content/ilab/data/your_content_2 folder by dragging and dropping them in the desired folder.
1. Double click on the /content/ilab/config.json file to edit and specify the qna_location where your data resides within the Dewey Decimal classification system. Close and save the config.json file.
1. You can now specify to run with your own data by selecting **Your Content 1** or **Your Content 2** in the next code cell.

## Step 1.3 Select InstructLab Parameters
Run this next cell, select the following parameters, then follow the direction in the next text cell to run the notebook.

We've provided question-and-answer files for these datasets: "2024 Oscar Awards Ceremony" and "Quantum Roadmap and Patterns" and "Artificial Intelligence Agents". Feel free to choose one of these datasets, or select your own custom dataset in the cell below.

In [None]:
# Run this second cell to show parameters
import ipywidgets as widgets
#See instructions on placing your hf_token in colab userdata
from google.colab import userdata
hf_token=userdata.get('hf_token')
data_set = widgets.ToggleButtons(
    options=['2024 Oscars', 'Quantum', 'Agentic AI', 'Your Content 1', 'Your Content 2'],
    description='Dataset:', style={"button_width": "auto"}
)
questions=widgets.ToggleButtons(options=['Yes','No'],description='Live Q&A:',style={"button_width":"auto"})
download=widgets.ToggleButtons(options=['Yes','No'],description='Download:',style={"button_width":"auto"}
)
print("\nSelect the Dataset for this run:")
display(data_set)
print("Select what to do with the taxonomy after creation:")
questions.value="Yes"
display(questions)
download.value="No"
display(download)
print("After selecting the parameters, select the next cell and then choose Runtime->Run cell and below")
print("When that run completes, you can come here, choose different parameters and rerun at the next cell with Runtime->Run cell and below")
print("Note: You can also go back and rerun individual sections of the notebook with different parameters.")

## 1.4 Complete Environment Set Up and Optionally Run All
This next code cell installs the remainder of the reuired pip packages and takes about 7 minutes to run.

If you perform **Runtime->Run cell and below** on this cell, the rest of notebook will take about an hour to run. After running, it will present a prompt for providing questions to the pre-trained and trained models to test improvements in the model.

**Note:** Please ignore the pip dependency errors that appear in the output of the pip installs. They do not affect the successful running of Instructlab.

In [None]:
# Run the rest of the notebook by selecting this third cell and choosing "Runtime->Run cell and below"
!pip install -r ilab/requirements.txt
!pip list

Wrap code cell output for ease of reading

In [None]:
from IPython.display import HTML, display
def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

## Step 1.5 Check Starting Configuration
### Check InstructLab Version

Check that InstructLab is installed properly and is configured for using a GPU.

The first line from 'InstructLab' section will give the InstructLab version.

In [None]:
!ilab system info

<a id="IL1_check"></a>
## Perform Imports and Check for a GPU

This code cell checks for a GPU in the configuration. This notebook requires a GPU in the configuration to run properly.

In [None]:
import os
import torch
from IPython.display import Image, display
from datasets import load_dataset
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
import json
import subprocess
import shutil
import ruamel.yaml
os.environ['NUMEXPR_MAX_THREADS'] = '64'
Norm = "<p style='font-family:IBM Plex Sans;font-size:20px'>"

notebook_dir='/content/ilab/'
os.chdir(notebook_dir)

## torch and cuda version check
TORCH_VERSION = ".".join(torch.__version__.split(".")[:2])
CUDA_VERSION = torch.__version__.split("+")[-1]
print("torch: ", TORCH_VERSION, "; cuda: ", CUDA_VERSION)

if torch.cuda.is_available() is False:
    print("No GPU in configuration")
else:
    os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
    print("GPU(s) are Available")
    gpus=torch.cuda.device_count()
    if gpus==1:
      gpu_type=torch.cuda.get_device_name(0)
      print("One GPU of Type: ", gpu_type)
    else:
      print("ERROR: More than 1 GPU in configuration: ",gpus)
print("Starting directory: "+ os.getcwd())

<a id="IL1_config"></a>
## Step 1.6 Configure InstructLab

### Create InstructLab config file
The InstructLab configuration is captured in the *config.yaml* file. This step creates the config.yaml file and sets:
- **taxomony_path = taxonomy** - the root location of the taxonomy is set to the taxonomy folder in instructlab-latest
- **model_path = models/merlinite-7b-lab-Q4_K_M.gguf** - the default model is set to merlinite

**Note:** The default directories for InstructLab are the following. If you initialize InstructLab on your own system, it will default to the following:
* **Downloaded Models:**  ~/.cache/instructlab/models/ - Contains all downloaded large language models, including the saved output of ones you generate with ilab.
* **Synthetic Data:** ~/.local/share/instructlab/datasets/ - Contains data output from the SDG phase, built on modifications to the taxonomy repository.
* **Taxonomy:** ~/.local/share/instructlab/taxonomy/ - Contains the skill and knowledge data.
* **Training Output:** ~/.local/share/instructlab/checkpoints/ - Contains the output of the training process.
* **config.yaml:** ~/.config/instructlab/config.yaml - Contains the config.yaml file

In [None]:
#Remove Colab Sample directory
if os.path.exists("sample_data"):
    print("removing sample_data")
    shutil.rmtree("sample_data")
    os.chdir("ilab")

#Initialize ilab
base_dir="/root/"
##Choose the base model as granite or mixtral
model_dir="models"
model_name="granite-7b-lab-Q4_K_M.gguf"
model_path = os.path.join(model_dir, model_name)

taxonomy_path='taxonomy'

## Define the file name
file_name = "config.yaml"
if os.path.exists(file_name):
    os.remove(file_name)
    print(f"ilab was already initialized. {file_name} has been deleted. Reinitialized")
else:
    print(f"ilab was not initialized yet. {file_name} does not exist.")

##Remove old data
if os.path.exists("taxonomy"):
    print("removing taxonomy")
    shutil.rmtree("taxonomy")
if os.path.exists(base_dir+".cache/instructlab"):
    print("removing " + base_dir+".cache/instructlab")
    shutil.rmtree(base_dir+".cache/instructlab")
if os.path.exists(base_dir+".config/instructlab"):
    print("removing " + base_dir+".config/instructlab")
    shutil.rmtree(base_dir+".config/instructlab")
if os.path.exists(base_dir+".local/share/instructlab"):
    print("removing " + base_dir+".local/share/instructlab")
    shutil.rmtree(base_dir+".local/share/instructlab")

print(f"ilab model is {model_path}.")
print('#############################################################')
print(' ')

command = f"""
ilab config init<<EOF
{taxonomy_path}
Y
{model_path}
0
EOF
"""

## Using the ! operator to run the command
!echo "Running ilab config init"
!{command}

### Display the config.yaml file
We examine the base configuration for identifying parameters for changing in the next step.

In [None]:
##to copy config.yaml to local directory
!cp /root/.config/instructlab/config.yaml .
!cat config.yaml

<a id="IL2_0"></a>
# Section 2. Create Taxonomy


This section demonstrates training with InstructLab. This section is part of a sequential notebook. Before running this section of the notebook, please ensure that you have run the Configuring InstructLab section of this notebook. In this section, we will demonstrate creating a question and answer data file.


The steps in this section are as follows:
* Step 2.1 Specify the Data for this Run
* Step 2.2 Create the Taxonomy Data Repository

<a id="IL2_data"></a>
## Step 2.1 Specify the Data for this Run

We've provided question-and-answer files for these datasets: "2024 Oscar Awards Ceremony", "Quantum Roadmap and Patterns" and "Artificial Intelligence Agents". Feel free to choose one of these datasets, or select your own custom dataset in the cell below.

### Optionally, Create your own data set for InstructLab

You can optionally provide your own InstructLab QNA file for processing in this step.

Follow these steps to add your own dataset:
1. Create your own qna.yaml file following the directions on the InstructLab taxonomy [readme](https://github.com/instructlab/taxonomy).
1. Create a questions.txt file with related sample questions to use on inferencing.
1. Add your qna.yaml and sample questions.txt files to the /content/ilab/data/your_content_1 folder or the /content/ilab/data/your_content_2 folder by dragging and dropping them in the desired folder.
1. Double click on the /content/ilab/config.json file to edit and specify the qna_location where your data resides within the Dewey Decimal classification system. Close and save the config.json file.
1. You can now specify to run with your own data by selecting **Your Content 1** or **Your Content 2** in the next code cell.

In [None]:
print("\nSelect the QNA dataset to add:")
display(data_set)
print("After choosing your dataset, please select and run the following cell")

In [None]:
print("Step 2.2 Choose the Dataset for this Run")
if data_set.value=='2024 Oscars':
    use_case="oscars"
elif data_set.value=='Quantum':
    use_case="quantum"
elif data_set.value=='Agentic AI':
    use_case="agentic_ai"
elif data_set.value=='Your Content 1':
    use_case="your_content_1"
elif data_set.value=='Your Content 2':
    use_case="your_content_2"
else:
    use_case="undefined"
    print("ERROR: Undefined data set: " + data_set.value + " data")

with open('config.json', 'r') as f:
    jsonData = json.load(f)

if use_case!="undefined":
    qna_file="data/" + use_case + "/qna.yaml"
    qna_location=jsonData["use_cases"][use_case]["qna_location"]
    print("Using " + data_set.value + " data")

<a id="IL2_taxonomy"></a>
## Step 2.2 Create the Taxonomy Data Repository
Delete the prior repository, clone the empty taxonomy repository and place the QNA file


In [None]:
#Delete the prior repository and clone the empty taxonomy repository
print("Delete the prior repository and clone the empty taxonomy repository")
shell_command1 = f"rm -rf taxonomy"
taxonomy_repo=jsonData["taxonomy_repo"]
shell_command2 = f"git clone {taxonomy_repo}"
!{shell_command1}
!{shell_command2}

#show the QNA file
print("Show the QNA file")
print_lines=40
with open(qna_file, 'r') as input_file:
    for line_number, line in enumerate(input_file):
        if line_number > print_lines:  # line_number starts at 0.
            break
        print(line, end="")

# Place the QNA file in the proper taxonomy directory
print("Place QNA file in taxononmy as: /taxonomy/"+qna_location+"/qna.yaml")
shell_command1 = f"mkdir -p ./taxonomy/{qna_location}"
shell_command2 = f"cp ./{qna_file} ./taxonomy/{qna_location}/qna.yaml"
!{shell_command1}
!{shell_command2}

print("Verify the taxonomy")
!ilab taxonomy diff

# Section 5. Download the Trained Model
 Now that we have a model trained on our dataset, we can download the trained model for futher testing and use.

In [None]:
print("Do you want to download the trained model to your local machine?")
display(download)
print("After making your selection, please select and run the following cell")

Select and run the next cell to download if selected.

In [None]:
from google.colab import files
if download.value=='Yes':
  files.download(trained_model)

<a id="IL3_conclusion"></a>
# Conclusion

This notebook demonstrated utilizing InstructLab for introducing datasets, data generation, model training, and model creation. This notebook produced an InstructLab trained model that was available for inferecing and downloading.

<a id="IL3_learn"></a>
# Learn More

InstructLab uses a novel synthetic data-based alignment tuning method for Large Language Models introduced in this [paper](https://arxiv.org/abs/2403.01081).

This notebook is based on the InstructLab CLI repository available [here](https://github.com/instructlab/instructlab).

Contact us by email to ask questions, discuss potential use cases, or schedule a technical deep dive. The contact email is IBM.Research.JupyterLab@ibm.com.

© 2025 IBM Corporation