# Creating InstructLab Taxonomies

<ul>
<li>Contributors: InstructLab team and IBM Research Technology Education team:
<li>Questions and support: kochel@us.ibm.com, IBM.Research.JupyterLab@ibm.com
<li>Release date: 2025-03-20
</ul>

# Summary
This Jupyter notebook facilitates compiling taxonomies for InstructLab, an open source AI project that enables knowledge and skills contributions to Large Language Models (LLMs). This notebook performs the following:
1. Accepts one or more of Question and Answer (QNA) files as input
1. Places the QNA files in the desired palce in a taxonomy
1. Verifies the taxonomy by running the ilab dif function
1. Creates a Tar file of the taxonomy that can be used for Synthetic Data Generation
1. Downloads the Tar file or provides it to the RedHat AI Instructlab Service

This notebook can be run with the free Colab environment. You can run this notebook either:
- Running All Cells by selecting Runtime->Run all
- Cell by cell by selecting the arrow on each code cell and running them sequentially.


# Step 1. Set up Environment and Show Paramsters

Replicate the ilab data repository containing the pip requirements and data files.


After selecting parameters, the remainder of this notebook can be run either:
- Running All Cells by selecting Runtime->Run cell and below
- Cell by cell by selecting the arrow on each code cell and running them sequentially.

## Step Select InstructLab Parameters

Run this next cell, select the following parameters, then follow the direction in the next text cell to run the notebook.

We've provided question-and-answer files for these datasets: "2024 Oscar Awards Ceremony" and "Quantum Roadmap and Patterns" and "Artificial Intelligence Agents". Feel free to choose one of these datasets, or select your own custom dataset in the cell below.

In [None]:
# Install these items first to avoid a later reset
!pip install psutil==7.0.0 pillow==10.4.0 --quiet

import os
os.chdir('/content/')
if os.path.exists("ilab"):
    !rm -rf ilab
!git clone https://github.com/KenOcheltree/ilab-test.git --quiet --recurse-submodules ilab

# Run this second cell to show parameters
import ipywidgets as widgets
data_set = widgets.ToggleButtons(
    options=['2024 Oscars', 'Quantum', 'Agentic AI', 'Your Content 1', 'Your Content 2'],
    description='Dataset:', style={"button_width": "auto"}
)
tar=widgets.ToggleButtons(options=['Yes','No'],description='Create Tar:',style={"button_width":"auto"})
download=widgets.ToggleButtons(options=['Yes','No'],description='Download:',style={"button_width":"auto"}
)
print("\nSelect the Dataset for this run:")
display(data_set)
print("Select what to do with the taxonomy after creation:")
tar.value="Yes"
display(tar)
download.value="No"
display(download)
print("After selecting the parameters, select the next cell and then choose Runtime->Run cell and below")

# Step 2. Provide Taxonomy data

You can provide your own InstructLab QNA files for processing in this step. You may want to run this notebook once with an existing dataset before creating your own to understand the Tasonomy creation flow.




Follow these steps to add your own dataset:
1. Create your own qna.yaml files following the directions on the InstructLab taxonomy [readme](https://github.com/instructlab/taxonomy).
1. Create a questions.txt file with related sample questions to use on inferencing.
1. Add your qna.yaml and sample questions.txt files to the /content/ilab/data/your_content_1 folder or the /content/ilab/data/your_content_2 folder by dragging and dropping them in the desired folder.
1. Double click on the /content/ilab/config.json file to edit and specify the qna_location where your data resides within the Dewey Decimal classification system. Close and save the config.json file.
1. You can now specify to run with your own data by selecting **Your Content 1** or **Your Content 2** in the next code cell.

## 1.4 Complete Environment Set Up and Optionally Run All
This code cell installs the remainder of the reuired pip packages and takes a few minutes to run.

The first half of the cell wraps the code cell output for all following cells for ease of reading.

Check the InstructLab version.

If you perform **Runtime->Run cell and below** on this cell, the rest of notebook will take about 10 minutes to run. After running, it will present a prompt for providing questions to the pre-trained and trained models to test improvements in the model.


## Step 1.6 Perform Imports and Configure InstructLab

### Create InstructLab config file
The InstructLab configuration is captured in the *config.yaml* file. This step creates the config.yaml file and sets:
- **taxomony_path = taxonomy** - the root location of the taxonomy is set to the taxonomy folder in instructlab-latest
- **model_path = models/merlinite-7b-lab-Q4_K_M.gguf** - the default model is set to merlinite

**Note:** The default directories for InstructLab are the following. If you initialize InstructLab on your own system, it will default to the following:
* **Downloaded Models:**  ~/.cache/instructlab/models/ - Contains all downloaded large language models, including the saved output of ones you generate with ilab.
* **Synthetic Data:** ~/.local/share/instructlab/datasets/ - Contains data output from the SDG phase, built on modifications to the taxonomy repository.
* **Taxonomy:** ~/.local/share/instructlab/taxonomy/ - Contains the skill and knowledge data.
* **Training Output:** ~/.local/share/instructlab/checkpoints/ - Contains the output of the training process.
* **config.yaml:** ~/.config/instructlab/config.yaml - Contains the config.yaml file

In [None]:
# Run the rest of the notebook by selecting this third cell and choosing "Runtime->Run cell and below"

# Wrap Code cell output
from IPython.display import HTML, display
def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

#Install the ibmcloud plugin
!curl -fsSL https://clis.cloud.ibm.com/install/linux | sh
!ibmcloud plugin install ilab -f

#Install the rest of the requirements
os.chdir('/content/ilab/')
print("Starting directory: "+ os.getcwd())
!pip install -r requirements.txt --quiet
!ilab system info

from IPython.display import Image, display
#import json
#import subprocess
import shutil
import ruamel.yaml

#Initialize ilab
base_dir="/root/"
taxonomy_path='taxonomy'
model_path = "models/granite-7b-lab-Q4_K_M.gguf"

## Define the file name
file_name = "config.yaml"
if os.path.exists(file_name):
    os.remove(file_name)
    print(f"ilab was already initialized. {file_name} has been deleted. Reinitialized")
else:
    print(f"ilab was not initialized yet. {file_name} does not exist.")

##Remove old data
if os.path.exists("taxonomy"):
    print("removing taxonomy")
    shutil.rmtree("taxonomy")
if os.path.exists(base_dir+".cache/instructlab"):
    print("removing " + base_dir+".cache/instructlab")
    shutil.rmtree(base_dir+".cache/instructlab")
if os.path.exists(base_dir+".config/instructlab"):
    print("removing " + base_dir+".config/instructlab")
    shutil.rmtree(base_dir+".config/instructlab")
if os.path.exists(base_dir+".local/share/instructlab"):
    print("removing " + base_dir+".local/share/instructlab")
    shutil.rmtree(base_dir+".local/share/instructlab")
print("Initialized ilab")
command = f"""
ilab config init<<EOF
{taxonomy_path}
Y
{model_path}
0
EOF
"""

## Using the ! operator to run the command
!echo "Running ilab config init"
!{command}

In [None]:
import json
import subprocess
import time
import ibm_boto3
from ibm_botocore.client import Config
from ibm_botocore.exceptions import ClientError

with open('cloud.json', 'r') as f:
    jsonCloud = json.load(f)

# Check COS Access Here
print("Set up COS storage and check access")

# IBM Cloud Object Storage credentials
cos_id=jsonCloud["cos_id"]
cos_api_key=jsonCloud["cos_api_key"]
cos_bucket=jsonCloud["cos_bucket"]
ibmcloud_region=jsonCloud["ibmcloud_region"]

endpoint_url = f'https://s3.{ibmcloud_region}.cloud-object-storage.appdomain.cloud'
# Current list of auth_endpoints avaiable at https://control.cloud-object-storage.cloud.ibm.com/v2/endpoints
auth_endpoint = 'https://iam.cloud.ibm.com/identity/token'

#Create client
cos = ibm_boto3.client('s3',
                         ibm_api_key_id=cos_api_key,
                         ibm_service_instance_id=cos_id,
                         ibm_auth_endpoint=auth_endpoint,
                         config=Config(signature_version='oauth'),
                         endpoint_url=endpoint_url
                      )

<a id="IL2_0"></a>
# Section 2. Create Taxonomy


This section demonstrates training with InstructLab. This section is part of a sequential notebook. Before running this section of the notebook, please ensure that you have run the Configuring InstructLab section of this notebook. In this section, we will demonstrate creating a question and answer data file.


The steps in this section are as follows:
* Step 2.1 Specify the Data for this Run
* Step 2.2 Create the Taxonomy Data Repository

<a id="IL2_data"></a>
## Step 2.1 Specify the Data for this Run

We've provided question-and-answer files for these datasets: "2024 Oscar Awards Ceremony", "Quantum Roadmap and Patterns" and "Artificial Intelligence Agents". Feel free to choose one of these datasets, or select your own custom dataset in the cell below.

### Optionally, Create your own data set for InstructLab

You can optionally provide your own InstructLab QNA file for processing in this step.

Follow these steps to add your own dataset:
1. Create your own qna.yaml file following the directions on the InstructLab taxonomy [readme](https://github.com/instructlab/taxonomy).
1. Create a questions.txt file with related sample questions to use on inferencing.
1. Add your qna.yaml and sample questions.txt files to the /content/ilab/data/your_content_1 folder or the /content/ilab/data/your_content_2 folder by dragging and dropping them in the desired folder.
1. Double click on the /content/ilab/config.json file to edit and specify the qna_location where your data resides within the Dewey Decimal classification system. Close and save the config.json file.
1. You can now specify to run with your own data by selecting **Your Content 1** or **Your Content 2** in the next code cell.

In [None]:
# Work with the selected dataset
if data_set.value=='2024 Oscars':
    use_case="oscars"
elif data_set.value=='Quantum':
    use_case="quantum"
elif data_set.value=='Agentic AI':
    use_case="agentic_ai"
elif data_set.value=='Your Content 1':
    use_case="your_content_1"
elif data_set.value=='Your Content 2':
    use_case="your_content_2"
else:
    use_case="undefined"
    print("ERROR: Undefined data set: " + data_set.value + " data")

if use_case!="undefined":
    qna_dir="data/" + use_case + "/"
    print("Using " + data_set.value + " data")

## Step 2.2 Place the QNA Files in the Taxonomy
Place the QNA files in the proper directories.


In [None]:
# List all of the files in the use_case directory that begin with QNA
print_lines=40
for f in os.listdir(qna_dir):
    f=f.lower()
    if f.startswith('qna'):
        qna_file = qna_dir + f
        print("Show the QNA file: " + qna_file)
        with open(qna_file, 'r') as input_file:
            for line_number, line in enumerate(input_file):
                if line_number == 0:
                    words = line.split()
                    print(words)
                    qna_location = words[1]
                if line_number > print_lines:  # line_number starts at 0.
                    break
                print(line_number, line, end="")
        # Place the QNA file in the proper taxonomy directory if it does not already exist
        new_qna_dir = "/taxonomy" + qna_location
        if os.path.exists(os.getcwd()+new_qna_dir):
            print("\nWARNING: QNA file already exists in taxonomy at duplicate location, not inserting")
        else:
            print("\nPlace QNA file in taxononmy as: /taxonomy"+qna_location+"/qna.yaml")
            shell_command1 = f"mkdir -p ./taxonomy{qna_location}"
            shell_command2 = f"cp ./{qna_file} ./taxonomy{qna_location}/qna.yaml"
            !{shell_command1}
            !{shell_command2}

## Step 2.3 Verify the Taxonomy Data Repository
Run diff to verify the taxonomy.

In [None]:
print("Verify the taxonomy")
!ilab taxonomy diff

# Section 3. Tar and Download the Trained Model
 Now that we have a model trained on our dataset, we can download the trained model for futher testing and use.

In [None]:
#print("Do you want to download the trained model to your local machine?")
#display(download)
#print("After making your selection, please select and run the following cell")

Select and run the next cell to download if selected.

In [None]:
#from google.colab import files
#if download.value=='Yes':
#  files.download(trained_model)

In [None]:
!pip install ibmcloud

<a id="IL3_conclusion"></a>
# Conclusion

This notebook demonstrated utilizing InstructLab for introducing datasets, data generation, model training, and model creation. This notebook produced an InstructLab trained model that was available for inferecing and downloading.

<a id="IL3_learn"></a>
# Learn More

InstructLab uses a novel synthetic data-based alignment tuning method for Large Language Models introduced in this [paper](https://arxiv.org/abs/2403.01081).

This notebook is based on the InstructLab CLI repository available [here](https://github.com/instructlab/instructlab).

Contact us by email to ask questions, discuss potential use cases, or schedule a technical deep dive. The contact email is IBM.Research.JupyterLab@ibm.com.

© 2025 IBM Corporation