# Creating InstructLab Taxonomies

<ul>
<li>Contributors: InstructLab team and IBM Research Technology Education team
<li>Questions and support: kochel@us.ibm.com, IBM.Research.JupyterLab@ibm.com
<li>Release date: 2025-05-06
</ul>

# Overview
This Jupyter notebook simplifies the compilation of taxonomies for the Red Hat AI InstructLab (RHAIL) service, an AI project that facilitates knowledge and skills contributions to Large Language Models (LLMs). This notebook performs the following:
1. Accepts one or more of Question and Answer (QNA) files as input
1. Performs `yamllint` checks on the QNA files to verify their format
1. Places the QNA files in the desired structure in a taxonomy
1. Verifies the taxonomy by running the `ilab diff` function
1. Creates a `tar` file of the taxonomy and provides it to the Red Hat AI InstructLab service for synthetic data generation

This notebook can be run within the free Colab environment.


# Before You Begin

- If you don't have one already, create a [Cloud Object Storage bucket](https://cloud.ibm.com/docs/instructlab?topic=instructlab-storage).
- If you are using the service for the first time, complete the [Assigning access](https://cloud.ibm.com/docs/instructlab?topic=instructlab-iam) steps.

# Step 1. Clone the Instructlab Environment and Select Run Options

The cell replicates an `ilab` data repository containing the pip requirements and data files, and then presents options for running the notebook.

After selecting the parameters, the remainder of this notebook can be run by either:
- Running all cells by selecting `Runtime`->`Run cell and below`.
- Running each cell sequentially by clicking <img src="./refs/run-cell.png" width=23> **Run cell** by each code cell.

Run the following cell, select from the following parameters, and then follow the directions in the cell to run the rest of this notebook.

We've provided question and qnswer files for these datasets:
- "2024 Oscar Awards Ceremony"
- "Quantum Roadmap and Patterns"
- "Artificial Intelligence Agents"
- "Multi-QNA Example": Contains QNA files for Oscars, Quantum, and Agentic AI data sets to show how multiple QNA files can be provided and processed.
- "Your Content 1" or "Your Content 2": Follow the instructions in Step 2.2 to provide your own data.

In [None]:
# Install these items first to avoid a later reset
!pip install psutil==7.0.0 pillow==10.4.0 --quiet

import os
os.chdir('/content/')
if os.path.exists("ilab"):
    !rm -rf ilab
!git clone https://github.com/KenOcheltree/ilab-test.git --quiet --recurse-submodules ilab
#Remove the colab sample_data
if os.path.exists("sample_data"):
    !rm -rf sample_data

# Run this second cell to show parameters
import ipywidgets as widgets
data_set = widgets.ToggleButtons(
    options=['2024 Oscars', 'Quantum', 'Agentic AI', 'Multi-QNA Example', 'Your Content 1', 'Your Content 2'],
    description='Dataset:', style={"button_width": "auto"}
)
print("\nSelect the Dataset for this run:")
display(data_set)
print("After selecting the dataset, select the next cell and then choose Runtime->Run cell and below")

# Step 2. Prepare to Create the Taxonomy

## 2.1 Store Your IBM Cloud and COS Access Credentials in Secrets

When you configure IBM Cloud and Cloud Object Storage (COS) access to use with the Red Hat AI InstructLab service, you must provide access keys and resource IDs to provide to upload your taxonomy. Set the following parameters in the secrets area.

Click the Secrets icon in the sidebar, which looks like a key and give the notebook access to each of these parameters:

- **ibmcloud_key**: An [API key](https://cloud.ibm.com/iam/apikeys) to access your IBM Cloud account.  Example: "XX_XXXXXXXXXXXXXXXXXX"
- **ibmcloud_region**: The IBM Cloud region. Example: `us-east`
- **ibmcloud_resource_group**: The [resource group](https://cloud.ibm.com/account/resource-groups). Example: `default`
- **project_id**: A project identifer. Example: `InstructLab`
- **cos_bucket**: The name of the COS bucket. The bucket it where the taxonomy is stored. If you do not have one yet, the bucket is created for you. Example: `ilabdata`
- **cos_endpoint**: The COS endpoint. Example: `https://s3.us-east.cloud-object-storage.appdomain.cloud`


## 2.2 Optional: Provide Your Own Taxonomy Data

You might want to run this notebook once with an existing data set before creating your own to understand the taxonomy creation flow.

You can provide your own InstructLab QNA file for processing in this step.
1. Create your own `qna.yaml` file by following the directions in the InstructLab taxonomy [readme](https://github.com/instructlab/taxonomy).
1. After creating your `qna.yaml` file, add a comment in the first line that starts with `# Location:` and specifies the location of the file in the taxonomy. For example, a quantum computing `qna.yaml` file has the following path for the location:
    ```
    # location: /knowledge/information/computer_science/quantum_computing
    ```
1. Add your `qna.yaml` to the `/content/ilab/data/your_content_1` folder or the `/content/ilab/data/your_content_2` folder by dragging and dropping it into the folder.
1. To include multiple `qna.yaml` files in your taxonomy, add a unique identifer `NNN` to the name so it is of the format `qnaNNN.yaml`. Any number of QNA files can be included as long as they have unique names.
1. You can use your own data by selecting **Your Content 1** or **Your Content 2** in the code cell.


## 2.3 Complete the Environment Set Up

This code cell installs the remainder of the required pip packages and configures InstructLab. The InstructLab configuration is captured in the `config.yaml` file. The `config.yaml` file is created for you and `taxomony_path = taxonomy` is set. The root location of the taxonomy is set to the taxonomy folder in `instructlab-latest`.

**Note:** 
- This step can take a few minutes to run. If you are running all of the cells at the same time, it can take 10 minutes to run.
- Ignore any pip inconsistency errors or warnings in the installation. They do not affect the running of this notebook.

In [None]:
# You can run the rest of the notebook by selecting this cell and choosing "Runtime->Run cell and below"

# Acquire access to secret keys
from google.colab import userdata

# Wrap Code cell output
from IPython.display import HTML, display
def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

# Install the ibmcloud plugin
!curl -fsSL https://clis.cloud.ibm.com/install/linux | sh
!ibmcloud plugin install ilab -f

# Install the rest of the requirements
os.chdir('/content/ilab/')
print("Starting directory: "+ os.getcwd())
!pip install -r requirements.txt --quiet
!ilab system info

from IPython.display import Image, display
import shutil

# Initialize ilab
base_dir="/root/"
taxonomy_path='taxonomy'
model_path = "models/granite-7b-lab-Q4_K_M.gguf"

# Remove old ilab configuration data
if os.path.exists(base_dir+".config/instructlab"):
    print("removing " + base_dir+".config/instructlab")
    shutil.rmtree(base_dir+".config/instructlab")
if os.path.exists(base_dir+".local/share/instructlab"):
    print("removing " + base_dir+".local/share/instructlab")
    shutil.rmtree(base_dir+".local/share/instructlab")

# Initialize local instructlab isntall
print("Initialized ilab")
command = f"""
ilab config init<<EOF
{taxonomy_path}
Y
{model_path}
0
EOF
"""
# Using the ! operator to run the command
!echo "Running ilab config init"
!{command}

# Step 3. Initialize Red Hat AI InstructLab Service Access

In [None]:
print("Initialize Red Hat AI InstructLab Service Access")
print("Installing the IBMCloud ilab plugin")
import os
import subprocess
import time

# Pull data from secrets
ibmcloud_key=userdata.get("ibmcloud_key")
ibmcloud_region=userdata.get("ibmcloud_region")
ibmcloud_resource_group=userdata.get("ibmcloud_resource_group")
cos_bucket=userdata.get("cos_bucket")
cos_endpoint=userdata.get("cos_endpoint")
try:
    project_id=userdata.get("project_id")
except:
    project_id="InstructLab"

!ibmcloud config --check-version=false
shell_command = f"ibmcloud login -apikey {ibmcloud_key} -r {ibmcloud_region} -g {ibmcloud_resource_group}"
!{shell_command}

# !ibmcloud resource service-instances --service-name instructlab --long
proj_index=0
response = subprocess.check_output("ibmcloud resource service-instances --service-name instructlab --long", shell=True).decode("utf-8").split()
print("Response: ",response)
for index, word in enumerate(response):
    if word == "GUID:" and response[index+3]==project_id:
        print("Project ID Found")
        proj_index=index+1
        break
if proj_index==0:
    print("Assign project-id")
    response = subprocess.check_output("ibmcloud resource service-instance-create 'InstructLab' instructlab instructlab-pricing-plan "+ibmcloud_region,
        shell=True).decode("utf-8").split()
    for index, word in enumerate(response):
        if word == "GUID:":
            proj_index=index+1
            break
if proj_index==0:
    print("ERROR in assigning Project ID")
project_index=response[proj_index]
print("Project Index: " + project_index)

shell_command = f"ibmcloud ilab config set project-id {project_index}"
!{shell_command}

print("Check IBM Cloud COS authorization policies")
!ibmcloud iam authorization-policies


# Step 4. Check the Format of the QNA YAML Files

Running this cell checks the format of the YAML files before they are placed in the taxonomy to ensure they are the right length and there are no trailing blanks.

For each QNA file, the YAML file checker outputs a `Checking File: QNA.yaml` header followed by a list of errors found in the file. There is no other output besides the header for properly configured files.

**Important:** Rerun the following cell until all of the QNA files pass the `yamllint` test. Otherwise,synthetic data generation steps fail later.


In [None]:
import yamllint
# Select the folder of the dataset
use_cases = {"2024 Oscars": "oscars", "Quantum": "quantum", "Agentic AI": "agentic_ai",
            "Multi-QNA Example": "example","Your Content 1": "your_content_1", "Your Content 2": "your_content_2"}
use_case = use_cases[data_set.value]
qna_dir = "data/" + use_case + "/"
print("Running yaml checker on " + data_set.value + " data in folder " + qna_dir)
for f in os.listdir(qna_dir):
    f=f.lower()
    if f.startswith('qna'):
      print("Checking File: " + f)
      yaml_file = qna_dir + f
      shell_command = f"yamllint /content/ilab/{yaml_file} -c /content/ilab/yamlrules.yaml"
      !{shell_command}

# Step 5. Create the Taxonomy with the QNA Files
Running this next cell places the QNA files in the proper directories of the taxonomy.

If you want to add additional QNA files to the taxonomy after the following cell is run, you can create the necessary levels of directories and add the `qna.yaml` named file directly to the taxonomy.


In [None]:
# List all of the files in the use_case directory that begin with QNA
print_lines=30
for f in os.listdir(qna_dir):
    f=f.lower()
    if f.startswith('qna'):
        qna_file = qna_dir + f
        print("Show the QNA file: " + qna_file)
        with open(qna_file, 'r') as input_file:
            for line_number, line in enumerate(input_file):
                if line_number == 0:
                    words = line.split()
                    print("Checking first line of QNA file for placement location: " + line)
                    if words[0] == "#" and words[1] == "location:" and len(words) == 3:
                      qna_location = words[2]
                    else:
                      print("ERROR: Placement location not specified in QNA File: " + qna_file)
                      break
                if line_number > print_lines:  # line_number starts at 0.
                    break
                print(line_number, line, end="")
        # Place the QNA file in the proper taxonomy directory if it does not already exist
        new_qna_dir = "/taxonomy" + qna_location
        if os.path.exists(os.getcwd()+new_qna_dir):
            print("\nWARNING: QNA file already exists in the taxonomy at duplicate location, not inserting")
        else:
            print("\nPlace QNA file in taxononmy as: /taxonomy"+qna_location+"/qna.yaml")
            shell_command1 = f"mkdir -p ./taxonomy{qna_location}"
            shell_command2 = f"cp ./{qna_file} ./taxonomy{qna_location}/qna.yaml"
            !{shell_command1}
            !{shell_command2}

# Step 6. Verify the Taxonomy Data Repository
Run the `diff` command to verify the taxonomy. Record the errors from this step and correct them in your QNA files. Then rerun the notebook with the corrected QNA files.

For a properly configured taxonomy, the last line of the output reads:

  "Taxonomy in taxonomy is valid :)"

In [None]:
print("Verify the taxonomy")
!ilab -vvv taxonomy diff --taxonomy-path taxonomy --taxonomy-base empty

# Step 7. Add the Taxonomy to the Cloud

If you receive an error running this code cell:
- Check your credentials added into the Colab secrets.
- Verify that the COS bucket is already created.
- Verify that you have the proper access.

After you make the corrections, run the notebook again.

In [None]:
set_name=use_case
tax_dir= os.getcwd()+"/taxonomy"
shell_command = f"ibmcloud ilab taxonomy add --name {set_name} --taxonomy-path {tax_dir} \
--cos-endpoint {cos_endpoint} --cos-bucket {cos_bucket}"

print("Add the taxonomy to the cloud")
tax_response = subprocess.check_output(shell_command, shell=True)
print("Taxonomy added")

response= tax_response.decode("utf-8").split()
for index, word in enumerate(response):
    if word == "id":
        break
tax_id = response[index+1]

print("taxonomy id = " + tax_id)

<a id="IL3_learn"></a>
# Learn More

InstructLab uses a novel synthetic data-based alignment tuning method for Large Language Models introduced in this [paper](https://arxiv.org/abs/2403.01081).

Contact us by email to ask questions, discuss potential use cases, or schedule a technical deep dive. The contact email is IBM.Research.JupyterLab@ibm.com.

© 2025 IBM Corporation