# Creating InstructLab Taxonomies

<ul>
<li>Contributors: InstructLab team and IBM Research Technology Education team:
<li>Questions and support: kochel@us.ibm.com, IBM.Research.JupyterLab@ibm.com
<li>Release date: 2025-03-20
</ul>

# Overview
This Jupyter notebook facilitates compiling taxonomies for InstructLab, an open source AI project that enables knowledge and skills contributions to Large Language Models (LLMs). This notebook performs the following:
1. Accepts one or more of Question and Answer (QNA) files as input
1. Places the QNA files in the desired palce in a taxonomy
1. Verifies the taxonomy by running the ilab dif function
1. Creates a Tar file of the taxonomy that can be used for Synthetic Data Generation
1. Downloads the Tar file or provides it to the RedHat AI Instructlab Service

This notebook can be run with the free Colab environment.


# Step 1. Clone the Ilab Environment and Present Run Options

Replicate the ilab data repository containing the pip requirements and data files.


After selecting parameters, the remainder of this notebook can be run either:
- Running All Cells by selecting Runtime->Run cell and below
- Cell by cell by selecting the arrow on each code cell and running them sequentially.

Run this next cell, select the following parameters, then follow the direction in the following cell to run the rest of this notebook.

We've provided question-and-answer files for these datasets: "2024 Oscar Awards Ceremony", "Quantum Roadmap and Patterns" and "Artificial Intelligence Agents". Feel free to choose one of these datasets, or select "Your Content 1" or "Your Content 2" and follow the instructions below to provide your own data.

In [None]:
# Install these items first to avoid a later reset
!pip install psutil==7.0.0 pillow==10.4.0 --quiet

import os
os.chdir('/content/')
if os.path.exists("ilab"):
    !rm -rf ilab
!git clone https://github.com/KenOcheltree/ilab-test.git --quiet --recurse-submodules ilab

# Run this second cell to show parameters
import ipywidgets as widgets
data_set = widgets.ToggleButtons(
    options=['2024 Oscars', 'Quantum', 'Agentic AI', 'Your Content 1', 'Your Content 2'],
    description='Dataset:', style={"button_width": "auto"}
)
print("\nSelect the Dataset for this run:")
display(data_set)
print("After selecting the parameters, select the next cell and then choose Runtime->Run cell and below")

# Step 2. Provide the Taxonomy data

You may want to run this notebook once with an existing dataset before creating your own to understand the Tasonomy creation flow.

You can optionally provide your own InstructLab QNA file for processing in this step. Follow these steps to add your own dataset:
1. Create your own qna.yaml file following the directions on the InstructLab taxonomy [readme](https://github.com/instructlab/taxonomy).
1. After creating your qna.yaml file, add a comment in the first line that starts with *#Location:* and specifies the location of the file in the taxonomy. For example, a quantum computing qna.yaml file would have the following for the first line:
```
#location: /knowledge/information/computer_science/quantum_computing'
```
1. Add your qna.yaml to the /content/ilab/data/your_content_1 folder or the /content/ilab/data/your_content_2 folder by dragging and dropping them in the desired folder.
1. If you want to include multiple qna.yaml files in your taxonomy, add a unique identifer "NNN" to the name so it is of the form qnaNNN.yaml. Any number of QNA files can be included as long as they have unique names
1. You can now specify to run with your own data by selecting **Your Content 1** or **Your Content 2** in the next code cell.

# Step 3. Complete the Environment Set Up


This code cell installs the remainder of the reuired pip packages and takes a few minutes to run.

The first half of the cell wraps the code cell output for all following cells for ease of reading.

The InstructLab configuration is captured in the *config.yaml* file. This step creates the config.yaml file and sets **taxomony_path = taxonomy** - the root location of the taxonomy is set to the taxonomy folder in instructlab-latest

**Note:** Ignore any pip inconsistency errors or warnings in the installation. They are inconsequential to the running of this notebook.

**Note** If you perform **Runtime->Run cell and below** on this cell, the rest of notebook will take about 10 minutes to run. After running, it will present a prompt for providing questions to the pre-trained and trained models to test improvements in the model.

In [None]:
# You can run the rest of the notebook by selecting this cell and choosing "Runtime->Run cell and below"

# Wrap Code cell output
from IPython.display import HTML, display
def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

# Install the ibmcloud plugin
!curl -fsSL https://clis.cloud.ibm.com/install/linux | sh
!ibmcloud plugin install ilab -f

# Install the rest of the requirements
os.chdir('/content/ilab/')
print("Starting directory: "+ os.getcwd())
!pip install -r requirements.txt --quiet
!ilab system info

from IPython.display import Image, display
import shutil
import ruamel.yaml

# Initialize ilab
base_dir="/root/"
taxonomy_path='taxonomy'
model_path = "models/granite-7b-lab-Q4_K_M.gguf"

# Define the file name
file_name = "config.yaml"
if os.path.exists(file_name):
    os.remove(file_name)
    print(f"ilab was already initialized. {file_name} has been deleted. Reinitialized")
else:
    print(f"ilab was not initialized yet. {file_name} does not exist.")

# Remove old data
if os.path.exists("taxonomy"):
    print("removing taxonomy")
    shutil.rmtree("taxonomy")
if os.path.exists(base_dir+".cache/instructlab"):
    print("removing " + base_dir+".cache/instructlab")
    shutil.rmtree(base_dir+".cache/instructlab")
if os.path.exists(base_dir+".config/instructlab"):
    print("removing " + base_dir+".config/instructlab")
    shutil.rmtree(base_dir+".config/instructlab")
if os.path.exists(base_dir+".local/share/instructlab"):
    print("removing " + base_dir+".local/share/instructlab")
    shutil.rmtree(base_dir+".local/share/instructlab")

# Initialize local instructlab isntall
print("Initialized ilab")
command = f"""
ilab config init<<EOF
{taxonomy_path}
Y
{model_path}
0
EOF
"""
# Using the ! operator to run the command
!echo "Running ilab config init"
!{command}

# Step 3. Initialize Red Hat AI InstructLab Access

In [None]:
print("Initialize Red Hat AI InstructLab Access")
print("Installing the IBMCloud ilab plugin as needed")

import os
import json
import subprocess
import time
import ibm_boto3
from ibm_botocore.client import Config
from ibm_botocore.exceptions import ClientError

# Replace the following with the use of secrets
with open('cloud.json', 'r') as f:
    jsonCloud = json.load(f)
ibmcloud_key=jsonCloud["ibmcloud_key"]
ibmcloud_region=jsonCloud["ibmcloud_region"]
ibmcloud_resource=jsonCloud["ibmcloud_resource"]
cos_id=jsonCloud["cos_id"]
cos_api_key=jsonCloud["cos_api_key"]
cos_bucket=jsonCloud["cos_bucket"]

!ibmcloud config --check-version=false
shell_command = f"ibmcloud login -apikey {ibmcloud_key} -r {ibmcloud_region} -g {ibmcloud_resource}"
!{shell_command}

# !ibmcloud resource service-instances --service-name instructlab --long
proj_index=0
response = subprocess.check_output("ibmcloud resource service-instances --service-name instructlab --long", shell=True).decode("utf-8").split()
for index, word in enumerate(response):
    if word == "GUID:":
        proj_index=index+1
        break
if proj_index==0:
    print("Assign project-id")
    !ibmcloud resource service-instance-create 'instructlab' instructlab instructlab-pricing-plan us-east
    for index, word in enumerate(response):
        if word == "GUID:":
            proj_index=index+1
            break
if proj_index==0:
    print("ERROR in assigning Project ID")
project_id=response[proj_index]

shell_command = f"ibmcloud ilab config set project-id {project_id}"
!{shell_command}

print("Check IBM Cloud COS authorization policies")
!ibmcloud iam authorization-policies

# Set up COS Access Here
print("Set up COS storage and check access")
endpoint_url = f'https://s3.{ibmcloud_region}.cloud-object-storage.appdomain.cloud'
# Current list of auth_endpoints is at https://control.cloud-object-storage.cloud.ibm.com/v2/endpoints
auth_endpoint = 'https://iam.cloud.ibm.com/identity/token'
#Create client
cos = ibm_boto3.client('s3',
                         ibm_api_key_id=cos_api_key,
                         ibm_service_instance_id=cos_id,
                         ibm_auth_endpoint=auth_endpoint,
                         config=Config(signature_version='oauth'),
                         endpoint_url=endpoint_url
                      )

# Step 4. Create the Taxonomy with the QNA Files
Place the QNA files in the proper directories of the taxonomy.


In [None]:
# Select the folder for the dataset
use_cases = {"2024 Oscars": "oscars", "Quantum": "quantum", "Agentic AI": "agentic_ai",
            "Your Content 1": "your_content_1", "Your Content 2": "your_content_2"}
use_case = use_cases[data_set.value]
qna_dir = "data/" + use_case + "/"
print("Using " + data_set.value + " data in folder " + qna_dir)

# List all of the files in the use_case directory that begin with QNA
print_lines=30
for f in os.listdir(qna_dir):
    f=f.lower()
    if f.startswith('qna'):
        qna_file = qna_dir + f
        print("Show the QNA file: " + qna_file)
        with open(qna_file, 'r') as input_file:
            for line_number, line in enumerate(input_file):
                if line_number == 0:
                    words = line.split()
                    print(words)
                    if words[0] == "#Location:" and len(words) == 2:
                      qna_location = words[1]
                    else:
                      print("ERROR: No specificed location found in QNA File: " + qna_file)
                      break
                if line_number > print_lines:  # line_number starts at 0.
                    break
                print(line_number, line, end="")
        # Place the QNA file in the proper taxonomy directory if it does not already exist
        new_qna_dir = "/taxonomy" + qna_location
        if os.path.exists(os.getcwd()+new_qna_dir):
            print("\nWARNING: QNA file already exists in taxonomy at duplicate location, not inserting")
        else:
            print("\nPlace QNA file in taxononmy as: /taxonomy"+qna_location+"/qna.yaml")
            shell_command1 = f"mkdir -p ./taxonomy{qna_location}"
            shell_command2 = f"cp ./{qna_file} ./taxonomy{qna_location}/qna.yaml"
            !{shell_command1}
            !{shell_command2}

# Step 5. Verify the Taxonomy Data Repository
Run diff to verify the taxonomy. Record the errors on this step and correct them in your QNA files and then rerun the notebook with the corrected QNA files.

In [None]:
print("Verify the taxonomy")
!ilab taxonomy diff

# Step 6. Add the Taxonomy to the Cloud


In [None]:
set_name=use_case
tax_dir= os.getcwd()+"/taxonomy"
shell_command = f"ibmcloud ilab taxonomy add --name {set_name} --taxonomy-path {tax_dir} \
--cos-endpoint https://s3.us-east.cloud-object-storage.appdomain.cloud \
--cos-id {cos_id} \
--cos-bucket {cos_bucket}"

print("Step 3. Add the taxonomy to the cloud")
tax_response = subprocess.check_output(shell_command, shell=True)
print("Taxonomy added")

response= tax_response.decode("utf-8").split()
for index, word in enumerate(response):
    if word == "id":
        break
tax_id = response[index+1]

print("taxonomy id = " + tax_id)

<a id="IL3_learn"></a>
# Learn More

InstructLab uses a novel synthetic data-based alignment tuning method for Large Language Models introduced in this [paper](https://arxiv.org/abs/2403.01081).

This notebook is based on the InstructLab CLI repository available [here](https://github.com/instructlab/instructlab).

Contact us by email to ask questions, discuss potential use cases, or schedule a technical deep dive. The contact email is IBM.Research.JupyterLab@ibm.com.

© 2025 IBM Corporation