# Configuring InstructLab

<ul>
<li>Contributors: InstructLab team and IBM Research Technology Education team
<li>Contact for questions and technical support: IBM.Research.JupyterLab@ibm.com
<li>Provenance: IBM Research
<li>Version: 1.0.9
<li>Release date: 2024-11-14
<li>Compute requirements: GPU: estimated 4 to 15 minutes (up to 4 min for cell 6, 8 min for cell 7. and 5 min for cell 8)
<li>Memory requirements: 16 GB
<li>Notebook set: InstructLab
</ul>

# Summary
This notebook set demonstrates InstructLab, an open source AI project that facilitates knowledge and skills contributions to Large Language Models (LLMs). InstructLab uses a novel synthetic data-based alignment tuning method for Large Language Models introduced in this [paper](https://arxiv.org/abs/2403.01081). The open source InstructLab repository is available [here](https://github.com/instructlab/instructlab) and provides additional documentation on using InstructLab.

InstructLab  can be instantiated in several different forms, depending on the processing capabilities available. InstructLab can take the form of an open source installation or a Red Hat AI InstructLab installation. The open source installation can be run on a range of hardware from a laptop to a build your own (BYO) server instance running on a Virtual Machine (VM). The below figure shows the different available instantiations of InstructLab.

<img src="./data/images/experiences.png" width="800">

In this notebook set, we will be demonstrating both the open source version running on a VM server and Red Hat Enterprise Linux AI InstructLab running on an IBM Cloud Server.

The open source version running on a server is demonstrated in the following notebooks that are run sequentially:
- [Configuring InstructLab](./00_configuring_InstructLab.ipynb)
- [Training with InstructLab](./01_training_with_InstructLab.ipynb)
- [Inferencing with InstructLab](./02_inferencing_with_InstructLab.ipynb)

The Red Hat Enterprise Linux AI InstructLab is demonstrated running as a service in the IBM Cloud by running the following notebooks:
- [Configuring InstructLab](./00_configuring_InstructLab.ipynb)
- [Training with Red Hat AI InstructLab Service](./03_training_with_RH_AI_InstructLab_Service.ipynb)
- [Inferencing with Redhat-AI-InstructLab Trained Model](./04_inferencing_with_RH_AI_InstructLab_Service.ipynb)
**Note:** The **Configuring InstructLab** notebook is run before the other Red Hat AI InstructLab notebooks to ensure that the *granite 7b model* is installed for inference comparisons as the base model.

## Configuring InstructLab

This notebook demonstrates the configuration of InstructLab. The InstructLab method consists of three major components:
* **Taxonomy-driven data curation:**  The taxonomy is a set of training data curated by humans as examples of new knowledge and skills for the model.
* **Large-scale synthetic data generation:** A teacher model is used to generate new examples based on the seed training data. Recognizing that synthetic data can vary in quality, the InstructLab method adds an automated step to refine the example answers, ensuring they are grounded and safe.
* **Iterative model alignment tuning:** The model is retrained based on the synthetic data. The InstructLab method includes two tuning phases: knowledge tuning, followed by skill tuning.

In this notebook, we will demonstrate the following:
1. Checking the InstructLab installation
2, Configuring InstructLab for use
3. Installing InstructLab LLM models

**Note:** This notebook must be run within a GPU session. If you are not running with a GPU, please select **File->Hub Control Panel->Stop My Server**, then **Start My Server** and then select a GPU Session.

# Table of Contents
* <a href="#I0_preconfig">Step 0. Environment Preconfiguration</a>
* <a href="#I0_init">Step 1. Check the Starting Configuration</a>
* <a href="#I0_config">Step 2. Configure InstructLab</a>
* <a href="#I0_down">Step 3. Download Models</a>
* <a href="#I0_conclusion">Conclusion</a>
* <a href="#I0_learn">Learn more</a>

<a id="#I0_preconfig"></a>
# Step 0. Environment Preconfiguration
This step has already been performed for this JupyterLab environment and requires no work by the user. This information is provided in case the user is setting up their own environment on their own laptop or server.

The full steps for a direct installation are [here](https://github.com/instructlab/instructlab).



In [None]:
!python --version

In [None]:
command = f"""
pip uninstall instructlab<<EOF
Y
"""

# Using the ! operator to run the command
#!echo "Running ilab config init"
#!{command}

In [None]:
!pip install instructlab

<a id="#I0_check"></a>
# Step 1. Check the Starting Configuration

## 1.1 Check for a GPU

This code cell checks for a GPU in the configuration as is required to run the kernel.

If you receive an error about a GPU not in the configuration, preform the following:
1. Select File->Hub Control Panel.
1. On the Hub Control Panel, select the blue "Stop My Server" button.
1. Then select "Start My Server" and choose Session with a GPU.
1. Check at the upper right that the kernel being run is "conda-instructlab-latest". If not, select the kernel and switch to the correct kernel.

In [None]:
# standard imports
import os
import torch

# torch and cuda version check
TORCH_VERSION = ".".join(torch.__version__.split(".")[:2])
CUDA_VERSION = torch.__version__.split("+")[-1]
print("torch: ", TORCH_VERSION, "; cuda: ", CUDA_VERSION)

if torch.cuda.is_available() is False:
    print("No GPU in configuration")
else:
    os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
    print("GPU is Available")

## 1.2 Check InstructLab Version and GPU Offload

Check that InstructLab version 0.23.1 is installed properly and is configured for using a GPU.

The first line from 'InstructLab' section should read
```
instructlab.version: 0.23.1
```

In [None]:
!ilab system info

<a id="#I0_config"></a>
# Step 2. Configure InstructLab

## 2.1 Create InstructLab config file
The InstructLab configuration is captured in the *config.yaml* file. This step creates the config.yaml file and sets:
- **taxomony_path = taxonomy** - the root location of the taxonomy is set to the taxonomy folder in instructlab-latest
- **model_path = models/merlinite-7b-lab-Q4_K_M.gguf** - the default model is set to merlinite

**Note:** The default directories for InstructLab are the following. If you initialize InstructLab on your own system, it will default to the following:
* **Downloaded Models:**  ~/.cache/instructlab/models/ - Contains all downloaded large language models, including the saved output of ones you generate with ilab.
* **Synthetic Data:** ~/.local/share/instructlab/datasets/ - Contains data output from the SDG phase, built on modifications to the taxonomy repository.
* **Taxonomy:** ~/.local/share/instructlab/taxonomy/ - Contains the skill and knowledge data.
* **Training Output:** ~/.local/share/instructlab/checkpoints/ - Contains the output of the training process.
* **config.yaml:** ~/.config/instructlab/config.yaml - Contains the config.yaml file

In [None]:
import shutil
base_dir="/root/"
model_dir="models"
#Choose the base model as granite or mixtral
# Choose as quantized granite model
model_name="granite-7b-lab-Q4_K_M.gguf"

model_path = os.path.join(model_dir, model_name)
taxonomy_path='taxonomy'

# Define the file name
file_name = "config.yaml"
if os.path.exists(file_name):
    os.remove(file_name)
    print(f"ilab was already initialized. {file_name} has been deleted. Reinitialized")
else:
    print(f"ilab was not initialized yet. {file_name} does not exist.")

#Remove old data
if os.path.exists("taxonomy"):
    print("removing taxonomy")
    shutil.rmtree("taxonomy")
if os.path.exists(base_dir+".cache/instructlab"):
    print("removing " + base_dir+".cache/instructlab")
    shutil.rmtree(base_dir+".cache/instructlab")
if os.path.exists(base_dir+".config/instructlab"):
    print("removing " + base_dir+".config/instructlab")
    shutil.rmtree(base_dir+".config/instructlab")
if os.path.exists(base_dir+".local/share/instructlab"):
    print("removing " + base_dir+".local/share/instructlab")
    shutil.rmtree(base_dir+".local/share/instructlab")

print(f"ilab model is {model_path}.")
print('#############################################################')
print(' ')

command = f"""
ilab config init<<EOF
{taxonomy_path}
Y
{model_path}
4
1
EOF
"""

# Using the ! operator to run the command
!echo "Running ilab config init"
!{command}

## 2.2 Display the config.yaml file
We examine the base configuration for identifying parameters for changing in the next step.

In [None]:
!ilab config show

In [None]:
#to copy config.yaml to local directory
!cp /root/.config/instructlab/config.yaml .
!cat config.yaml

## 2.3 Customize LLM Models and copy to notebook for use

This cell changes the models to use for the generate stage. The mistral model as the teacher model in the generate step and as the student model to be trained.

If you want to customize other models for generation or the training phase, you would specify the models in this step.

This step specifies that the models to be used will be from this notebook's models directory.

In [None]:
#Use ruamel.yaml to load the yaml file to preserve comments
import ruamel.yaml
yaml = ruamel.yaml.YAML()
with open('config.yaml', 'r') as file:
    config = yaml.load(file)

#Upate to use the same models and just change the directory
teacher_model_path = "models/mistral-7b-instruct-v0.2.Q4_K_M.gguf"
base_model_path = "models/instructlab/granite-7b-lab"
#judge_model_path = "models/prometheus-eval/prometheus-8x7b-v2.0"

#config['evaluate']['mt_bench']['judge_model'] = judge_model_path
#config['evaluate']['mt_bench_branch']['judge_model'] = judge_model_path
config['generate']['model'] = teacher_model_path
config['generate']['teacher']['model_path']= teacher_model_path
#config['train']['phased_mt_bench_judge']=judge_model_path

# Save the updated config.yaml file
yaml.default_flow_style=False
with open('config.yaml', 'w') as file:
    yaml.dump(config, file)

#copy the config file to the .config/instructlab/ where it is used by InstructLab
!cp config.yaml {base_dir}.config/instructlab/

print("Updated config.yaml successfully.\n")
!cat config.yaml

<a id="#I0_down"></a>
# Step 3. Download Models
The models that will be used in the InstructLab processing are downloaded in this step. Additional steps can be added if other models are used in processing.

## 3.1. Download the merlinite and mistral-7b-instruct-v0.2.Q4_K_M models

The merlinite model will be used as the teacher model for the simple pipeline in the [Training with InstructLab](./01_training_with_InstructLab.ipynb) notebook.

The mistral-7b-instruct-v0.2.Q4_K_M model will be used as the teacher model for the full pipeline in the same notebook.

The granite07b-lab.gguf model is a quantized version oft eh granite-7b-lab model.

In [None]:
models_dir="models"
hf_token = ""
!ilab model download --hf-token {hf_token} --model-dir {models_dir}

In [None]:
!ls

## 3.2. Optionally Download the granite 7b safe tensors model

Download the *granite-7b-lab* model. The  *granite-7b-lab* model is used as:
* The default base model for training in the [Training with InstructLab](./01_training_with_InstructLab.ipynb) notebook.
* The default base model for inferencing comparisons in the [Inferencing with InstructLab](./02_inferencing_with_InstructLab.ipynb) notebook.
* The base model for inferencing comparisons in the [Inferencing with Redhat-AI-InstructLab Trained Model](./04_inferencing_with_RH_AI_InstructLab_Service.ipynb) notebook.

In [None]:
#!ilab model download --repository instructlab/granite-7b-lab --hf-token {hf_token} --model-dir {models_dir}

## 3.3. Optionally download the prometheus-8x7b-v2.0 model
The *prometheus-8x7b-v2.0* model is used as a judege model for multi-phase training and benchmark evaluation. This model is not required for simple or full training.

In [None]:
#!ilab model download --repository prometheus-eval/prometheus-8x7b-v2.0 --hf-token {hf_token} --model-dir {models_dir}

<a id="I0_conclusion"></a>
# Conclusion

This notebook demonstrated setting up the InstructLab environment to be ready for introducing datasets for data generation, training and model creation.

<a id="I0_learn"></a>
# Learn More

Proceed to run the [Training with InstructLab](./01_training_with_InstructLab.ipynb) notebook to introduce datasets, perform synthetic data generation and train InstructLab models.

InstructLab uses a novel synthetic data-based alignment tuning method for Large Language Models introduced in this [paper](https://arxiv.org/abs/2403.01081).

This notebook is based on the open source InstructLab CLI repository available [here](https://github.com/instructlab/instructlab).

Contact us by email to ask questions, discuss potential use cases, or schedule a technical deep dive. The contact email is IBM.Research.JupyterLab@ibm.com.

© 2025 IBM Corporation