- have a variable qna.yaml = ""
- pull the document from the github path/file system path in the qna.yaml
- chunk this document using docling v2
- once you have these chunks, use the SDG functions to do the icl_mapping
- have an input that goes into generate_data
- Have synthetically generated data

## How Chunking works in SDG

In [6]:
from instructlab.sdg.utils.taxonomy import read_taxonomy
from pathlib import Path


yaml_path = Path('./qna.yaml')


data = read_taxonomy(
  taxonomy=yaml_path,
  taxonomy_base='origin/main',
  yaml_rules=None,
  document_output_dir=None
)

Error retrieving documents: Cmd('git') failed due to: exit code(128)
  cmdline: git clone -v -- https://github.com/rhai-code/instructlab_knowledge None
  stderr: 'fatal: destination path 'None' already exists and is not an empty directory.
'


TaxonomyReadingException: Exception Cmd('git') failed due to: exit code(128)
  cmdline: git clone -v -- https://github.com/rhai-code/instructlab_knowledge None
  stderr: 'fatal: destination path 'None' already exists and is not an empty directory.
' raised in qna.yaml

In [7]:
data[0]


{'questions_and_answers': [{'question': 'What is InstructLab?\n',
   'answer': 'InstructLab is an open source AI project\nthat faciliates contributions to Large Language Models (LLMs).\n'},
  {'question': 'Can anyone contribute to InstructLab?\n',
   'answer': 'Yes, the community welcomes everyone\ninterested in generative AI.\n'},
  {'question': 'What is the mission of InstructLab?\n',
   'answer': 'We are on a mission to let anyone\nshape generative AI by enabling contributed\nupdates to existing LLMs in an accessible way.\nOur community welcomes all those who\nwould like to help us enable everyone\nto shape the future of generative AI.\n'}],
 'context': 'InstructLab is a model-agnostic open source AI project that facilitates\ncontributions to Large Language Models (LLMs).\nWe are on a mission to let anyone shape generative\nAI by enabling contributed updates to existing\nLLMs in an accessible way. Our community welcomes all those who\nwould like to help us enable everyone to shape\n

Inspecting the contents of the processed data, we see that there are a few keys:

```
dict_keys([
  'questions_and_answers',
  'context',
  'taxonomy_path',
  'documents',
  'filepaths',
  'domain',
  'document_outline',
])
```

In [17]:
sample = data[0]
sep_length = 400

# context from the qna
print(sample['context'])
print('-' * sep_length)

# path in the taxonomy
print(sample['taxonomy_path'])
print('-' * sep_length)
# documents list
print(sample['documents'])

# filepaths
print('-' * sep_length)
print(sample['filepaths'])


# domain
print('-' * sep_length)
print(sample['domain'])

# document_outline
print('-' * sep_length)
print(sample['document_outline'])

InstructLab is a model-agnostic open source AI project that facilitates
contributions to Large Language Models (LLMs).
We are on a mission to let anyone shape generative
AI by enabling contributed updates to existing
LLMs in an accessible way. Our community welcomes all those who
would like to help us enable everyone to shape
the future of generative AI.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
/->Users->osilkin->Programming->labrador->sdg->notebooks
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

It looks like a lot of these fields are taken directly from the `qna.yaml` file. However; one that seems to be unique is the `documents` field, let's take a closer look:

In [19]:
len(sample['documents'])

1

It looks like it's a list of documents, in our case we only provided a single document source, so it makes sense that there's only one document in the list. Let's print it out to see what it contains

In [22]:
from IPython.display import display, Markdown


# Display it as rendered markdown
display(Markdown(sample['documents'][0]))

# Welcome to the 🐶 InstructLab Project

![Banner](https://github.com/instructlab/.github/blob/main/assets/instructlab-banner.png)
InstructLab is a model-agnostic open source AI project that facilitates contributions to Large Language Models (LLMs).

We are on a mission to let anyone shape generative AI by enabling contributed updates to existing LLMs in an accessible way.

**Our community welcomes all those who would like to help us enable ***everyone*** to shape the future of generative AI.**

## Why InstructLab

There are many projects rapidly embracing and extending permissively licensed AI models, but they are faced with three main challenges:

* Contribution to LLMs is not possible directly. They show up as forks, which forces consumers to choose a “best-fit” model that isn’t easily extensible. Also, the forks are expensive for model creators to maintain.
* The ability to contribute ideas is limited by a lack of AI/ML expertise. One has to learn how to fork, train, and refine models to see their idea move forward. This is a high barrier to entry.
* There is no direct community governance or best practice around review, curation, and distribution of forked models.

**InstructLab is here to solve these problems.**

The project enables community contributors to add additional "skills" or "knowledge" to a particular model.

InstructLab's model-agnostic technology gives model upstreams with sufficient infrastructure resources the ability to create regular builds of their open source licensed models not by rebuilding and retraining the entire model but by composing new skills into it.

Take a look at "lab-enhanced" models on the [InstructLab Hugging Face page](https://huggingface.co/instructlab).

## Get Started with InstructLab

* Check out the [Community README](https://github.com/instructlab/community/blob/main/README.md) to get started with using and contributing to the project.
* You may wish to read through the [project's FAQ](https://github.com/instructlab/community/blob/main/FAQ.md) to get more familiar with all aspects of InstructLab.
* If you want to jump right in, head to the [`ilab` documentation](https://github.com/instructlab/instructlab/blob/main/README.md) to get InstructLab set up and running.
* Learn more about the [skills and knowledge](https://github.com/instructlab/taxonomy/blob/main/README.md) you can add to models.
* You can find all the ways to collaborate with project maintainers and your fellow users of InstructLab beyond GitHub by visiting our [project collaboration](https://github.com/instructlab/community/blob/main/Collaboration.md) page.
* When you are ready to make a contribution to the project, please take a few minutes to look over our [contribution guidelines](https://github.com/instructlab/community/blob/main/CONTRIBUTING.md) to ensure your contribution is aligned with the project policies.

## Community Meetings

For folks getting started with all things InstructLab, it may be easiest for you to join one of our community meetings and speak with project maintainers and other InstructLab collaborators live. You can find details on all of our community meetings, including our open office hours each Thursday, in our detailed [Project Meetings documentation](https://github.com/instructlab/community/blob/main/Collaboration.md#project-meetings).

Everyone is welcome and encouraged to attend if they will find value in joining. Please note that some meetings are recorded and the recordings [published in our project YouTube channel](https://www.youtube.com/@InstructLab/playlists). The meeting host will advise all attendees if the meeting is being recorded. If you prefer to join camera off or dial in via phone so as to not be actively recorded and/or you prefer not to be on camera during meetings, that is absolutely no problem.

## Code of Conduct

Participation in all aspects of the InstructLab community (including but not limited to community meetings, mailing lists, real-time chat, and the project GitHub repos) is governed by our [Code of Conduct](https://github.com/instructlab/community/blob/main/CODE_OF_CONDUCT.md).

## Quick Links

### Governance

See the [project governance document](https://github.com/instructlab/community/blob/main/GOVERNANCE.md) for an overview of how InstructLab project operates.

### Security

Security policies and practices, including reporting vulnerabilities, can be found in our [security document](https://github.com/instructlab/community/blob/main/SECURITY.md).

### Read the Paper

InstructLab 🐶 uses a novel synthetic data-based alignment tuning method for Large Language Models (LLMs.) The "lab" in InstructLab 🥼 stands for [**L**arge-Scale **A**lignment for Chat**B**ots](https://arxiv.org/abs/2403.01081) [1].

[1] Shivchander Sudalairaj*, Abhishek Bhandwaldar*, Aldo Pareja*, Kai Xu, David D. Cox, Akash Srivastava*. "LAB: Large-Scale Alignment for ChatBots", arXiv preprint arXiv: 2403.01081, 2024. (* denotes equal contributions)

## Acknowledgements

The InstructLab project is sponsored by Red Hat.

InstructLab was originally created by engineers from Red Hat and IBM Research.

The infrastructure used to regularly train models based on new contributions from the community is donated and maintained by IBM.

Unsurprisingly, it contains the markdown content of the InstructLab README file. 

Now let's create a separate `qna.yaml` file using an example cat shelter PDF.

In [32]:
from instructlab.sdg.utils.taxonomy import read_taxonomy
from pathlib import Path


yaml_path = Path('./cat-shelter-example/qna.yaml')


data = read_taxonomy(
  taxonomy=yaml_path,
  taxonomy_base='origin/main',
  yaml_rules=None,
  document_output_dir=None
)



Learnings:

- A `qna.yaml` **must** have 5 context entries.
- Each context entry must be comprised of a unique set of questions and answers. Duplicates will throw an error

In [33]:
data

[{'questions_and_answers': [{'question': 'What is a feral cat shelter?\n',
    'answer': 'A feral cat shelter is a structure designed to provide protection from inclement weather for feral cat colonies.\n'},
   {'question': 'What is the mission of Alley Cat Allies?\n',
    'answer': 'Alley Cat Allies is a nonprofit organization dedicated to promoting the humane treatment of feral and free-roaming cats.\n'},
   {'question': 'What types of shelters are recommended for feral cat colonies?\n',
    'answer': 'Dog igloos can be used in less harsh climates.\nFor extremely harsh, cold, and wet climates, insulation is advised.\n'}],
  'context': 'Alley Cat Allies recommends that feral cat colonies have proper protection from inclement weather.\nFollowing are detailed instructions needed to build a feral cat shelter. These building plans are\nrecommended for use throughout the United States. For extremely harsh, cold, and wet climates,\ninsulation (as described) is advised. Other types of shelte

In [34]:
sample = data[0]
print(sample.keys())

dict_keys(['questions_and_answers', 'context', 'taxonomy_path', 'documents', 'filepaths', 'domain', 'document_outline'])


In [36]:
sample

{'questions_and_answers': [{'question': 'What is a feral cat shelter?\n',
   'answer': 'A feral cat shelter is a structure designed to provide protection from inclement weather for feral cat colonies.\n'},
  {'question': 'What is the mission of Alley Cat Allies?\n',
   'answer': 'Alley Cat Allies is a nonprofit organization dedicated to promoting the humane treatment of feral and free-roaming cats.\n'},
  {'question': 'What types of shelters are recommended for feral cat colonies?\n',
   'answer': 'Dog igloos can be used in less harsh climates.\nFor extremely harsh, cold, and wet climates, insulation is advised.\n'}],
 'context': 'Alley Cat Allies recommends that feral cat colonies have proper protection from inclement weather.\nFollowing are detailed instructions needed to build a feral cat shelter. These building plans are\nrecommended for use throughout the United States. For extremely harsh, cold, and wet climates,\ninsulation (as described) is advised. Other types of shelters, suc

As in the previous example, we have a listing of various items from the `qna.yaml`. But now the `.documents` field contains a list of the documents including our parsed PDF

In [38]:
from IPython.display import display, Markdown


# Display it as rendered markdown
# display(Markdown(sample['documents'][0]))
print(sample['documents'][0])

www.alleycat.org • 7920 Norfolk Avenue, Suite 600 • Bethesda, MD 20814-2525 • ©2017
BUILD AN INEXPENSIVE CAT SHELTER
The following instructions are for building an
insulated cat shelter 2 ft. x 3 ft. x 18in. high.
You should be able to buy the materials at a local lumberyard.
An electric saw and screwdriver are highly recommended.
Caution: If you are not experienced with an electric saw, ask a
skilled person to cut the wood and paneling.
Materials Needed
•
One 4-ft. x 1/2-in x 8-ft. sheet of exterior grade plywood
or waferboard
•
One 4-ft. x 8-ft. sheet interior paneling or thin plywood
One package roofing shingles or enough to cover
8-sq. ft. roof
•
Two 2-in. x 3-in. x 6-ft. untreated lumber
Linoleum or other floor tiles (to cover 6-sq. ft. floor)
•
One quart exterior house paint
Two medium hinges (“T” or gate hinges)
•
Fifty 2-in. flat head wood screws or grippers
Four to nine bricks for foundation
•
Small roofing nails (approximately 15)
Fiberglass insulation (1 roll, or enough to c

Here we see the PDF was attempted to be parsed, but we ran into issues due to the formatting of the PDF document, showing that we cannot always perfectly parse documents when dealing with PDFs.

In [39]:
for sample in data:
  print(sample['documents'][0], end=('\n' + '=' * 200))


www.alleycat.org • 7920 Norfolk Avenue, Suite 600 • Bethesda, MD 20814-2525 • ©2017
BUILD AN INEXPENSIVE CAT SHELTER
The following instructions are for building an
insulated cat shelter 2 ft. x 3 ft. x 18in. high.
You should be able to buy the materials at a local lumberyard.
An electric saw and screwdriver are highly recommended.
Caution: If you are not experienced with an electric saw, ask a
skilled person to cut the wood and paneling.
Materials Needed
•
One 4-ft. x 1/2-in x 8-ft. sheet of exterior grade plywood
or waferboard
•
One 4-ft. x 8-ft. sheet interior paneling or thin plywood
One package roofing shingles or enough to cover
8-sq. ft. roof
•
Two 2-in. x 3-in. x 6-ft. untreated lumber
Linoleum or other floor tiles (to cover 6-sq. ft. floor)
•
One quart exterior house paint
Two medium hinges (“T” or gate hinges)
•
Fifty 2-in. flat head wood screws or grippers
Four to nine bricks for foundation
•
Small roofing nails (approximately 15)
Fiberglass insulation (1 roll, or enough to c

In [41]:
print(data[0]['documents'][0])

www.alleycat.org • 7920 Norfolk Avenue, Suite 600 • Bethesda, MD 20814-2525 • ©2017
BUILD AN INEXPENSIVE CAT SHELTER
The following instructions are for building an
insulated cat shelter 2 ft. x 3 ft. x 18in. high.
You should be able to buy the materials at a local lumberyard.
An electric saw and screwdriver are highly recommended.
Caution: If you are not experienced with an electric saw, ask a
skilled person to cut the wood and paneling.
Materials Needed
•
One 4-ft. x 1/2-in x 8-ft. sheet of exterior grade plywood
or waferboard
•
One 4-ft. x 8-ft. sheet interior paneling or thin plywood
One package roofing shingles or enough to cover
8-sq. ft. roof
•
Two 2-in. x 3-in. x 6-ft. untreated lumber
Linoleum or other floor tiles (to cover 6-sq. ft. floor)
•
One quart exterior house paint
Two medium hinges (“T” or gate hinges)
•
Fifty 2-in. flat head wood screws or grippers
Four to nine bricks for foundation
•
Small roofing nails (approximately 15)
Fiberglass insulation (1 roll, or enough to c

In [42]:
from instructlab.sdg.utils.taxonomy import leaf_node_to_samples


chunked_samples = leaf_node_to_samples(
  leaf_node=data,
  taxonomy_path=Path('./cat-shelter-example'),
  server_ctx_size=4096,
  chunk_word_count=1024,
  document_output_dir=Path('./output'),
  model_name='gpt2'
)

Failed to load tokenizer as no valid model was not found at gpt2. Please provide a path to a valid model format. For help on downloading models, run `ilab model download --help`.


ValueError: 