- have a variable qna.yaml = ""
- pull the document from the github path/file system path in the qna.yaml
- chunk this document using docling v2
- once you have these chunks, use the SDG functions to do the icl_mapping
- have an input that goes into generate_data
- Have synthetically generated data

## How Chunking works in SDG

In [6]:
from instructlab.sdg.utils.taxonomy import read_taxonomy
from pathlib import Path


yaml_path = Path('./qna.yaml')


data = read_taxonomy(
  taxonomy=yaml_path,
  taxonomy_base='origin/main',
  yaml_rules=None,
  document_output_dir=None
)

In [7]:
data[0]


{'questions_and_answers': [{'question': 'What is InstructLab?\n',
   'answer': 'InstructLab is an open source AI project\nthat faciliates contributions to Large Language Models (LLMs).\n'},
  {'question': 'Can anyone contribute to InstructLab?\n',
   'answer': 'Yes, the community welcomes everyone\ninterested in generative AI.\n'},
  {'question': 'What is the mission of InstructLab?\n',
   'answer': 'We are on a mission to let anyone\nshape generative AI by enabling contributed\nupdates to existing LLMs in an accessible way.\nOur community welcomes all those who\nwould like to help us enable everyone\nto shape the future of generative AI.\n'}],
 'context': 'InstructLab is a model-agnostic open source AI project that facilitates\ncontributions to Large Language Models (LLMs).\nWe are on a mission to let anyone shape generative\nAI by enabling contributed updates to existing\nLLMs in an accessible way. Our community welcomes all those who\nwould like to help us enable everyone to shape\n

Inspecting the contents of the processed data, we see that there are a few keys:

```
dict_keys([
  'questions_and_answers',
  'context',
  'taxonomy_path',
  'documents',
  'filepaths',
  'domain',
  'document_outline',
])
```

In [8]:
sample = data[0]
sep_length = 400

# context from the qna
print(sample['context'])
print('-' * sep_length)

# path in the taxonomy
print(sample['taxonomy_path'])
print('-' * sep_length)
# documents list
print(sample['documents'])

# filepaths
print('-' * sep_length)
print(sample['filepaths'])


# domain
print('-' * sep_length)
print(sample['domain'])

# document_outline
print('-' * sep_length)
print(sample['document_outline'])

InstructLab is a model-agnostic open source AI project that facilitates
contributions to Large Language Models (LLMs).
We are on a mission to let anyone shape generative
AI by enabling contributed updates to existing
LLMs in an accessible way. Our community welcomes all those who
would like to help us enable everyone to shape
the future of generative AI.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
/->home->ec2-user->sdg->notebooks
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

It looks like a lot of these fields are taken directly from the `qna.yaml` file. However; one that seems to be unique is the `documents` field, let's take a closer look:

In [9]:
len(sample['documents'])

1

It looks like it's a list of documents, in our case we only provided a single document source, so it makes sense that there's only one document in the list. Let's print it out to see what it contains

In [10]:
from IPython.display import display, Markdown


# Display it as rendered markdown
display(Markdown(sample['documents'][0]))

# Welcome to the 🐶 InstructLab Project

![Banner](https://github.com/instructlab/.github/blob/main/assets/instructlab-banner.png)
InstructLab is a model-agnostic open source AI project that facilitates contributions to Large Language Models (LLMs).

We are on a mission to let anyone shape generative AI by enabling contributed updates to existing LLMs in an accessible way.

**Our community welcomes all those who would like to help us enable ***everyone*** to shape the future of generative AI.**

## Why InstructLab

There are many projects rapidly embracing and extending permissively licensed AI models, but they are faced with three main challenges:

* Contribution to LLMs is not possible directly. They show up as forks, which forces consumers to choose a “best-fit” model that isn’t easily extensible. Also, the forks are expensive for model creators to maintain.
* The ability to contribute ideas is limited by a lack of AI/ML expertise. One has to learn how to fork, train, and refine models to see their idea move forward. This is a high barrier to entry.
* There is no direct community governance or best practice around review, curation, and distribution of forked models.

**InstructLab is here to solve these problems.**

The project enables community contributors to add additional "skills" or "knowledge" to a particular model.

InstructLab's model-agnostic technology gives model upstreams with sufficient infrastructure resources the ability to create regular builds of their open source licensed models not by rebuilding and retraining the entire model but by composing new skills into it.

Take a look at "lab-enhanced" models on the [InstructLab Hugging Face page](https://huggingface.co/instructlab).

## Get Started with InstructLab

* Check out the [Community README](https://github.com/instructlab/community/blob/main/README.md) to get started with using and contributing to the project.
* You may wish to read through the [project's FAQ](https://github.com/instructlab/community/blob/main/FAQ.md) to get more familiar with all aspects of InstructLab.
* If you want to jump right in, head to the [`ilab` documentation](https://github.com/instructlab/instructlab/blob/main/README.md) to get InstructLab set up and running.
* Learn more about the [skills and knowledge](https://github.com/instructlab/taxonomy/blob/main/README.md) you can add to models.
* You can find all the ways to collaborate with project maintainers and your fellow users of InstructLab beyond GitHub by visiting our [project collaboration](https://github.com/instructlab/community/blob/main/Collaboration.md) page.
* When you are ready to make a contribution to the project, please take a few minutes to look over our [contribution guidelines](https://github.com/instructlab/community/blob/main/CONTRIBUTING.md) to ensure your contribution is aligned with the project policies.

## Community Meetings

For folks getting started with all things InstructLab, it may be easiest for you to join one of our community meetings and speak with project maintainers and other InstructLab collaborators live. You can find details on all of our community meetings, including our open office hours each Thursday, in our detailed [Project Meetings documentation](https://github.com/instructlab/community/blob/main/Collaboration.md#project-meetings).

Everyone is welcome and encouraged to attend if they will find value in joining. Please note that some meetings are recorded and the recordings [published in our project YouTube channel](https://www.youtube.com/@InstructLab/playlists). The meeting host will advise all attendees if the meeting is being recorded. If you prefer to join camera off or dial in via phone so as to not be actively recorded and/or you prefer not to be on camera during meetings, that is absolutely no problem.

## Code of Conduct

Participation in all aspects of the InstructLab community (including but not limited to community meetings, mailing lists, real-time chat, and the project GitHub repos) is governed by our [Code of Conduct](https://github.com/instructlab/community/blob/main/CODE_OF_CONDUCT.md).

## Quick Links

### Governance

See the [project governance document](https://github.com/instructlab/community/blob/main/GOVERNANCE.md) for an overview of how InstructLab project operates.

### Security

Security policies and practices, including reporting vulnerabilities, can be found in our [security document](https://github.com/instructlab/community/blob/main/SECURITY.md).

### Read the Paper

InstructLab 🐶 uses a novel synthetic data-based alignment tuning method for Large Language Models (LLMs.) The "lab" in InstructLab 🥼 stands for [**L**arge-Scale **A**lignment for Chat**B**ots](https://arxiv.org/abs/2403.01081) [1].

[1] Shivchander Sudalairaj*, Abhishek Bhandwaldar*, Aldo Pareja*, Kai Xu, David D. Cox, Akash Srivastava*. "LAB: Large-Scale Alignment for ChatBots", arXiv preprint arXiv: 2403.01081, 2024. (* denotes equal contributions)

## Acknowledgements

The InstructLab project is sponsored by Red Hat.

InstructLab was originally created by engineers from Red Hat and IBM Research.

The infrastructure used to regularly train models based on new contributions from the community is donated and maintained by IBM.

Unsurprisingly, it contains the markdown content of the InstructLab README file. 

Now let's create a separate `qna.yaml` file using an example cat shelter PDF.

In [11]:
!rm -rf './None'

In [12]:
from instructlab.sdg.utils.taxonomy import read_taxonomy
from pathlib import Path


yaml_path = Path('./cat-shelter-example/qna.yaml')


data = read_taxonomy(
  taxonomy=yaml_path,
  taxonomy_base='origin/main',
  yaml_rules=None,
  document_output_dir=None
)



Learnings:

- A `qna.yaml` **must** have 5 context entries.
- Each context entry must be comprised of a unique set of questions and answers. Duplicates will throw an error

In [13]:
data

[{'questions_and_answers': [{'question': 'What is a feral cat shelter?\n',
    'answer': 'A feral cat shelter is a structure designed to provide protection from inclement weather for feral cat colonies.\n'},
   {'question': 'What is the mission of Alley Cat Allies?\n',
    'answer': 'Alley Cat Allies is a nonprofit organization dedicated to promoting the humane treatment of feral and free-roaming cats.\n'},
   {'question': 'What types of shelters are recommended for feral cat colonies?\n',
    'answer': 'Dog igloos can be used in less harsh climates.\nFor extremely harsh, cold, and wet climates, insulation is advised.\n'}],
  'context': 'Alley Cat Allies recommends that feral cat colonies have proper protection from inclement weather.\nFollowing are detailed instructions needed to build a feral cat shelter. These building plans are\nrecommended for use throughout the United States. For extremely harsh, cold, and wet climates,\ninsulation (as described) is advised. Other types of shelte

In [14]:
sample = data[0]
print(sample.keys())

dict_keys(['questions_and_answers', 'context', 'taxonomy_path', 'documents', 'filepaths', 'domain', 'document_outline'])


In [15]:
sample

{'questions_and_answers': [{'question': 'What is a feral cat shelter?\n',
   'answer': 'A feral cat shelter is a structure designed to provide protection from inclement weather for feral cat colonies.\n'},
  {'question': 'What is the mission of Alley Cat Allies?\n',
   'answer': 'Alley Cat Allies is a nonprofit organization dedicated to promoting the humane treatment of feral and free-roaming cats.\n'},
  {'question': 'What types of shelters are recommended for feral cat colonies?\n',
   'answer': 'Dog igloos can be used in less harsh climates.\nFor extremely harsh, cold, and wet climates, insulation is advised.\n'}],
 'context': 'Alley Cat Allies recommends that feral cat colonies have proper protection from inclement weather.\nFollowing are detailed instructions needed to build a feral cat shelter. These building plans are\nrecommended for use throughout the United States. For extremely harsh, cold, and wet climates,\ninsulation (as described) is advised. Other types of shelters, suc

As in the previous example, we have a listing of various items from the `qna.yaml`. But now the `.documents` field contains a list of the documents including our parsed PDF

In [16]:
from IPython.display import display, Markdown


# Display it as rendered markdown
# display(Markdown(sample['documents'][0]))
print(sample['documents'][0])

www.alleycat.org • 7920 Norfolk Avenue, Suite 600 • Bethesda, MD 20814-2525 • ©2017
BUILD AN INEXPENSIVE CAT SHELTER
The following instructions are for building an
insulated cat shelter 2 ft. x 3 ft. x 18in. high.
You should be able to buy the materials at a local lumberyard.
An electric saw and screwdriver are highly recommended.
Caution: If you are not experienced with an electric saw, ask a
skilled person to cut the wood and paneling.
Materials Needed
•
One 4-ft. x 1/2-in x 8-ft. sheet of exterior grade plywood
or waferboard
•
One 4-ft. x 8-ft. sheet interior paneling or thin plywood
One package roofing shingles or enough to cover
8-sq. ft. roof
•
Two 2-in. x 3-in. x 6-ft. untreated lumber
Linoleum or other floor tiles (to cover 6-sq. ft. floor)
•
One quart exterior house paint
Two medium hinges (“T” or gate hinges)
•
Fifty 2-in. flat head wood screws or grippers
Four to nine bricks for foundation
•
Small roofing nails (approximately 15)
Fiberglass insulation (1 roll, or enough to c

Here we see the PDF was attempted to be parsed, but we ran into issues due to the formatting of the PDF document, showing that we cannot always perfectly parse documents when dealing with PDFs.

In [17]:
for sample in data:
  print(sample['documents'][0], end=('\n' + '=' * 200))


www.alleycat.org • 7920 Norfolk Avenue, Suite 600 • Bethesda, MD 20814-2525 • ©2017
BUILD AN INEXPENSIVE CAT SHELTER
The following instructions are for building an
insulated cat shelter 2 ft. x 3 ft. x 18in. high.
You should be able to buy the materials at a local lumberyard.
An electric saw and screwdriver are highly recommended.
Caution: If you are not experienced with an electric saw, ask a
skilled person to cut the wood and paneling.
Materials Needed
•
One 4-ft. x 1/2-in x 8-ft. sheet of exterior grade plywood
or waferboard
•
One 4-ft. x 8-ft. sheet interior paneling or thin plywood
One package roofing shingles or enough to cover
8-sq. ft. roof
•
Two 2-in. x 3-in. x 6-ft. untreated lumber
Linoleum or other floor tiles (to cover 6-sq. ft. floor)
•
One quart exterior house paint
Two medium hinges (“T” or gate hinges)
•
Fifty 2-in. flat head wood screws or grippers
Four to nine bricks for foundation
•
Small roofing nails (approximately 15)
Fiberglass insulation (1 roll, or enough to c

In [18]:
print(data[0]['documents'][0])

www.alleycat.org • 7920 Norfolk Avenue, Suite 600 • Bethesda, MD 20814-2525 • ©2017
BUILD AN INEXPENSIVE CAT SHELTER
The following instructions are for building an
insulated cat shelter 2 ft. x 3 ft. x 18in. high.
You should be able to buy the materials at a local lumberyard.
An electric saw and screwdriver are highly recommended.
Caution: If you are not experienced with an electric saw, ask a
skilled person to cut the wood and paneling.
Materials Needed
•
One 4-ft. x 1/2-in x 8-ft. sheet of exterior grade plywood
or waferboard
•
One 4-ft. x 8-ft. sheet interior paneling or thin plywood
One package roofing shingles or enough to cover
8-sq. ft. roof
•
Two 2-in. x 3-in. x 6-ft. untreated lumber
Linoleum or other floor tiles (to cover 6-sq. ft. floor)
•
One quart exterior house paint
Two medium hinges (“T” or gate hinges)
•
Fifty 2-in. flat head wood screws or grippers
Four to nine bricks for foundation
•
Small roofing nails (approximately 15)
Fiberglass insulation (1 roll, or enough to c

In [19]:
from instructlab.sdg.utils.taxonomy import leaf_node_to_samples

model_path = '/home/ec2-user/.cache/huggingface/hub/models--mistralai--Mixtral-8x7B-v0.1/snapshots/ffe1a706bacbd5abddc5ff99432ee38f7e0662fb'

chunked_samples = leaf_node_to_samples(
  leaf_node=data,
  taxonomy_path=Path('./cat-shelter-example'),
  server_ctx_size=4096,
  chunk_word_count=1024,
  document_output_dir=Path('./output'),
  model_name=model_path,
)

Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 84828.62it/s]
Using CPU. Note: This module is much faster with a GPU.
Using CPU. Note: This module is much faster with a GPU.


generated chunks: ['Alley Cat Allies\n\n# Title: **Caregiving Information**\n\n## **BUILD AN INEXPENSIVE CAT SHELTER**\n\nA lley Cat Allies recommends that feral cat colonies have proper protection from inclement weather. Following are detailed instructions needed to build a feral cat shelter. These building plans are recommended for use throughout the United States. For extremely harsh, cold, and wet climates, insulation (as described) is advised. Other types of shelters, such as dog igloos, can be used in less harsh climates. Go to www.alleycat.org/BuildaShelter for additional shelter ideas.\n\n## **The following instructions are for building an insulated cat shelter 2 ft. x 3 ft. x 18in. high.**\n\nYou should be able to buy the materials at a local lumberyard. An electric saw and screwdriver are highly recommended. Caution: If you are not experienced with an electric saw, ask a skilled person to cut the wood and paneling.\n\n## **Materials Needed**\n\n- · One 4-ft. x 1/2-in x 8-ft. 

chunked_samples

In [20]:
chunked_samples
df = chunked_samples.to_pandas()
df.to_excel('chunked_samples.xlsx')

In [21]:
import json
print(json.dumps(chunked_samples[0], indent=2))

{
  "document": "Alley Cat Allies\n\n# Title: **Caregiving Information**\n\n## **BUILD AN INEXPENSIVE CAT SHELTER**\n\nA lley Cat Allies recommends that feral cat colonies have proper protection from inclement weather. Following are detailed instructions needed to build a feral cat shelter. These building plans are recommended for use throughout the United States. For extremely harsh, cold, and wet climates, insulation (as described) is advised. Other types of shelters, such as dog igloos, can be used in less harsh climates. Go to www.alleycat.org/BuildaShelter for additional shelter ideas.\n\n## **The following instructions are for building an insulated cat shelter 2 ft. x 3 ft. x 18in. high.**\n\nYou should be able to buy the materials at a local lumberyard. An electric saw and screwdriver are highly recommended. Caution: If you are not experienced with an electric saw, ask a skilled person to cut the wood and paneling.\n\n## **Materials Needed**\n\n- \u00b7 One 4-ft. x 1/2-in x 8-ft

So far I've learned a few things:

**Docling performs advanced file parsing using AI models**


Specifically, they use AI models in order to extract the complex structure of files such as .PDF, .docx, .pptx, which allows it to understand things like:
- complex structure
- table structure detection
- OCR

They accomplish this using model pipelines, e.g.:

```
self.build_pipe = [
    PagePreprocessingModel,
    OCR model,
    LayoutModel,
    TableStructureModel,
    PageAssembleModel
]
```


**The model tokenizer is used during document chunking**:


The model tokenizer is used in order to split up the text segments by their token lengths. This ensures that the sequences don't exceed the model's context length.

Sequences of a lower length are fused together in order to ensure that we don't have very small and atomic text segments.

We make use of chunk size control to ensure that chunks stay within the specified token limits. This also enables us to have consistent chunks.


In [22]:
# now lets inspect the actual document context

# full document
print(len(chunked_samples[0]['document']))
print(len(chunked_samples[0]['icl_document']))

# this is the chunk from context
print(chunked_samples[0]['icl_document'])

# this is simply the first question & answer pair from the `qna.yaml` file
print(chunked_samples[0]['icl_query_1'])
print(chunked_samples[0]['icl_response_1'])

1594
478
Alley Cat Allies recommends that feral cat colonies have proper protection from inclement weather.
Following are detailed instructions needed to build a feral cat shelter. These building plans are
recommended for use throughout the United States. For extremely harsh, cold, and wet climates,
insulation (as described) is advised. Other types of shelters, such as dog igloos, can be used in less
harsh climates. Go to www.alleycat.org/BuildaShelter for additional shelter ideas.

What is a feral cat shelter?

A feral cat shelter is a structure designed to provide protection from inclement weather for feral cat colonies.



In [23]:
# now lets inspect the actual document context

# full document
print(len(chunked_samples[1]['document']))
print(len(chunked_samples[1]['icl_document']))

# this is the chunk from context
print(chunked_samples[1]['icl_document'])

# this is simply the first question & answer pair from the `qna.yaml` file
print(chunked_samples[1]['icl_query_1'])
print(chunked_samples[1]['icl_response_1'])

1594
544
Things to consider before starting your project:
These will help you determine what you need to buy and
how much work will be involved, and also provide a few
helpful hints.
- How many cats do you need to house? This number
  determines how many shelters to build. Keep in mind that
  not all cats are likely to use the shelter, or at least not all at
  the same time. This shelter should probably house no more
  than five to seven cats at once. You can adjust this plan
  to make a larger shelter, or build more than one shelter
  as needed.

How many cats can a feral cat shelter house?

A feral cat shelter should probably house no more
than five to seven cats at once.



In [24]:
for i, sample in enumerate(chunked_samples):
  print(f'[{i}] ================================================================================================================')
  print(f"")
  print(f"icl_query_1: {sample['icl_query_1']}")
  print(f"icl_response_1: {sample['icl_query_1']}")


icl_query_1: What is a feral cat shelter?

icl_response_1: What is a feral cat shelter?


icl_query_1: How many cats can a feral cat shelter house?

icl_response_1: How many cats can a feral cat shelter house?


icl_query_1: How to assemble a feral cat shelter?

icl_response_1: How to assemble a feral cat shelter?


icl_query_1: What is the best way to insulate a feral cat shelter?

icl_response_1: What is the best way to insulate a feral cat shelter?


icl_query_1: How many cats can a feral cat shelter house?

icl_response_1: How many cats can a feral cat shelter house?


icl_query_1: What is a feral cat shelter?

icl_response_1: What is a feral cat shelter?


icl_query_1: How many cats can a feral cat shelter house?

icl_response_1: How many cats can a feral cat shelter house?


icl_query_1: How to assemble a feral cat shelter?

icl_response_1: How to assemble a feral cat shelter?


icl_query_1: What is the best way to insulate a feral cat shelter?

icl_response_1: What is the best 

If we actually inpsect the makeup of these chunked samples, we find that there are duplicates of each question and document:

In [25]:
from collections import Counter

doc_counts = Counter([s['icl_document'] for s in chunked_samples])
icl_query_1_counts = Counter([s['icl_query_1'] for s in chunked_samples])
doc_counts, icl_query_1_counts

(Counter({'Alley Cat Allies recommends that feral cat colonies have proper protection from inclement weather.\nFollowing are detailed instructions needed to build a feral cat shelter. These building plans are\nrecommended for use throughout the United States. For extremely harsh, cold, and wet climates,\ninsulation (as described) is advised. Other types of shelters, such as dog igloos, can be used in less\nharsh climates. Go to www.alleycat.org/BuildaShelter for additional shelter ideas.\n': 4,
          'Things to consider before starting your project:\nThese will help you determine what you need to buy and\nhow much work will be involved, and also provide a few\nhelpful hints.\n- How many cats do you need to house? This number\n  determines how many shelters to build. Keep in mind that\n  not all cats are likely to use the shelter, or at least not all at\n  the same time. This shelter should probably house no more\n  than five to seven cats at once. You can adjust this plan\n  to mak

Let's just print out the icl queries & documents where the questions are identical to understand what's happening better

In [26]:
selected_field = 'icl_query_2'
query_1 = chunked_samples[0][selected_field]
for i, sample in enumerate(chunked_samples):
  if sample[selected_field] != query_1:
    continue
  print(f'[{i}] ================================================================================================================')
  print(f"icl_document: {sample['icl_document']}")
  print(f"icl_query_1: {sample['icl_query_1']}")
  print(f"icl_response_1: {sample['icl_query_1']}")
  print(f"icl_query_2: {sample['icl_query_2']}")
  print(f"icl_response_2: {sample['icl_query_2']}")
  print(f"icl_query_3: {sample['icl_query_3']}")
  print(f"icl_response_3: {sample['icl_query_3']}")

icl_document: Alley Cat Allies recommends that feral cat colonies have proper protection from inclement weather.
Following are detailed instructions needed to build a feral cat shelter. These building plans are
recommended for use throughout the United States. For extremely harsh, cold, and wet climates,
insulation (as described) is advised. Other types of shelters, such as dog igloos, can be used in less
harsh climates. Go to www.alleycat.org/BuildaShelter for additional shelter ideas.

icl_query_1: What is a feral cat shelter?

icl_response_1: What is a feral cat shelter?

icl_query_2: What is the mission of Alley Cat Allies?

icl_response_2: What is the mission of Alley Cat Allies?

icl_query_3: What types of shelters are recommended for feral cat colonies?

icl_response_3: What types of shelters are recommended for feral cat colonies?

icl_document: Alley Cat Allies recommends that feral cat colonies have proper protection from inclement weather.
Following are detailed instructions

In [27]:
from instructlab.sdg.utils.taxonomy import map_chunks_to_icls



dataset = map_chunks_to_icls(chunked_samples, data)
dataset

generated chunks: Dataset({
    features: ['document', 'icl_document', 'document_outline', 'domain', 'icl_query_1', 'icl_response_1', 'icl_query_2', 'icl_response_2', 'icl_query_3', 'icl_response_3'],
    num_rows: 20
})


Dataset({
    features: ['document', 'icl_document', 'document_outline', 'domain', 'icl_query_1', 'icl_response_1', 'icl_query_2', 'icl_response_2', 'icl_query_3', 'icl_response_3'],
    num_rows: 100
})

In [28]:
chunked_samples['icl_query_1']

['What is a feral cat shelter?\n',
 'How many cats can a feral cat shelter house?\n',
 'How to assemble a feral cat shelter?\n',
 'What is the best way to insulate a feral cat shelter?\n',
 'How many cats can a feral cat shelter house?\n',
 'What is a feral cat shelter?\n',
 'How many cats can a feral cat shelter house?\n',
 'How to assemble a feral cat shelter?\n',
 'What is the best way to insulate a feral cat shelter?\n',
 'How many cats can a feral cat shelter house?\n',
 'What is a feral cat shelter?\n',
 'How many cats can a feral cat shelter house?\n',
 'How to assemble a feral cat shelter?\n',
 'What is the best way to insulate a feral cat shelter?\n',
 'How many cats can a feral cat shelter house?\n',
 'What is a feral cat shelter?\n',
 'How many cats can a feral cat shelter house?\n',
 'How to assemble a feral cat shelter?\n',
 'What is the best way to insulate a feral cat shelter?\n',
 'How many cats can a feral cat shelter house?\n']

In [29]:
dataset[:3]

{'document': [{'document': 'Alley Cat Allies\n\n# Title: **Caregiving Information**\n\n## **BUILD AN INEXPENSIVE CAT SHELTER**\n\nA lley Cat Allies recommends that feral cat colonies have proper protection from inclement weather. Following are detailed instructions needed to build a feral cat shelter. These building plans are recommended for use throughout the United States. For extremely harsh, cold, and wet climates, insulation (as described) is advised. Other types of shelters, such as dog igloos, can be used in less harsh climates. Go to www.alleycat.org/BuildaShelter for additional shelter ideas.\n\n## **The following instructions are for building an insulated cat shelter 2 ft. x 3 ft. x 18in. high.**\n\nYou should be able to buy the materials at a local lumberyard. An electric saw and screwdriver are highly recommended. Caution: If you are not experienced with an electric saw, ask a skilled person to cut the wood and paneling.\n\n## **Materials Needed**\n\n- · One 4-ft. x 1/2-in 

In [30]:
df = dataset.to_pandas()
df.to_excel("chunked_data.xlsx", index=False)  # For Excel format

No wonder dataset is getting replicated. We simply passed the processed qna file in through `map_chunks_to_icl` twice. 

In [31]:
t =  ['Alley Cat Allies\n\n# Title: **Caregiving Information**\n\n## **BUILD AN INEXPENSIVE CAT SHELTER**\n\nA lley Cat Allies recommends that feral cat colonies have proper protection from inclement weather. Following are detailed instructions needed to build a feral cat shelter. These building plans are recommended for use throughout the United States. For extremely harsh, cold, and wet climates, insulation (as described) is advised. Other types of shelters, such as dog igloos, can be used in less harsh climates. Go to www.alleycat.org/BuildaShelter for additional shelter ideas.\n\n## **The following instructions are for building an insulated cat shelter 2 ft. x 3 ft. x 18in. high.**\n\nYou should be able to buy the materials at a local lumberyard. An electric saw and screwdriver are highly recommended. Caution: If you are not experienced with an electric saw, ask a skilled person to cut the wood and paneling.\n\n## **Materials Needed**\n\n- · One 4-ft. x 1/2-in x 8-ft. sheet of exterior grade plywood or waferboard\n\n- · One 4-ft. x 8-ft. sheet interior paneling or thin plywood\n\n- One package roofing shingles or enough to cover 8-sq. ft. roof\n\n- · Two 2-in. x 3-in. x 6-ft. untreated lumber\n\n- Linoleum or other floor tiles (to cover 6-sq. ft. floor)\n\n- · One quart exterior house paint\n\n- Two medium hinges ("T" or gate hinges)\n\n- · Fifty 2-in. flat head wood screws or grippers\n\n- Four to nine bricks for foundation\n\n- · Small roofing nails (approximately 15)\n\n- Fiberglass insulation (1 roll, or enough to cover 14-20 sq. ft.)\n\n## **Tools Needed**\n\n- · Hammer\n\n- Saw\n\n- · Electric screw driver', '- · Angle brace or T-square\n\n- · Staple gun\n\n- Measuring tape\n\n- · Marking pen\n\n## **Things to consider before starting your project**\n\nThese will help you determine what you need to buy and how much work will be involved, and also provide a few helpful hints.\n\n- · How many cats do you need to house? This number detemines how many shelters to build. Keep in mind that not all cats are likely to use the shelter, or at least not all at the same time. This shelter should probably house no more than five to seven cats at once. You can adjust this plan to make a larger shelter, or build more than one shelter as needed.\n\n## **Fact Sheet:**\n\n## **BUILD AN INEXPENSIVE CAT SHELTER, page 2 of 4**\n\n- · Be sure to make the shelter small enough for transport in your vehicle. The shelter size described here will fit in a standard size car trunk with the trunk lid open.\n\n- · If you live in a climate that gets very cold, we recommend that you use insulation as described in the plans.\n\n- · Use only exterior paint to reduce weather exposure, preferably dark green or dark brown to match natural surroundings.\n\n- · Floor should have linoleum or tile square instead of carpet to reduce the chance of flea infestation. Carpets and towels retain moisture and should not be used.\n\n## **Assembly**\n\n**- 1** Cut wood. For easy assembly, cut all wood first, then assemble the shelter. Some pieces may need adjustment after cutting.\n\nCut plywood as shown at right. (This is only enough for one shelter.)\n\nCut paneling as shown at right. One sheet of paneling is enough for two shelters.\n\n**Cut 2-in** x 3-in. x 6-ft. lumber into eight posts and two shelf braces as shown at right.\n\n- · Use screws, not nails, for better durability.\n\n- Roof should be hinged so bedding can be replaced, and for easy access when retrieving kittens who may be in the shelter.', '- · Roof must be slanted to drain off water.\n\n- A wind block should be placed inside the door of the shelter to improve warmth. You may also consider a canvas flap to go over the door.\n\n- · Place wood chips or straw inside for warmth and comfort.\n\n- Blankets, towels, and carpets retain moisture.\n\n## **Fact Sheet:**\n\n## **BUILD AN INEXPENSIVE CAT SHELTER, page 3 of 4**\n\n**- 2** Put side wall A in place on the left of the base and screw front wall and left side wall together using one 17-in. corner post.\n\n**- 3** Position side wall B on the right and attach to front wall using other 17-in. corner post.\n\n**- 4** Position back wall and attach to both side walls using two 11-in. corner posts. Note: Corner posts should rest on top of the base, as should the front, back, and side walls. All posts should be inside of the front, back, and side walls.\n\n**- 5** Turn walls upside down and place the 3-ft. x 2-ft. base on top. Mount base to sides, first screwing down corners then going along edges. Be careful that screws go straight into plywood walls, without protruding through sides.\n\n**- 6** Turn the shelter back to upright position.\n\n**- 7** Cut and staple insulation blueboard to inside of side walls A and B.\n\n**- 8** Attach front and back posts for front and back wall supports. Note that the posts are placed flat against the front and back walls, at right angles to the corner posts, as shown. The post next to the front door should be 5-1/2 inches from the right interior wall to leave room for the wind block.\n\n**- 9** Cut and staple the remaining insulation to the inside of front and back walls.\n\n**- 10** Put the wind block in place and screw it to the front of the shelter, then to bottom (do this from outside in).', '**- 11** For extra cat sleeping room, screw 5-in. shelf braces upright to the center of wind block and left interior wall near the front corner of shelter to support shelf, if desired. Then screw 9-in. x 2-ft. 3.5-in. shelf on top of braces.\n\n**- 12** Place the 2-ft. 7-in. x 3-ft. 3-in. roof on bench and turn shelter upside down. Center shelter on the roof with roof hanging over on all sides. Screw hinges to the underside of the roof and outside the front of the shelter so it will open easily and stand up straight on its own.\n\n**- 13** Turn the shelter back over and attach shingles with roofing nails in an offset pattern to seal against weather. After nailing shingles bend nail points over to avoid injuring cats.\n\nSteps 1-6\n\nSteps 7-8\n\n## **Fact Sheet:**\n\n## **BUILD AN INEXPENSIVE CAT SHELTER, page 4 of 4**\n\n**- 14** Place the vinyl floor tiles inside if desired for extra protection.\n\n**- 15** Paint the shelter (all exposed wood should be painted, including bottom, to protect it from rain and/or snow).\n\n**- 16**When installing the shelter, make sure to set it on top of bricks or other objects to keep it away from ground contact. Also take prevailing winds and exposure into account; placing shelter front facing south often maximizes warmth.\n\nShelter Interior\n\nShelter design and construction drawings by Bill McFadden and Ken Crawford. Shelter illustration by Doug Hall.\n\nNote: You may also cover the interior underside of the roof with fiberglass or plastic foam insulation, but be sure to cover it with plastic or wood. Foam needs to be covered to hold it in place, and uncovered fiberglass will harm cats. You can insulate the shelter with strong plastic to keep out wind, rain, and cold. Leave a small opening for the cats to enter. A flap can be placed over the entrance for added protection.']
len(t)

4

In [32]:
print(('\n' +'='*250).join(t))

Alley Cat Allies

# Title: **Caregiving Information**

## **BUILD AN INEXPENSIVE CAT SHELTER**

A lley Cat Allies recommends that feral cat colonies have proper protection from inclement weather. Following are detailed instructions needed to build a feral cat shelter. These building plans are recommended for use throughout the United States. For extremely harsh, cold, and wet climates, insulation (as described) is advised. Other types of shelters, such as dog igloos, can be used in less harsh climates. Go to www.alleycat.org/BuildaShelter for additional shelter ideas.

## **The following instructions are for building an insulated cat shelter 2 ft. x 3 ft. x 18in. high.**

You should be able to buy the materials at a local lumberyard. An electric saw and screwdriver are highly recommended. Caution: If you are not experienced with an electric saw, ask a skilled person to cut the wood and paneling.

## **Materials Needed**

- · One 4-ft. x 1/2-in x 8-ft. sheet of exterior grade plywood or

In [33]:
print([len(d) for d in t])
print(sum(len(d) for d in t))

[1594, 1843, 1734, 1816]
6987


OKAY 

So I've figured it out:

The `leaf_node_to_samples` function effectively takes a `qna.yaml` file and multiplies all of the document chunks to all of the contexts that we've provided. So if we have a single document in the qna.yaml which produced 3 chunks, and we've provided 5 contexts, then we would come out with a total of 3x5=15 chunked examples.

So we effectively replicate the set of context examples such that a unique combination of chunk_i + example_k exists for all chunks i and all examples k

### Generating knowledge samples

Now that we've obtained our `chunked_samples` which contains a mapping of every chunk from our document into every context example in the qna, we can generate some data samples.

In [49]:
from openai import Client
from instructlab.sdg.generate_data import MODEL_FAMILY_MIXTRAL

client = Client(
  base_url='http://127.0.0.1:8000/v1',
  api_key='default',
)
model_id = '/home/ec2-user/.cache/huggingface/hub/models--mistralai--Mixtral-8x7B-v0.1/snapshots/ffe1a706bacbd5abddc5ff99432ee38f7e0662fb'

from instructlab.sdg.pipeline import PipelineContext, Pipeline

ctx = PipelineContext(
  client=client,
  model_id=model_id,
  model_family=MODEL_FAMILY_MIXTRAL,
  num_instructions_to_generate=500,
  max_num_tokens=4096,
  batch_size=8,
)

knowledge_pipe = Pipeline.from_file(ctx, '/home/ec2-user/sdg/src/instructlab/sdg/pipelines/full/knowledge.yaml')

Dataset({
    features: ['document', 'icl_document', 'document_outline', 'domain', 'icl_query_1', 'icl_response_1', 'icl_query_2', 'icl_response_2', 'icl_query_3', 'icl_response_3'],
    num_rows: 20
})

In [50]:
result = knowledge_pipe.generate(chunked_samples, 'checkpoint_path')

In [51]:
result

Dataset({
    features: [],
    num_rows: 0
})