# LLM data creation

This notebook adapts the llm-data-creation github repository to show how the code works. It focuses on the data_creation_tree.py code, which per the published paper is the most effective strategy for questions that are most similar as the starting data. This notebook adapts the code to make it work with the Gemini API.

## Dependency setup

To run the code below, make sure that you have set the `GOOGLE_API_KEY`as a Google Colab secret.

In [None]:
!pip install -U google-generativeai==0.8.3



In [None]:
# Import the Python SDK
import google.generativeai as genai
# Used to securely store your API key
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

## Query setup

This function sets up the LLM querying, including the model and other configuration attributes.

In [None]:
def api_query(description: str, text: str, model: str = "gemini-1.5-flash") -> str:
    model = genai.GenerativeModel(model, system_instruction=description)
    config = genai.types.GenerationConfig(temperature=1, top_p=0)
    response = model.generate_content(text, generation_config=config, request_options={"timeout": 1000})
    return_text = response.text
    return return_text


## Prompt setup

### Prompt type

The paper explains two main ways of setting up a prompt, depending on the type of data that you wish to generate. The first type is "fix" which sets up a question with a fixed set of answers, the LLM is asked to always pick from the fixed set of answers. The second type is "variant", this allows the LLM to pick its own list of answers, and pick the one that is correct.

### Root example

To get the data generation started, a root node example must be set. For this, we will need a question, options and an answer.

### Context

Optionally, a context can be set for the question. The paper explains that setting a context will make the LLM generate a context for answering the question, and it will perform better with reasoning tasks.

In [None]:
prompt_type = "variant" # one of "variant" or "fix"
question = "What liquid can be kept in a large container?"
options = ["juice", "door", "shed", "supermarket", "cabinet"]
answer = "refrigerator" # answer must be one of the
context=None

## Tree size

The following parameters decide the amount of data to generate.

### Number examples

This parameter sets the number of examples to ask for from the LLM during each iteration.

### Length of train

Length of train sets the total number of examples to generate. This should be more than num_examples, and ideally a multiplier of num_examples.

In [None]:
num_examples = 3
len_train = 9

## Generation algorithm

The following code has the generation algorithm! It will use a tree-based algorithm to generate more and more examples based on the initial example.

In [None]:
import json
import random
from textwrap import dedent
from typing import List, Optional


def system_instruction(num_examples: int = 3, prompt_type: str = "fix") -> str:
    if prompt_type == "fix":
        instruction = dedent(
            f"""
            - You are creating {num_examples} more examples that follow the format of the example provided, but with a different content.
            - The created examples **must** all have different answers.
            - The created examples **must** have the same options as the provided example.
            - The output **must** be in unnumbered JSON format.
            """
        )
    else:
        instruction = dedent(
            f"""
            - You are creating {num_examples} more examples that follow the format of the example provided, but with a different content.
            - The created examples **must** all have different answers.
            - The output **must** be in unnumbered JSON format.
            """
        )
    return instruction


def example_instruction(
    label_space: List[str],
    question: str,
    answer: str,
    context: Optional[str],
    prompt_type: str = "fix",
) -> str:
    if prompt_type == "fix":
        if context:
            prompt = dedent(
                f"""
                "Options": {json.dumps(label_space)},
                "Answer": "{answer}",
                "Question": "{question}",
                "Context": "{context}"
                """
            )
        else:
            prompt = dedent(
                f"""
                "Options": {json.dumps(label_space)},
                "Answer": "{answer}",
                "Question": "{question}"
                """
            )
    elif prompt_type == "variant":
        if context:
            prompt = dedent(
                f"""
                "Question": "{question}",
                "Context": "{context}",
                "Options": {json.dumps(label_space)},
                "Answer": "{answer}"
                """
            )
        else:
            prompt = dedent(
                f"""
                "Question": "{question}",
                "Options": {json.dumps(label_space)},
                "Answer": "{answer}"
                """
            )

    return "{" + prompt + "}"

if __name__ == "__main__":

    system_inst = system_instruction(num_examples=num_examples, prompt_type=prompt_type)

    example_inst = example_instruction(
        label_space=options,
        question=question,
        answer=answer,
        context=context,
        prompt_type=prompt_type,
    )

    print(system_inst)

    history = set()
    count = 0
    depth = 0

    prev_tree = [example_inst]
    next_tree = []

    while count < len_train:
        print(f"size of previous tree: {len(prev_tree)}")
        random.shuffle(prev_tree)
        for tree_example_inst in prev_tree:
            text_result = api_query(description=system_inst, text=tree_example_inst)
            output = text_result.strip().split("\n},\n{")
            temp_dataset = []
            try:
                assert len(output) == num_examples
                for x in output:
                    x = x.strip()
                    if x[0] != "{":
                        x = "{" + x
                    if x[-1] != "}":
                        x = x + "}"
                    json_instance = json.loads(x)

                    if len(json_instance["Options"]) != len(options):
                        continue
                    else:
                        if prompt_type == "fix" and json_instance["Options"] != options:
                            continue

                    if json_instance["Question"] in history:
                        continue

                    if json_instance["Answer"] not in json_instance["Options"]:
                        continue

                    history.add(json_instance["Question"])

                    data_instance = {
                        "question": json_instance["Question"],
                        "context": json_instance["Context"] if "Context" in json_instance else None,
                        "options": json_instance["Options"],
                        "label": json_instance["Answer"],
                    }

                    temp_dataset.append(data_instance)

                for x in temp_dataset:
                    next_tree.append(
                        example_instruction(
                            label_space=x["options"],
                            question=x["question"],
                            answer=x["label"],
                            context=x["context"],
                            prompt_type=prompt_type,
                        )
                    )
                    print(json.dumps(x, ensure_ascii=False) + "\n")
                    count += 1
                    if count == len_train:
                        break

                print(f"depth: {depth}, count: {count}")
                if count == len_train:
                    break

            except Exception as e:  # pylint: disable=broad-exception-caught
                print(e)

        if count == len_train:
            break

        if len(next_tree) == 0:
            continue

        prev_tree = next_tree
        next_tree = []
        depth += 1

    print("\n")


- You are creating 3 more examples that follow the format of the example provided, but with a different content.
- The created examples **must** all have different answers.
- The output **must** be in unnumbered JSON format.

size of previous tree: 1
{"question": "What is the opposite of hot?", "context": null, "options": ["cold", "warm", "cool", "freezing", "ice"], "label": "cold"}

{"question": "What is the name of the animal that barks?", "context": null, "options": ["cat", "dog", "bird", "fish", "snake"], "label": "dog"}

{"question": "What is the name of the planet we live on?", "context": null, "options": ["Mars", "Venus", "Jupiter", "Saturn", "Earth"], "label": "Earth"}

depth: 0, count: 3
size of previous tree: 3
{"question": "What is the opposite of up?", "context": null, "options": ["down", "left", "right", "forward", "backward"], "label": "down"}

{"question": "What is the opposite of happy?", "context": null, "options": ["sad", "angry", "excited", "tired", "bored"], "label