# Generation

To build a chatbot app, we need a set of questions and answers. This is helpful for evaluating different prompt engineering techniques and different app design choices.

In this notebook we dive deeper on prompting the model by passing a better context by:
* using available data of W&B user questions 
* using the documentation files to generate better answers

In [1]:
import os
import random

from pathlib import Path
from pprint import pprint
from getpass import getpass

from rich.markdown import Markdown
import pandas as pd

from tqdm import tqdm

from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential, # for exponential backoff
)  

import torch
import transformers
import wandb
from wandb.integration.huggingface import autolog


# Set Llama2 API key 

To get key, go to your Hugging Face account and copy the key from your Access Tokens.

In [2]:
# Set LLAMA2 API key environment variable
if os.getenv("LLAMA2_API_KEY") is None:
  if any(['VSCODE' in x for x in os.environ.keys()]):
    print('Please enter password in the VS Code prompt at the top of your VS Code window!')
  os.environ["LLAMA2_API_KEY"] = getpass("Paste your LLAM2 key from your huggingface settings \n")

assert os.getenv("LLAMA2_API_KEY", "").startswith("hf_"), "This doesn't look like a valid HuggingFace llama2 key"
print("Llama2 API key configured")

# Get the HF auth token
hf_auth = os.getenv("LLAMA2_API_KEY", "")

Please enter password in the VS Code prompt at the top of your VS Code window!
Llama2 API key configured


# Start W&B logging

autolog - convenient function for logging results to W&B

In [3]:
autolog({"project":"llmapps", "job_type": "generation"})

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33md-oliver-cort[0m ([33mdoc93[0m). Use [1m`wandb login --relogin`[0m to force relogin


# Generating synthetic support questions

In [3]:
# Define llama2 model to load
model_id = 'meta-llama/Llama-2-7b-chat-hf'

In [4]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
    )



In [6]:
# Set quantization configuration to load large model with less GPU memory
# - this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# This configuration object uses the model configuration from Hugging Face 
# to set different model parameters
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

# Download and initialize the model 
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )


In [7]:
pipeline_generate_text = transformers.pipeline(
    task='text-generation',
    model=model,
    tokenizer=tokenizer,
)

In [8]:
# completion_with_backoff 
# - this decorator will make API request wait if it hits a rate limiting error
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def completion_with_backoff(message, **kwargs):
    return pipeline_generate_text(message, **kwargs)

## Zero Shot prompting

Not giving any examples or context

In [9]:
# special tokens used by llama 2 chat
B_INST = "[INST] "
E_INST = "[/INST]"
B_SYS  = "<<SYS>>\n"
E_SYS  = "\n<</SYS>>\n\n"


# Define the behaviour, qualities of the LLM
system_prompt = """You are a helpful assistant."""
# Define what we ask the LLM to do
user_message = """Generate a support question from a W&B user"""

# create the system message
system_prompt = f"<s>{B_INST}{B_SYS}{system_prompt}{E_SYS}"
# Create user message
user_prompt = f"{user_message}{E_INST}"

In [10]:
def generate_and_print(system_prompt, user_prompt, n=5):
    message = system_prompt + user_prompt

    responses = completion_with_backoff(
        message,
        do_sample=True,           # Whether or not to use sampling 
        repetition_penalty=1.1,   # without this output begins repeating
        num_return_sequences = n,
        max_new_tokens=50
        )
    # pprint(responses)
    for response in responses:
        response = response['generated_text']
        generation = response[response.find('[/INST]')+len('[/INST]'):]
        display(Markdown(generation))
    
responses = generate_and_print(system_prompt, user_prompt)

The 5 generated responses above are quite generic. 

## Few Shot prompting 

Give the model several examples.
* We read some user submitted questions (from Discord server) listed in the file `examples.txt`. 
* This file contains multiline questions separated by tabs (`\t`).

In [21]:
delimiter = "\t" # tab separated queries
with open("../data/examples.txt", "r") as file:
    data = file.read()
    real_queries = data.split(delimiter)

pprint(f"We have {len(real_queries)} real queries:")  
Markdown(f"Sample one: \n\"{random.choice(real_queries)}\"")

'We have 228 real queries:'


For a Few Shot prompt:
* we can now add a few of those real user questions to the prompt, to guide our model to produce synthetic questions like those.

In [14]:
def generate_few_shot_prompt(queries, n=3):
    prompt = "Generate a support question from a W&B user\n" +\
        "Below you will find a few examples of real user queries:\n"
    for _ in range(n):
        prompt += random.choice(queries) + "\n"
    prompt += "Let's start!"
    return prompt

generation_prompt = generate_few_shot_prompt(real_queries)
# Get user prompt in llama2 format
generation_prompt = f"{generation_prompt}{E_INST}"

# Print the prompt to be fitted into the LLM
Markdown(generation_prompt)

In [15]:
# Print prompt passed into LLM
pprint(system_prompt + generation_prompt)

('<s>[INST] <<SYS>>\n'
 'You are a helpful assistant.\n'
 '<</SYS>>\n'
 '\n'
 'Generate a support question from a W&B user\n'
 'Below you will find a few examples of real user queries:\n'
 'can I make a project publicly viewable?\n'
 'The integration with gradio works within a jupyter notebook but not with the '
 'same code run as a python script. Why?\n'
 'how do I fix an error with wandb Table construction from pandas dataframe: '
 'TypeError: Data row contained incompatible types\n'
 "Let's start![/INST]")


Below we can see how the Llama2 LLM does when passing few examples (context) to the user prompt.

In [15]:
generate_and_print(system_prompt, user_prompt=generation_prompt)

Questions produced by LLM, by passing some example questions, are a bit more diverse than with Zero Shot prompting, but we can go further (to increase range of questions produced by LLM).

Something that we could add to the prompt to avoid words like "Sure, ..." could be: "Just give me the question, don't give me extra text"

## Add Context & Response

We want to be able to respond questions that also have some documentation available. As long as we have the documentation for a specific user question, we should be able to answer that question.

To evaluate the model, we want to make sure that whenever documentation is available, the answer is correct for that question.
* So why not use documentation to also generate synthetic questions?

To do this, the folder `../docs_sample` contains several examples of wandb docs. Dataset of questions will be limited to what is available in this docs.

In [18]:
# check if directory exists, if not, create it and download the files, e.g if running in colab
if not os.path.exists("../docs_sample/"):
  !git clone https://github.com/wandb/edu.git
  !cp -r edu/llm-apps-course/docs_sample ../

In [16]:
def find_md_files(directory):
    "Find all markdown files in a directory and return their content and path"
    md_files = []
    for file in Path(directory).rglob("*.md"):
        with open(file, 'r', encoding='utf-8') as md_file:
            content = md_file.read()
        md_files.append((file.relative_to(directory), content))
    return md_files

documents = find_md_files('../docs_sample/')
len(documents)

11

Check that the documents are not too long for our context window (prompt), by computing the number of tokens in each document.

In [17]:
tokens_per_document = [len(tokenizer.encode(document)) for _, document in documents]
pprint(tokens_per_document)

[1172, 3173, 432, 672, 960, 1519, 3555, 2059, 3010, 2512, 5369]


Some of the documents are too long (don't need that much text in our prompt). For thos documents, we'll extract a random chunk from them - to inspire the LLM to generate more questions.

In [18]:
# extract a random chunk from a document
def extract_random_chunk(document, max_tokens=512):
    tokens = tokenizer.encode(document)
    if len(tokens) <= max_tokens:
        return document
    start = random.randint(0, len(tokens) - max_tokens)
    end = start + max_tokens
    return tokenizer.decode(tokens[start:end])

Now, we use extracted chunk to create a question that can be answered by the document. This way we can generate questions that our current documentation is capable of answering.

In [19]:
def generate_context_prompt(chunk):
    prompt = "Generate a support question from a W&B user.\n" +\
        "Start your answere with 'Question:'. Don't answer the generated question.\n" +\
        "The question should be answerable by provided fragment of W&B documentation.\n" +\
        "Don't mention the W&B documentation in your answer.\n" +\
        "Below you will find a fragment of W&B documentation:\n" +\
        chunk + "\n" +\
        "Let's start!"
    return prompt

chunk = extract_random_chunk(documents[0][1])
generation_prompt = generate_context_prompt(chunk)
# Follow llama2 prompt format
generation_prompt = f"{generation_prompt}{E_INST}"

In [50]:
Markdown(generation_prompt)

Let's generate 3 possible questions:

In [34]:
generate_and_print(system_prompt, generation_prompt, n=3)

Some output questions above seem synthetic (not that related to wandb, but more about some specific coding concept). There are further prompt engineering steps to improve this.

## Level 5 prompt structure

This prompt structure has a complex directive that includes:
* Description of high-level goal 
* And few short examples
* Detailed bulleted list of sub-tasks 
* An explicit statement asking the LLM to explain its output
* Guidelines on how LLM output will be evaluated


### System and user templates

Here we attempt to create a prompt that follows these Level 5 directions. We split the prompt split into:
* **System template** (system message) - instructing model to get into a specific role
* **User template** (input from the user)

In [10]:
# read system_template.txt file into an f-string
with open("../data/system_template.txt", "r") as file:
    system_prompt = file.read()

# Follow llama2 prompt format
system_prompt = f"<s>{B_INST}{B_SYS}{system_prompt}{E_SYS}"

In [11]:
pprint(system_prompt)

('<s>[INST] <<SYS>>\n'
 'You are a creative assistant with the goal to generate a synthetic dataset '
 'of Weights & Biases (W&B) user questions.\n'
 "W&B users are asking these questions to a bot, so they don't know the answer "
 "and their questions are grounded in what they're trying to achieve. \n"
 'We are interested in questions that can be answered by W&B documentation. \n'
 "But the users don't have access to this documentation, so you need to "
 "imagine what they're trying to do and use according language.\n"
 '<</SYS>>\n'
 '\n')


In [12]:
# read prompt_template.txt file into an f-string
with open("../data/prompt_template.txt", "r") as file:
    prompt_template = file.read()

# Follow llama2 prompt format
prompt_template = f"{prompt_template}{E_INST}"

In [13]:
Markdown(prompt_template)

In the above prompt, we tell the model:
* We say that we provide examples of real user question (this is the **few shot** part of the prompt)
* {Need to provide examples}  
* We provide fragment of W&B docs for inspiration for synthetic questions and source of answer
* {Need to provide docs}
* Provide further info to the model to guide the model answer

Now, below, we fill above template prompt by using
* Example questions from **[examples.txt](../data/examples.txt)**
* Example documentation from **[docs_sample](../docs_sample/)**

In [22]:
def generate_context_prompt(chunk, n_questions=3):
    # Randombly sample n questions from list real_queries
    questions = '\n'.join(random.sample(real_queries, n_questions))
    user_prompt = prompt_template.format(QUESTIONS=questions, CHUNK=chunk)
    return user_prompt

user_prompt = generate_context_prompt(chunk)

In [40]:
Markdown(user_prompt)

Now, we request the model to generate answers

In [23]:
def generate_questions(documents, n_questions=3, n_generations=5):
    questions = []
    for _, document in tqdm(documents):
        # Extract random chunck from a W&B document
        chunk = extract_random_chunk(document)
        # Fill in prompt_template with example questions and docs chunck
        user_prompt = generate_context_prompt(chunk, n_questions)
        # Pass system_prompt and user_prompt to LLM
        message = system_prompt + user_prompt
        # display(Markdown(message))
        # pprint(message)
        
        # Produce n responses from input prompt
        responses = completion_with_backoff(
            message,
            do_sample=True,           # Whether or not to use sampling 
            repetition_penalty=1.1,   # without this output begins repeating
            num_return_sequences = n_generations,
            max_new_tokens=512
            )

        # Remove asked prompt from responses and append to list
        questions.extend([response['generated_text'][response['generated_text'].find('[/INST]')+len('[/INST]'):] for response in responses])
    return questions

In [24]:
# function to parse model generation and extract CONTEXT, QUESTION and ANSWER
def parse_generation(generation):
    lines = generation.split("\n")
    context = []
    question = []
    answer = []
    flag = None
    
    for line in lines:
        if "CONTEXT:" in line:
            flag = "context"
            line = line.replace("CONTEXT:", "").strip()
        elif "QUESTION:" in line:
            flag = "question"
            line = line.replace("QUESTION:", "").strip()
        elif "ANSWER:" in line:
            flag = "answer"
            line = line.replace("ANSWER:", "").strip()

        if flag == "context":
            context.append(line)
        elif flag == "question":
            question.append(line)
        elif flag == "answer":
            answer.append(line)

    context = "\n".join(context)
    question = "\n".join(question)
    answer = "\n".join(answer)
    return context, question, answer

In [43]:
# Generate questions using LLM
generations = generate_questions([documents[0]], n_questions=3, n_generations=5)

100%|██████████| 1/1 [00:38<00:00, 38.14s/it]


In [44]:
# Extract CONTEXT, QUESTION and ANSWER from a generation
context, question, answer = parse_generation(generations[0])

In [45]:
display(Markdown(generations[0]))
print('context -----------')
display(Markdown(context))
# pprint(context)
print('question-----------')
display(Markdown(question))
# pprint(question)
print('answer-----------')
display(Markdown(answer))
# pprint(answer)


context -----------


question-----------


answer-----------


* Above generated text is split into `Context`, `Question`, `Answer`
* Question looks better that with previous approaches

Now that we verified that function works, we can run it in a loop to generate questions

Below.. cause we want a big dataset of synthetic questions for our model evaluation:
* we save LLM generations into a dataframe and a csv, 
* we log this as a W&B Table and save the csv as a W&B Artifact

In [25]:
parsed_generations = []
generations = generate_questions(documents, n_questions=3, n_generations=5)
for generation in generations:
    context, question, answer = parse_generation(generation)
    parsed_generations.append({"context": context, "question": question, "answer": answer})

# Convert parsed_generations to a pandas dataframe and save it locally
df = pd.DataFrame(parsed_generations)
df.to_csv('generated_examples.csv', index=False)

# Log df as a table to W&B for interactive exploration
wandb.log({"generated_examples": wandb.Table(dataframe=df)})

# Log csv file as an artifact to W&B for later use
artifact = wandb.Artifact("generated_examples", type="dataset")
artifact.add_file("generated_examples.csv")
wandb.log_artifact(artifact)

100%|██████████| 11/11 [10:30<00:00, 57.36s/it]


<Artifact generated_examples>

In [26]:
# Finish wandb run
wandb.finish()



In [36]:
df[:10]

Unnamed: 0,context,question,answer
0,\nThe user is a beginner user of Weights & Bia...,How do I create a new version of my dataset in...,To create a new version of your dataset in Wei...
1,A Weights & Biases (W&B) user who is relativel...,How do I version my datasets in Weights & Bias...,Great question! Versioning datasets in Weights...
2,A Weights & Biases (W&B) user is working with ...,"""How do I go about tagging my runs in Weights ...","Sure, W&B provides a powerful tagging system t..."
3,A new W&B user is trying to learn how to use t...,How do I tag multiple runs in Wandb? Is there ...,"Great question! In W&B, you can tag multiple r..."
4,,,
5,A user is trying to improve the quality of the...,How can I use W&B's artifact versioning featur...,Great question! W&B's artifact versioning feat...
6,The user is a data scientist who is working on...,How can I automatically version my dataset aft...,"""Great question! In W&B, you can automatically..."
7,A user named Sarah is trying to improve the qu...,How can I refine my dataset to address common ...,"Great question, Sarah! W&B provides an automat..."
8,\nA beginner user of Weights & Biases is attem...,"""How do I properly version my datasets and Art...",Great question! Properly versioning your datas...
9,A user has been using W&B to create a dataset ...,"""Hey there, guys! I'm having some trouble with...","""Hi there! It sounds like you're experiencing ..."
