# Question-Answer Data Generator

In [1]:
!pip install https://ragsample.blob.core.windows.net/wheels/azure_ai_tools-0.0.1-py3-none-any.whl
!pip install azureml-rag azure-ai-ml azureml-core
!pip install "langchain==0.0.276" "nltk==3.8.1" pandas wikipedia
!pip install "promptflow[azure]" "promptflow-tools==0.1.0.b5" promptflow_vectordb

## Setup AzureOpenAI Connection

Update workspace details:

In [None]:
%%writefile config.json
{
    "subscription_id": "<subscription_id>",
    "resource_group": "<resource_group_name>",
    "workspace_name": "<workspace_name>"
}

Update Azure Open AI details:

In [1]:
aoai_endpoint = "https://<aoai-endpoint>.openai.azure.com/" # Replace with resource endpoint URL
aoai_key = "<aoai-key>"  # Replace with resource key
completion_deployment_name = "gpt-4"  # Replace with deployment name of the chat completion model
completion_model_name = "gpt-4"  # Replace with chat model name: gpt-4, gpt-35-turbo
embedding_deployment_name = "text-embedding-ada-002"  # Replace with deployment name for text-embedding-ada-002 model
aoai_connection_name = "azure_open_ai_connection"

In [2]:
# Updating Azure Open AI details in sample Prompt flow files
chat_qna_file = "./flows/bring_your_own_data_chat_qna/flow.dag.yaml"
with open(chat_qna_file) as f:
    chat_qna_flow_dag = f.read()
chat_qna_flow_dag = chat_qna_flow_dag.replace("deployment_name: 'text-embedding-ada-002'", f"deployment_name: '{embedding_deployment_name}'") \
    .replace("deployment_name: 'gpt-4'", f"deployment_name: '{completion_deployment_name}'") \
    .replace("connection: 'azure_open_ai_connection'", f"connection: '{aoai_connection_name}'")
with open(chat_qna_file, "w") as f:
    f.write(chat_qna_flow_dag)

eval_file = "./flows/qna_gpt_similarity_eval/flow.dag.yaml"
with open(eval_file) as f:
    eval_flow_dag = f.read()
eval_flow_dag = eval_flow_dag.replace("deployment_name: 'gpt-4'", f"deployment_name: '{completion_deployment_name}'") \
    .replace("connection: 'azure_open_ai_connection'", f"connection: '{aoai_connection_name}'")
with open(eval_file, "w") as f:
    f.write(eval_flow_dag)

In [3]:
from azureml.core import Workspace
from azureml.rag.utils.connections import get_connection_by_name_v2, create_connection_v2

ws = Workspace.from_config()

try:
    aoai_connection = get_connection_by_name_v2(ws, aoai_connection_name)
except:
    print("Creating connection...")
    aoai_connection = create_connection_v2(
        workspace=ws,
        name=aoai_connection_name,
        category="AzureOpenAI",
        target=aoai_endpoint,
        auth_type="ApiKey",
        credentials={
            "key": aoai_key,
        },
        metadata={"ApiType": "azure", "ApiVersion": "2023-05-15"},
    )
print(aoai_connection["name"])

azure_open_ai_connection


## Generate QA

Initialize a QA data generator by passing in your Azure OpenAI details for your gpt-4 or gpt-35-turbo deployment.
We'll use it to generate different types of QA for sample text.

Supported QA types:

|Type|Description|
|--|--|
|SHORT_ANSWER|Short answer QAs have answers that are only a few words long. These words are generally relevant details from text like dates, names, statistics, etc.|
|LONG_ANSWER|Long answer QAs have answers that are one or more sentences long. ex. Questions where answer is a definition: What is a {topic_from_text}?|
|BOOLEAN|Boolean QAs have answers that are either True or False.|
|SUMMARY|Summary QAs have questions that ask to write a summary for text's title in a limited number of words. It generates just one QA.|
|CONVERSATION|Conversation QAs have questions that might reference words or ideas from previous QAs. ex. If previous conversation was about some topicX from text, next question might reference it without using its name: How does **it** compare to topicY?|

In [4]:
from azure.ai.tools.synthetic.qa import QADataGenerator, QAType

# For granular logs you may set DEBUG log level:
# import logging
# logging.basicConfig(level=logging.DEBUG)

model_config = dict(
    api_base=aoai_connection["properties"]["target"],
    api_key=aoai_connection["properties"]["credentials"]["key"],
    deployment=completion_deployment_name,
    model=completion_model_name,
    max_tokens=2000,
)

qa_generator = QADataGenerator(model_config=model_config)

In [5]:
import wikipedia

wiki_title = wikipedia.search("Leonardo da vinci")[0]
wiki_page = wikipedia.page(wiki_title)
text = wiki_page.summary[:700]
text

'Leonardo di ser Piero da Vinci (15 April 1452 – 2 May 1519) was an Italian polymath of the High Renaissance who was active as a painter, draughtsman, engineer, scientist, theorist, sculptor, and architect. While his fame initially rested on his achievements as a painter, he also became known for his notebooks, in which he made drawings and notes on a variety of subjects, including anatomy, astronomy, botany, cartography, painting, and paleontology. Leonardo is widely regarded to have been a genius who epitomized the Renaissance humanist ideal, and his collective works comprise a contribution to later generations of artists matched only by that of his younger contemporary Michelangelo.Born ou'

In [54]:
result = qa_generator.generate(
    text=text,
    qa_type=QAType.SHORT_ANSWER,  # Feel free to change QA type 
    num_questions=5,
)
for question, answer in result["question_answers"]:
    print(f"Q: {question}")
    print(f"A: {answer}")

Q: When was Leonardo da Vinci born?
A: 15 April 1452
Q: When did Leonardo da Vinci pass away?
A: 2 May 1519
Q: What was Leonardo da Vinci's full name?
A: Leonardo di ser Piero da Vinci
Q: Which period was Leonardo da Vinci an Italian polymath of?
A: High Renaissance
Q: Who was Leonardo da Vinci's younger contemporary with a significant contribution to later generations of artists?
A: Michelangelo


## Generate QA from files

Files might have large texts that go beyond model's context lengths. They need to be split to create smaller chunks. Moreover, they should not be split mid-sentence. Such partial sentences might lead to improper QAs. We will use LangChain's `NLTKTextSplitter` to deal with these issues.

We'll generate QAs to use it later in Promptflow's Bulk Test.

We'll read sample markdown files from `data/data_generator_texts` directory.

In [6]:
import os
import glob

texts_glob = os.path.join("data", "data_generator_texts", "**", "*")
files = glob.glob(texts_glob, recursive=True)
files = [file for file in files if os.path.isfile(file)]
files

['data\\data_generator_texts\\text-1.md',
 'data\\data_generator_texts\\text-2.md',
 'data\\data_generator_texts\\text-3.md']

In [9]:
import nltk

# download pre-trained Punkt tokenizer for sentence splitting
nltk.download("punkt")

In [10]:
from langchain.text_splitter import NLTKTextSplitter

text_splitter = NLTKTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base",  # encoding for gpt-4 and gpt-35-turbo
    chunk_size=300,  # number of tokens to split on
    chunk_overlap=0,
)
texts = []
for file in files:
    with open(file) as f:
        data = f.read()
    texts += text_splitter.split_text(data)
print(f"Number of texts after splitting: {len(texts)}")

Number of texts after splitting: 7


We'll use `qa_generator.generate_async()` method to make concurrent requests to Azure Open AI.

In [12]:
import asyncio
from collections import Counter

concurrency = 3  # number of concurrent calls
sem = asyncio.Semaphore(concurrency)

async def generate_async(text):
    async with sem:
        return await qa_generator.generate_async(
            text=text,
            qa_type=QAType.LONG_ANSWER,
            num_questions=3,  # Number of questions to generate per text
        )

results = await asyncio.gather(*[generate_async(text) for text in texts],
                               return_exceptions=True)

question_answer_list = []
token_usage = Counter()
for result in results:
    if isinstance(result, Exception):
        raise result  # exception raised inside generate_async()
    question_answer_list.append(result["question_answers"])
    token_usage += result["token_usage"]

print("Successfully generated QAs")

In [13]:
# preview generated QAs
for question, answer in question_answer_list[0][:3]:
    print(f"Q: {question}")
    print(f"A: {answer}")

Q: What is a compute target in Azure Machine Learning and what are its typical uses in a model development lifecycle?
A: A compute target is a designated compute resource or environment where you run your training script or host your service deployment, which can be your local machine or a cloud-based compute resource. In a typical model development lifecycle, you start by developing and experimenting on a small amount of data using your local environment, then scale up to larger data or do distributed training using training compute targets, and finally deploy the model to a web hosting environment using deployment compute targets.


In [14]:
print(f"Token usage: {token_usage}")

Token usage: Counter({'total_tokens': 5507, 'prompt_tokens': 4794, 'completion_tokens': 713})


## Promptflow Bulk Test and Evaluation

Promptflow's Bulk Test and Evaluation lets you test & evaluate flows. It requires data in a certain format. We'll prepare this format from our generated QAs. We'll then use this data to do Bulk Test and Evaluation for Chat QnA flow.

In [15]:
import json
from collections import defaultdict

data_dict = defaultdict(list)
for question_answers in question_answer_list:
    chat_history = []
    for question, answer in question_answers:
        # QnA columns:
        data_dict["question"].append(question)
        data_dict["ground_truth"].append(answer)  # Consider generated answer as the ground truth

        # Chat QnA columns:
        data_dict["chat_history"].append(json.dumps(chat_history))
        data_dict["chat_input"].append(question)
        chat_history.append({"inputs": {"chat_input": question}, "outputs": {"chat_output": answer}})

In [16]:
import os
import pandas as pd

output_dir = "./data/data_generator_output/"
os.makedirs(output_dir, exist_ok=True)
output_file = output_dir + "qa_data.jsonl"

In [17]:
data_df = pd.DataFrame(data_dict, columns=list(data_dict.keys()))
data_df.to_json(output_file, lines=True, orient="records")

# If file already exists:
# data_df = pd.read_json(output_file, lines=True)

In [19]:
from promptflow.azure import PFClient
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

pf = PFClient.from_config(credential=credential)  # uses config.json created at the beginning of the notebook

**Note**: You'll need to create a compute instance runtime if you don't have one already. You can create it from: `Prompt flow > Runtime > Create`.

In [20]:
runtime = "ci-runtime"

In [21]:
from azure.core.exceptions import ServiceResponseError

def pf_run_with_retries(*args, **kwargs):
    retries = 0
    while True:
        try:
            base_run = pf.run(*args, **kwargs)
        except ServiceResponseError as e:
            retries += 1
            if retries < 3:
                print(f"Request failed: {e}. Retrying...")
                continue
            raise
        return base_run

### Bulk Test

We'll create Bulk Test for a **Chat QnA flow**. This flow contains a sample Vector Index which has been indexed on Compute Target documentation. For the Bulk Test we'll use our generated QAs. Each question will be fed as `chat_input` to the flow which will return the answer `outputs.chat_output`. `outputs.chat_output` can then be compared with the `ground_truth` (originally generated answer). 

In [24]:
base_run = pf_run_with_retries(
    flow="./flows/bring_your_own_data_chat_qna/",
    data=output_file,
    column_mapping={
        "chat_history": "${data.chat_history}",
        "chat_input": "${data.chat_input}",
    },
    runtime=runtime,
)

print("Visualize outputs:")
pf.visualize(base_run)

In [74]:
pf.stream(base_run)

In [33]:
from IPython.display import display

try:
    details = pf.get_details(base_run)
    details = details.merge(data_df[["chat_input", "ground_truth"]], left_on=["inputs.chat_input"], right_on=["chat_input"],
                        how="inner")
    details = details.drop(["inputs.chat_history", "chat_input"], axis=1)
    display(details.head(5))
except ValueError as e:
    # TODO: fix issue in promptflow library that causes ValueError in pf.get_details()
    print(f"Retrieving details failed with: {e}")
    print("Use the 'Visualize outputs' link printed above to see outputs.")

Unnamed: 0,inputs.chat_input,outputs.chat_output,ground_truth
0,What is the retirement date for H-series virtu...,The retirement date for H-series virtual machi...,H-series virtual machine series will be retire...
1,What can you do with Azure Machine Learning co...,"With Azure Machine Learning compute, you can s...","With Azure Machine Learning compute, you can c..."
2,What is the purpose of a Docker container in A...,"In Azure Machine Learning inference, a Docker ...",When performing inference in Azure Machine Lea...
3,What are the capabilities of Compute cluster a...,Compute clusters are multi-node compute target...,Compute cluster supports single- or multi-node...
4,What is a compute target in Azure Machine Lear...,A compute target in Azure Machine Learning is ...,A compute target is a designated compute resou...


### Bulk Evaluate

Now, we'll create Bulk Evaluation for the above Chat QnA flow. We'll evaluate `inputs.answer` (returned by flow) against `inputs.ground_truth` using **QnA GPT Similarity** metric. QnA GPT Similarity produces a score from 1 to 5 depending on how close (similar) `inputs.answer` is to `inputs.ground_truth`.

In [35]:
eval_run = pf_run_with_retries(
    flow="./flows/qna_gpt_similarity_eval/",
    data=output_file,
    run=base_run,
    column_mapping={
        "question": "${data.question}",
        "ground_truth": "${data.ground_truth}",
        "answer": "${run.outputs.chat_output}",
    },
    runtime=runtime,
)

print("Visualize outputs:")
pf.visualize(eval_run)

In [None]:
pf.stream(eval_run)

In [36]:
eval_details = pf.get_details(eval_run)

In [38]:
for _, row in eval_details.iterrows():
    print("Question:", row["inputs.question"])
    print("Answer:", row["inputs.answer"])
    print("Expected answer:", row["inputs.ground_truth"])
    print("Similarity score:", row["outputs.gpt_similarity"])
    print()