# Generate Questions and Answers from your data

## Objective

Use the QADataGenerator to generate high-quality questions and answers from your data using LLMs.

This tutorial uses the following Azure AI services:

- Access to Azure OpenAI Service - you can apply for access [here](https://go.microsoft.com/fwlink/?linkid=2222006)
- An Azure AI Studio project - go to [aka.ms/azureaistudio](https://aka.ms/azureaistudio) to create a project

## Time

You should expect to spend 5-10 minutes running this sample. 

## About this example

Large Language Models (LLMs) can help you create question and answer datasets from your existing data sources. These datasets can be useful for various tasks, such as testing your retrieval capabilities, evaluating and improving your RAG workflows, tuning your prompts and more. In this sample, we will explore how to use the QADataGenerator to generate high-quality questions and answers from your data using LLMs.

This sample will be useful to developers and for data scientists who need data for developing RAG workflows or evaluating and improving RAG workflows.



### Data

In this sample we will use data from 2 sources. First, we will generate text data from Wikipedia. We will also use data from files to generate QnA. For this we will use files from the `data/data_generator_texts` folder in this repo.

## Before you begin



### Installation

Install the following packages required to execute this notebook. 



In [None]:
# Install the packages
%pip install azure-identity azure-ai-generative
%pip install wikipedia langchain nltk unstructured

### Parameters

Lets initialize some variables. For `subscription_id`, `resource_group_name` and `project_name`, you can go to the Project Overview page in the AI Studio. Replace the items in <> with values for your project. 

In [None]:
# project details
subscription_id: str = "<your-subscription-id>"
resource_group_name: str = "<your-resource-group>"
project_name: str = "<your-project-name>"

should_cleanup: bool = False

## Connect to your project

To start with let us create a config file with your project details. This file can be used in this sample or other samples to connect to your workspace. 

In [None]:
import json
from pathlib import Path

config = {
    "subscription_id": subscription_id,
    "resource_group": resource_group_name,
    "project_name": project_name,
}

p = Path("config.json")

with p.open(mode="w") as file:
    file.write(json.dumps(config))

Let us connect to the project

In [None]:
from azure.ai.resources.client import AIClient
from azure.identity import DefaultAzureCredential

# connects to project defined in the first config.json found in this or parent folders
client = AIClient.from_config(DefaultAzureCredential())

## Retrieve Azure OpenAI details
We will use an Azure Open AI service to access the LLM. Let us get the details of these from your project.

In [None]:
# Get the default Azure Open AI connection for your project
default_aoai_connection = client.get_default_aoai_connection()
default_aoai_connection.set_current_environment()

## Generate QA
Initialize a QA data generator by passing in your Azure OpenAI details for your gpt-4 or gpt-35-turbo deployment.
We'll use it to generate different types of QA for sample text.

Supported QA types:

|Type|Description|
|--|--|
|SHORT_ANSWER|Short answer QAs have answers that are only a few words long. These words are generally relevant details from text like dates, names, statistics, etc.|
|LONG_ANSWER|Long answer QAs have answers that are one or more sentences long. ex. Questions where answer is a definition: What is a {topic_from_text}?|
|BOOLEAN|Boolean QAs have answers that are either True or False.|
|SUMMARY|Summary QAs have questions that ask to write a summary for text's title in a limited number of words. It generates just one QA.|
|CONVERSATION|Conversation QAs have questions that might reference words or ideas from previous QAs. ex. If previous conversation was about some topicX from text, next question might reference it without using its name: How does **it** compare to topicY?|

In [None]:
from azure.ai.generative.synthetic.qa import QADataGenerator, QAType

# For granular logs you may set DEBUG log level:
# import logging
# logging.basicConfig(level=logging.DEBUG)

model_config = {
    "deployment": "gpt-35-turbo",
    "model": "gpt-35-turbo",
    "max_tokens": 2000,
}

qa_generator = QADataGenerator(model_config=model_config)

### Generate QA from raw text
In this example we use a wikipedia article as raw text generate different types of Question and Answer pairs.

In [None]:
import wikipedia

wiki_title = wikipedia.search("Leonardo da vinci")[0]
wiki_page = wikipedia.page(wiki_title)
text = wiki_page.summary[:700]

In [None]:
# Try out with different QATypes like LONG_ANSWER or CONVERSATION
qa_type = QAType.SHORT_ANSWER

result = qa_generator.generate(
    text=text,
    qa_type=qa_type,
    num_questions=5,
)
for question, answer in result["question_answers"]:
    print(f"Q: {question}")
    print(f"A: {answer}")

### Generate QA from files
To generate QA from files, we need to consider two aspects: the file type and the text length. Different file types may require different loaders to extract the raw text. Also, the text length may exceed the model's context limit, which can affect the QA generation performance. Therefore, we use Langchain's Unstructured File loader and NLTKText Splitter to handle these issues. The Unstructured File loader can read various file types and convert them to raw text. The NLTKText Splitter can divide the text into smaller chunks that fit the model's context. It also avoids splitting the text in the middle of a sentence, as this can result in incorrect QAs. We should always ensure that the text chunks are complete sentences.

We'll read sample files from `data/data_generator_texts` folder to generate QAs. 

In [None]:
texts_glob = Path("data", "data_generator_texts")
files = Path.glob(texts_glob, pattern="**/*")
files = [file for file in files if Path.is_file(file)]

Let us chunk and split the text

In [None]:
from langchain.document_loaders import UnstructuredFileLoader
from langchain.text_splitter import NLTKTextSplitter
import nltk

# download pre-trained Punkt tokenizer for sentence splitting
nltk.download("punkt")

text_splitter = NLTKTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base",  # encoding for gpt-4 and gpt-35-turbo
    chunk_size=300,  # number of tokens to split on
    chunk_overlap=0,
)
texts = []
for file in files:
    loader = UnstructuredFileLoader(file)
    docs = loader.load()
    data = docs[0].page_content
    texts += text_splitter.split_text(data)
print(f"Number of texts after splitting: {len(texts)}")

#### Generate QA asynchronously
To improve the performance of our file processing, we can leverage the generate_async method from `QADataGenerator`. This method allows us to send multiple chunks of text from the file to the API in parallel, and then retrieve the results asynchronously. This way, we can avoid waiting for each chunk to be processed sequentially, and reduce the overall latency.

In [None]:
import asyncio
from collections import Counter
from typing import Dict

concurrency = 3  # number of concurrent calls
sem = asyncio.Semaphore(concurrency)

qa_type = QAType.CONVERSATION


async def generate_async(text: str) -> Dict:
    async with sem:
        return await qa_generator.generate_async(
            text=text,
            qa_type=qa_type,
            num_questions=3,  # Number of questions to generate per text
        )


results = await asyncio.gather(*[generate_async(text) for text in texts], return_exceptions=True)

question_answer_list = []
token_usage = Counter()
for result in results:
    if isinstance(result, Exception):
        raise result  # exception raised inside generate_async()
    question_answer_list.append(result["question_answers"])
    token_usage += result["token_usage"]

print("Successfully generated QAs")

### Save the generated data for later use
Let us save the generated QnA in a format which can be understood by prompt flow (for evaluation, batch runs). 

In [None]:
output_file = "generated_qa.jsonl"
qa_generator.export_to_file(output_file, qa_type, question_answer_list)

## How to use in promptflow

To use the above data in promptflow, please refer to the documentation [here](https://learn.microsoft.com/azure/ai-studio/how-to/generate-data-qa?#using-the-generated-data-in-prompt-flow)

## Cleaning up

To clean up all Azure ML resources used in this example, you can delete the individual resources you created in this tutorial.

If you made a resource group specifically to run this example, you could instead [delete the resource group](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/delete-resource-group).

In [None]:
if should_cleanup:
    # add clean up steps if needed
    pass