# Example of generating QAs for an ML book using Azure OpenAI

### Before running the code

You will need to have the following packages installed:
```
    pip install langchain pandas unstructured
```

Also, make sure you have a .env file with your following parameter values in the root directory of this project
```
    api_key="YOUR_API_KEY"
    endpoint="YOUR_END_POINT"
    deployment_id="YOUR_DEPLOYMENT_ID"
    model_version="YOUR_MODEL_VERSION"
```

### Load packages

In [2]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

In [3]:
import os
import pandas as pd
from dotenv import load_dotenv
from uniflow.flow.client import ExtractClient, TransformClient
from uniflow.flow.config import ExtractHTMLConfig, TransformAzureOpenAIConfig
from uniflow.flow.flow_factory import FlowFactory
from uniflow.op.model.model_config import AzureOpenAIModelConfig
from uniflow.op.prompt import Context, PromptTemplate

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

In [4]:
FlowFactory.list()

{'extract': ['ExtractHTMLFlow',
  'ExtractImageFlow',
  'ExtractIpynbFlow',
  'ExtractMarkdownFlow',
  'ExtractPDFFlow',
  'ExtractTxtFlow'],
 'transform': ['TransformAzureOpenAIFlow',
  'TransformCopyFlow',
  'TransformHuggingFaceFlow',
  'TransformLMQGFlow',
  'TransformOpenAIFlow'],
 'rater': ['RaterFlow']}

### Prepare the input data

Set file name

In [5]:
html_file = "22.11_information-theory.html"

Set current directory and input data directory

In [6]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", html_file)

Load the html file via ExtractClient

In [7]:
input_data = [{"filename": input_file}]

In [8]:
extract_client = ExtractClient(ExtractHTMLConfig())

In [9]:
extract_output = extract_client.run(input_data)

100%|██████████| 1/1 [00:00<00:00,  1.75it/s]


### Prepare input dataset

In [10]:
guided_prompt = PromptTemplate(
        instruction="Generate one question and its corresponding answer based on context. Following the format of the examples below to include the same context, question, and answer in the response.",
        few_shot_prompt=[
            Context(
                context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",
                question="Who published A Mathematical Theory of Communication in 1948?",
                answer="Claude E. Shannon.",
            )
        ]
)

In [11]:
data = [ Context(context=p) for p in extract_output[0]['output'][0]['text'] if len(p) > 200 ]

In [12]:
data = data[-2:]

### Run ModelFlow


In [13]:
config = TransformAzureOpenAIConfig(
    prompt_template=guided_prompt,
    model_config=AzureOpenAIModelConfig(response_format={"type": "json_object"}),
)

In [14]:
client = TransformClient(config)

In [15]:
output = client.run(data)

Making API call with data: {"instruction": "Generate one question and its corresponding answer based on context. Following the 


  0%|          | 0/2 [00:00<?, ?it/s]

Received response: {'id': 'chatcmpl-8wfYPuuML62dfnDVp00OrBomSSxRX', 'object': 'chat.completion', 'created': 1708993449, 'model': 'gpt-4', 'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': '{\n  "context": "Assume that the test word has 4.5 letters, how many bits of randomness per character do you observe now?",\n  "question": "What is the assumed average length of a word in the provided context?",\n  "answer": "4.5 letters."\n}'}}], 'usage': {'prompt_tokens': 756, 'completion_tokens': 61, 'total_tokens': 817}, 'system_fingerprint': 'fp_8abb16fa4e'}


 50%|█████     | 1/2 [01:02<01:02, 62.41s/it]

Making API call with data: {"instruction": "Generate one question and its corresponding answer based on context. Following the 
Received response: {'id': 'chatcmpl-8wfZQwNbmjoJejxd5OAlz0aXladUz', 'object': 'chat.completion', 'created': 1708993512, 'model': 'gpt-4', 'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': '{\n  "context": "22.11. Information Theory\\n22.11.1. Information\\n22.11.1.1. Self-information\\n22.11.2. Entropy\\n22.11.2.1. Motivating Entropy\\n22.11.2.2. Definition\\n22.11.2.3. Interpretations\\n22.11.2.4. Properties of Entropy\\n22.11.3. Mutual Information\\n22.11.3.1. Joint Entropy\\n22.11.3.2. Conditional Entropy\\n22.11.3.3. Mutual Information\\n22.11.3.4. Properties of Mutual Information\\n22.11.3.5. Pointwise Mutual Information\\n22.11.3.6. Applications of Mutual Information\\n22.11.4. Kullback–Leibler Divergence\\n22.11.4.1. Definition\\n22.11.4.2. KL Divergence Properties\\n22.11.4.3. Example\\n22.11.5. Cross-Entropy

100%|██████████| 2/2 [02:12<00:00, 66.38s/it]


In [16]:
output

[{'output': [{'response': [{'context': 'Assume that the test word has 4.5 letters, how many bits of randomness per character do you observe now?',
      'question': 'What is the assumed average length of a word in the provided context?',
      'answer': '4.5 letters.'}],
    'error': 'No errors.'}],
  'root': <uniflow.node.Node at 0x7ff340e32290>},
 {'output': [{'response': [{'context': '22.11. Information Theory\n22.11.1. Information\n22.11.1.1. Self-information\n22.11.2. Entropy\n22.11.2.1. Motivating Entropy\n22.11.2.2. Definition\n22.11.2.3. Interpretations\n22.11.2.4. Properties of Entropy\n22.11.3. Mutual Information\n22.11.3.1. Joint Entropy\n22.11.3.2. Conditional Entropy\n22.11.3.3. Mutual Information\n22.11.3.4. Properties of Mutual Information\n22.11.3.5. Pointwise Mutual Information\n22.11.3.6. Applications of Mutual Information\n22.11.4. Kullback–Leibler Divergence\n22.11.4.1. Definition\n22.11.4.2. KL Divergence Properties\n22.11.4.3. Example\n22.11.5. Cross-Entropy\n22.1

### Format result into pandas table

In [17]:
# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

for item in output:
    for i in item['output']:
        for response in i['response']:
            contexts.append(response['context'])
            questions.append(response['question'])
            answers.append(response['answer'])

df = pd.DataFrame({
    'context': contexts,
    'question': questions,
    'answer': answers
})

# Set display options
pd.set_option('display.max_colwidth', None)  # or use a specific width like 50
pd.set_option('display.width', 1000)

styled_df = df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
    'selector': 'th',
    'props': [('text-align', 'left')]
}])
styled_df

Unnamed: 0,context,question,answer
0,"Assume that the test word has 4.5 letters, how many bits of randomness per character do you observe now?",What is the assumed average length of a word in the provided context?,4.5 letters.
1,22.11. Information Theory 22.11.1. Information 22.11.1.1. Self-information 22.11.2. Entropy 22.11.2.1. Motivating Entropy 22.11.2.2. Definition 22.11.2.3. Interpretations 22.11.2.4. Properties of Entropy 22.11.3. Mutual Information 22.11.3.1. Joint Entropy 22.11.3.2. Conditional Entropy 22.11.3.3. Mutual Information 22.11.3.4. Properties of Mutual Information 22.11.3.5. Pointwise Mutual Information 22.11.3.6. Applications of Mutual Information 22.11.4. Kullback–Leibler Divergence 22.11.4.1. Definition 22.11.4.2. KL Divergence Properties 22.11.4.3. Example 22.11.5. Cross-Entropy 22.11.5.1. Formal Definition 22.11.5.2. Properties 22.11.5.3. Cross-Entropy as An Objective Function of Multi-class Classification 22.11.6. Summary 22.11.7. Exercises,What is considered as an objective function of multi-class classification in information theory?,Cross-Entropy.
