# Example of generating QAs for an ML book using Azure OpenAI

### Before running the code

Make sure you have a .env file in the root directory with following parameter values in the root directory of this project
```
    AZURE_API_KEY="YOUR_API_KEY"
    AZURE_ENDPOINT="YOUR_ENDPOINT"
    AZURE_API_VERSION="YOUR_API_VERSION"
    AZURE_DEPLOYMENT_NAME="YOUR_DEPLOYMENT_NAME"
```
`AZURE_API_KEY`, `AZURE_ENDPOINT`, and `AZURE_DEPLOYMENT_NAME` can be accessed at your Azure OpenAI portal. Available `AZURE_API_VERSION` can be found [here](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#chat-completions)

### Load packages

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

In [2]:
import os
import pandas as pd
from dotenv import load_dotenv
from uniflow.flow.client import ExtractClient, TransformClient
from uniflow.flow.config import ExtractHTMLConfig, TransformAzureOpenAIConfig
from uniflow.flow.flow_factory import FlowFactory
from uniflow.op.model.model_config import AzureOpenAIModelConfig
from uniflow.op.prompt import Context, PromptTemplate

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

In [3]:
FlowFactory.list()

{'extract': ['ExtractHTMLFlow',
  'ExtractImageFlow',
  'ExtractIpynbFlow',
  'ExtractMarkdownFlow',
  'ExtractPDFFlow',
  'ExtractTxtFlow'],
 'transform': ['TransformAzureOpenAIFlow',
  'TransformCopyFlow',
  'TransformGoogleFlow',
  'TransformGoogleMultiModalModelFlow',
  'TransformHuggingFaceFlow',
  'TransformLMQGFlow',
  'TransformOpenAIFlow'],
 'rater': ['RaterFlow']}

### Prepare the input data

Set file name

In [4]:
html_file = "22.11_information-theory.html"

Set current directory and input data directory

In [5]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", html_file)

Load the html file via ExtractClient

In [6]:
input_data = [{"filename": input_file}]

In [7]:
extract_client = ExtractClient(ExtractHTMLConfig())

In [8]:
extract_output = extract_client.run(input_data)

100%|██████████| 1/1 [00:00<00:00,  3.52it/s]


### Prepare input dataset

In [9]:
guided_prompt = PromptTemplate(
        instruction="Generate one question and its corresponding answer based on context. Following the format of the examples below to include the same context, question, and answer in the response.",
        few_shot_prompt=[
            Context(
                context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",
                question="Who published A Mathematical Theory of Communication in 1948?",
                answer="Claude E. Shannon.",
            )
        ]
)

In [10]:
data = [ Context(context=p) for p in extract_output[0]['output'][0]['text'] if len(p) > 200 ]

### Run ModelFlow


In [11]:
config = TransformAzureOpenAIConfig(
    prompt_template=guided_prompt,
    model_config=AzureOpenAIModelConfig(response_format={"type": "json_object"}),
)

In [12]:
client = TransformClient(config)

In [13]:
data = data[-5:]

In [14]:
output = client.run(data)

  0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 5/5 [01:29<00:00, 17.88s/it]


### Format result into pandas table

In [15]:
# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

for item in output:
    for i in item['output']:
        for response in i['response']:
            contexts.append(response['context'])
            questions.append(response['question'])
            answers.append(response['answer'])

df = pd.DataFrame({
    'context': contexts,
    'question': questions,
    'answer': answers
})

# Set display options
pd.set_option('display.max_colwidth', None)  # or use a specific width like 50
pd.set_option('display.width', 1000)

styled_df = df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
    'selector': 'th',
    'props': [('text-align', 'left')]
}])
styled_df

Unnamed: 0,context,question,answer
0,"Since in maximum likelihood estimation, we are maximizing the objective function \(l(\theta)\) by having \(\pi_{j} = p_{\theta} (y_{ij} \mid \mathbf{x}_i)\). Therefore, for any multi-class classification, maximizing the above log-likelihood function \(l(\theta)\) is equivalent to minimizing the CE loss \(\textrm{CE}(y, \hat{y})\).",What is equivalent to minimizing the CE loss in multi-class classification?,Maximizing the log-likelihood function \(l(\theta)\).
1,"To test the above proof, let's apply the built-in measure NegativeLogLikelihood. Using the same labels and preds as in the earlier example, we will get the same numerical loss as the previous example up to the 5 decimal place.",What measure is used to test the proof and achieve the same numerical loss as a previous example?,NegativeLogLikelihood.
2,"Information theory is a field of study about encoding, decoding, transmitting, and manipulating information. Entropy is the unit to measure how much information is presented in different signals. KL divergence can also measure the divergence between two distributions. Cross-entropy can be viewed as an objective function of multi-class classification. Minimizing cross-entropy loss is equivalent to maximizing the log-likelihood function.",What is the objective of minimizing cross-entropy loss in multi-class classification?,Minimizing cross-entropy loss is equivalent to maximizing the log-likelihood function.
3,"In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.",What concept did Claude E. Shannon introduce for the first time in his 1948 article?,The concept of information entropy.
4,22.11. Information Theory 22.11.1. Information 22.11.1.1. Self-information 22.11.2. Entropy 22.11.2.1. Motivating Entropy 22.11.2.2. Definition 22.11.2.3. Interpretations 22.11.2.4. Properties of Entropy 22.11.3. Mutual Information 22.11.3.1. Joint Entropy 22.11.3.2. Conditional Entropy 22.11.3.3. Mutual Information 22.11.3.4. Properties of Mutual Information 22.11.3.5. Pointwise Mutual Information 22.11.3.6. Applications of Mutual Information 22.11.4. Kullback–Leibler Divergence 22.11.4.1. Definition 22.11.4.2. KL Divergence Properties 22.11.4.3. Example 22.11.5. Cross-Entropy 22.11.5.1. Formal Definition 22.11.5.2. Properties 22.11.5.3. Cross-Entropy as An Objective Function of Multi-class Classification 22.11.6. Summary 22.11.7. Exercises,What is considered as an objective function of multi-class classification according to the context?,Cross-Entropy.


## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering/tree/main/example/transform)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>