# Example of generating QAs for an ML book (using self-instruct)
Source: https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/information-theory.html

### Before running the code

You will need to have the following packages installed:
```
pip install langchain pandas unstructured
```

Also, make sure you have a .env file with your OpenAI API key in the root directory of this project.
```
OPENAI_API_KEY=YOUR_API_KEY
```

## Load packages

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

In [2]:
import os
import pandas as pd
from uniflow.flow.client import TransformClient
from uniflow.flow.config import TransformOpenAIConfig
from langchain.document_loaders import UnstructuredHTMLLoader
from dotenv import load_dotenv
from uniflow.op.prompt import Context

load_dotenv()

True

## Prepare the input data

Uncomment any of the html files below as the sample file to build the self-instruct flow.

In [3]:
#html_file = "do_things_that_dont_scale.html" #from http://paulgraham.com/ds.html
#html_file = "makers_schedule_managers_schedule.html" #from http://www.paulgraham.com/makersschedule.html
#html_file = "life_is_short.html" #http://www.paulgraham.com/vb.html
html_file = "22.11_information-theory.html"

Set current directory and input data directory.

In [4]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", html_file)

In [5]:
loader = UnstructuredHTMLLoader(input_file)
pages = loader.load_and_split()

## Prepare input dataset

In [6]:
data = [
    Context(context=p)
    for p in pages[2].page_content.split("\n\n")
    if len(p) > 200
]

In [7]:
data = data[-3:]
data


[Context(context='Any notion of information we develop must conform to this intuition.\r\nIndeed, in the next sections we will learn how to compute that these\r\nevents have \\(0\\textrm{ bits}\\), \\(2\\textrm{ bits}\\),\r\n\\(~5.7\\textrm{ bits}\\), and \\(~225.6\\textrm{ bits}\\) of\r\ninformation respectively.'),
 Context(context='If we read through these thought experiments, we see a natural idea. As\r\na starting point, rather than caring about the knowledge, we may build\r\noff the idea that information represents the degree of surprise or the\r\nabstract possibility of the event. For example, if we want to describe\r\nan unusual event, we need a lot information. For a common event, we may\r\nnot need much information.'),
 Context(context='In 1948, Claude E. Shannon published A Mathematical Theory of\r\nCommunication (Shannon, 1948) establishing the theory of\r\ninformation. In his article, Shannon introduced the concept of\r\ninformation entropy for the first time. We will begi

## Run ModelFlow

In [8]:
config = TransformOpenAIConfig()
client = TransformClient(config)

In [9]:
output = client.run(data)

  0%|          | 0/3 [00:00<?, ?it/s]

In [10]:
output

[{'output': [{'response': ['question: How many bits of information do the events have?\nanswer: 0 bits, 2 bits, 5.7 bits, and 225.6 bits respectively.'],
    'error': 'No errors.'}],
  'root': <uniflow.node.node.Node at 0x19e0f3d3c10>},
 {'output': [{'response': ['question: How does information represent the degree of surprise or abstract possibility of an event?\nanswer: Information represents the degree of surprise or abstract possibility of an event by requiring more information to describe an unusual event and less information for a common event.'],
    'error': 'No errors.'}],
  'root': <uniflow.node.node.Node at 0x19e0f3e8370>},
 {'output': [{'response': ['question: Who introduced the concept of information entropy for the first time?\nanswer: Claude E. Shannon.'],
    'error': 'No errors.'}],
  'root': <uniflow.node.node.Node at 0x19e0f3e9cf0>}]

## Format result into pandas table

In [11]:
# Extracting context, question, and answer into a DataFrame
import re
questions = []
answers = []
responses = []

prompt_keys = [
    "question",
    "answer",
]

for item in output:
    d = item['output'][0]['response'][0]
    pattern = "|".join(map(re.escape, prompt_keys))

    segments = [
        segment.strip(' :"\n,}{') for segment in re.split(pattern, d.lower())
    ]
    segments = d.split("\n")

    result = dict()
    result.update(
        {
            prompt_keys[0]: segments[-2].strip(prompt_keys[0]+":"),
            prompt_keys[1]: segments[-1].strip(prompt_keys[1]+":"),
        }
    )
    responses.append(result)

for response in responses:
    questions.append(response['question'])
    answers.append(response['answer'])

df = pd.DataFrame({
    'question': questions,
    'answer': answers
})

# Set display options
pd.set_option('display.max_colwidth', None)  # or use a specific width like 50
pd.set_option('display.width', 1000)

styled_df = df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
    'selector': 'th',
    'props': [('text-align', 'left')]
}])
styled_df

Unnamed: 0,question,answer
0,How many bits of information do the events have?,"0 bits, 2 bits, 5.7 bits, and 225.6 bits respectively."
1,How does information represent the degree of surprise or abstract possibility of an event?,Information represents the degree of surprise or abstract possibility of an event by requiring more information to describe an unusual event and less information for a common event.
2,Who introduced the concept of information entropy for the first time?,Claude E. Shannon.
