# Generating QAs from a Jupyter Notebook

In this example, we will show you how to generate question-answer pairs from a given jupyter notebook.

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

### Import dependency
First, we set system paths and import libraries.

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

In [2]:
import os
import re
import pandas as pd
from dotenv import load_dotenv
from pprint import pprint

from uniflow.flow.client import TransformClient
from uniflow.flow.config import TransformOpenAIConfig
from uniflow.op.model.model_config import OpenAIModelConfig
from uniflow.op.prompt import Context, PromptTemplate

from langchain.document_loaders import NotebookLoader

load_dotenv()

True

### Prepare the input data
First, we need to pre-process the given jupyter notebook `model.ipynb` to get text chunks that we can feed into the model. We will use `NotebookLoader` from langchain.

In [3]:
dir_cur = os.getcwd()
jupyter_notebook_file = """model.ipynb"""
input_file = os.path.join(f"{dir_cur}", jupyter_notebook_file)

In [4]:
loader = NotebookLoader(input_file,
                        include_outputs=True,
                        max_output_length=1000,
                        remove_newline=True)
raw_content = loader.load()
raw_content[0].page_content

  filtered_data = filtered_data.applymap(remove_newlines)


'\'markdown\' cell: \'[\'# Notebook for ModelFlow \', \'\', "In this example, we will show you how to generate question-answers (QAs) from give text strings using OpenAI\'s models via uniflow\'s [ModelFlow](https://github.com/CambioML/uniflow/blob/main/uniflow/flow/model_flow.py#L11).", \'\', \'### Before running the code\', \'\', \'You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.\', \'\', \'Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)\', \'\', \'### Update system path\']\'\n\n \'code\' cell: \'[\'%reload_ext autoreload\', \'%autoreload 2\', \'\', \'import sys\', \'\'

As you can see above, the loaded jupyter notebook content is quite messy. Let's split this content by each markdown header!

In [5]:
def split_string_based_on_markdown_header(s, markdown_symbol, code_symbol):
    """
    Splits a string based on a list of tokens including markdown_symbol and code_symbol,
    then further process the list so that the given string s is splits based on each markdown header.

    :param s: The string to be split.
    :param markdown_symbol: A symbol that represented a markdown cell.
    :param code_symbol: A symbol that represented a code cell.
    :return: A list of strings, split based markdown header.
    """
    # Create a regular expression pattern from the tokens, ensuring to escape special characters
    pattern = '|'.join(re.escape(token) for token in [markdown_symbol, code_symbol])

    # Use re.split() to split the string, but keep the tokens in the result
    strings = re.split(f'(?={pattern})', s)

    # Further process the list
    processed = []
    for s in strings:
        # Check if the string starts with "'markdown' cell: '"
        if s.startswith(markdown_symbol):
            # Split the string by the specified pattern ', '
            parts = re.split("""', '""" , s)

            # Process each part
            new_list = []
            for part in parts:
                # If the part starts with "#", "##", or "###", it remains standalone
                if part.startswith("#") or part.startswith("##") or part.startswith("###"):
                    new_list.append(part)
                # Otherwise, it is appended to the previous part
                else:
                    if new_list:
                        new_list[-1] += "\n"
                        new_list[-1] += part
                    else:
                        # If it's the first part and doesn't start with "#", it's added as is
                        new_list.append(part)

            # Add the processed parts to the main list
            processed.extend(new_list)

        # For strings starting with """'code' cell""", append them to the last string
        elif s.startswith(code_symbol):
            processed[-1] += s
        else:
            # Remove empty lines
            if len(s.replace("\n", "")) > 0:
                processed.append(s)

    return processed


Now with the `split_string_based_on_markdown_header` function, let's split our loaded jupyter notebook! It's much more semanticly organized now as shown below.

In [6]:
markdown_symbol = """\'markdown\' cell: \'"""
code_symbol = """\'code\' cell: \'"""
content_splited_by_header =  split_string_based_on_markdown_header(raw_content[0].page_content,
                                               markdown_symbol=markdown_symbol,
                                               code_symbol=code_symbol)
for j in content_splited_by_header:
    print("================= New Header ================")
    print(j)

'markdown' cell: '['# Notebook for ModelFlow 
', "In this example, we will show you how to generate question-answers (QAs) from give text strings using OpenAI's models via uniflow's [ModelFlow](https://github.com/CambioML/uniflow/blob/main/uniflow/flow/model_flow.py#L11).", '
### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

### Update system path']'

 'code' cell: '['%reload_ext autoreload', '%autoreload 2', '', 'import sys', '', 'sys.path.append(".")', 'sys.path.append("..")', 'sys.path.appen

### Run Uniflow on the self-instruct dataset (with prompt)

Now we can extract knowledge from the given jupyter notebook via Uniflow! First, we need to define a [PromptTemplate](https://github.com/CambioML/uniflow/blob/main/uniflow/schema.py#L57), which includes a prompt and a list of examples for the LLM to do few-shot learning.

In [None]:
guided_prompt = PromptTemplate(
    instruction="""If there is a code cell, generate one question given the markdown cell and its corresponding \
answer based on code cell and its output. If there is no code cell, generate one question and its corresponding \
answer based on context. Following the format of the examples below to include the same context, question, and \
answer in the response.""",
    few_shot_prompt=[
        Context(
            context="""'markdown' cell: '['### Use LLM to generate data in Uniflow. \
In this example, we use the base `Config` defaults with the [OpenAIModelConfig] to generate questions and answers.']' \
'code' cell: '['config = Config(model_config=OpenAIModelConfig())', 'client = Client(config)']'""",
            question="""How to use LLM to generate data in Uniflow""",
            answer="""We can use the Uniflow's default [OpenAIModelConfig] to generate questions and answers with code: '['config = Config(model_config=OpenAIModelConfig())', 'client = Client(config)']'""",
        )
    ]
)

In this example, we will use the [`OpenAIModelConfig`](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L17) as the default LLM to generate questions and answers. If you want to use open-source models, you can replace the `OpenAIConfig` and `OpenAIModelConfig` with `HuggingfaceConfig` and [`HuggingfaceModelConfig`](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L27).

Now we pass in our `guided_prompt` to the `OpenAIConfig` to use our customized instructions and examples, instead of the `uniflow` default ones. 

We also want to get the response in the `json` format instead of the `text` default, so we set the `response_format` to `json_object`.

In [8]:
config = TransformOpenAIConfig(
    prompt_template=guided_prompt,
    model_config=OpenAIModelConfig(response_format={"type": "json_object"}),
)
client = TransformClient(config)

Since each example in the `guided_prompt` are wrapped by `Context`, we need to apply the same format on our input data.

In [9]:
data = [Context(context=p) for p in content_splited_by_header if len(p) > 10]
output = client.run(data)

  0%|          | 0/11 [00:00<?, ?it/s]

In [10]:
output

[{'output': [{'response': [{'error': 'No code cell found'}],
    'error': 'No errors.'}],
  'root': <uniflow.node.node.Node at 0x24aef2c3970>},
 {'output': [{'response': [{'response': [{'context': "'markdown' cell: '['### Use LLM to generate data in Uniflow.\n            In this example, we use the base `Config` defaults with the [OpenAIModelConfig] to generate questions and answers.']'\n            'code' cell: '['config = Config(model_config=OpenAIModelConfig())', 'client = Client(config)']'",
        'question': 'How to use LLM to generate data in Uniflow',
        'answer': "We can use the Uniflow's default [OpenAIModelConfig] to generate questions and answers with code: '['config = Config(model_config=OpenAIModelConfig())', 'client = Client(config)']'"},
       {'context': '### Before running the code\n\nYou will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installa

### Reformat the output into pandas table

The output is a bit messy, we can reconstructure it into a pandas dataframe.

In [11]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

df = pd.DataFrame([{'context': response['context'],
                    'question': response['question'],
                    'answer': response['answer']}
                   for item in output
                   for i in item['output']
                   for response in i['response'] if 'context' in response])

styled_df = df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
    'selector': 'th',
    'props': [('text-align', 'left')]
}])
styled_df


Unnamed: 0,context,question,answer
0,"'markdown' cell: '['### Use LLM to generate data in Uniflow.  In this example, we use the base `Config` defaults with the [OpenAIModelConfig] to generate questions and answers.']'  'code' cell: '['config = Config(model_config=OpenAIModelConfig())', 'client = Client(config)']'",How to use LLM to generate data in Uniflow,"We can use the Uniflow's default [OpenAIModelConfig] to generate questions and answers with code: '['config = Config(model_config=OpenAIModelConfig())', 'client = Client(config)']'"
1,"'markdown' cell: '['Next, for the given raw text strings `raw_context_input` above, we convert them to the `Context` class to be processed by `uniflow`.']'  'code' cell: '['', 'data = [', ' Context(context=c)', ' for c in raw_context_input', ']']'",How do we convert the raw text strings to the Context class for processing by uniflow?,"We can convert the raw text strings to the Context class for processing by uniflow using the following code: '['', 'data = [', ' Context(context=c)', ' for c in raw_context_input', ']']'"


## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>
