# Example of generating QAs for a Paul Graham Essay
**Source:** http://www.paulgraham.com/makersschedule.html

**Description:** A famous essay by Paul Graham about the difference between the schedules of managers and makers.

### Before running the code

You will need to have the following packages installed:
```
pip install nougat-ocr pandas pypdf
```

Also, make sure you have a .env file with your OpenAI API key in the root directory of this project.
```
OPENAI_API_KEY=YOUR_API_KEY
```

### Load Packages

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

In [2]:
import os
import pandas as pd
from dotenv import load_dotenv
from uniflow.flow.client import TransformClient, ExtractClient
from uniflow.flow.config import ExtractPDFConfig, NougatModelConfig
from uniflow.flow.config import TransformOpenAIConfig
from uniflow.op.model.model_config import OpenAIModelConfig
from uniflow.op.prompt import Context, PromptTemplate
from dotenv import load_dotenv

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### Prepare the input data

In [3]:
pdf_file = "makers_schedule_managers_schedule.pdf"

Set current directory and input data directory.

In [4]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

### Load and split the pdf

In [5]:
pdf_directory = [
    {"pdf": input_file},
]

extract_config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        model_name = "0.1.0-small",
        batch_size = 128 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU
    )
)

nougat_client = ExtractClient(extract_config)

pdf_output = nougat_client.run(pdf_directory)

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:19<00:00, 19.78s/it]


Pre-process the output context from containing elements as individual lines in a PDF to having each element contain text within a 1,000-token length.

In [6]:
def count_tokens(text):
    # Assuming each word is a token, this function counts the number of tokens in the text.
    return len(text.split())

def recreate_string(pdf_output):
    recreated_output = []
    current_element = ""
    for line in pdf_output:
        line = line.rstrip('\n')  # Remove the trailing newline character
        if current_element:
            temp_element = current_element + " " + line
        else:
            temp_element = line

        if count_tokens(temp_element) <= 1000:
            current_element = temp_element
        else:
            recreated_output.append(current_element)
            current_element = line

    if current_element:
        recreated_output.append(current_element)

    return recreated_output

page_contents = recreate_string(pdf_output[0]['output'][0]['text'])

print(page_contents)

['[MISSING_PAGE_EMPTY:1] capacity. A small decrease in morale is enough to kill them off. Each type of schedule works fine by itself. Problems arise when they meet. Since most powerful people operate on the manager\'s schedule, they\'re in a position to make everyone resonate at their frequency if they want to. But the smarter ones restrain themselves, if they know that some of the people working for them need long chunks of time to work in. Our case is an unusual one. Nearly all investors, including all VCs I know, operate on the manager\'s schedule. But Y Combinator runs on the maker\'s schedule. Rtm and Trevor and I do because we always have, and Jessica does too, mostly, because she\'s gotten into sync with us. I wouldn\'t be surprised if there start to be more companies like us. I suspect founders may increasingly be able to resist, or at least postpone, turning into managers, just as a few decades ago they started to be able to resist switching from jeans to suits. How do we mana

In [13]:
guided_prompt = PromptTemplate(
    instruction="Generate one question and its corresponding answer based on the context. Following the format of the examples below to include the same context, question, and answer in the response.",
    few_shot_prompt=[
        Context(
            context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",
            question="Who published A Mathematical Theory of Communication in 1948?",
            answer="Claude E. Shannon."
        ),
    ]
)

data = [ Context(context=page_contents[0])]

In [8]:
data

[Context(context='[MISSING_PAGE_EMPTY:1] capacity. A small decrease in morale is enough to kill them off. Each type of schedule works fine by itself. Problems arise when they meet. Since most powerful people operate on the manager\'s schedule, they\'re in a position to make everyone resonate at their frequency if they want to. But the smarter ones restrain themselves, if they know that some of the people working for them need long chunks of time to work in. Our case is an unusual one. Nearly all investors, including all VCs I know, operate on the manager\'s schedule. But Y Combinator runs on the maker\'s schedule. Rtm and Trevor and I do because we always have, and Jessica does too, mostly, because she\'s gotten into sync with us. I wouldn\'t be surprised if there start to be more companies like us. I suspect founders may increasingly be able to resist, or at least postpone, turning into managers, just as a few decades ago they started to be able to resist switching from jeans to suits

In [9]:
config = TransformOpenAIConfig(
    model_config=OpenAIModelConfig(response_format={"type": "json_object"}),
)
client = TransformClient(config)

In [10]:
output = client.run(data)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.77s/it]


In [11]:
output

[{'output': [{'response': [{'context': "When you're operating on the manager's schedule you can do something you'd never want to do on the maker's: you can have speculative meetings. You can meet someone just to get to know one another. If you have an empty slot in your schedule, why not? Maybe it will turn out you can help one another in some way.",
      'question': "What can you do on the manager's schedule that you would never want to do on the maker's schedule?",
      'answer': 'You can have speculative meetings and meet someone just to get to know one another.'}],
    'error': 'No errors.'}],
  'root': <uniflow.node.Node at 0x7fcd1cd68970>}]

In [12]:
# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

for item in output:
    for i in item['output']:
        for response in i['response']:
            contexts.append(response['context'])
            questions.append(response['question'])
            answers.append(response['answer'])

df = pd.DataFrame({
    'context': contexts,
    'question': questions,
    'answer': answers
})

# Set display options
pd.set_option('display.max_colwidth', None)  # or use a specific width like 50
pd.set_option('display.width', 1000)

styled_df = df.head().style.set_properties(**{'text-align': 'left'}).set_table_styles([{
    'selector': 'th',
    'props': [('text-align', 'left')]
}])
styled_df

Unnamed: 0,context,question,answer
0,"When you're operating on the manager's schedule you can do something you'd never want to do on the maker's: you can have speculative meetings. You can meet someone just to get to know one another. If you have an empty slot in your schedule, why not? Maybe it will turn out you can help one another in some way.",What can you do on the manager's schedule that you would never want to do on the maker's schedule?,You can have speculative meetings and meet someone just to get to know one another.
