# Example of generating self-instruct dataset for Paul Graham's essays
Source: http://www.paulgraham.com/articles.html

### Load packages

In [1]:
import os
import pandas as pd
import sys
sys.path.append(os.path.join(os.getcwd(), os.pardir, os.pardir))
from uniflow.client import Client
from uniflow.flow.constants import (OUTPUT_NAME, INPUT_FILE, QAPAIR_DF_KEY, OUTPUT_FILE)

Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
  from .autonotebook import tqdm as notebook_tqdm


### Prepare the input data

Uncomment any of the html files below as the sample file to build the self-instruct flow.

In [2]:
#html_file = "do_things_that_dont_scale.html" #from http://paulgraham.com/ds.html
#html_file = "makers_schedule_managers_schedule.html" #from http://www.paulgraham.com/makersschedule.html
html_file = "life_is_short.html" #http://www.paulgraham.com/vb.html
#html_file = "22.11_information-theory.html"

Set current diretory and input data directory.

In [3]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", html_file)

### Run the Self Instructed Gen Flow

Note it will take a few minutes to run this cell (especially if you on a single GPU machine).

In [4]:

# Initiate flow
client = Client("flow_self_instructed_gen")

# Run flow
input_dict = {INPUT_FILE: input_file}
input_list = [input_dict]
output_list = client.run(input_list)
output_dict = output_list[0]

print(f"output_dict keys: {output_dict.keys()}")

INFO [preprocess_html_op]: Starting Preprocess HTML...
INFO [preprocess_html_op]: Preprocess HTML Complete!
INFO [si_model_inf_op]: Initializing SIModelInfOp...
INFO [si_model_inf_op]: 1. Initializing model...
Loading checkpoint shards: 100%|██████████| 2/2 [00:16<00:00,  8.18s/it]
INFO [si_model_inf_op]: 2. Initializing pipeline...
INFO [si_model_inf_op]: 3. Creating LangChain LLMChain...
INFO [si_model_inf_op]: SIModelInfOp initialization Complete!
INFO [si_model_inf_op]: Starting SIModelInfOp transform...
INFO [si_model_inf_op]: Processing page 1 of 3...
INFO [si_model_inf_op]: === processed page 1 | total questions generated: 4 ===
INFO [si_model_inf_op]: Processing page 2 of 3...
INFO [si_model_inf_op]: === processed page 2 | total questions generated: 5 ===
INFO [si_model_inf_op]: Processing page 3 of 3...
INFO [si_model_inf_op]: === processed page 3 | total questions generated: 6 ===
INFO [si_model_inf_op]: SIModelInfOp transform complete!
INFO [data_output_si_op]: Starting Data

output_dict keys: dict_keys(['output', 'root'])


### Print out the results

In [5]:
# number of output nodes
len(output_dict[OUTPUT_NAME])

1

In [6]:
# output dictionary keys
output_dict[OUTPUT_NAME][0].keys()

dict_keys(['QApair_df', 'output_file'])

In [7]:
#output file path
output_dict[OUTPUT_NAME][0][OUTPUT_FILE]

'/home/ubuntu/uniflow/example/self_instructed_ft/data/output/output_self_instructed_data.csv'

In [8]:
# Set this option to None to display full contents of each column
pd.set_option('display.max_colwidth', None)

# print the first 50 entries in the generated question-answer pairs.
output_dict[OUTPUT_NAME][0][QAPAIR_DF_KEY][:50]

Unnamed: 0,Question,Answer
0,What is the author's opinion on whether life is short or not?[Page 0],The author believes that life is short.
1,How did having children change the author's perspective on the length of life?[Page 0],Having children made the author realize that life is indeed short because it helped them convert time into discrete quantities. They were able to count the number of weekends spent with their child and the number of times they experienced certain events like Christmas magic.
2,Does knowing that life is short make a difference to the author?[Page 0],"Yes, knowing that life is short makes a big difference to the author. It gives greater weight to arguments such as ""Life is too short for X"". It also helps the author identify things that are unnecessary and wasteful, which they refer to as ""bullshit"", and eliminating those things from their lives."
3,"What kinds of activities does the author consider to be ""bullshit""?[Page 0]","The author considers activities such as unnecessary meetings, pointless disputes, bureaucracy, posturing, dealing with other people's mistakes, traffic jams, and addictive but unrewarding pastimes to be ""bullshit"". These activities either get forced upon us or trick us into doing them."
4,What is the author's opinion on defending oneself?[Page 1],"The author believes that it's better most of the time not to defend oneself, as counterintuitive as it may feel. He argues that people who attack others are literally taking their lives."
5,What is the main idea of the passage?[Page 2],"The main idea of the passage is to relentlessly prune bullshit, prioritize doing things that matter, and savor the time you have. This is because life is short."
