# Example of generating QAs for an ML book (using self-instruct)
Source: https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/information-theory.html

### Load packages

In [1]:
import os
import pandas as pd
import sys
sys.path.append(os.path.join(os.getcwd(), os.pardir, os.pardir))
from uniflow.client import Client
from uniflow.flow.constants import (OUTPUT_NAME, INPUT_FILE, QAPAIR_DF_KEY, OUTPUT_FILE)

Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
  from .autonotebook import tqdm as notebook_tqdm


### Prepare the input data

Uncomment any of the html files below as the sample file to build the self-instruct flow.

In [2]:
#html_file = "do_things_that_dont_scale.html" #from http://paulgraham.com/ds.html
#html_file = "makers_schedule_managers_schedule.html" #from http://www.paulgraham.com/makersschedule.html
#html_file = "life_is_short.html" #http://www.paulgraham.com/vb.html
html_file = "22.11_information-theory.html"

Set current directory and input data directory.

In [3]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", html_file)

### Run the Self Instructed Gen Flow

Note it will take about 5 minutes to run this cell if you on 4-GPU machine (much longer for single GPU machine)).

In [4]:

# Initiate flow
client = Client("SelfInstructedGenFlow")

# Run flow
input_dict = {INPUT_FILE: input_file}
input_list = [input_dict]
output_list = client.run(input_list)
output_dict = output_list[0]

print(f"output_dict keys: {output_dict.keys()}")

INFO [preprocess_html_op]: Starting Preprocess HTML...
[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ubuntu/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
INFO [preprocess_html_op]: Preprocess HTML Complete!
INFO [si_model_inf_op]: Initializing SIModelInfOp...
INFO [si_model_inf_op]: 1. Initializing model...
Loading checkpoint shards: 100%|██████████| 2/2 [00:16<00:00,  8.26s/it]
INFO [si_model_inf_op]: 2. Initializing pipeline...
INFO [si_model_inf_op]: 3. Creating LangChain LLMChain...
INFO [si_model_inf_op]: SIModelInfOp initialization Complete!
INFO [si_model_inf_op]: Starting SIModelInfOp transform...
INFO [si_model_inf_op]: Processing page 1 of 13...
INFO [si_model_inf_op]: === processed page 1 | total questions generated: 1 ===
INFO [si_model_inf_op]: Processing page 2 of 13...
INFO [si_model_inf_op]:

output_dict keys: dict_keys(['output', 'root'])


### Print out the results

In [5]:
# number of output nodes
len(output_dict[OUTPUT_NAME])

1

In [6]:
# output dictionary keys
output_dict[OUTPUT_NAME][0].keys()

dict_keys(['QApair_df', 'output_file'])

In [7]:
#output file path
output_dict[OUTPUT_NAME][0][OUTPUT_FILE]

'/home/ubuntu/uniflow/example/self_instructed_ft/data/output/output_self_instructed_data.csv'

In [9]:
# Set this option to None to display full contents of each column
pd.set_option('display.max_colwidth', None)

# print the first 50 entries in the generated question-answer pairs.
output_dict[OUTPUT_NAME][0][QAPAIR_DF_KEY][:50]

Unnamed: 0,Question,Answer
0,What is linear regression?[Page 0],"Linear regression is a statistical method used to model the relationship between two variables, where the dependent variable is predicted based on the independent variable using a straight line. It is commonly used for predictive modeling tasks such as forecasting house prices or stock market trends. In this book, we will discuss how to implement linear regression from scratch and use it for various applications."
1,What are some common techniques used in pretraining natural language processing models?[Page 1],"Some common techniques used in pretraining natural language processing models include word embedding (such as word2vec), approximate training, subword embedding, and fine-tuning pretrained models like BERT.\n ==End==\n\n 16. Natural Language Processing: Applications"
2,How can sentiment analysis be performed using neural networks?[Page 1],"Sentiment analysis can be performed using both recurrent neural networks (RNNs) and convolutional neural networks (CNNs). RNNs are commonly used for sequence-level tasks such as text classification, while CNNs are better suited for token-level tasks such as sentiment analysis.\n ==End==\n\n 17. Reinforcement Learning"
3,What is the difference between value iteration and Q-learning in reinforcement learning?[Page 1],"Value iteration and Q-learning are two popular methods for solving Markov decision processes (MDPs) in reinforcement learning. Value iteration computes the expected future reward by iteratively improving an estimate of the value function, while Q-learning updates the action-value function directly based on observed rewards and next states.\n ==End==\n\n 18. Gaussian Processes"
4,What is the purpose of Gaussian process priors in Gaussian process regression?[Page 1],"Gaussian process priors are used to model prior knowledge or beliefs about the underlying function being estimated in Gaussian process regression. They help to incorporate domain expertise or previous experience into the modeling process, which can improve the accuracy and efficiency of the model.\n ==End==\n\n 19. Hyperparameter Optimization"
5,What is the role of hyperparameters in deep learning models?[Page 1],"Hyperparameters are parameters that control the behavior of deep learning models but are not learned during training. These include learning rate, batch size, number of layers, and activation functions, among others. Properly tuning these hyperparameters can significantly impact the performance of a deep learning model.\n ==End==\n\n 20. Generative Adversarial Networks"
6,What is the main goal of generative adversarial networks (GANs)?[Page 1],"The main goal of GANs is to generate new data samples that are similar to a given dataset. This is achieved through a two-player game where a generator network creates fake samples and a discriminator network tries to distinguish them from real samples. By alternating between generating and discriminating, the generator learns to produce more realistic samples over time.\n ==End==\n\n 21. Recommender Systems"
7,What is matrix factorization in recommender systems?[Page 1],"Matrix factorization is a technique used in recommender systems to decompose user-item interaction matrices into lower-dimensional latent factors. This allows for more efficient storage and retrieval of recommendations, as well as improved prediction accuracy. Common types of matrix factorization include singular value decomposition (SVD) and alternative least squares (ALS).\n ==End==\n\n 22. Appendix: Mathematics for Deep Learning"
8,What is the chain rule in calculus?[Page 1],"The chain rule is a fundamental concept in calculus that relates differentiation of composite functions. It allows us to compute the derivative of a composition of two functions by applying the derivative rule to each component separately. Specifically, if f(x) = g(h(x)), then f'(x) = g"
9,What is the formula for calculating self-information?[Page 3],"The formula for calculating self-information is given by I(X) = - log\_2(p), where X is the event being observed and p is its probability.\n \n ===End==="
