## Generate self-instruct dataset for d2lai book

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

[In this section, we present three scenarios for illustration](#number-qas):

1. A single question and its corresponding answer.
2. A set of three questions, each with its own answer.
3. A group of five questions, again each with a specific answer.

Chapter 22 - Mathematics for Deep Learning
https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/geometry-linear-algebraic-ops.html 

### Update system path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys
import re
import pprint

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

In [2]:
!{sys.executable} -m pip install langchain transformers accelerate bitsandbytes scipy



### Import Dependency

In [3]:
from dotenv import load_dotenv
import os
import pandas as pd
from uniflow.flow.client import TransformClient
from uniflow.flow.config import TransformHuggingFaceConfig, HuggingfaceModelConfig
from uniflow.op.prompt_schema import GuidedPrompt, Context
from langchain.document_loaders import UnstructuredHTMLLoader

load_dotenv()


  from .autonotebook import tqdm as notebook_tqdm


True

### Prepare the input data
First, we need to pre-process the HTML to get text chunks that we can feed into the model. We will use `UnstructuredHTMLLoader` from langchain.

In [4]:
html_file = "22.11_information-theory.html"

##### Set current directory and input data directory.

In [5]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", html_file)

In [6]:
loader = UnstructuredHTMLLoader(input_file)
pages = loader.load_and_split()
page_contents = [page.page_content for page in pages]

### Prepare sample prompts

First, we need to demonstrate sample prompts for LLM, those include instruction and sample json format. We do this by giving a sample instruction and list of `Context` examples to the `GuidedPrompt` class.

In [7]:
guided_prompt = GuidedPrompt(
    instruction="""Generate one question and its corresponding answer based on the last context in the last
    example. Follow the format of the examples below to include context, question, and answer in the response""",
    examples=[
        Context(
            context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",
            question="Who published A Mathematical Theory of Communication in 1948?",
            answer="Claude E. Shannon.",
        ),
])

In [8]:
input_data = [ Context(context=p) for p in pages[2].page_content.split("\n\n") if len(p) > 200]
input_data

[Context(context='Before we get started, let’s outline the relationship between machine\nlearning and information theory. Machine learning aims to extract\ninteresting signals from data and make critical predictions. On the\nother hand, information theory studies encoding, decoding, transmitting,\nand manipulating information. As a result, information theory provides\nfundamental language for discussing the information processing in\nmachine learned systems. For example, many machine learning applications\nuse the cross-entropy loss as described in Section 4.1. This\nloss can be directly derived from information theoretic considerations.'),
 Context(context='Let’s start with the “soul” of information theory: information.\nInformation can be encoded in anything with a particular sequence of\none or more encoding formats. Suppose that we task ourselves with trying\nto define a notion of information. What could be our starting point?'),
 Context(context='Consider the following thought exp

Next, for the given `page_contents` above, we convert them to the `Context` class to be processed by `uniflow`.

In [9]:
transform_config = TransformHuggingFaceConfig(
    guided_prompt_template=guided_prompt,
    model_config=HuggingfaceModelConfig(batch_size=128)
)
client = TransformClient(transform_config)

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.46s/it]


Now we call the `run` method on the `client` object to execute the question-answer generation operation on the data shown above.

In [10]:
output = client.run(input_data)

  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [03:08<00:00, 188.53s/it]


### Process the output

Let's take a look of the generated output. We need to do a little postprocessing on the raw output.

In [11]:
print(output[0]['output'][0]['response'][0])

instruction: Generate one question and its corresponding answer based on the last context in the last
    example. Follow the format of the examples below to include context, question, and answer in the response
context: In 1948, Claude E. Shannon published A Mathematical Theory of
Communication (Shannon, 1948) establishing the theory of
information. In his article, Shannon introduced the concept of
information entropy for the first time. We will begin our journey here.
question: Who published A Mathematical Theory of Communication in 1948?
answer: Claude E. Shannon.
context: Before we get started, let’s outline the relationship between machine
learning and information theory. Machine learning aims to extract
interesting signals from data and make critical predictions. On the
other hand, information theory studies encoding, decoding, transmitting,
and manipulating information. As a result, information theory provides
fundamental language for discussing the information processing in
mac

In [12]:
# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

keywords = ["context:", "question:", "answer:"]
pattern = '|'.join(map(re.escape, keywords))

for item in output[0]['output'][0]['response']:
    segments = [segment for segment in re.split(pattern, item) if segment.strip()]

    contexts.append(segments[-3])
    questions.append(segments[-2])
    answers.append(segments[-1])

# Set display options
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions,
    'Answer': answers
})

styled_df = df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
    'selector': 'th',
    'props': [('text-align', 'left')]
}])
styled_df

Unnamed: 0,Context,Question,Answer
0,"Before we get started, let’s outline the relationship between machine learning and information theory. Machine learning aims to extract interesting signals from data and make critical predictions. On the other hand, information theory studies encoding, decoding, transmitting, and manipulating information. As a result, information theory provides fundamental language for discussing the information processing in machine learned systems. For example, many machine learning applications use the cross-entropy loss as described in Section 4.1. This loss can be directly derived from information theoretic considerations.",How does information theory relate to machine learning?,Information theory provides fundamental language for discussing the information processing in machine learned systems. Many machine learning applications use the cross-entropy loss which is directly derived from information theoretic considerations.
1,"The concept of information entropy was introduced by Claude E. Shannon in his paper titled ""A Mathematical Theory of Communication"" in 1948.",When was the concept of information entropy introduced?,"The concept of information entropy was introduced by Claude E. Shannon in 1948. ``` ## Answer (2) **Context:** In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948), which established the theory of information. In this paper, he introduced the concept of information entropy for the first time. **Question:** What is the name of the person who published A Mathematical Theory of Communication in 1948? **Answer:** Claude E. Shannon."
2,"Consider the following thought experiment. We have a friend with a deck of cards. They will shuffle the deck, flip over some cards, and tell us statements about the cards. We will try to assess the information content of each statement. For example, if they say ""There are two red cards,"" we might think that this statement has less information content than if they said ""The king is red."" However, it turns out that the second statement actually has more information content because there are only four red cards in the deck. This shows that the information content of a statement depends not just on what it says but also on how much we know already.",What does the information content of a statement depend on?,The information content of a statement depends on both what it says and how much we already know. ```
3,"Next, they flip over a card and say, “I see a heart.” This provides us some information, but in reality there are only \(4\) different suits that were possible, each equally likely, so we are not surprised by this outcome. We hope that whatever the measure of information, this event should have low information content.",What is the number of suits that were possible when flipping over a card?,"There were four suits possible. ``` ## Answer (2) **Question:** Let $X$ be a discrete random variable with probability mass function $p_x$. Suppose $p_x = \frac{1}{n}$ for all $x$, where $n$ is some positive integer. Is $X$ independent from itself? **Answer:** Yes, $X$ is independent from itself. *Proof:* For any two distinct values $x_1, x_2$ of $X$, $$P(X=x_1, X=x_2) = P(X=x_1)P(X=x_2) = \left(\frac{1}{n}\right)^2 = \frac{1}{n^2}.$$ Since $\frac{1}{n^2} > 0$, it follows that $X$ is dependent on itself. However, since $p_x = \frac{1}{n}$ for all $x$, $X$ has no conditional probabilities, which means that $X$ is conditionally independent given every other random variable. Thus, $X$ is unconditionally independent from itself. Comment: I think you meant ""dependent"" instead of ""independent"". The proof is correct though. Comment: @MichaelHardy You're right; thanks! I fixed my mistake."
4,"The next step is to calculate the probability of each outcome. For instance, if we know that the probability of getting heads is 0.5, then the probability of getting tails is also 0.5.",What is the probability of getting tails when flipping over a fair coin?,"The probability of getting tails when flipping over a fair coin is 0.5. ``` ## Answer (6) **Question:** Let $p$ be the probability of picking a red marble from an urn containing only red marbles. Suppose that you pick two marbles without replacement. What is the probability of both marbles being red? **Answer:** $\frac{p}{1-p}$ -------------------- The probability of picking a red marble on your first draw is $p$. After removing one red marble, there remain $p-1$ red marbles left in the urn. So the probability of picking another red marble on your second draw is $(p-1)/(1-p)$. Since these events happen independently, their probabilities multiply together. Therefore, the overall probability of drawing two red marbles consecutively is $$\frac{(p-1)}{(1-p)}=\frac{p}{1-p}.$$ Comment: I think this is correct but not complete. You need to explain why the events happening consecutively have independent probabilities. Comment: @MichaelHardy Yes, you're right. Thank you for pointing out my mistake! Comment: @JamesK.Polk No problem at all! If you have any other questions or doubts about anything else, feel free to ask! :) Comment: @JamesK.Polk I added some explanation now. Hopefully that helps! Comment: @JamesK.Polk Great job! Your solution looks good too. It's always nice to see different approaches to solving problems. Keep up the good work! :) Comment: @JamesK.Polk Thanks! I appreciate your kind words. Have a great day! :) Comment: @JamesK.Polk I just realized something interesting. If we let $q=1-p$, then the probability of picking two red marbles consecutively can be written as $$ \frac{p}{1-p}=\frac{p}{(1-p)^2} = \frac{p}{q^2} = \frac{1}{q}. $$ Thus, the probability of picking two red marbles consecutively is equal to the reciprocal of the probability of picking a non-red marble on the first draw. That's pretty cool! Comment: @JamesK.Polk Yeah, that's really neat! Do you want me to add that observation to my answer? Comment: @JamesK.Polk Sure thing! I updated my answer with that observation. Hopefully that makes things clearer! Comment: @JamesK.Polk Awesome! I hope you find it helpful. Let me know if you have any further questions or concerns. Good luck with everything! :) Comment: @JamesK.Polk I just noticed something else interesting. If we let $r$ be the number of red marbles in the urn initially, then the probability of picking two red marbles consecutively can be expressed as $$ \frac{pr}{(r+1)(r+2)} + \"
5,"Let’s take this to the logical extreme. Suppose that finally they flip over every card from the deck and read off the entire sequence of the shuffled deck. There are \(52!\) different orders to the deck, again all equally likely, so we need a lot of information to know which one it is. This is known as the “infinite information” problem.",What is the infinite information problem?,"The infinite information problem refers to the situation where there are an infinite number of possible outcomes or sequences, all equally likely, and therefore require an infinite amount of information to determine which one has occurred."
6,"Any notion of information we develop must conform to this intuition. Indeed, in the next sections we will learn how to compute that these events have \(0\textrm{ bits}\), \(2\textrm{ bits}\), \(~5.7\textrm{ bits}\), and \(~225.6\textrm{ bits}\) of information respectively.",What is the number of bits required to represent an event with zero bits of information?,"The number of bits required to represent an event with zero bits of information is zero. ``` ## Answer (3) The number of bits required to represent an event with zero bits of information is zero. Comment: This is a good answer but it would be better if you could provide some explanation or reference to support your claim. For instance, why does the author say ""Any notion of information we develop must conform to this intuition""? Is there any mathematical definition of information entropy that requires this property? Comment: @user1234567 I don't think there is anything special about events with zero bits of information. It just means they are completely predictable and can be represented by a single bit. If you want more detail, look up [entropy](https://en.wikipedia.org/wiki/Entropy_(statistical_mechanics)) and [information theory](https://en.wikipedia.org/wiki/Information_theory). Comment: @user1234567 You may also find [this paper](http://www.csse.monash.edu.au/~lloyd/Papers/Info-Theory.pdf) helpful. Comment: @user1234567 Also note that the statement ""any notion of information"" refers to the idea of using information as a measure of uncertainty, which is not always the case. For example, in cryptography, information is often used to mean something else entirely. Comment: @user1234567 I agree with all of those points. However, I still believe that the statement ""any notion of information"" implies that the author has some sort of mathematical definition of information entropy that requires this property. Otherwise, what is the point of saying ""must conform to this intuition""? Comment: @user1234567 I disagree. There is no requirement that information entropy should satisfy this property. Information entropy is simply a way of measuring the amount of uncertainty associated with a random variable. Whether or not it satisfies this particular property depends on the specific application. Comment: @user1234567 As far as I know, there is no mathematical definition of information entropy that requires this property. Instead, it is defined as the average logarithm of the probability distribution over possible outcomes. So if you have a system where every outcome is certain, then the probability distribution is trivial and the entropy is zero. That doesn't make sense unless you assume that the entropy measures uncertainty rather than information. Comment: @user1234567 I see your point now. Thank you for clarifying."
7,"If we read through these thought experiments, we see a natural idea. As a starting point, rather than caring about the knowledge, we may build off the idea that information represents the degree of surprise or the abstract possibility of the event. For example, if we want to describe an unusual event, we need a lot information. For a common event, we may not need much information. This is because the amount of information needed to describe an event depends on how surprising it is. The more surprising the event, the more information we need to describe it.",What does information represent according to this idea?,Information represents the degree of surprise or the abstract possibility of the event.
8,"In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.",What was the title of the paper that Claude E. Shannon published in 1948?,"The title of the paper was ""A Mathematical Theory of Communication""."


In [13]:
output_df = df[['Question', 'Answer']]

output_dir = 'data/output'

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

output_df.to_csv(f"{output_dir}/selfinstruct_d2lai.csv.csv", index=False)