# Creating a synthetic Q&A dataset


In this notebook, we're utilizing the 'davinci-instruct-beta-v3' model, which is designed specifically to follow instructions. With this model, we generate questions based on the provided context, and subsequently, we employ 'davinci-instruct-beta-v3' again to answer those questions within the same context.

## Installing Dependencies

In [1]:
!pip install openai
!pip install tiktoken

Collecting openai
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/77.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/77.0 kB[0m [31m684.3 kB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.[0m[31m
[0mSuccessfully installed openai-0.28.1
Collecting tiktoken
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

In [2]:
import openai
import pandas as pd

In [3]:
openai.api_key = '<API Key>'

## Read the data

In [4]:
df = pd.read_csv("/content/drive/MyDrive/DAMG7245/pdf_content_openai.csv")

In [5]:
df.head()

Unnamed: 0,num_tokens,content
0,993,OMB APPROVAL OMB Number: 3235-0554 Expires: F...
1,997,This collection of information has been review...
2,986,The exchange consents that service of any civi...
3,900,Exhibit D Describe the manner of operation of ...


## Create questions based on the context

Utilize the davinci-instruct model to create a set of viable questions that pertain to the information found in the sec form. Please note that we've applied a temperature value of 0, but it can be experimented to explore the use of a higher temperature setting to generate a wider range of questions.

"Temperature" parameter influences the randomness and creativity of the generated text. A higher temperature value results in more diverse and creative outputs, while a lower value, such as 0, makes the output more focused and deterministic.

In [6]:
# Define a function to generate questions based on a given context.
def get_questions(context):
    try:
        # Create a completion request to the OpenAI engine.
        response = openai.Completion.create(
            engine="davinci-instruct-beta-v3",
            # Use the context as the basis for generating questions.
            prompt=f"Write questions based on the text below. The text has been parsed from a SEC public form\n\nText: {context}\n\nQuestions:\n1.",
            temperature=0,  # Control randomness (0 means deterministic output).
            max_tokens=200,  # Limit the length of the response.
            top_p=1,  # Probability for nucleus sampling (1 means always pick the top word).
            frequency_penalty=0,  # Adjust the frequency penalty.
            presence_penalty=0,  # Adjust the presence penalty.
            stop=["\n\n"]  # Specify when to stop generating text (at double line breaks).
        )

        # Extract and return the generated questions from the response.
        return response['choices'][0]['text']
    except Exception as e:
        # Handle exceptions and print an error message.
        print(e)
        return ""

In [7]:
# Apply the get_questions function to the 'content' column of the DataFrame and store the results in a new 'questions' column
df['questions'] = df.content.apply(get_questions)

# Prepend a "1." to each generated question
df['questions'] = "1." + df.questions

# Print the first generated question (for demonstration purposes)
print(df[['questions']].values[0][0])

1. What is the purpose of Form 1-N?
2. What is the contact employee for the Security Futures Product Exchange?
3. What is the format of Form 1-N?
4. How many copies of Form 1-N are required to be filed?
5. What is the Paperwork Reduction Act Disclosure for Form 1-N?
6. What is the estimated burden hours for Form 1-N?


The prompt is fed in the 'get_questions' function to generate questions for each content in the dataframe. Few example questions can be seen above.

## Create answers based on the context

Utilize the davinci-instruct model to create answers for the questions build above passing the same information found in the sec form.

In [8]:
# Define a function named 'get_answers' that takes a 'row' as input
def get_answers(row):
    try:
        # Create a response using the OpenAI Completion API
        response = openai.Completion.create(
            engine="davinci-instruct-beta-v3",
            # Construct a prompt with the provided 'content', 'questions', and placeholders for answers
            prompt=f"Write answer based on the text below\n\nText: {row.content}\n\nQuestions:\n{row.questions}\n\nAnswers:\n1.",
            temperature=0,
            max_tokens=200,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
        )
        # Extract and return the text content from the response
        return response['choices'][0]['text']
    except Exception as e:
        # If an exception occurs, print the error message and return an empty string
        print(e)
        return ""

In [13]:
# Apply the 'get_answers' function to each row in the DataFrame 'df' using 'apply' along the rows (axis=1)
df['answers'] = df.apply(get_answers, axis=1)

# Add a prefix "1." to each answer in the 'answers' column
df['answers'] = "1." + df['answers']

# Drop rows with missing (NaN) values, reset the index, and drop the previous index column
df = df.dropna().reset_index().drop('index', axis=1)

# Print the 'answers' from the first row of the DataFrame
print(df[['answers']].values[0][0])

1. Form 1-N is the form for notice of registration as a national securities exchange for the sole purpose of trading security futures products pursuant to Section 6(g) of the Securities Exchange Act of 1934.
2. The contact employee for the Security Futures Product Exchange is the individual listed on the Execution Page (Page 1) of Form 1-N as the contact employee.
3. Form 1-N is typed and must include an Execution Page (Page 1) with original manual signatures.
4. An exchange filing Form 1-N must submit one original and two copies of Form 1-N to the Securities and Exchange Commission, Division of Market Regulation, Office of Market Supervision, 450 Fifth Street, NW, Washington, DC 20549.
5. Security Futures Product Exchanges are required to update certain information filed on Form 1-N on a periodic basis.
6. It is estimated that an exchange will spend approximately 31 hours completing the initial application on Form 1-N


Above are the answers to the questions build by the same model based on the context from sec form. We can see that the answers are sensible enough according to the context passed. Hence, the model works fairly well

In [14]:
df.head()

Unnamed: 0,num_tokens,content,questions,answers
0,993,OMB APPROVAL OMB Number: 3235-0554 Expires: F...,1. What is the purpose of Form 1-N?\n2. What i...,1. Form 1-N is the form for notice of registra...
1,997,This collection of information has been review...,1. What is the name of the Security Futures Pr...,1. The Security Futures Product Exchange is th...
2,986,The exchange consents that service of any civi...,1. What is the name of the form being filed?\n...,1. The name of the form being filed is Form 1-...
3,900,Exhibit D Describe the manner of operation of ...,1. What is the means of access to the System?\...,1. The means of access to the System is throug...


## Save the Q&A dataset

We stored this dataframe as a csv file, too utilize in next the notebook.

In [15]:
df.to_csv("/content/drive/MyDrive/DAMG7245/pdf_content_openai_qa.csv", index=False)

## Search Content

We will now create a search function that serves the following purposes:

1. Accepts a user query and a DataFrame containing text and embedding columns.
2. Utilizes the OpenAI API to embed the user query.
3. Ranks the text entries based on the distance between the query embedding and text embeddings.
4. Returns two lists:
   - The most relevant N texts, ordered by their relevance.
   - The respective relevance scores for each of these texts.

In [16]:
# calculate embeddings
EMBEDDING_MODEL = "text-embedding-ada-002"  # OpenAI's best embeddings as of Apr 2023

# Define a function to generate embeddings for a single text
def generate_embeddings(text):
    # Create embeddings for the given 'text' using the specified EMBEDDING_MODEL
    response = openai.Embedding.create(model=EMBEDDING_MODEL, input=text)

    # Extract the embeddings from the API response and join them as a comma-separated string
    return ', '.join(map(str, response['data'][0]['embedding']))

# Apply the 'generate_embeddings' function to each row in the 'content' column of the DataFrame 'df'
df['embeddings'] = df['content'].apply(generate_embeddings)

In [17]:
df.head()

Unnamed: 0,num_tokens,content,questions,answers,embeddings
0,993,OMB APPROVAL OMB Number: 3235-0554 Expires: F...,1. What is the purpose of Form 1-N?\n2. What i...,1. Form 1-N is the form for notice of registra...,"-0.023155031725764275, -0.012439599260687828, ..."
1,997,This collection of information has been review...,1. What is the name of the Security Futures Pr...,1. The Security Futures Product Exchange is th...,"-0.01856633462011814, 0.008737890981137753, -0..."
2,986,The exchange consents that service of any civi...,1. What is the name of the form being filed?\n...,1. The name of the form being filed is Form 1-...,"-0.021929476410150528, -0.005438805092126131, ..."
3,900,Exhibit D Describe the manner of operation of ...,1. What is the means of access to the System?\...,1. The means of access to the System is throug...,"-0.01889314502477646, 0.005769156385213137, -0..."


Above are the embeddings for each content row in the dataframe

In [18]:
import ast
# convert embeddings from CSV str type back to list type
df['embeddings'] = df['embeddings'].apply(ast.literal_eval)

In [19]:
from scipy import spatial
# search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = openai.Embedding.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response["data"][0]["embedding"]
    strings_and_relatednesses = [
        (row["content"], relatedness_fn(query_embedding, row["embeddings"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

In [20]:
strings, relatednesses = strings_ranked_by_relatedness("trading of security futures products", df, top_n=5)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    display(string)

relatedness=0.847


'Exhibit D Describe the manner of operation of the System involving trading of security futures products.  This description should include the following:  1. The means of access to the System.  2. Procedures governing entry and display of quotations and orders in the System.  3. Procedures governing the execution, reporting, clearance, and settlement of transactions in connection with the System. 4. Proposed fees.  5. Procedures for ensuring compliance with System usage guidelines.   Form 1-N Page 3U.S. SECURITIES AND EXCHANGE COMMISSION WASHINGTON, D.C.  20549 FORM AND AMENDMENTS FOR NOTICE OF REGISTRATION AS A NATIONAL SECURITIES EXCHANGE FOR THE SOLE PURPOSE OF TRADING SECURITY FUTURES PRODUCTS PURSUANT TO SECTION 6(g) OF THE EXCHANGE ACTOFFICIAL OFFICIAL USE USE ONLY 6. The hours of operation of the System, and the date on which the exchange intends to commence operation of the System. 7. Attach a copy of the users’ manual. Exhibit E A list of the officers, governors, or persons pe

relatedness=0.820




relatedness=0.819


'OMB APPROVAL OMB Number: 3235-0554 Expires:  February 2 8, 2023 Estimated average burden hours per response. . . . . . . 31 SEC 2568 (11-02)UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 Form 1-N FORM AND AMENDMENTS FOR NOTICE OF REGISTRATION AS A NATIONAL SECURITIES EXCHANGE FOR THE SOLE PURPOSE OF TRADING SECURITY FUTURES PRODUCTS PURSUANT TO SECTION 6(g) OF THE EXCHANGE ACT Persons who pot entially are to respond to the collection of information contained in this form are not required to respond unless the form displays a currently valid OMB control number.FORM 1-N INSTRUCTIONS A.  GENERAL INSTRUCTIONS  1. Form 1-N is the form for notice of registration as a national securities exchange for the sole purpose of trading security fu tures products (“Security Futures Product Exchange”) pursuant to Section 6(g) of the Securities Exchange Act of 1934 (“Exchange Act”) . 2.UPDATING  - A Security Futures Product Exchange must file amendments to Form 1-N in accordanc

relatedness=0.811


'The exchange consents that service of any civil action brought by or notice of any proceeding before the Securities and Exchange Commission in connection with the exchange’s activities may be given by registered or certified mail or confirmed telegram to the exchange’s contact employee at the main address, or mailing addressif different, given in Items 2 and 3.  The undersigned, being first duly sworn, deposes and says that he/she has executed this form on behalf of, and with the authorityof, said exchange.  The undersigned and the exchange represent that the information and statements contained herein, including exhibits, schedules, or other documents attached hereto, and other information filed herewith, all of which are made a part hereof, are current, true, and complete. Date: (MM/DD/YY) (Name of Exchange) By: (Signature) (Printed Name and Title) Subscribed and sworn before me this day of by (Month) This page must always be completed in full with original, manual signature and not

## Answer questions based on the context provided

Here, we establish an "ask" function that:

Accepts a user query.
Scans for text that is pertinent to the query.
Embeds that discovered text into a message intended for GPT.
Dispatches the message to GPT.
Provides the response given by GPT.

In [21]:
# Import the 'tiktoken' library
import tiktoken

# Define the GPT-3 model to be used
GPT_MODEL = "gpt-3.5-turbo"

# Function to count the number of tokens in a given text
def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Function to create a message for GPT with relevant source texts
def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    # Rank source texts by relevance to the query
    strings, relatednesses = strings_ranked_by_relatedness(query, df)

    # Introduction for the message
    introduction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'

    # Question to be asked
    question = f"\n\nQuestion: {query}"

    # Initialize the message with the introduction
    message = introduction

    # Iterate through the ranked source texts
    for string in strings:
        next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""'
        # Check if adding the next article to the message exceeds the token budget
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article

    # Combine the message with the question
    return message + question

# Function to interact with GPT, answer a query, and return the response
def ask(
    query: str,
    df: pd.DataFrame = df,  # Assuming 'df' is a global DataFrame
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
    print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    # Create a message for GPT
    message = query_message(query, df, model=model, token_budget=token_budget)

    # Print the message if requested
    if print_message:
        print(message)

    # Define a conversation for GPT
    messages = [
        {"role": "system", "content": "You answer questions about the 2022 Winter Olympics."},
        {"role": "user", "content": message},
    ]

    # Generate a response from GPT
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0
    )

    # Extract and return the response message
    response_message = response["choices"][0]["message"]["content"]
    return response_message

In [22]:
ask('What is this form for?')

'This form is for the notice of registration as a national securities exchange for the sole purpose of trading security futures products pursuant to Section 6(g) of the Exchange Act.'

Observed above, the model is able search suitable text in the context and provide answers to the relveant questions asked