# Open AI Q/A Chatbot

## Abstract
This idea of this project is to create a question answering model, based on a few paragraphs of extracted data from sec gov website

## Aim
In this notebook we aim to develop a Q/A chatbot with
1. Notebook focuses on extracting data from sec forms using pypdf , which GPT-3 didn't see during it's pre-training and downloaded the content. We organized the dataset by individual sections, which will serve as context for asking and answering the questions.
2. Converting Extracted data into Embeddings :
- Prerequisites: Import libraries, set API key
- Collect: We extracted sec data using pypdf and converted into CSV file
- Chunk: CSV file is are split into short, semi-self-contained sections to
  be embedded
- Embed: Each section is embedded with the OpenAI API
- Store: Embeddings are saved in a CSV file

3. A two-step Search-Ask method for enabling GPT to answer questions using a library of reference text is implemented

- Search: search your library of text for relevant text sections
- Ask: insert the retrieved text sections into a message to GPT and ask it
  the question

In [52]:
#Installing Dependecines
!pip install PyPDF2
!pip install 'PyPDF2<3.0'
!pip install tiktoken
from PyPDF2 import PdfMerger
from PyPDF2 import PdfReader
!pip install ghostscript
!pip install camelot-py[cv]
!pip install excalibur-py
!apt install ghostscript python3-tk
!pip install openai

import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import PyPDF2
import requests
import time
import openai

import ast  # for converting embeddings saved as strings back to arrays
import tiktoken  # for counting tokens
from scipy import spatial  # for calculating vector similarities for search



Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ghostscript is already the newest version (9.55.0~dfsg1-0ubuntu5.5).
python3-tk is already the newest version (3.10.8-1~22.04).
0 upgraded, 0 newly installed, 0 to remove and 18 not upgraded.


# Extracting data from SEC FORMS using pypdf

This Python code uses the PyPDF2 library to extract text from a PDF document hosted at a specific URL. It then creates a Pandas DataFrame containing information about the extracted text, including the title, heading, content, and the number of tokens (words) in the text. Finally, it saves this DataFrame to a CSV file, which contains the entire content of the PDF along with its metadata.

In [53]:
#  Evaluating pypdf and CSV file generated with the full content of the PDF


# PDF URL
pdf_url = "https://www.sec.gov/files/form1-a.pdf"

def extract_pdf_text(pdf_url):
    text = ''
    response = requests.get(pdf_url, stream=True)
    with open("temp.pdf", "wb") as pdf_file:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                pdf_file.write(chunk)

    pdf_file = open("temp.pdf", "rb")
    pdf_reader = PyPDF2.PdfReader(pdf_file)

    for page_number in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[page_number]
        text += page.extract_text()

    pdf_file.close()
    return text

# Extract text from the PDF link
pdf_text = extract_pdf_text(pdf_url)

# Create a DataFrame with the entire content
data = [
    {
        'title': "Form 1-K",
        'heading': "Full Content",
        'content': pdf_text,
        'tokens': len(pdf_text.split())
    }
]

# Create a DataFrame
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv("output.csv", index=False)

print("CSV file generated with the full content of the PDF.")


CSV file generated with the full content of the PDF.


#  Converting Text content of pdf to Embeddings
- This block of code uses TensorFlow and TensorFlow Hub to convert the textual content extracted from a CSV file (containing PDF text) into embeddings.
- It loads the Universal Sentence Encoder model, reads the CSV data, and then uses the model to transform the text into numerical embeddings.
- The resulting embeddings are saved in a new DataFrame along with the original text.
- Finally, this DataFrame is saved to a new CSV file, creating a dataset that pairs the text with its corresponding embeddings, making it useful for various natural language processing tasks.

In [54]:
# Converting Text content of pdf to Embeddings

# Load the Universal Sentence Encoder model
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
embed = hub.load(module_url)

# Load your CSV data
df = pd.read_csv("/content/output.csv")

# Convert the 'content' column to embeddings
embeddings = embed(df['content'])

# Create a new DataFrame with 'text' and 'embedding' columns
result_df = pd.DataFrame({'text': df['content'], 'embedding': embeddings.numpy().tolist()})

# Save the new DataFrame to a CSV file
result_df.to_csv("output_with_embeddings.csv", index=False)

print("CSV file generated with 'text' and 'embedding' columns.")





CSV file generated with 'text' and 'embedding' columns.


In [55]:
#loading embeddings

df = pd.read_csv('/content/output.csv')
df['context'] = df.title + "\n" + df.heading + "\n\n" + df.content
df.head()

Unnamed: 0,title,heading,content,tokens,context
0,Form 1-K,Full Content,Page 1UNITED STATES\nSECURITIES AND EXCHANGE C...,14773,Form 1-K\nFull Content\n\nPage 1UNITED STATES\...


In [56]:
print(df.content.values[0])

Page 1UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 1-A
REGULATION A OFFERING STATEMENT
 UNDER THE SECURITIES ACT OF 1933
GENERAL INSTRUCTIONS
I. Eligibility Requirements for Use of Form 1-A.
Th is Form is to be used for securities off  erings made pursuant to Regulation A (17 CFR 230.251 et seq.).
Careful attention should be directed to the terms, conditions and requirements of Regulation A, especially Rule 
251, because the exemption is not available to all issuers or for every type of securities transaction. Further, the aggregate off  ering price and aggregate sales of securities in any 12-month period is strictly limited to $20 million 
for Tier 1 off  erings and $75 million for Tier 2 off  erings, including no more than $6 million off  ered by all selling 
securityholders that are affi   liates of the issuer for Tier 1 off  erings and $22.5 million by all selling securityholders 
that are affi   liates of the issuer for Tier 2 off  erings. Please re


# Create questions based on the context
- Use davinci-instruct to generate a number of plausible questions relating to the SEC form parsed contents.

- Note: We have used temperature=0, but it may be beneficial to experiment with a higher temperature to get a higher diversity of questions.

In [57]:
# GPT Model generating Questions on own


openai.api_key = 'apikey'


def get_questions(context):
    # print(context)

    response = openai.Completion.create(
        engine="davinci-instruct-beta-v3",
        prompt=f"Write 3 questions based on the text below\n\nText: {context}\n\nQuestions:\n1.",
        temperature=0,
        max_tokens=257,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=["\n\n"]
    )
    return response['choices'][0]['text']



In [58]:
#printing questions
Questions=get_questions(df.iloc[1:4])


In [59]:
print(Questions)

 What is the title of the text?
2. What is the heading of the text?
3. What is the content of the text?


# Create answers based on the context
- Use davinci-instruct to answer the questions given the relevant SEC form
 contents

- We have used temperature=0, but it may be beneficial to experiment with a higher temperature to get a higher diversity of questions.

In [60]:
#GPT model generating Answers on own


openai.api_key = 'apikey'

def get_answers(context,Questions):
    # print(context)

    response = openai.Completion.create(
        engine="davinci-instruct-beta-v3",
        prompt=f"Write answer based on the text below\n\nText: {context}\n\nQuestions:{Questions}\nAnswers:\n1.",
        temperature=0,
        max_tokens=257,
        top_p=1,

    )
    return response['choices'][0]['text']

In [61]:
# print(Questions)
get_answers(df.iloc[1:4],Questions)

' The title of the text is "Empty DataFrame".\n2. The heading of the text is "Columns: [title, heading, content, tokens, context]".\n3. The content of the text is "Index: []".'

# Question answering using embeddings-based search


In [62]:
# Fine tune model

EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

# You can give GPT knowledge about a topic by inserting it into an input message

To help give the model knowledge of curling at the SEC content we can copy and paste portion of SEC extracted data into our message:

In [63]:
form1a = """Eligibility Requirements for Use of Form 1-A.
Th is Form is to be used for securities off erings made pursuant to Regulation A (17 CFR 230.251 et seq.).
Careful attention should be directed to the terms, conditions and requirements of Regulation A, especially Rule
251, because the exemption is not available to all issuers or for every type of securities transaction. Further, the
aggregate off ering price and aggregate sales of securities in any 12-month period is strictly limited to $20 million
for Tier 1 off erings and $75 million for Tier 2 off erings, including no more than $6 million off ered by all selling
securityholders that are affi liates of the issuer for Tier 1 off erings and $22.5 million by all selling securityholders
that are affi liates of the issuer for Tier 2 off erings. Please refer to Rule 251 of Regulation A for more details.
"""

In [64]:
query = f"""Use the below content of Eligibility Requirements to answer the subsequent question. If the answer cannot be found, write "I don't know."

Article:
\"\"\"
{form1a}
\"\"\"

Question: What is the eligibility requirement for use of Form 1-A?"""

response = openai.ChatCompletion.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about Eligibility Requirements.'},
        {'role': 'user', 'content': query},
    ],
    model=GPT_MODEL,
    temperature=0,
)

print(response['choices'][0]['message']['content'])

The eligibility requirement for use of Form 1-A is that it is to be used for securities offerings made pursuant to Regulation A (17 CFR 230.251 et seq.).


GPT answers correctly.

In this particular case, GPT was intelligent enough to realize that the original question was underspecified, and gives the eligibility requirements for Form1-A


In [65]:
#reading embeddings
embeddings_path ='/content/output_with_embeddings.csv'


df = pd.read_csv(embeddings_path)
df.head()

Unnamed: 0,text,embedding
0,Page 1UNITED STATES\nSECURITIES AND EXCHANGE C...,"[0.04657985270023346, -0.04721924662590027, 0...."


In [66]:
print(result_df)
import ast
# convert embeddings from CSV str type back to list type
df['embedding'] = df['embedding'].apply(ast.literal_eval)

                                                text  \
0  Page 1UNITED STATES\nSECURITIES AND EXCHANGE C...   

                                           embedding  
0  [0.04657985270023346, -0.04721924662590027, 0....  


In [67]:
# the dataframe has two columns: "text" and "embedding"
df

Unnamed: 0,text,embedding
0,Page 1UNITED STATES\nSECURITIES AND EXCHANGE C...,"[0.04657985270023346, -0.04721924662590027, 0...."


# Search
Now we'll define a search function that:

- Takes a user query and a dataframe with text & embedding columns
- Embeds the user query with the OpenAI API
- Uses distance between query embedding and text embeddings to rank the texts
- Returns two lists:
  1. The top N texts, ranked by relevance
  2. Their corresponding relevance scores

In [68]:
from scipy.spatial import distance

def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = openai.Embedding.create(
        model=EMBEDDING_MODEL,
        input=query,
    )

    query_embedding = query_embedding_response["data"][0]["embedding"]

    # Ensure query embedding has the same dimension as the DataFrame embeddings
    query_embedding = query_embedding[:len(df.iloc[0]["embedding"])]

    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

In [69]:
# examples
strings, relatednesses = strings_ranked_by_relatedness("Financial Statements,", df, top_n=5)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    display(string)

relatedness=0.046


"Page 1UNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\nFORM 1-A\nREGULATION A OFFERING STATEMENT\n UNDER THE SECURITIES ACT OF 1933\nGENERAL INSTRUCTIONS\nI. Eligibility Requirements for Use of Form 1-A.\nTh is Form is to be used for securities off  erings made pursuant to Regulation A (17 CFR 230.251 et seq.).\nCareful attention should be directed to the terms, conditions and requirements of Regulation A, especially Rule \n251, because the exemption is not available to all issuers or for every type of securities transaction. Further, the aggregate off  ering price and aggregate sales of securities in any 12-month period is strictly limited to $20 million \nfor Tier 1 off  erings and $75 million for Tier 2 off  erings, including no more than $6 million off  ered by all selling \nsecurityholders that are affi   liates of the issuer for Tier 1 off  erings and $22.5 million by all selling securityholders \nthat are affi   liates of the issuer for Tier 2 off  eri

# Ask
With the search function above, we can now automatically retrieve relevant knowledge and insert it into messages to GPT.

Below, we define a function ask that:

- Takes a user query
- Searches for text relevant to the query
- Stuffs that text into a message for GPT
- Sends the message to GPT
- Returns GPT's answer

In [70]:
def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = 'Use the below content of Eligibility Requirements to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nEligibility Requirements:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question


def ask(
    query: str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
    print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model=model, token_budget=token_budget)
    if print_message:
        print(message)
    messages = [
        {"role": "system", "content": "You answer questions about Financial Statements for Tier 1 Off erings."},
        {"role": "user", "content": message},
    ]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response["choices"][0]["message"]["content"]
    return response_message



# Example questions
Finally, let's ask our system our original question about

In [71]:
ask('What is the financial statement requirement for use of Form 1-A?')

'According to the Eligibility Requirements, the financial statement requirement for use of Form 1-A is as follows:\n\n- For offerings of up to $20 million: Financial statements of the issuer for the two most recently completed fiscal years, or for such shorter period that the issuer has been in existence.\n- For offerings of more than $20 million but not more than $50 million: Financial statements of the issuer for the two most recently completed fiscal years, or for such shorter period that the issuer has been in existence, audited by an independent public accountant.\n- For offerings of more than $50 million: Financial statements of the issuer for the two most recently completed fiscal years, or for such shorter period that the issuer has been in existence, audited by an independent public accountant, and the financial statements must be prepared in accordance with U.S. GAAP.\n\nTherefore, the financial statement requirement for use of Form 1-A depends on the size of the offering.'

In [72]:
ask('What is the Eligibility Requirements for Use of Form 1-A.')

'The Eligibility Requirements for Use of Form 1-A are as follows:\n\n1. The issuer must be organized under the laws of the United States or Canada, or any state, province, territory or possession thereof, or the District of Columbia.\n2. The issuer must not be subject to the reporting requirements of Section 13 or 15(d) of the Securities Exchange Act of 1934 immediately before the offering.\n3. The issuer must not have been subject to any order of the Commission under Section 12(j) of the Exchange Act entered within five years before the filing of the offering statement.\n4. The issuer must not have filed a registration statement that is the subject of a currently effective registration stop order under the Securities Act.\n5. The issuer must not have been convicted of any felony or misdemeanor in connection with the purchase or sale of any security or involving the making of any false filing with the Commission.\n6. The issuer must not be subject to any order, judgment, or decree of a

# Conclusion
GPT-4 succeeds perfectly, correctly identifying all THE eligilibility requiremnets to use Form 1-A