# Question Answer Using Embedding Based Search

### Preamble

In [1]:
pip install tiktoken


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import ast  # for converting embeddings saved as strings back to arrays
import openai  # for calling the OpenAI API
import pandas as pd  # for storing text and embeddings data
import tiktoken  # for counting tokens
from scipy import spatial  # for calculating vector similarities for search


# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

## Giving GPT knowledge about a topic by inserting it into an input message.

In [3]:
government_forms_search = """
2023 Government forms
Examination Brochure: Information about Examinations
SEC 2389Examination Brochure: Information about Examinations
SEC 2389 ( 03-23)

INFORMATION FOR ENTITIES SUBJECT TO  EXAMINATION OR INSPECTION BY  THE SECURITIES AND EXCHANGE  COMMISSION      The examination staff (staf f) of the Division  of Examinations ( EXAMS ) of the Securities and  Exchange Commission (Commission) has prepared this brochure to provide information  about examinations it conduct s, including  information about the examination process and  the methods the staff employ s for resolving issues identified d uring examinations.. This  information, provided to entities undergoing examination or inspection , should help entities  to understand better the  staff’s objectives in this area.1    I.. PURPOSE OF EXAMINATIONS   Commission representatives have statutory authority  to conduct , at any time or from time to  time, reasonable periodic, special , and other examinations of the records of specified  Commission -regulated entities  (entity  or entities ).. The s taff carries out these responsibilities in   11 regional offices and headquarters  in Washington, DC.. EXAMS ’ mission  is to protect  investors, ensure market integrity, and support responsible capital formation through risk-focused  strategies that: improve compliance , prevent fraud , monitor risk , and inform policy.. During exam inations, the staff will seek to determine whether the entity being examined is :  conducting its activities in accordance with the federal securities laws and the rules adopted  under these laws ( as well as , where applicable, the rules of self-regulatory organizations subject  to the Commission ’s oversight); adhering to the disclosures it has made to  its clients, customers ,  the general public, and /or the Commission ; and implementing supervisory systems and/or  compliance policies and procedures that are reasonably designed to ensure that the entity ’s  operations are in compliance with applicable legal requirements .. The staff  appreciates each entity ’s cooperation as it will facilitate the staff ’s ability to time ly  complete the examination .. Entities should work promptly to  provide the staff  with complete and  accurate information and ensure that knowledgeable employees are made available to  help the  staff better understand the entity and its operations.. II.. THE EXAMINATION PROCESS   The Commission ’s exam ination  program is risk -based .. An entity may be selected for  examination for any number of reasons including , but not limited to,  a statutory mandate that  requires the Commission to examine the entity ; the entity ’s risk profile; a tip, complaint, or  referral; or a review of a particular compliance risk area.. The reason an entity has been selected                                                    1  This statement represents the views of staff of EXAMS.. It is not a rule, regulation, or statement of the  Commission.. The Commission has neither approved nor disapproved its content.. This statement, like all staff  statements, has no legal force or effect: it does not alter or amend applicable law, and it creates no new or additi onal  obligations for any person.. SEC 2389 ( 03-23) -2- for examination is non-public information and typically will not be shared with the entity under  examination .. As part of the pre -examination planning process, the staff  strives to efficiently   allocate resources and minimize , if possible, any  overlap with the scope of any recent or ongoing  examinations or investigations by other regulators or Commission staff  in other Commission  offices or divisions.. If an entity has any concerns with respect to overlapping examinations or investigations , the entity should contact the Commission staff  involved.. Throughout the examination process, the staff  may consult and/or coordinate with other  Commission staff, including supervisory examination staff and staff in other Commission offices and divisions, regarding any issues identified and  the interpretation  and application of the  securities laws and rules adopted under these laws, and, as  applicable, self -regulatory  organization rules.. As a result, the staff may share information and documents received from the  entity with other Commission staff to the extent the staff  deems necessary or appropriate.. The  Commission may also share information and documents with other regulators or authorities.. Whether the Commission has shared information and documents received from an entity is a confidential matter..


2023 Government forms
Application for registration or exemption from registration as a national securities exchange
Form 1

APPLICATION FOR, AND AMENDMENTS TO APPLICATION FOR, REGISTRATION AS   A NATIONAL SECURITIES EXCHANGE OR EXEMPTION FROM REGISTRATION  PURSUANT TO SECTION 5 OF THE EXCHANGE  ACT  SEC 1935 (2-99) hours  per response.. .........  30.00  March 31, 2025   Estimated  average  burden  Expires:  3235-0017 OMB  Number:  OMB  APPROVAL  2  FORM  1 INSTRUCTIONS     A.. GENERAL  INSTRUCTIONS     1.. Form  1 is the application  for registration  as a national  securities  exchange  or an exchange  exempt  from  registration  pursuant to Section 5 of the Securities Exchange Act of 1934 (“Exchange Act”).. 2.. UPDATING  - A registered  exchange or exchange  exempt  from registration  pursuant  to Section  5 of the Exchange  Act  must file amendments  to Form  1 in accordance  with Exchange  Act Rule 6a-2.. 3.. CONTACT  EMPLOYEE - The individual  listed  on the Execution  Page (Page  1) of Form  1 as the contact  employee must be  authorized to receive all contact information, communications, and mailings, and is responsible for disseminating such  information  within  the applicant’s  organization.. 4.. FORMAT    Attach  an Execution  Page (Page  1) with original  manual  signatures..  Please  type all information..  Use only the current  version  of Form  1 or a reproduction.. 5.. If the information called for by any Exhibit is available in printed form, the printed material may be filed, provided it doe s  not exceed 8 1/2 X 11 inches in size.. 6.. If any Exhibit  required  is inapplicable,  a statement to that effect  shall be furnished  in lieu of such Exhibit.. 7.. An exchange  that is filing  Form  1 as an application  may not satisfy  the requirements  to provide  certain  information  by  means  of an Internet  web page.. All  materials  must be filed with the Commission  in paper.. 8.. WHERE TO FILE AND NUMBER OF COPIES  - Submit one original and two copies of Form 1 to: SEC, Division of Market  Regulation, Office of Market Supervision, 450 Fifth Street, N.W., Washington, DC  20549.. 9.. PAPERWORK REDUCTION ACT DISCLOSURE    Form 1 requires an exchange seeking to register as a national securities exchange or seeking an exemption from  registration  as a national  securities  exchange  pursuant  to Section  5 of the Exchange  Act to provide  the Securities  and  Exchange  Commission  (“SEC”  or “Commission”)  with certain  information  regarding  the operation  of the exchange.. Form 1 also requires national securities exchanges or exchanges exempt from registration based on limited volume  to update certain information on a periodic basis..  An agency  may not conduct  or sponsor,  and a person  is not required  to respond  to, a collection  of information  unless  it displays  a currently  valid  control  number  .. Sections  3(a)(1),  5, 6(a) and 23(a)  authorize  the Commission  to collect  information  on this Form  1 from exchanges.. See 15 U.S.C.. §§78c(a)(1),  78e, 78f(a)  and 78w(a)..  Any member  of the public  may direct  to the Commission  any comments  concerning  the accuracy  of the burden  estimate  on the facing  page  of Form  1 and any suggestions  for reducing this burden..  Form 1 is designed to enable the Commission to determine whether an exchange applying for registration is in  compliance with the provisions of Sections 6 and 19 of the Exchange Act.. Form 1 is also designed to enable the  Commission  to determine whether  a national  securities  exchange  or exchange  exempt  from registration  based  on  limited  volume  is operating  in compliance  with the Exchange  Act..  It is estimated that an exchange will spend approx imately 47 hours completing the initial application on Form 1  pursuant  to Rule 6a-1..

2023 Government forms
Regulation A Offering Statement
Page 1UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549

FORM 1-A REGULATION A OFFERING STATEMENT  UNDER THE SECURITIES ACT OF 1933 GENERAL INSTRUCTIONS I.. Eligibility Requirements for Use of Form 1-A.. Th is Form is to be used for securities off  erings made pursuant to Regulation A (17 CFR 230.251 et seq.).. Careful attention should be directed to the terms, conditions and requirements of Regulation A, especially Rule  251, because the exemption is not available to all issuers or for every type of securities transaction.. Further, the aggregate off  ering price and aggregate sales of securities in any 12-month period is strictly limited to $20 million  for Tier 1 off  erings and $75 million for Tier 2 off  erings, including no more than $6 million off  ered by all selling  securityholders that are affi   liates of the issuer for Tier 1 off  erings and $22.5 million by all selling securityholders  that are affi   liates of the issuer for Tier 2 off  erings.. Please refer to Rule 251 of Regulation A for more details.. II.. Preparation, Submission and Filing of the Off  ering Statement.. An off  ering statement must be prepared by all persons seeking exemption under the provisions of Reg- ulation A.. Parts I, II and III must be addressed by all issuers.. Part II, which relates to the content of the required off ering circular, provides alternative formats, of which the issuer must choose one.. General informa- tion  regarding the preparation, format, content, and submission or fi  ling of the off  ering statement is contained in  Rule 252.. Information regarding non-public submission of the off  ering statement is contained in Rule 252(d).. Requirements relating to the off  ering circular are contained in Rules 253 and 254.. Th  e off  ering statement must  be submitted or fi  led with the Securities and Exchange Commission in electronic format by means of the Com-  mission’s Electronic Data Gathering, Analysis and Retrieval System (EDGAR) in accordance with the EDGAR rules set forth in Regulation S-T (17 CFR Part 232) for such submission or fi  ling.. III.. Incorporation by Reference and Cross-Referencing.. An issuer may incorporate by reference to other documents previously submitted or fi  led on EDGAR.. Cross-referencing within the off  ering statement is also encouraged to avoid repetition of information.. For exam-  ple, you may respond to an item of this Form by providing a cross-reference to the location of the information in the fi  nancial statements, instead of repeating such information.. Incorporation by reference and cross-referencing  are subject to the following additional conditions: (a) Th  e use of incorporation by reference and cross-referencing in Part II of this Form: (1) Is limited to the following items: (A) Items 2-14 of Part II and Part F/S if following the Off  ering Circular format; (B) Items 3-11 of Form S-1 if following the Part I of Form S-1 format; or(C) Items 3-28, and 30 of Form S-11 if following the Part I of Form S-11 format;OMB APPROVAL OMB Number:                     3235-0286 Expires:                 September 30, 2024Estimated average burdenhours per response ..

                                                      or              [  ] SPECIAL FINANCIAL REPORT PURSUANT TO REGULATION A For the fiscal semiannual period ended ___________________________________________________   ____________________________________________________________________________   (Exact name of issuer as specified in its charter) _____________________________________________  _____________________   State or other jurisdiction of incorporation or organization  (I.R.S.. Employer  Identification No.). __________________________________________________________________________________   (Full mailing address of principal executive offices) ____________________________________________________________________________________   (Issuer’s telephone number, including area code) GENERAL  INSTRUCTIONS A.. Rules as to Use of Form 1-SA.. (1) This Form shall be used for semiannual reports pursuant to Rule 257(b)(3) of Regulation A (§§ 230.251- 230.263).. (2) Semiannual reports on this Form shall be filed within 90 calendar days after the end of the semiannual pe- riod covered by the report.. (3) This Form also shall be used for special financial reports filed pursuant to Rule 257(b)(2)(i)(B) of Regulation A.. Such special financial reports shall be filed and signed in the manner set forth in this Form, but otherwiseneed only provide the cover page and financial statements required by Rule 257(b)(2)(i)(B).. Special financialreports filed using this Form shall be filed within 90 calendar days after the qualification date of the offeringstatement.. B.. Preparation of Report.. (1) Regulation A contains certain general requirements that are applicable to reports on any form, including amendments to reports.. These general requirements should be carefully read and observed in the preparationand filing of reports on this Form.. (2) This Form is not to be used as a blank form to be filled in, but only as a guide in the preparation of the re- port.OMB Number:          3235-0721  Expir    es:       October 31, 2025   Estimated average burden hours per response  .. .. .. .. 188.04OMB APPROVAL Persons who are to respond to the collection of information contained in this form are not  required to respond unless the form displays a currently valid OMB control number.SEC2914 (1-21) 1 of 5(3) In addition to the information expressly required to be included in this Form, there shall be added such further material information, if any , as may be necessary to make the required statements, in light of the circum - stances under which they are made, not misleading.. C. Signature and Filing of Report.. (1) The report must be filed with the Commission in electronic format by means of the Commission’s Electronic Data Gathering, Analysis and Retrieval System (“EDGAR”) in accordance with the EDGAR rules set forth inRegulation S-T (17 CFR Part 232).. (2) The report must be signed by the issuer, its principal executive officer, principal financial officer and princi- pal accounting officer.. If a signature is by a person on behalf of any other person, evidence of authority to signmust be filed with the report, except where an executive officer signs on behalf of the issuer .. (3) The report must be signed using a typed signature.. Each signatory to the filing must also manually signa signature page or other document authenticating, acknowledging or otherwise adopting his or her signaturethat appears in the filing.. Such document must be executed before or at the time the filing is made and must beretained by the issuer for a period of five years.. Upon request, the issuer must furnish to the Commission or itsstaff a copy of any or all documents retained pursuant to this paragraph.. D. Incorporation by Reference and Cross-Referencing.. (1) An issuer may incorporate by reference to other documents previously submitted or filed on EDGAR.. Cross-referencing within the report is also encouraged to avoid repetition of information..



"""

In [None]:
api_key="your api key"
query = f"""Use the below articles of the SEC Goverment forms to answer the subsequent question. If the answer cannot be found, write "I don't know."

Article:
\"\"\"
{government_forms_search}
\"\"\"

Question: What is the purpose of the SEC's examinations?"""

response = openai.ChatCompletion.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about the SEC Government Forms.'},
        {'role': 'user', 'content': query},
    ],
    model=GPT_MODEL,
    temperature=0,
    api_key = api_key
)

print(response['choices'][0]['message']['content'])

The purpose of the SEC's examinations is to determine whether the entity being examined is conducting its activities in accordance with the federal securities laws and rules, adhering to the disclosures it has made, and implementing supervisory systems and compliance policies to ensure compliance with applicable legal requirements. The examinations aim to protect investors, ensure market integrity, and support responsible capital formation.


## Prepare Search Data

### Preparing a dataset of Government Forms for search used in the Embeddings notebook

#### Importing Libraries

In [5]:
import pandas as pd

In [6]:
pip install mwparserfromhell

Collecting mwparserfromhell
  Using cached mwparserfromhell-0.6.5-cp311-cp311-macosx_10_9_universal2.whl (123 kB)
Installing collected packages: mwparserfromhell
Successfully installed mwparserfromhell-0.6.5

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [7]:
pip install mwclient

Collecting mwclient
  Using cached mwclient-0.10.1-py2.py3-none-any.whl (27 kB)
Collecting requests-oauthlib (from mwclient)
  Using cached requests_oauthlib-1.3.1-py2.py3-none-any.whl (23 kB)
Collecting oauthlib>=3.0.0 (from requests-oauthlib->mwclient)
  Using cached oauthlib-3.2.2-py3-none-any.whl (151 kB)
Installing collected packages: oauthlib, requests-oauthlib, mwclient
Successfully installed mwclient-0.10.1 oauthlib-3.2.2 requests-oauthlib-1.3.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [8]:
import mwclient  # for downloading example Wikipedia articles
import mwparserfromhell  # for splitting Wikipedia articles into sections
import openai  # for generating embeddings
import pandas as pd  # for DataFrames to store article sections and embeddings
import re  # for cutting <ref> links out of Wikipedia articles
import tiktoken  # for counting tokens

### Chunk Documents

In [9]:
import csv

# Function to read the CSV file and return the specified column
def read_csv_and_extract_column(csv_filename, column_name):
    extracted_column = []
    with open(csv_filename, 'r', newline='') as csv_file:
        reader = csv.DictReader(csv_file)
        for row in reader:
            extracted_column.append(row[column_name])
    return extracted_column

# Replace 'your_csv_file.csv' with the path to your CSV file
csv_file_path = '/Users/prathamesh/Desktop/Github/Assignment-2/part_1/openai_notebooks/embeddings/preprocessed_data.csv'

# Replace 'your_column_name' with the name of the column you want to filter
column_name = 'context'

# Read the CSV file and extract the specified column
csv_column = read_csv_and_extract_column(csv_file_path, column_name)

# Define the function keep_section as provided in your code
def keep_section(section):
    if len(section) < 16:
        return False
    return True

# Filter the data based on the keep_section function
filtered_data = [data for data in csv_column if keep_section(data)]

# Print the results
print(f"Filtered out {len(csv_column) - len(filtered_data)} rows, leaving {len(filtered_data)} rows.")

# Optionally, you can write the filtered data back to a new CSV file
filtered_csv_file_path = 'filtered_data.csv'
with open(filtered_csv_file_path, 'w', newline='') as filtered_csv_file:
    writer = csv.writer(filtered_csv_file)
    writer.writerows(filtered_data)


Filtered out 0 rows, leaving 20 rows.


In [10]:
import csv

# Function to read the CSV file and return the specified column
def read_csv_and_extract_column(csv_filename, column_name):
    extracted_column = []
    with open(csv_filename, 'r', newline='') as csv_file:
        reader = csv.DictReader(csv_file)
        for row in reader:
            extracted_column.append(row[column_name])
    return extracted_column

# Replace 'your_csv_file.csv' with the path to your CSV file
csv_file_path = '/Users/prathamesh/Desktop/Github/Assignment-2/part_1/openai_notebooks/embeddings/preprocessed_data.csv'

# Replace 'your_column_name' with the name of the column you want to display
column_name = 'context'

# Read the CSV file and extract the specified column
csv_column = read_csv_and_extract_column(csv_file_path, column_name)

# Display the first 5 sections of the specified column
for section_text in csv_column[:5]:
    print(section_text[:77] + "...")
    print()

2023 government forms examination brochure information examinations sec 2389e...

2023 government forms application registration exemption registration nationa...

2023 government forms regulation offering statement page 1united states secur...

2023 government forms notification regulation e may send completed printout f...

2023 government forms annual reports special financial reports united states ...



In [11]:
import csv
import tiktoken  # Make sure to import the tiktoken library

GPT_MODEL = "gpt-3.5-turbo"

def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def split_and_truncate_text(text, max_tokens, model=GPT_MODEL):
    encoding = tiktoken.encoding_for_model(model)
    encoded_text = encoding.encode(text)
    
    if len(encoded_text) <= max_tokens:
        return [text]  # No need to split or truncate
    
    # Split the text into smaller chunks
    chunks = []
    start = 0
    while start < len(encoded_text):
        end = start + max_tokens
        chunks.append(encoded_text[start:end])
        start = end
    
    # Convert the chunks back to text
    chunk_texts = [encoding.decode(chunk) for chunk in chunks]
    
    return chunk_texts

def read_csv_and_extract_column(csv_filename, column_name):
    extracted_column = []
    with open(csv_filename, 'r', newline='') as csv_file:
        reader = csv.DictReader(csv_file)
        for row in reader:
            extracted_column.append(row[column_name])
    return extracted_column

# Replace 'your_csv_file.csv' with the path to your CSV file
csv_file_path = '/Users/prathamesh/Desktop/Github/Assignment-2/part_1/openai_notebooks/embeddings/preprocessed_data.csv'

# Replace 'your_column_name' with the name of the column you want to process
column_name = 'context'

# Read the CSV file and extract the specified column
csv_column = read_csv_and_extract_column(csv_file_path, column_name)

# Set your desired maximum tokens
max_tokens = 1000
processed_data = []

for text in csv_column:
    split_sections = split_and_truncate_text(text, max_tokens)
    processed_data.extend(split_sections)

# Optionally, you can write the processed data back to a new CSV file
output_csv_file_path = 'processed_data.csv'

with open(output_csv_file_path, 'w', newline='') as output_csv_file:
    writer = csv.writer(output_csv_file)
    for row in processed_data:
        writer.writerow([row])


### Displaying the preprocessed text

In [13]:
df1 = pd.read_csv("/Users/prathamesh/Desktop/Github/Assignment-2/part_1/openai_notebooks/embeddings/data_1.csv")

In [14]:
df1

Unnamed: 0,Text
0,2023 government forms examination brochure inf...
1,2023 government forms application registration...
2,2023 government forms regulation offering stat...
3,2023 government forms notification regulation ...
4,2023 government forms annual reports special f...
5,2023 government forms form amendments notice r...
6,2023 government forms semiannual report specia...
7,2023 government forms current report pursuant ...
8,2023 government forms exit report regulation s...
9,2023 government forms general form registratio...


In [15]:
import pandas as pd

# Read the CSV file into a DataFrame
df1 = pd.read_csv("/Users/prathamesh/Desktop/Github/Assignment-2/part_1/openai_notebooks/embeddings/data_1.csv")

MAX_TOKENS = 700
government_strings = []

# Assuming your CSV file has a column named "text" that contains the text data
for text in df1['Text']:
    # Split the text into strings with a maximum of MAX_TOKENS tokens
    strings = split_and_truncate_text(text, max_tokens=MAX_TOKENS)
    government_strings.extend(strings)
    

print(f"{len(df1)} rows split into {len(government_strings)} strings.")

20 rows split into 20 strings.


In [16]:
print(government_strings[1])

2023 government forms application registration exemption registration national securities exchange form 1 application amendments application registration national securities exchange exemption registration pursuant section 5 exchange act sec 1935 299 hours per response 3000 march 31 2025 estimated average burden expires 32350017 omb number omb approval 2 form 1 instructions general instructions 1 form 1 application registration national securities exchange exchange exempt registration pursuant section 5 securities exchange act 1934 exchange act 2 updating registered exchange exchange exempt registration pursuant section 5 exchange act must file amendments form 1 accordance exchange act rule 6a2 3 contact employee individual listed execution page page 1 form 1 contact employee must authorized receive contact information communications mailings responsible disseminating information within applicant organization 4 format attach execution page page 1 original manual signatures please type 

## Embed document chunks

In [17]:
import openai
import pandas as pd

# Set your API key here
API_KEY = "Your Api Key"

# Define your model and batch size
EMBEDDING_MODEL = "text-embedding-ada-002"  # OpenAI's best embeddings as of Apr 2023
BATCH_SIZE = 1000  # you can submit up to 2048 embedding inputs per request

# Initialize OpenAI with your API key
openai.api_key = API_KEY

embeddings = []
for batch_start in range(0, len(government_strings), BATCH_SIZE):
    batch_end = min(batch_start + BATCH_SIZE, len(government_strings))
    batch = government_strings[batch_start:batch_end]
    print(f"Batch {batch_start} to {batch_end-1}")
    response = openai.Embedding.create(model=EMBEDDING_MODEL, input=batch)
    for i, be in enumerate(response["data"]):
        assert i == be["index"]  # double check embeddings are in the same order as input
    batch_embeddings = [e["embedding"] for e in response["data"]]
    embeddings.extend(batch_embeddings)

df1 = pd.DataFrame({"text": government_strings, "embedding": embeddings})

Batch 0 to 19


### Store document chunks and embeddings

In [18]:
df1.to_csv("new_data.csv", index=False)

In [19]:
import ast  # for converting embeddings saved as strings back to arrays
import openai  # for calling the OpenAI API
import pandas as pd  # for storing text and embeddings data
import tiktoken  # for counting tokens
from scipy import spatial  # for calculating vector similarities for search


# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

## Prepare Search Data 

In [20]:
embeddings_path = "/Users/prathamesh/Desktop/Github/Assignment-2/part_1/openai_notebooks/embeddings/new_data.csv"

df2 = pd.read_csv(embeddings_path)

In [21]:
df2['embedding'] = df2['embedding'].apply(ast.literal_eval)

In [22]:
df2

Unnamed: 0,text,embedding
0,2023 government forms examination brochure inf...,"[-0.010471713729202747, -0.0001877099857665598..."
1,2023 government forms application registration...,"[-0.019031904637813568, -0.0005620317533612251..."
2,2023 government forms regulation offering stat...,"[-0.02028811164200306, -0.011089688166975975, ..."
3,2023 government forms notification regulation ...,"[-0.013064158149063587, -0.009831175208091736,..."
4,2023 government forms annual reports special f...,"[-0.020037660375237465, -0.012947814539074898,..."
5,2023 government forms form amendments notice r...,"[-0.020353544503450394, -0.010243287310004234,..."
6,2023 government forms semiannual report specia...,"[-0.016406027600169182, -0.009464507922530174,..."
7,2023 government forms current report pursuant ...,"[-0.01728230156004429, -0.02042453922331333, -..."
8,2023 government forms exit report regulation s...,"[-0.015645071864128113, -0.010695029981434345,..."
9,2023 government forms general form registratio...,"[-0.02019253931939602, -0.017759384587407112, ..."


## Search

In [23]:
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = openai.Embedding.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response["data"][0]["embedding"]
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

In [25]:
strings, relatednesses = strings_ranked_by_relatedness("curling gold medal", df1, top_n=5)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    display(string)

relatedness=0.710


'2023 government forms regulation offering statement page 1united states securities exchange commission washington dc 20549 form 1a regulation offering statement securities act 1933 general instructions eligibility requirements use form 1a th form used securities erings made pursuant regulation 17 cfr 230251 et seq careful attention directed terms conditions requirements regulation especially rule 251 exemption available issuers every type securities transaction aggregate ering price aggregate sales securities 12month period strictly limited 20 million tier 1 erings 75 million tier 2 erings including 6 million ered selling securityholders affi liates issuer tier 1 erings 225 million selling securityholders affi liates issuer tier 2 erings please refer rule 251 regulation details ii preparation submission filing ering statement ering statement must prepared persons seeking exemption provisions reg ulation parts ii iii must addressed issuers part ii relates content required ering circula

relatedness=0.696


'2023 government forms irrevocable appointment agent service process pleadings papers nonresident general partner broker dealer omb approval subject omb clearance 44 usc 3501 et seq united states secruities exchange commission washington dc 20459 form 10m irrevocable appointment agent service process pleadings papers nonresident general partner broker dealer ___________________________________________________________________________________________________________ form shall filed duplicate original ___________________________________________________________________________________________________________ 1 _________________________________________________________ of____ ________________________________________ name address full hereby designate appoint without power revocation united states securities exchange commission agent upon may served process pleadings papers civil suit action brought individually partner partnership engaged business broker dealer appropriate court place subje

relatedness=0.690


'2023 government forms general form registration securities pursuant section 12 b g omb approval omb number 32350064 expires pwfnber 3 202 estimated average burden hours per response 20038 united states securities exchange commission washington dc 20549 form 10 general form registration securities pursuant section 12 b g securities exchange act 1934 exact name registrant speci ed charter state jurisdiction incorporation organization irs employer identi cation address principal executive ൶ces registrant telephone number including area code zip code securities registered pursuant section 12 b act title class registered name exchange class registered securities registered pursuant section 12 g act title class title class indicate check mark whether registrant large accelerated ler accelerated ler nonaccelerated ler smaller reporting company emerging growth company see de nitions large accelerated ler accelerated ler smaller reporting company emerging growth company rule 12b2 exchange act 

relatedness=0.686


'2023 government forms notification late filing form 12b 25 notification late filing united states securities exchange commission washington dc 20549 read instruction back page preparing form please print type nothing form shall construed imply commission veriﬁed information contained herein persons respond collection information contained form required respond unless form displays currently valid omb control number check one form 10 k form 20 f form 11 k form 10 q form 10 form n cen form n csr period ended transition report form 10 k transition report form 20 f transition report form 11 k transition report form 10 q transition period ended notiﬁcation relates portion ﬁling checked identify item notiﬁcation relates part registrant information full name registrant former name applicable address principal executive office street number city state zip code part ii rules 12b 25 b c subject report could ﬁled without unreasonable effort expense registrant seeks relief pursuant rule 12b25 b f

relatedness=0.685


'2023 government forms notification regulation e may send completed printout form sec satisfy filing obligat ion satisfy sec filing obligation submitting information required form sec electronic format online https wwwedgarfilingsecgov united states securities exchange commission washington dc 20549 form 1e omb approval omb number 3235 0232 expires february 28 2025 estimated average burden hours per response 1000 notification regulation e item 1 issuer state exact name issuer address street city state principal business office item 2 affiliates principal security holders issuer list full name complete address following persons affiliate issuer indicating nature affiliation b person owns record known beneficially ten percent outstanding securities class issuer stating title amount owned person item 3 directors officers list full name complete residence address following persons director issuer b officer issuer indicating positions offices held issuer c investment adviser item 4 counsel 

## Ask the model

In [26]:
def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = 'Use the below articles on the 2023 Government Forms to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nGovernment Article Section:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question


def ask(
    query: str,
    df: pd.DataFrame = df2,
    model: str = GPT_MODEL,
    token_budget: int = 1000,
    print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model=model, token_budget=token_budget)
    if print_message:
        print(message)
    messages = [
        {"role": "system", "content": "You answer questions about the 2023 Government Forms."},
        {"role": "user", "content": message},
    ]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response["choices"][0]["message"]["content"]
    return response_message

## Example Questions

In [27]:
ask('What is the eligibility requirement for using Form 1-A?')

'The eligibility requirement for using Form 1-A is that the form can be used for securities offerings made pursuant to Regulation A (17 CFR 230.251 et seq).'

In [28]:
ask("Where was the UEFA Champions League final in 2023")

'I could not find an answer.'

In [29]:
ask("What is the form for filing a semiannual report pursuant to Regulation A")

'The form for filing a semiannual report pursuant to Regulation A is Form 1SA.'

In [30]:
ask("What information does the Commission require on the Form 1-N")

"The Commission requires certain information regarding the operation of the security futures product exchange on the Form 1-N. This includes information about the exchange's operation, documents containing information satisfying the Commission's information requirements, and updates on certain information on a periodic basis."