# Embedding Wikipedia articles for search

This notebook shows how we prepared a dataset of Wikipedia articles for search, used in [Question_answering_using_embeddings.ipynb](Question_answering_using_embeddings.ipynb).

Procedure:

0. Prerequisites: Import libraries, set API key (if needed)
1. Collect: We download a few hundred Wikipedia articles about the 2022 Olympics
2. Chunk: Documents are split into short, semi-self-contained sections to be embedded
3. Embed: Each section is embedded with the OpenAI API
4. Store: Embeddings are saved in a CSV file (for large datasets, use a vector database)

## 0. Prerequisites

### Import libraries

In [3]:
# imports
import mwclient  # for downloading example Wikipedia articles
import mwparserfromhell  # for splitting Wikipedia articles into sections
import openai  # for generating embeddings
import pandas as pd  # for DataFrames to store article sections and embeddings
import re  # for cutting <ref> links out of Wikipedia articles
import tiktoken  # for counting tokens


Install any missing libraries with `pip install` in your terminal. E.g.,

```zsh
pip install openai
```

(You can also do this in a notebook cell with `!pip install openai`.)

If you install any libraries, be sure to restart the notebook kernel.

### Set API key (if needed)

Note that the OpenAI library will try to read your API key from the `OPENAI_API_KEY` environment variable. If you haven't already, set this environment variable by following [these instructions](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety).

In [11]:
import os
import getpass
import openai
os.environ['OPENAI_API_KEY'] = getpass.getpass()  #Passkey: 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
openai.api_key = os.getenv("OPENAI_API_KEY")

## 1. Collect documents

In this example, we'll download a few hundred Wikipedia articles related to the 2022 Winter Olympics.

In [4]:
df = pd.read_csv('form_qa.csv')

In [5]:
df

Unnamed: 0,Number,Description,Last_Updated,SEC_Number,Topic,PDF_Link,pyPDF_extraction,tokens,questions,answers
0,1,Application for registration or exemption from...,Feb. 1999,SEC1935,Self-Regulatory Organizations,https://www.sec.gov/files/form1-e.pdf,You may not send a completed printout of this ...,1581,1) What is the purpose of Form 1-E mentioned i...,1.The purpose of Form 1-E mentioned in the tex...
1,1-A,Regulation A Offering Statement (PDF),Sept. 2021,SEC486,"Securities Act of 1933, Small Businesses",https://www.sec.gov/files/form1-k.pdf,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,2000,1. What is the purpose of Form 1-K?\n2. What a...,1.The purpose of Form 1-K is to provide an ann...
2,1-E,Notification under Regulation E (PDF),Aug. 2001,SEC1807,"Investment Company Act of 1940, Small Business...",https://www.sec.gov/files/form1-n.pdf,OMB APPROVAL OMB Number 3235 0554 Expires Febr...,2000,1. What is the purpose of Form 1-N?\n\n2. How ...,1.The purpose of Form 1-N is to serve as a not...
3,1-K,Annual Reports and Special Financial Reports (...,Sept. 2021,SEC2913,"Securities Act of 1933, Small Businesses",https://www.sec.gov/files/form1-sa.pdf,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,2000,1. What is the purpose of Form 1 SA?\n2. How o...,1.The purpose of Form 1 SA is to file semiannu...
4,1-N,Form and amendments for notice of registration...,Dec. 2013,SEC2568,"Securities Exchange Act of 1934, Self-Regulato...",https://www.sec.gov/files/form1-u.pdf,OMB APPROVAL OMB Number 3235 0722 Expires Dece...,2000,1. What is the purpose of Form 1-U?\n\n2. What...,1.The purpose of Form 1-U is to file a current...
...,...,...,...,...,...,...,...,...,...,...
144,X-17A-5 Part I,"FOCUS Report, Part I (PDF)",Apr-03,SEC1705,"Security-Based Swap Entities, Broker-Dealers",https://www.sec.gov/files/formx-17a-5_2c-instr...,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,1170,1. Who is required to file Part IIC of the FOC...,1.Firms that are regulated by a prudential reg...
145,X-17A-5 Part II,"FOCUS Report, Part II Instructions (PDF)",Oct. 2021,SEC1695A,"Security-Based Swap Entities, Broker-Dealers",https://www.sec.gov/files/formx-17a-5_2c.pdf,Form X 17A 5 FOCUS Report Part IIC Cover Page ...,2000,1. How many types of entities can file the For...,1.There are two types of entities that can fil...
146,X-17A-5 Part II,"FOCUS Report, Part II (PDF)",Oct. 2021,SEC1695,"Security-Based Swap Entities, Broker-Dealers",https://www.sec.gov/files/formx-17a-5_3.pdf,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,920,1. What information is required to be included...,1
147,X-17A-5 Part IIA,FOCUS Report Part IIa (PDF),Nov. 2018,SEC1696,"Security-Based Swap Entities, Broker-Dealers",https://www.sec.gov/files/formx-17a-5_schedi.pdf,OMB APPROVAL OMB Number 3235 0123 Expires Octo...,2000,,1.I dont know the answer


## 2. Embed document chunks

Now that we've split our library into shorter self-contained strings, we can compute embeddings for each.

(For large embedding jobs, use a script like [api_request_parallel_processor.py](api_request_parallel_processor.py) to parallelize requests while throttling to stay under rate limits.)

In [14]:
extracted_text = df['pyPDF_extraction'].tolist()

In [15]:
# calculate embeddings
EMBEDDING_MODEL = "text-embedding-ada-002"  # OpenAI's best embeddings as of Apr 2023
BATCH_SIZE = 1000  # you can submit up to 2048 embedding inputs per request

embeddings = []
for batch_start in range(0, len(extracted_text), BATCH_SIZE):
    batch_end = batch_start + BATCH_SIZE
    batch = extracted_text[batch_start:batch_end]
    print(f"Batch {batch_start} to {batch_end-1}")
    response = openai.Embedding.create(model=EMBEDDING_MODEL, input=batch)
    for i, be in enumerate(response["data"]):
        assert i == be["index"]  # double check embeddings are in same order as input
    batch_embeddings = [e["embedding"] for e in response["data"]]
    embeddings.extend(batch_embeddings)

df1 = pd.DataFrame({"text": extracted_text, "embedding": embeddings})


Batch 0 to 999


In [16]:
df1

Unnamed: 0,text,embedding
0,You may not send a completed printout of this ...,"[-0.005231186281889677, -0.002754298970103264,..."
1,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,"[-0.012306703254580498, -0.0033481954596936703..."
2,OMB APPROVAL OMB Number 3235 0554 Expires Febr...,"[-0.02113576978445053, -0.003132110694423318, ..."
3,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,"[-0.002609275048598647, -0.009246379137039185,..."
4,OMB APPROVAL OMB Number 3235 0722 Expires Dece...,"[-0.020553508773446083, -0.009975481778383255,..."
...,...,...
144,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,"[-0.007044337224215269, -0.006628572475165129,..."
145,Form X 17A 5 FOCUS Report Part IIC Cover Page ...,"[-0.0035341274924576283, 0.003870225977152586,..."
146,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,"[-0.003206119639798999, -0.0044323778711259365..."
147,OMB APPROVAL OMB Number 3235 0123 Expires Octo...,"[-0.015941224992275238, -0.0038643358275294304..."


## 3. Store document chunks and embeddings

Because this example only uses a few thousand strings, we'll store them in a CSV file.

(For larger datasets, use a vector database, which will be more performant.)

In [18]:
# save document chunks and embeddings

SAVE_PATH = "form_embeddings.csv"

df1.to_csv(SAVE_PATH, index=False)
