# Embedding & Retrieval

© Advanced Analytics, Amir Ben Haim, 2024

<br>
<br>
<hr class="dotted">
<br>
<br>

## <u>We'll cover</u>

- File input - use PDF files as inputs to the OpenAI API
- Vector Stores - with Retrieval
- Vector Embedding & Vector DB

<br>
<br>
<hr class="dotted">
<br>
<br>

## Setup

<br></br>

### <u>API Keys</u>

In order to use the OpenAI language model, users are required to generate a token.
<br></br>
<u>Follow these simple steps to generate a token with openai:</u>
- Go to <a href="url">https://platform.openai.com/apps</a>  and signup with your email address or connect your Google Account.
- Go to View API Keys on left side of your Personal Account Settings
- Select Create new Secret key
- The API access to OPENAI is a paid service
- You have to set up billing
- You don’t need ChatGPT Plus - The API and ChatGPT subscriptions are billed separately
<br></br>
<p style="background-color:Tomato"> Make sure you read the Pricing information before experimenting</p>
<p style="background-color:Tomato">Once you add your API key, make sure to not share it with anyone! The API key should remain private</p>
<p style="background-color:Tomato">Use the <code>.env</code> file for you API key</p>

<br></br>

### <u>pip install</u>

```powershell
pip install openai
pip install python-dotenv
pip install PyPDF2
pip install chromadb
```

<br></br>

### <u>API Key Setup</u>

Before using LangChain with OpenAI, set your API key:

In [1]:
from openai import OpenAI

In [3]:
from dotenv import load_dotenv
import os

load_dotenv()  # Loads variables from .env

openai_key = os.getenv("OPENAI_API_KEY")

<br>
<br>
<hr class="dotted">
<br>
<br>

## File inputs - UPDATED (with `responses` api)

<p style="font-size:25px"><u>Use PDF files as inputs to the OpenAI API</u></p>

- OpenAI models with vision capabilities can also accept PDF files as input

- Provide PDFs either as **Base64-encoded data** or as **file IDs** obtained after uploading files to the `/v1/files` endpoint

- To help models understand PDF content, both the extracted text and images are used as input

- File size limitations - You can upload up to 100 pages and 32MB of total content in a single request to the API, across multiple file inputs

- Supported models - Only models that support both text and image inputs, such as gpt-4o, gpt-4o-mini, or o1, can accept PDF files as input

<br></br>

### <u>Uploading files</u>

In the example below, we first upload a PDF using the Files API

In [5]:
client = OpenAI()

file = client.files.create(
    file=open("Basic_Woodworking_Course_Syllabus.pdf", "rb"),  # "rb" means read binary mode. Used to read files like PDFs or images
    purpose="user_data"
)

file

FileObject(id='file-SLeVeDWT9jbgjZKAH14xD9', bytes=99884, created_at=1750839934, filename='Basic_Woodworking_Course_Syllabus.pdf', object='file', purpose='user_data', status='processed', expires_at=None, status_details=None)

In [6]:
type(file)

openai.types.file_object.FileObject

In [7]:
file.id

'file-SLeVeDWT9jbgjZKAH14xD9'

In [8]:
file.filename

'Basic_Woodworking_Course_Syllabus.pdf'

In [9]:
file.status

'processed'

<br></br>

### <u>API request to the model</u>

Reference the file ID in an API request to the model

In [10]:
response = client.responses.create(
    model="gpt-4.1",
    input=[
        {"role": "user",
        "content":
                [{"type": "input_file","file_id": file.id},
                {"type": "input_text", "text": "What is the instructor name?"}]
        }
    ]
)

print(response.output_text)

The instructor's name is **Michael Carpenter**.


<br></br>

### <u>Where is the file saved?</u>

- Uploaded to OpenAI's servers
- Associated with your OpenAI account [Storage](https://platform.openai.com/storage/files)
- Not saved locally; the file object is just a Python reference to metadata

<p style="font-size:25px"><u>To list your uploaded files</u></p>

In [11]:
files = client.files.list()

for f in files.data:
    print(f.id)
    print(f.filename)
    print()

file-SLeVeDWT9jbgjZKAH14xD9
Basic_Woodworking_Course_Syllabus.pdf

file-4taMwcgHqkNZ6G7jtvA525
Basic_Woodworking_Course_Syllabus.pdf

file-WpJftGKNAwWuGPvBJPuwFV
Basic_Woodworking_Course_Syllabus.pdf

file-HF1dngdc9u2eEbgwcPy2yC
Basic_Woodworking_Course_Syllabus.pdf

file-DNQF2rKBJkuaPhk2WH7WyF
Basic_Woodworking_Course_Syllabus.pdf

file-7vBDyJFQ5wLAVNq7i8tVhy
Basic_Woodworking_Course_Syllabus.pdf

file-RgKy1bGsp1jZpqUvT7HsJD
Python Developer job Description.pdf

file-2bLPsbpuhAo6DYDPc522cE
step_metrics.csv

file-HoXyN1vHy4Lhmo2pnN4Q7T
hotel_review_sentiment_training.jsonl

file-3nzAfdeoMr3wefpXYu65TJ
ecommerce_complaints_training.jsonl

file-9U3oiMqWpGEwvcWNHJ4aZu
hotel_review_sentiment_test.jsonl

file-2FJ9Z5ygMKPobC3VJx48WE
ecommerce_complaints_test.jsonl



In [12]:
files.data

[FileObject(id='file-SLeVeDWT9jbgjZKAH14xD9', bytes=99884, created_at=1750839934, filename='Basic_Woodworking_Course_Syllabus.pdf', object='file', purpose='user_data', status='processed', expires_at=None, status_details=None),
 FileObject(id='file-4taMwcgHqkNZ6G7jtvA525', bytes=99884, created_at=1750839581, filename='Basic_Woodworking_Course_Syllabus.pdf', object='file', purpose='user_data', status='processed', expires_at=None, status_details=None),
 FileObject(id='file-WpJftGKNAwWuGPvBJPuwFV', bytes=99884, created_at=1750839464, filename='Basic_Woodworking_Course_Syllabus.pdf', object='file', purpose='user_data', status='processed', expires_at=None, status_details=None),
 FileObject(id='file-HF1dngdc9u2eEbgwcPy2yC', bytes=99884, created_at=1750839399, filename='Basic_Woodworking_Course_Syllabus.pdf', object='file', purpose='user_data', status='processed', expires_at=None, status_details=None),
 FileObject(id='file-DNQF2rKBJkuaPhk2WH7WyF', bytes=99884, created_at=1750839318, filename='

<p style="font-size:25px"><u>To delete a file</u></p>

In [13]:
client.files.delete(file.id)

FileDeleted(id='file-SLeVeDWT9jbgjZKAH14xD9', deleted=True, object='file')

In [14]:
files = client.files.list()

for f in files.data:
    print(f.id)
    print(f.filename)
    print()

file-4taMwcgHqkNZ6G7jtvA525
Basic_Woodworking_Course_Syllabus.pdf

file-WpJftGKNAwWuGPvBJPuwFV
Basic_Woodworking_Course_Syllabus.pdf

file-HF1dngdc9u2eEbgwcPy2yC
Basic_Woodworking_Course_Syllabus.pdf

file-DNQF2rKBJkuaPhk2WH7WyF
Basic_Woodworking_Course_Syllabus.pdf

file-7vBDyJFQ5wLAVNq7i8tVhy
Basic_Woodworking_Course_Syllabus.pdf

file-RgKy1bGsp1jZpqUvT7HsJD
Python Developer job Description.pdf

file-2bLPsbpuhAo6DYDPc522cE
step_metrics.csv

file-HoXyN1vHy4Lhmo2pnN4Q7T
hotel_review_sentiment_training.jsonl

file-3nzAfdeoMr3wefpXYu65TJ
ecommerce_complaints_training.jsonl

file-9U3oiMqWpGEwvcWNHJ4aZu
hotel_review_sentiment_test.jsonl

file-2FJ9Z5ygMKPobC3VJx48WE
ecommerce_complaints_test.jsonl



<br></br>

### <u>API request to the model - with Base64-encoded</u>

You can send PDF file inputs as Base64-encoded inputs as well

<u><b>What Is Base64 Encoding?</b></u>
<br>
Base64 is a way to convert binary data (like a PDF) into a text string that can safely be embedded in JSON or HTML.
<br>
This allows APIs that accept only JSON inputs to receive file contents.

In [15]:
import base64


with open("Basic_Woodworking_Course_Syllabus.pdf", "rb") as f:  # "rb" means read binary mode. Used to read files like PDFs or images
    data = f.read()

base64_string = base64.b64encode(data).decode("utf-8")

In [16]:
base64_string

'JVBERi0xLjcNCiW1tbW1DQoxIDAgb2JqDQo8PC9UeXBlL0NhdGFsb2cvUGFnZXMgMiAwIFIvTGFuZyhlbikgL1N0cnVjdFRyZWVSb290IDI0IDAgUi9NYXJrSW5mbzw8L01hcmtlZCB0cnVlPj4vTWV0YWRhdGEgMTAxIDAgUi9WaWV3ZXJQcmVmZXJlbmNlcyAxMDIgMCBSPj4NCmVuZG9iag0KMiAwIG9iag0KPDwvVHlwZS9QYWdlcy9Db3VudCAyL0tpZHNbIDMgMCBSIDIxIDAgUl0gPj4NCmVuZG9iag0KMyAwIG9iag0KPDwvVHlwZS9QYWdlL1BhcmVudCAyIDAgUi9SZXNvdXJjZXM8PC9Gb250PDwvRjEgNSAwIFIvRjIgMTIgMCBSL0YzIDE0IDAgUi9GNCAxOSAwIFI+Pi9FeHRHU3RhdGU8PC9HUzEwIDEwIDAgUi9HUzExIDExIDAgUj4+L1Byb2NTZXRbL1BERi9UZXh0L0ltYWdlQi9JbWFnZUMvSW1hZ2VJXSA+Pi9NZWRpYUJveFsgMCAwIDU5NS4zMiA4NDEuOTJdIC9Db250ZW50cyA0IDAgUi9Hcm91cDw8L1R5cGUvR3JvdXAvUy9UcmFuc3BhcmVuY3kvQ1MvRGV2aWNlUkdCPj4vVGFicy9TL1N0cnVjdFBhcmVudHMgMD4+DQplbmRvYmoNCjQgMCBvYmoNCjw8L0ZpbHRlci9GbGF0ZURlY29kZS9MZW5ndGggMjcxNz4+DQpzdHJlYW0NCnicxVtLiyQ3Er439H/I48xCa/R+QJFQmTVtvNgH44E9GB+Msee0w9r7/2H1zlCmVJ10SazxUNVZylQoFPHFp4jI6dPP//nt23S5fPpx/f424U8//Pbt6/Thj28v3//wcZ6n5bZOfz0/YYTdf1orMuFJGIEYnTQnyNDp7z+en/71j+nb89Py5fnp0yuZiEBGTl/+fH5yo/Fkr1CFyKQER2b68m8

In [17]:
type(base64_string)

str

In [18]:
response = client.responses.create(
    model="gpt-4.1",
    input=[
        {"role": "user",
        "content":
                [{"type": "input_file", "filename": "Basic_Woodworking_Course_Syllabus.pdf", "file_data": f"data:application/pdf;base64,{base64_string}"},
                {"type": "input_text", "text": "What is opening date?"}]
        }
    ]
)

print(response.output_text)

The opening date (start date) of the Basic Woodworking Course - Beginner Level is **Monday, May 6, 2024**.


<br>
<br>
<hr class="dotted">
<br>
<br>

## File inputs - (with `chat.completion` api)

<br></br>

### <u>Extract PDF Data</u>

Extract actual readable text from the PDF before sending it

In [19]:
from PyPDF2 import PdfReader

# Extract text from PDF
reader = PdfReader("Basic_Woodworking_Course_Syllabus.pdf")
text = "\n".join(page.extract_text() for page in reader.pages if page.extract_text())

In [20]:
print(text)

Basic Woodworking Course - Beginner Level  
 
 Course Overview  
This hands -on course introduces you to the fundamentals of woodworking. 
Learn how to safely use  tools, work with wood, and build small personal 
projects with confidence.  
 
What You Will Learn  
1. Introduction to Woodworking - Types of wood, understanding grain, 
moisture, and hardness  
 
2. Workshop Safety - Personal protective equipment (PPE), safe handling 
of tools  
 
3. Basic Hand Tools - Handsaws, chisels, hammers, measuring tools  
 
4. Power Tools - Drills, jigsaws, sanders (introductory level)  
 
5. Wood Joinery - Butt joints, lap joints, dowels, screws  
 
6. Project Building - Guided construction of a small stool, shelf, or toolbox  
 
7. Sanding & Finishing - Surface prep and applying finishes (oil, varnish, 
etc.) 
 
8. Final Project - Build your own small piece with instructor support  
 
Instructor Details  
Name: Michael Carpenter  
Age: 52  
Experience: 25+ years as a master woodworker and certif

In [21]:
type(text)

str

<br></br>

### <u>API request to the model</u>

In [22]:
response = client.chat.completions.create(
    model="gpt-4.1",
    messages = [
        {"role": "user", "content": f"{text}\n\nWhat is the instructor's name?"}]
)

print(response.choices[0].message.content)

The instructor's name is Michael Carpenter.


<br>
<br>
<hr class="dotted">
<br>
<br>

## Vector Stores - UPDATED (with `responses` api)

- Vector stores are the containers that power semantic search for the Retrieval API and the Assistants API file search tool

- When you add a file to a vector store it will be automatically **chunked, embedded, and indexed**

- Vector stores contain `vector_store_file` objects, which are backed by a `file` object

- <u>Object Type</u>
  - `file` - Represents content uploaded through the **Files API**
  - `vector_store` - Container for searchable files
  - `vector_store.file` - Wrapper type specifically representing a file that has been chunked and embedded

- Limits - The maximum file size is 512 MB. Each file should contain no more than 5,000,000 tokens per file (computed automatically when you attach a file)

<br></br>

- Supported Files:

| File Format | MIME Type                                                       |
|-------------|------------------------------------------------------------------|
| .c          | text/x-c                                                         |
| .cpp        | text/x-c++                                                       |
| .cs         | text/x-csharp                                                    |
| .css        | text/css                                                         |
| .doc        | application/msword                                               |
| .docx       | application/vnd.openxmlformats-officedocument.wordprocessingml.document |
| .go         | text/x-golang                                                    |
| .html       | text/html                                                        |
| .java       | text/x-java                                                      |
| .js         | text/javascript                                                  |
| .json       | application/json                                                 |
| .md         | text/markdown                                                    |
| .pdf        | application/pdf                                                  |
| .php        | text/x-php                                                       |
| .pptx       | application/vnd.openxmlformats-officedocument.presentationml.presentation |
| .py         | text/x-python / text/x-script.python                             |
| .rb         | text/x-ruby                                                      |
| .sh         | application/x-sh                                                 |
| .tex        | text/x-tex                                                       |
| .ts         | application/typescript                                           |
| .txt        | text/plain                                                       |



<br></br>

### <u>Vector store operations</u>

#### **Create vector store**

In [23]:
# Create file
file = client.files.create(
    file=open("Basic_Woodworking_Course_Syllabus.pdf", "rb"),  # "rb" means read binary mode. Used to read files like PDFs or images
    purpose="user_data"
)


# Create vector_stores and attache a file
client.vector_stores.create(
    name="Courses",
    file_ids=[file.id]
)

VectorStore(id='vs_685bb3421dc081918c114228e6db28de', created_at=1750840130, file_counts=FileCounts(cancelled=0, completed=0, failed=0, in_progress=1, total=1), last_active_at=1750840130, metadata={}, name='Courses', object='vector_store', status='in_progress', usage_bytes=0, expires_after=None, expires_at=None)

#### **List vector stores**

In [24]:
client.vector_stores.list()

SyncCursorPage[VectorStore](data=[VectorStore(id='vs_685bb3421dc081918c114228e6db28de', created_at=1750840130, file_counts=FileCounts(cancelled=0, completed=1, failed=0, in_progress=0, total=1), last_active_at=1750840130, metadata={}, name='Courses', object='vector_store', status='completed', usage_bytes=3060, expires_after=None, expires_at=None)], has_more=False, object='list', first_id='vs_685bb3421dc081918c114228e6db28de', last_id='vs_685bb3421dc081918c114228e6db28de')

In [25]:
for v in client.vector_stores.list():
    print(v.id)
    print(v.name)
    print()

vs_685bb3421dc081918c114228e6db28de
Courses



#### **Retrieve vector store**

In [26]:
client.vector_stores.retrieve(
    vector_store_id=v.id
)

VectorStore(id='vs_685bb3421dc081918c114228e6db28de', created_at=1750840130, file_counts=FileCounts(cancelled=0, completed=1, failed=0, in_progress=0, total=1), last_active_at=1750840130, metadata={}, name='Courses', object='vector_store', status='completed', usage_bytes=3060, expires_after=None, expires_at=None)

#### **Update vector store**

In [27]:
client.vector_stores.update(
    vector_store_id=v.id,
    name="Courses updated"
)

VectorStore(id='vs_685bb3421dc081918c114228e6db28de', created_at=1750840130, file_counts=FileCounts(cancelled=0, completed=1, failed=0, in_progress=0, total=1), last_active_at=1750840149, metadata={}, name='Courses updated', object='vector_store', status='completed', usage_bytes=3060, expires_after=None, expires_at=None)

In [28]:
for v in client.vector_stores.list():
    print(v.id)
    print(v.name)
    print()

vs_685bb3421dc081918c114228e6db28de
Courses



#### **Delete vector store**

In [29]:
client.vector_stores.delete(
    vector_store_id=v.id
)

VectorStoreDeleted(id='vs_685bb3421dc081918c114228e6db28de', deleted=True, object='vector_store.deleted')

In [31]:
for v in client.vector_stores.list():
    print(v.id)
    print(v.name)
    print()

<br></br>

### <u>Vector store file operations</u>

- Some operations, like `create` for `vector_store.file`, are <u>**asynchronous**</u> and may take time to complete
- You can use functions, like `create_and_poll` to block until it is

<u>Synchronous API</u>

- Waits for the response before moving on
- Blocking
- Example: response = api.call() (waits for result)

<u>Asynchronous API</u>

- Sends request and moves on
- Non-blocking
- Uses callbacks, promises, or async/await

#### **Create vector store file**

In [32]:
# Create vector_stores and attache a file
# Because we deleted it before
client.vector_stores.create(
    name="Courses",
)

VectorStore(id='vs_685bb381b1848191a3d7b1441a33c814', created_at=1750840193, file_counts=FileCounts(cancelled=0, completed=0, failed=0, in_progress=0, total=0), last_active_at=1750840193, metadata={}, name='Courses', object='vector_store', status='completed', usage_bytes=0, expires_after=None, expires_at=None)

In [33]:
# Get vector_stores id
for v in client.vector_stores.list():
    print(v.id)
    print(v.name)
    print()

vs_685bb381b1848191a3d7b1441a33c814
Courses



In [34]:
# Create vector store file
client.vector_stores.files.create_and_poll(
    vector_store_id=v.id,
    file_id=file.id
)

VectorStoreFile(id='file-SYjmvqvwEqo6QnJdPsp1mf', created_at=1750840209, last_error=None, object='vector_store.file', status='completed', usage_bytes=3060, vector_store_id='vs_685bb381b1848191a3d7b1441a33c814', attributes={}, chunking_strategy=StaticFileChunkingStrategyObject(static=StaticFileChunkingStrategy(chunk_overlap_tokens=400, max_chunk_size_tokens=800), type='static'))

#### **Upload vector store file**

In [35]:
client.vector_stores.files.upload_and_poll(
    vector_store_id=v.id,
    file=open("Basic_Woodworking_Course_Syllabus.pdf", "rb")
)

VectorStoreFile(id='file-HcTg5sE1sPiJSEjDySC2yZ', created_at=1750840260, last_error=None, object='vector_store.file', status='completed', usage_bytes=3060, vector_store_id='vs_685bb381b1848191a3d7b1441a33c814', attributes={}, chunking_strategy=StaticFileChunkingStrategyObject(static=StaticFileChunkingStrategy(chunk_overlap_tokens=400, max_chunk_size_tokens=800), type='static'))

#### **Retrieve vector store file**

In [36]:
client.vector_stores.files.retrieve(
    vector_store_id=v.id,
    file_id=file.id
)

VectorStoreFile(id='file-SYjmvqvwEqo6QnJdPsp1mf', created_at=1750840209, last_error=None, object='vector_store.file', status='completed', usage_bytes=3060, vector_store_id='vs_685bb381b1848191a3d7b1441a33c814', attributes={}, chunking_strategy=StaticFileChunkingStrategyObject(static=StaticFileChunkingStrategy(chunk_overlap_tokens=400, max_chunk_size_tokens=800), type='static'))

#### **Update vector store file**

<u>Attributes</u>

- Each `vector_store.file` can have associated `attributes`, a dictionary of values that can be referenced when performing **semantic search** with **attribute filtering**
- The dictionary can have at most 16 keys, with a limit of 256 characters each
- You can update file's `attributes` or initialize them when adding the file to the `vector_store`

In [37]:
client.vector_stores.files.update(
    vector_store_id=v.id,
    file_id=file.id,
    attributes={"Country": "USA",
                "Category":"Marketing"} 
)

VectorStoreFile(id='file-SYjmvqvwEqo6QnJdPsp1mf', created_at=1750840209, last_error=None, object='vector_store.file', status='completed', usage_bytes=3060, vector_store_id='vs_685bb381b1848191a3d7b1441a33c814', attributes={'Country': 'USA', 'Category': 'Marketing'}, chunking_strategy=StaticFileChunkingStrategyObject(static=StaticFileChunkingStrategy(chunk_overlap_tokens=400, max_chunk_size_tokens=800), type='static'))

In [38]:
file_attr = client.vector_stores.files.retrieve(
    vector_store_id=v.id,
    file_id=file.id
)

file_attr.attributes

{'Country': 'USA', 'Category': 'Marketing'}

#### **List vector store files**

In [39]:
# The current vector_store and how many files each has
for v in client.vector_stores.list():
    print(v.id)
    print(v.name)
    print(v.file_counts.total)
    print("\n\n")

vs_685bb381b1848191a3d7b1441a33c814
Courses
2





In [40]:
# The second vector_store files qty
v.file_counts.total

2

In [41]:
# Get all files from vectore_store
x = client.vector_stores.files.list(
    vector_store_id=v.id
)

In [42]:
x.data

[VectorStoreFile(id='file-HcTg5sE1sPiJSEjDySC2yZ', created_at=1750840260, last_error=None, object='vector_store.file', status='completed', usage_bytes=3060, vector_store_id='vs_685bb381b1848191a3d7b1441a33c814', attributes={}, chunking_strategy=StaticFileChunkingStrategyObject(static=StaticFileChunkingStrategy(chunk_overlap_tokens=400, max_chunk_size_tokens=800), type='static')),
 VectorStoreFile(id='file-SYjmvqvwEqo6QnJdPsp1mf', created_at=1750840209, last_error=None, object='vector_store.file', status='completed', usage_bytes=3060, vector_store_id='vs_685bb381b1848191a3d7b1441a33c814', attributes={'Country': 'USA', 'Category': 'Marketing'}, chunking_strategy=StaticFileChunkingStrategyObject(static=StaticFileChunkingStrategy(chunk_overlap_tokens=400, max_chunk_size_tokens=800), type='static'))]

In [43]:
# Get all files from vectore_store
file_from_vector_store= client.vector_stores.files.list(
    vector_store_id=v.id
)


for fiv in file_from_vector_store:
    print(fiv.id)

file-HcTg5sE1sPiJSEjDySC2yZ
file-SYjmvqvwEqo6QnJdPsp1mf


#### **Delete vector store file**

In [44]:
client.vector_stores.files.delete(
    vector_store_id=v.id,
    file_id=fiv.id
)

VectorStoreFileDeleted(id='file-SYjmvqvwEqo6QnJdPsp1mf', deleted=True, object='vector_store.file.deleted')

In [45]:
# Get all files from vectore_store
file_from_vector_store= client.vector_stores.files.list(
    vector_store_id=v.id
)


for fiv in file_from_vector_store:
    print(fiv.id)

file-HcTg5sE1sPiJSEjDySC2yZ


In [46]:
# The second vector_store files qty

# The current vector_store and how many files each has
for v in client.vector_stores.list():
    v


v.file_counts.total

1

<br></br>

### <u>Retrieval</u>

- Search your data using semantic similarity (semantic search)

- Get semantically similar results even when they match few or no keywords

![Vector Search Flow](retrieval_diagram.png)

<br></br>
<br></br>

#### **Setup**

<br>

For a clean start, we'll drop all `vector_store` and create a new one with our example file

In [47]:
# before Deleting vector_store
for v in client.vector_stores.list():
    print(v.id)
    print(v.name)
    print()

vs_685bb381b1848191a3d7b1441a33c814
Courses



In [48]:
# Deleting all vector_store
for v in client.vector_stores.list():
    client.vector_stores.delete(
        vector_store_id=v.id
    )

In [49]:
# After Deleting vector_store
for v in client.vector_stores.list():
    print(v.id)
    print(v.name)
    print()

<br>

For a clean start, we'll drop all `files`

In [50]:
# before Deleting files
files = client.files.list()

for f in files.data:
    print(f.id)
    print(f.filename)
    print()

file-HcTg5sE1sPiJSEjDySC2yZ
Basic_Woodworking_Course_Syllabus.pdf

file-SYjmvqvwEqo6QnJdPsp1mf
Basic_Woodworking_Course_Syllabus.pdf

file-4taMwcgHqkNZ6G7jtvA525
Basic_Woodworking_Course_Syllabus.pdf

file-WpJftGKNAwWuGPvBJPuwFV
Basic_Woodworking_Course_Syllabus.pdf

file-HF1dngdc9u2eEbgwcPy2yC
Basic_Woodworking_Course_Syllabus.pdf

file-DNQF2rKBJkuaPhk2WH7WyF
Basic_Woodworking_Course_Syllabus.pdf

file-7vBDyJFQ5wLAVNq7i8tVhy
Basic_Woodworking_Course_Syllabus.pdf

file-RgKy1bGsp1jZpqUvT7HsJD
Python Developer job Description.pdf

file-2bLPsbpuhAo6DYDPc522cE
step_metrics.csv

file-HoXyN1vHy4Lhmo2pnN4Q7T
hotel_review_sentiment_training.jsonl

file-3nzAfdeoMr3wefpXYu65TJ
ecommerce_complaints_training.jsonl

file-9U3oiMqWpGEwvcWNHJ4aZu
hotel_review_sentiment_test.jsonl

file-2FJ9Z5ygMKPobC3VJx48WE
ecommerce_complaints_test.jsonl



In [51]:
# Deleting files
for f in files.data:
    client.files.delete(f.id)

In [52]:
# After Deleting files
files = client.files.list()

for f in files.data:
    print(f.id)
    print(f.filename)
    print()

<br>

Creating a `vector_store` and a `file`

In [53]:
# Create file
file = client.files.create(
    file=open("Basic_Woodworking_Course_Syllabus.pdf", "rb"),  # "rb" means read binary mode. Used to read files like PDFs or images
    purpose="user_data"
)


# Create vector_stores and attache a file
client.vector_stores.create(
    name="Courses",
    file_ids=[file.id]
)

VectorStore(id='vs_685bb470fa90819188bc0f7380b07c06', created_at=1750840432, file_counts=FileCounts(cancelled=0, completed=0, failed=0, in_progress=1, total=1), last_active_at=1750840432, metadata={}, name='Courses', object='vector_store', status='in_progress', usage_bytes=0, expires_after=None, expires_at=None)

In [58]:
# Get vector_stores id
for v in client.vector_stores.list():
    print(v.id)
    print(v.name)
    print()

vs_685bb470fa90819188bc0f7380b07c06
Courses



In [55]:
v.file_counts.total

1

**Ready to go**

#### **Semantic search**

- A technique that leverages vector embeddings to surface semantically relevant results

- Includes results with few or no shared keywords, which classical search techniques might miss

- Semantic search is powered by `vector_store`

- A response will contain 10 results maximum by default, but you can set up to 50 using the `max_num_results` param

In [59]:
# Perform search query to get results
results = client.vector_stores.search(
    vector_store_id=v.id,
    query="How many different subjects, we will learn in the course ?",
)

In [60]:
results

SyncPage[VectorStoreSearchResponse](data=[VectorStoreSearchResponse(attributes={}, content=[Content(text='Basic Woodworking Course - Beginner Level \n\n \n\n Course Overview \n\nThis hands-on course introduces you to the fundamentals of woodworking. \n\nLearn how to safely use tools, work with wood, and build small personal \n\nprojects with confidence. \n\n \n\nWhat You Will Learn \n\n1. Introduction to Woodworking - Types of wood, understanding grain, \n\nmoisture, and hardness \n\n \n\n2. Workshop Safety - Personal protective equipment (PPE), safe handling \n\nof tools \n\n \n\n3. Basic Hand Tools - Handsaws, chisels, hammers, measuring tools \n\n \n\n4. Power Tools - Drills, jigsaws, sanders (introductory level) \n\n \n\n5. Wood Joinery - Butt joints, lap joints, dowels, screws \n\n \n\n6. Project Building - Guided construction of a small stool, shelf, or toolbox \n\n \n\n7. Sanding & Finishing - Surface prep and applying finishes (oil, varnish, \n\netc.) \n\n \n\n8. Final Project 

In [61]:
results.to_dict()

{'data': [{'attributes': {},
   'content': [{'text': 'Basic Woodworking Course - Beginner Level \n\n \n\n Course Overview \n\nThis hands-on course introduces you to the fundamentals of woodworking. \n\nLearn how to safely use tools, work with wood, and build small personal \n\nprojects with confidence. \n\n \n\nWhat You Will Learn \n\n1. Introduction to Woodworking - Types of wood, understanding grain, \n\nmoisture, and hardness \n\n \n\n2. Workshop Safety - Personal protective equipment (PPE), safe handling \n\nof tools \n\n \n\n3. Basic Hand Tools - Handsaws, chisels, hammers, measuring tools \n\n \n\n4. Power Tools - Drills, jigsaws, sanders (introductory level) \n\n \n\n5. Wood Joinery - Butt joints, lap joints, dowels, screws \n\n \n\n6. Project Building - Guided construction of a small stool, shelf, or toolbox \n\n \n\n7. Sanding & Finishing - Surface prep and applying finishes (oil, varnish, \n\netc.) \n\n \n\n8. Final Project - Build your own small piece with instructor support

#### **Query rewriting**

- Query styles yield better results
- Enable the feature `rewrite_query` by setting `rewrite_query=true` when performing a search

In [62]:
# Perform search query to get results
results = client.vector_stores.search(
    vector_store_id=v.id,
    query="How many different subjects, we will learn in the course ?",
    
    
    rewrite_query=True

    
)

#### **Synthesizing responses**

- After performing a query it's better to synthesize a response based on the results

- You can leverage our models to do so, by supplying the results and original query, to get back a grounded response

In [63]:
# Perform search query to get results

user_query = "How many different subjects, we will learn in the course ?"

results = client.vector_stores.search(
    vector_store_id=v.id,
    query=user_query,
    rewrite_query=True
)

In [64]:
# Define a format_results FUNCTION
def format_results(results):
    formatted_results = ''
    for result in results:
        formatted_result = f"<result file_id='{result.file_id}'>"
        for part in result.content:
            formatted_result += f"<content>{part.text}</content>"
        formatted_results += formatted_result + "</result>"
    return f"<sources>{formatted_results}</sources>"

In [65]:
# Use FUNCTION
formatted_results = format_results(results.data)

formatted_results

"<sources><result file_id='file-4MyEPKVwfntyPgEfjdVnok'><content>Basic Woodworking Course - Beginner Level \n\n \n\n Course Overview \n\nThis hands-on course introduces you to the fundamentals of woodworking. \n\nLearn how to safely use tools, work with wood, and build small personal \n\nprojects with confidence. \n\n \n\nWhat You Will Learn \n\n1. Introduction to Woodworking - Types of wood, understanding grain, \n\nmoisture, and hardness \n\n \n\n2. Workshop Safety - Personal protective equipment (PPE), safe handling \n\nof tools \n\n \n\n3. Basic Hand Tools - Handsaws, chisels, hammers, measuring tools \n\n \n\n4. Power Tools - Drills, jigsaws, sanders (introductory level) \n\n \n\n5. Wood Joinery - Butt joints, lap joints, dowels, screws \n\n \n\n6. Project Building - Guided construction of a small stool, shelf, or toolbox \n\n \n\n7. Sanding & Finishing - Surface prep and applying finishes (oil, varnish, \n\netc.) \n\n \n\n8. Final Project - Build your own small piece with instruc

In [66]:
type(formatted_results)

str

In [67]:
'\n'.join('\n'.join(c.text for c in result.content) for result in results.data)


completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "Produce a concise answer to the query based on the provided sources."
        },
        {
            "role": "user",
            "content": f"Sources: {formatted_results}\n\nQuery: '{user_query}'"
        }
    ],
)

print(completion.choices[0].message.content)

You will learn eight different subjects in the course.


<br>
<br>
<hr class="dotted">
<br>
<br>

## Vector Embedding & Vector DB - UPDATED

<br></br>

### <u>Vector embeddings</u>

- OpenAI’s text embeddings measure the relatedness of text strings
  - `text-embedding-3-small` - Small embedding model
  - `text-embedding-3-large` - Large embedding model
  - `text-embedding-ada-002` - Older embedding model

- An embedding is a vector (list) of floating point numbers

- The distance between two vectors measures their relatedness
  - Small distances suggest high relatedness
  - large distances suggest low relatedness

#### **Get embeddings**

Send your text string to the **embeddings API endpoint** along with the embedding model name

In [68]:
response = client.embeddings.create(
    input="Let's test embedding here",
    model="text-embedding-3-small"
)

vector = response.data[0].embedding
print(vector)

[-0.004876371938735247, -0.02793268673121929, 0.03972188010811806, -0.03569341450929642, -0.001951289246790111, -0.011722546070814133, 0.027651287615299225, 0.02634795941412449, -0.006068620830774307, 0.022097332403063774, 0.041321419179439545, -0.008227257989346981, -0.008930758573114872, -0.02640720084309578, -0.023060018196702003, 0.015062323771417141, 0.0017448673024773598, 0.031072523444890976, -0.004261734429746866, 0.025666674599051476, 0.01766897924244404, -0.04250626266002655, 0.013136953115463257, 0.0355156846344471, 0.015299293212592602, -0.035100992769002914, 0.009093674831092358, 0.045616477727890015, 0.050178125500679016, -0.02442258782684803, 0.009397290647029877, -0.03335334733128548, -0.005579872522503138, -0.007968072779476643, -0.037352193146944046, 0.021164268255233765, 0.020986542105674744, 0.051777664572000504, -0.060782477259635925, 0.01802443340420723, 0.036907877773046494, 0.003369398880749941, -0.04049202799797058, 0.011122719384729862, -0.00550581980496645, 0

In [69]:
type(vector)

list

In [70]:
len(vector)

1536

In [71]:
response

CreateEmbeddingResponse(data=[Embedding(embedding=[-0.004876371938735247, -0.02793268673121929, 0.03972188010811806, -0.03569341450929642, -0.001951289246790111, -0.011722546070814133, 0.027651287615299225, 0.02634795941412449, -0.006068620830774307, 0.022097332403063774, 0.041321419179439545, -0.008227257989346981, -0.008930758573114872, -0.02640720084309578, -0.023060018196702003, 0.015062323771417141, 0.0017448673024773598, 0.031072523444890976, -0.004261734429746866, 0.025666674599051476, 0.01766897924244404, -0.04250626266002655, 0.013136953115463257, 0.0355156846344471, 0.015299293212592602, -0.035100992769002914, 0.009093674831092358, 0.045616477727890015, 0.050178125500679016, -0.02442258782684803, 0.009397290647029877, -0.03335334733128548, -0.005579872522503138, -0.007968072779476643, -0.037352193146944046, 0.021164268255233765, 0.020986542105674744, 0.051777664572000504, -0.060782477259635925, 0.01802443340420723, 0.036907877773046494, 0.003369398880749941, -0.04049202799797

In [72]:
type(response)

openai.types.create_embedding_response.CreateEmbeddingResponse

In [73]:
response.usage

Usage(prompt_tokens=5, total_tokens=5)

In [74]:
response.model

'text-embedding-3-small'

- The response contains the embedding vector
- The response contains additional metadata
- You can extract the embedding vector, save it in a vector database
- By default, the length of the embedding vector is `1536` for `text-embedding-3-small` or `3072` for `text-embedding-3-large`

<br></br>

### <u>Vector DB</u>

#### **Chroma**

- Chroma is an AI-native open-source vector database

- You can run it on your machine

- You can use a hosted version - Chroma Cloud

<br>

- **In Python, you can run a Chroma server in-memory and connect to it with the ephemeral client**
  - This is a great tool for experimenting with different embedding functions and retrieval techniques in a Python notebook
  - If you don't need data persistence, the ephemeral client is a good choice for getting up and running with Chroma

<u>We'll use Chroma's **ephemeral client** for simplicity</u>

- It starts a Chroma server in-memory, <u>so any data you ingest will be lost when your program terminates</u>

- You can use the **persistent client** or run Chroma in **client-server** mode <u>if you need data persistence</u>

- The `EphemeralClient()` method starts a Chroma server in-memory and also returns a client with which you can connect to it



#### **Create a Chroma Client**

In [75]:
import chromadb
chroma_client = chromadb.Client()

ModuleNotFoundError: No module named 'chromadb'

#### **Create a collection**

- Collections are where you'll store your embeddings, documents, and any additional metadata

- Collections index your embeddings and documents, and enable efficient retrieval and filtering

- You can create a collection with a name

In [None]:
collection = chroma_client.create_collection(name="my_collection")

In [None]:
collection

In [None]:
type(collection)

#### **Add some text documents to the collection**

- Chroma will store your text and handle embedding and indexing automatically

- You can also customize the embedding model

- **You must provide unique string IDs for your documents**

In [None]:
collection.add(
    documents=[
        "This is a document about apples",
        "This is a document about cats"
    ],
    ids=["id1", "id2"]
)

#### **Query the collection**

- You can query the collection with a list of query texts

- Chroma will return the n most similar results

- If `n_results` is not provided, Chroma will return **10** results by default

- Here we only added **2** documents, so we set `n_results=2`

In [None]:
results = collection.query(
    query_texts=["This is a query document about fruits"], # Chroma will embed this for you
    n_results=2 # how many results to return
)

print(results)

In [None]:
results

#### **Inspect Results**

<u>**Very Important**</u>

- By default **ChromaDB uses cosine similarity** to calculate the distance between vectors

- Cosine similarity produces a value between -1 and 1

- **Cosine similarity IS NOT A DISTANCE, but a similarity**

<br>

- Chroma convert it to a distance by using:
  - **cosine distance = 1 - cosine similarity**
  - **cosine distance ranges from 1 to 2**
    - 0 = identical
    - 1 = unrelated
    - 2 = opposite meaning

From the above query - you can see that our query about `fruits` is the semantically most similar to the document about `apples`

#### **Use Chroma with OpenAI's Embedding models**

In [None]:
# String to embed
text = ["some string A", "some string B", "some string C"]

# Get embedding from OpenAI
response = client.embeddings.create(
    input=text,
    model="text-embedding-3-small"
)


# Extract all embeddings in a list
embeddings = [list(d.embedding) for d in response.data]



# Initialize collection
collection = chroma_client.get_or_create_collection(name="my_collection_openai")

#Store the string and its embedding in Chroma
collection.add(
    documents=text,
    embeddings=embeddings,
    ids=["A","B","C"]    # unique ID for this item
)

In [None]:
embeddings

#### **Why use Chroma with OpenAI's Embedding models?**

- Chroma (the open-source library/server) is free and open source

- If you use a cloud service or managed version, there may be costs

- In recent versions, Chroma can automatically call a default embedding model for you if you have not ***provided embeddings explicitly***

- This only works in Python (not with the HTTP/server API), and it is designed for convenience/testing

- The default is usually an open-source model - which are usually more simple