# 01 Embeddings and Vector Database

In this notebook, you will learn about embeddings and storing/retrieving/search through a Vector Database

### What are Embeddings?

<CONTENT ABOUT LLM EMBEDDINGS HERE>

Here are some code example of tiktoken based embeddings



### LLM Model vs compatible embeddings

Here are some other types of embeddings and their respecting compatibly models


### Tokens

### Converting inputs into Embeddings

In [None]:
# 01 Embeddings and Vector Database
# Example code for token-based embeddings
import importlib
from pathlib import Path

import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))



# Define the vocabulary size and embedding dimension
vocab_size = 10000
embedding_dim = 100

In [None]:
filepath = Path("../../backend/assets/test/copypasta.txt")
enc = tiktoken.get_encoding("o200k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"

# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4o")

text = filepath.read_text()

enc.encode(text)

In [12]:
from pathlib import Path

import pymupdf
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.milvus import Milvus

# Load the PDF document
pdf_path = Path("../../backend/assets/test/test.pdf")
pdf = pymupdf.open(pdf_path)

# Extract the text from the PDF
docs = ""
for page in pdf:
    docs += page.get_text()

# Close the PDF document
pdf.close()

# Split the documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=0)
all_splits = text_splitter.split_text(docs)
all_splits

19

In [None]:
from openai import client


# use model "gpt-3.5-turbo-instruct" for text
def generate_response_with_chatgpt(prompt):
    response = client.completions.create(
        model="gpt-3.o",  # Choose appropriate model
        prompt=prompt,
        max_tokens=150,
        n=1,
        stop=None,
        temperature=0.7,
    )
    return response.choices[0].text.strip()


# filename = "national-capitals.pdf"
# pdf_text = extract_text_from_pdf(filename)

print("Ready - ask questions or exit with q/Q:")
while True:
    user_query = input("==> ")
    if user_query.lower().strip() == "q":
        break
    prompt = pdf_text + "\n\n" + user_query
    response = generate_response_with_chatgpt(prompt)
    print("Response:\n")
    for line in textwrap.wrap(response, width=70):
        print(line)
    print("-" * 10)

## Storing Embeddings in Vector Databases

Instead of loading a document directly via the ChatUI, you may want to do a RAG prompt based off an existing maintained set of dataset.
This dataset may contain post-process real time data that you have gathered using other systems.

In order to do that, we need a vector database. One of the most common app to introduce to the concept of vector database is opensearch

## Milvus

For LLM, we will introduce Milvus

There are various methods of deploying Milvus. In this scenario, we will use Milvus Lite.

[Learn more here](https://milvus.io/docs/install-overview.md)

In [None]:
embeddings = OpenAIEmbeddings()
connection_args = { 'uri': URI, 'token': TOKEN }

vector_store = Milvus(
    embedding_function=embeddings,
    connection_args=connection_args,
    collection_name=COLLECTION_NAME,
    drop_old=True,
).from_documents(
    all_splits,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    connection_args=connection_args,
)