# Vectorstores and Embeddings

In [81]:
import openai
openai.api_key  = "xxx"

In [47]:
# !pip install -U langchain-community

In [48]:
# !pip install pypdf

Loading and Parsing PDF Documents

In [49]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
pdf_loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("/content/sample_data/sample_textbook.pdf"),
    PyPDFLoader("/content/sample_data/crime_data.pdf")
]
document_list = []
for pdf_loader in pdf_loaders:
    document_list.extend(pdf_loader.load())

Configuring Text Splitter for Document Segmentation

In [50]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

doc_splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=100
)

In [51]:
document_chunks = doc_splitter.split_documents(document_list)

In [52]:
len(document_chunks)

31

## Embeddings
Let's take our splits and embed them.

In [82]:
import os
os.environ["OPENAI_API_KEY"] = "xxx"

Generating Embeddings Using OpenAIEmbeddings

In [54]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding_model = OpenAIEmbeddings()

Embedding Sample Sentences for Similarity Testing

In [55]:
query1 = "The stock market experienced a significant surge."
query2 = "The financial markets saw a major rise in value."
query3 = "The weather today is extremely cold and rainy."

In [56]:
# !pip install tiktoken

In [57]:
embedding1 = embedding_model.embed_query(query1)
embedding2 = embedding_model.embed_query(query2)
embedding3 = embedding_model.embed_query(query3)

In [58]:
import numpy as np

Calculating Cosine Similarity Between Embeddings

In [59]:
np.dot(embedding1, embedding2)

np.float64(0.9367377828059265)

In [60]:
np.dot(embedding1, embedding3)

np.float64(0.730183633491168)

In [61]:
np.dot(embedding2, embedding3)

np.float64(0.7200210410390958)

## Vectorstores

In [62]:
# !pip install chromadb

In [63]:
from langchain.vectorstores import Chroma

In [64]:
vectorstore_directory = 'docs/chroma/'

In [69]:
!rm -rf ./vectorstore_data/chroma

In [72]:
vector_database = Chroma.from_documents(
    documents=document_chunks,
    embedding=embedding_model,
    persist_directory=None  # Use None for in-memory store
)

In [73]:
print(vector_database._collection.count())

31


Performing Similarity Search for Query

In [74]:
user_query = "What are the latest advancements in AI technologies?"
results = vector_database.similarity_search(user_query, k=3)

In [75]:
len(results)

3

In [76]:
results[0].page_content

"and mental health outcomes of children and adolescents. Dev Psychopathol \n2009; 21(1):227-259. \n2. Takagi D, Ken'ichi I, Kawachi I. Neighborhood social capital and crime"

In [77]:
vector_database.persist()

## Failure modes
This seems great, and basic similarity search will get you 80% of the way there very easily.

But there are some failure modes that can creep up.

Here are some edge cases that can arise - we'll fix them in the next class.

Testing Similarity Search with Another Query

In [78]:
new_query = "How do quantum computers function?"
retrieved_docs = vector_database.similarity_search(new_query, k=5)

Notice that we're getting duplicate chunks (because of the duplicate MachineLearning-Lecture01.pdf in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

docs[0] and docs[1] are indentical.

In [79]:
for doc in retrieved_docs:
    print(doc.metadata)

{'creationdate': '', 'creator': 'PyPDF', 'page': 3, 'page_label': '4', 'producer': 'Prince 20150210 (www.princexml.com)', 'source': '/content/sample_data/sample_textbook.pdf', 'title': 'Anatomy of the Somatosensory System', 'total_pages': 4}
{'creationdate': '', 'creator': 'PyPDF', 'page': 0, 'page_label': '1', 'producer': 'Prince 20150210 (www.princexml.com)', 'source': '/content/sample_data/sample_textbook.pdf', 'title': 'Anatomy of the Somatosensory System', 'total_pages': 4}
{'creationdate': '', 'creator': 'PyPDF', 'page': 2, 'page_label': '3', 'producer': 'Prince 20150210 (www.princexml.com)', 'source': '/content/sample_data/sample_textbook.pdf', 'title': 'Anatomy of the Somatosensory System', 'total_pages': 4}
{'creationdate': '', 'creator': 'PyPDF', 'page': 1, 'page_label': '2', 'producer': 'Prince 20150210 (www.princexml.com)', 'source': '/content/sample_data/sample_textbook.pdf', 'title': 'Anatomy of the Somatosensory System', 'total_pages': 4}
{'creationdate': '', 'creator': 

Displaying Content of a Retrieved Document

In [80]:
print(retrieved_docs[4].page_content)

sidenotes are shown in the outside
margin (on the left or right, depending
on whether the page is left or right).
Also, figures are floated to the top/
bottom of the page. Wide content, like
the table and Figure 3, intrude into the
outside margins.
or polymodal receptors. Polymodal receptors respond not
only to intense mechanical stimuli, but also to heat and
to noxious chemicals. These receptors respond to minute
punctures of the epithelium, with a response magnitude
that depends on the degree of tissue deformation. They al-
so respond to temperatures in the range of 40–60°C, and
