![intro-to-vector-databases-queries](../assets/intro-vector-db-query.png)

---


### Learning objective:
By the end of this lesson, students will be able to load, split, and embed documents to store and query witihin a vector database. 
 

### About:  
In this lesson, we introduce the concept of vector databases and apply them to question and answer chains. 

### Prerequisites:
- Introduction to LangChain
- Advanced Prompt Development Module



### Contents
1. [Imports](#imports)
1. [Vector Database Workflow](#vec-db)
    1. [Load Documents](#load-docs)
    1. [Split Documents](#split-docs)
    1. [Embed and Store Documents](#embed-docs)
    1. [Vector Database Queries](#query) 

### Activities
1. [Try it!](#try-it)
1. [Lab](#lab)
    



### References

- [Chroma](https://www.trychroma.com/)
- [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings)
---


### Installs
pip install chromadb

<a id='vec-db'></a>
## Vector Databases 
Vector databases are a way to store and query text embeddings from documents. 
- Build models using custom docs
- Can be computationally expensive depending on the volume of your data


### Key Steps
1. Load raw text data from a source (e.g., websites, articles, etc.) 
2. Split docs transforming the text into documents for the language model 
3. Embed the documents 
    - As a reminder, text embeddings are numerical representation of the how free text is used in your documents
    - You can use a 3rd party algorithm  to create your text embeddings (available through Open AI, Hugging Face, Cohere, etc.) and create them using ```.embed_documents``` in [LangChain](https://python.langchain.com/docs/modules/data_connection/text_embedding/)
    - We will combine this step with creating the vector database using Chroma 
4. Store the embeddings in a vector database/vector store
    - You can use a 3rd party database along with LangChain tooling for this purpose including : 
        - Chroma - what we will use
        - FAISS
        - Lance
        - [source](https://python.langchain.com/docs/modules/data_connection/vectorstores/)
5. Query the vector store as needed 





<a id='imports'></a>
### Imports

In [1]:
from langchain_openai import ChatOpenAI #openai chatbot
from langchain_core.prompts import ChatPromptTemplate #template for chat prompts
from langchain_core.output_parsers import StrOutputParser #output parser for string output 

# loaders and splitters 
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

#new imports 
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma


<a id='load-docs'></a>
### Load Our Docs 
We will load text from the wikipedia ancient Rome article [source](https://en.wikipedia.org/wiki/Ancient_Rome)

In [2]:
loader = WebBaseLoader("https://en.wikipedia.org/wiki/Ancient_Rome")
pages = loader.load()
print(pages)

[Document(page_content='\n\n\n\nAncient Rome - Wikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJump to content\n\n\n\n\n\n\n\nMain menu\n\n\n\n\n\nMain menu\nmove to sidebar\nhide\n\n\n\n\t\tNavigation\n\t\n\n\nMain pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate\n\n\n\n\n\n\t\tContribute\n\t\n\n\nHelpLearn to editCommunity portalRecent changesUpload file\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCreate account\n\nLog in\n\n\n\n\n\n\n\n\nPersonal tools\n\n\n\n\n\n Create account Log in\n\n\n\n\n\n\t\tPages for logged out editors learn more\n\n\n\nContributionsTalk\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nContents\nmove to sidebar\nhide\n\n\n\n\n(Top)\n\n\n\n\n\n1Early Italy and the founding of Rome\n\n\n\n\n\n\n\n2Kingdom\n\n\n\n\n\n\n\n3Republic\n\n\n\nToggle Republic subsection\n\n\n\n\n\n3.1Punic Wars\n\n\n\n\n\n\n\n\n\

<a id='split-docs'></a>
### Split Our Text and Docs 
Now we will split our text into smaller chunks for the processing by the language model 

In [3]:
#split 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)
split_docs = text_splitter.split_documents(pages)
print(split_docs)

[Document(page_content='Ancient Rome - Wikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJump to content\n\n\n\n\n\n\n\nMain menu\n\n\n\n\n\nMain menu\nmove to sidebar\nhide\n\n\n\n\t\tNavigation\n\t\n\n\nMain pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate\n\n\n\n\n\n\t\tContribute\n\t\n\n\nHelpLearn to editCommunity portalRecent changesUpload file\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCreate account\n\nLog in\n\n\n\n\n\n\n\n\nPersonal tools\n\n\n\n\n\n Create account Log in\n\n\n\n\n\n\t\tPages for logged out editors learn more\n\n\n\nContributionsTalk\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nContents\nmove to sidebar\nhide\n\n\n\n\n(Top)\n\n\n\n\n\n1Early Italy and the founding of Rome\n\n\n\n\n\n\n\n2Kingdom\n\n\n\n\n\n\n\n3Republic\n\n\n\nToggle Republic subsection\n\n\n\n\n\n3.1Punic Wars\n\n\n\n\n\n\n\n\n\n4Late R

<a id='embed-docs'></a>
### Create Text Embeddings and Store in Chroma Vector Database
Now we will create text embeddings using OpenAI embeddings through LangChain and will store them in [Chroma].(https://www.trychroma.com/)

In [4]:
db = Chroma.from_documents(split_docs, OpenAIEmbeddings(openai_api_key="...", model="text-embedding-3-large"))

<a id='query'></a>
### Query Database

There are variety of search methods that we can use to query our vector database. Similarity search is common and we will use it in our lab today with the ```.similarity_search()``` method. 

#### Query Methods
1. [Similarity Search](https://python.langchain.com/docs/modules/data_connection/vectorstores/#similarity-search)
1. [Maximum Marginal Relevance Retrieval (MMR) ](https://python.langchain.com/docs/modules/data_connection/vectorstores/#maximum-marginal-relevance-search-mmr)
1. [Top K](https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore#specifying-top-k) 
1. Additional info: [Link](https://python.langchain.com/docs/modules/data_connection/vectorstores/#similarity-search-1)



In [5]:
query = "What structures were built in ancient Rome?"
docs = db.similarity_search(query)
print(docs[0].page_content)

Culture[edit]
Main article: Culture of ancient Rome
The seven hills of Rome
Life in ancient Rome revolved around the city of Rome, located on seven hills. The city had a vast number of monumental structures like the Colosseum, the Trajan's Forum and the Pantheon. It had theatres, gymnasiums, marketplaces, functional sewers, bath complexes complete with libraries and shops, and fountains with fresh drinking water supplied by hundreds of miles of aqueducts. Throughout the territory under the control of ancient Rome, residential architecture ranged from modest houses to country villas.


<a id='try-it'></a>
### Try it
Try writing your own query about ancient Rome.

In [6]:
query = "Who founded Rome?"
docs = db.similarity_search(query)
print(docs[0].page_content)

The Romans themselves had a founding myth, attributing their city to Romulus and Remus, offspring of Mars and a princess of the mythical city of Alba Longa.[7] The sons, sentenced to death, were rescued by a wolf and returned to restore the Alban king and found a city. After a dispute, Romulus killed Remus and became the city's sole founder. The area of his initial settlement on the Palatine Hill was later known as Roma Quadrata ("Square Rome"). The story dates at least to the third century, and the later Roman antiquarian Marcus Terentius Varro placed the city's foundation to 753 BC.[8] Another legend, recorded by Greek historian Dionysius of Halicarnassus, says that Prince Aeneas led a group of Trojans on a sea voyage to found a new Troy after the Trojan War. They landed on the banks of the Tiber River and a woman travelling with them, Roma, torched their ships to prevent them leaving again. They named the settlement after her.[9] The Roman poet Virgil recounted this legend in his


<a id='lab'></a>
## Lab
We will use the meetings notes stored in my_docs. These notes were generated by GPT, and will be loaded using the the file directory loader in LangChain. [LangChain Docs](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory)

Objectives:

1. Load my_documents from file folder, split, embed, and store in a Chroma vector database
1. Use a similarity search to find how revenue was discussed our prior meetings

In [7]:
# load
notes_loader = DirectoryLoader('assets/my_docs')
meeting_notes = notes_loader.load()
len(meeting_notes)

5

In [8]:
#split with recursive character text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=20,
    length_function=len,
)
split_notes = text_splitter.split_documents(meeting_notes)
print(split_notes)

[Document(page_content='Meeting Notes - July 20, 2021\n\nDiscussed monthly sales targets and progress towards meeting them\n\nReviewed marketing campaigns and analyzed their effectiveness\n\nIdentified potential leads and discussed strategies for converting them into customers\n\nDiscussed customer complaints and identified areas for improvement in customer service\n\nAgreed on action items and assigned responsibilities for addressing the issues raised', metadata={'source': 'assets/my_docs/document_3.txt'}), Document(page_content='Meeting Notes - May 5, 2021\n\nReviewed customer feedback and identified areas for improvement\n\nBrainstormed ideas for new product features and enhancements\n\nAssigned development tasks to team members and set deadlines\n\nDiscussed marketing strategy to promote the new features\n\nAgreed on next steps and scheduled a demo session for the updated product', metadata={'source': 'assets/my_docs/document_2.txt'}), Document(page_content='Meeting Notes - March 1

In [10]:
# create a vector database called db_notes
db_notes = Chroma.from_documents(split_notes, OpenAIEmbeddings(openai_api_key="...", model="text-embedding-3-large"))

In [11]:
# Similarity Search Query 
query = "What was discussed about revenue?"
notes_docs= db_notes.similarity_search(query)
print(notes_docs[0].page_content)

Meeting Notes - November 30, 2021

Reviewed financial reports and analyzed the company's performance

Discussed cost-cutting measures to improve profitability

Brainstormed ideas for new revenue streams and market expansion

Discussed employee satisfaction and proposed initiatives for boosting morale

Agreed on next steps and scheduled a strategic planning session for the upcoming year.
