![intro-to-vector-databases-queries](../assets/intro-vector-db-query.png)

---


### Learning objective:
By the end of this lesson, students will be able to load, split, and embed documents to store and query witihin a vector database. 
 

### About:  
In this lesson, we introduce the concept of vector databases and apply them to question and answer chains. 

### Prerequisites:
- Introduction to LangChain
- Advanced Prompt Development Module



### Contents
1. [Imports](#imports)
1. [Vector Database Workflow](#vec-db)
    1. [Load Documents](#load-docs)
    1. [Split Documents](#split-docs)
    1. [Embed and Store Documents](#embed-docs)
    1. [Vector Database Queries](#query) 

### Activities
1. [Try it!](#try-it)
1. [Lab](#lab)
    



### References

- [Chroma](https://www.trychroma.com/)
- [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings)
---


### Installs
pip install chromadb

<a id='vec-db'></a>
## Vector Databases 
Vector databases are a way to store and query text embeddings from documents. 
- Build models using custom docs
- Can be computationally expensive depending on the volume of your data


### Key Steps
1. Load raw text data from a source (e.g., websites, articles, etc.) 
2. Split docs transforming the text into documents for the language model 
3. Embed the documents 
    - As a reminder, text embeddings are numerical representation of the how free text is used in your documents
    - You can use a 3rd party algorithm  to create your text embeddings (available through Open AI, Hugging Face, Cohere, etc.) and create them using ```.embed_documents``` in [LangChain](https://python.langchain.com/docs/modules/data_connection/text_embedding/)
    - We will combine this step with creating the vector database using Chroma 
4. Store the embeddings in a vector database/vector store
    - You can use a 3rd party database along with LangChain tooling for this purpose including : 
        - Chroma - what we will use
        - FAISS
        - Lance
        - [source](https://python.langchain.com/docs/modules/data_connection/vectorstores/)
5. Query the vector store as needed 





<a id='imports'></a>
### Imports

In [None]:
from langchain_openai import ChatOpenAI #openai chatbot
from langchain_core.prompts import ChatPromptTemplate #template for chat prompts
from langchain_core.output_parsers import StrOutputParser #output parser for string output 

# loaders and splitters 
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

#new imports 
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma


<a id='load-docs'></a>
### Load Our Docs 
We will load text from the wikipedia ancient Rome article [source](https://en.wikipedia.org/wiki/Ancient_Rome)

In [None]:
loader = WebBaseLoader("https://en.wikipedia.org/wiki/Ancient_Rome")
pages = loader.load()
print(pages)

<a id='split-docs'></a>
### Split Our Text and Docs 
Now we will split our text into smaller chunks for the processing by the language model 

In [None]:
#split 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)
split_docs = text_splitter.split_documents(pages)
print(split_docs)

<a id='embed-docs'></a>
### Create Text Embeddings and Store in Chroma Vector Database
Now we will create text embeddings using OpenAI embeddings through LangChain and will store them in [Chroma].(https://www.trychroma.com/)

In [None]:
db = Chroma.from_documents(split_docs, OpenAIEmbeddings())

<a id='query'></a>
### Query Database

There are variety of search methods that we can use to query our vector database. Similarity search is common and we will use it in our lab today with the ```.similarity_search()``` method. 

#### Query Methods
1. [Similarity Search](https://python.langchain.com/docs/modules/data_connection/vectorstores/#similarity-search)
1. [Maximum Marginal Relevance Retrieval (MMR) ](https://python.langchain.com/docs/modules/data_connection/vectorstores/#maximum-marginal-relevance-search-mmr)
1. [Top K](https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore#specifying-top-k) 
1. Additional info: [Link](https://python.langchain.com/docs/modules/data_connection/vectorstores/#similarity-search-1)



In [None]:
query = "What structures were built in ancient Rome?"
docs = db.similarity_search(query)
print(docs[0].page_content)

<a id='try-it'></a>
### Try it
Try writing your own query about ancient Rome.

<a id='lab'></a>
## Lab
We will use the meetings notes stored in my_docs. These notes were generated by GPT, and will be loaded using the the file directory loader in LangChain. [LangChain Docs](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory)

Objectives:

1. Load my_documents from file folder, split, embed, and store in a Chroma vector database
1. Use a similarity search to find how revenue was discussed our prior meetings

In [None]:
# load


In [None]:
#split with recursive character text splitter


In [None]:
# create a vector database called db_notes


In [None]:
# Similarity Search Query 
