# Vector Store as Retriever
* Find the embedding that best answers your question.

## Setup

#### After you download the code from the github repository in your computer
In terminal:
* cd project_name
* pyenv local 3.11.4
* poetry install
* poetry shell

#### To open the notebook with Jupyter Notebooks
In terminal:
* jupyter lab

Go to the folder of notebooks and open the right notebook.

#### To see the code in Virtual Studio Code or your editor of choice.
* open Virtual Studio Code or your editor of choice.
* open the project-folder
* open the 005-retrievers.py file

## Create your .env file
* In the github repo we have included a file named .env.example
* Rename that file to .env file and here is where you will add your confidential api keys. Remember to include:
* OPENAI_API_KEY=your_openai_api_key
* LANGCHAIN_TRACING_V2=true
* LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
* LANGCHAIN_API_KEY=your_langchain_api_key
* LANGCHAIN_PROJECT=your_project_name

We will call our LangSmith project **005-retrievers**.

## Track operations
From now on, we can track the operations **and the cost** of this project from LangSmith:
* [smith.langchain.com](https://smith.langchain.com)

## Connect with the .env file located in the same directory of this notebook

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#pip install python-dotenv

In [1]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

#### Install LangChain

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [3]:
#!pip install langchain

## Connect with an LLM

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [4]:
#!pip install langchain-openai

* NOTE: Since right now is the best LLM in the market, we will use OpenAI by default. You will see how to connect with other Open Source LLMs like Llama3 or Mistral in a next lesson.

## Vector databases (aka vector stores): store and search embeddings
* See the documentation page [here](https://python.langchain.com/v0.1/docs/modules/data_connection/vectorstores/).
* See the list of vector stores [here](https://python.langchain.com/v0.1/docs/integrations/vectorstores/).

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install langchain-chroma

In [77]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_chroma import Chroma

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
loaded_document = TextLoader('./data/state_of_the_union.txt').load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

chunks_of_text = text_splitter.split_documents(loaded_document)

vector_db = Chroma.from_documents(chunks_of_text, OpenAIEmbeddings())

In [79]:
question = "What did the president say about the John Lewis Voting Rights Act?"

response = vector_db.similarity_search(question)

print(response[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


## Vector Stores vs. Retrievers

1. **Purpose and Functionality**:
   - **Vector Stores**: These are specialized databases designed to store information in the form of vectors (high-dimensional data points that represent text or other information). Vector stores are primarily used for quickly searching and retrieving similar vectors based on a query vector. They are focused on efficiently handling similarity comparisons between the stored vectors and any query vector.
   - **Retrievers**: Retrievers are more general tools that use various methods, including vector stores, to find and return relevant documents or information in response to a query. A retriever doesn't necessarily store the information itself but knows how to access and retrieve it when needed.

2. **Storage vs. Retrieval**:
   - **Vector Stores**: As the name implies, these are primarily concerned with storing data in a structured way that makes it fast and efficient to perform similarity searches.
   - **Retrievers**: While they may utilize storage systems like vector stores, retrievers are more focused on the act of fetching the right information in response to a user's query. Their main job is to provide the end-user with the most relevant information or documents based on the input they receive.

3. **Flexibility**:
   - **Vector Stores**: These are somewhat limited in their scope to handling tasks that involve similarity searches within the stored vectors. They are a specific tool for specific types of data retrieval tasks.
   - **Retrievers**: They can be designed to use different back-end systems (like vector stores or other databases) and can be part of larger systems that may involve more complex data processing or response generation.

In summary, vector stores in LangChain are about how information is stored and efficiently accessed based on similarity, while retrievers are about using various methods (including vector stores) to actively fetch and return the right information in response to diverse queries.

## Retriever: returns a response given a question

1. **Retriever: returns a response given a question** - A retriever is a tool that provides specific pieces of information or documents when you ask a question or make a query.

2. **A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store.** - A retriever works by taking a question that doesn't have a fixed format (an unstructured query) and finding relevant documents based on that question. It's a broad tool, more versatile than a vector store, which is just one way to organize information to make retrieval efficient.

3. **A retriever does not need to be able to store documents, only to return (or retrieve) them.** - The main job of a retriever is to find and return documents when asked; it doesn't have to store these documents itself. It can find documents stored elsewhere.

4. **Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.** - Although many retrievers use a system called a vector store to help find the right documents quickly (a vector store organizes information into a format that's easy to search through), there are other ways to build a retriever that don't rely on this method.

Overall, a retriever helps you find information you need from a large amount of data by searching through documents, and it can use different methods to do this efficiently.
* See the documentation page [here](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/).
* See the list of third-party retrievers [here](https://python.langchain.com/v0.1/docs/integrations/retrievers/).

#### Vector store as retriever

In [81]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./data/state_of_the_union.txt")

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install faiss-cpu

In [83]:
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

loaded_document = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

chunks_of_text = text_splitter.split_documents(loaded_document)

embeddings = OpenAIEmbeddings()

vector_db = FAISS.from_documents(chunks_of_text, embeddings)

In [84]:
retriever = vector_db.as_retriever()

## Differences .similarity_search vs. .as_retriever()
Both methods involve finding the most relevant text based on a query, but they are structured differently and may offer different functionalities based on their implementation.

#### `.similarity_search`

This method directly performs a similarity search against a vector database, which in your first code snippet is managed by the `Chroma` class. The process includes:
- Embedding the input query using the same model that was used to embed the document chunks.
- Searching the vector database for the closest vectors to the query's embedding.
- Returning the most relevant chunks based on their semantic similarity to the query.

This method is straightforward and typically used when you need to quickly find and retrieve the text segments that best match the query.

#### `.as_retriever()`

This method involves a different approach:
1. **Retriever Setup**: In the second code snippet, `vector_db.as_retriever()` converts the vector database (managed by `FAISS` in this case) into a retriever object. This object abstracts the similarity search into a retriever model that can be used in more complex retrieval-augmented generation (RAG) tasks.
2. **Invoke Method**: The `invoke()` function on the retriever is then used to perform the query. This method can be part of a larger system where the retriever is integrated with other components (like a language model) to generate answers or further process the retrieved documents.

#### Key Differences

- **Flexibility**: `.as_retriever()` provides a more flexible interface that can be integrated into larger, more complex systems, potentially combining retrieval with generation (like in RAG setups). This method is beneficial in applications where the retrieved content might be used as input for further processing or answer generation.
- **Backend Implementation**: While `.similarity_search` directly accesses the vector database, `.as_retriever()` encapsulates this access within a retriever object, which might have additional functionalities or optimizations for specific retrieval tasks.
- **Use Cases**: The direct `.similarity_search` might be faster and more straightforward for simple query-to-document retrieval tasks. In contrast, `.as_retriever()` could be used in scenarios requiring additional steps after retrieval, like feeding the retrieved information into a language model for generating coherent and context-aware responses.

Both methods are useful, but their appropriateness depends on the specific requirements of your application, such as whether you need straightforward retrieval or a more complex retrieval-augmented generation process.

#### Simple use without LCEL

In [93]:
response = retriever.invoke("what did he say about ketanji brown jackson?")

In [94]:
len(response)

4

In [95]:
response[0]

Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': './data/state_of_the_union.txt'})

In [90]:
response

"He said that Ketanji Brown Jackson is one of the nation's top legal minds who will continue Justice Breyer's legacy of excellence."

## How to execute the code from Visual Studio Code
* In Visual Studio Code, see the file 001-data-load.py
* In terminal, make sure you are in the directory of the file and run:
    * python 005-retrievers.py