# FAISS


- Author: [Jeongeun Lim](https://www.linkedin.com/in/jeongeun-lim-808978188/)
- Design: []()
- Peer Review : 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)


[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/03-OutputParser/08-OutputFixingParser.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/03-OutputParser/08-OutputFixingParser.ipynb)


## Overview


`FAISS` is a library designed for the efficient similarity search and clustering of dense vectors. It provides robust algorithms for searching vector sets of any size, including those that may not fit entirely in `RAM`.


In addition to the core search functionality, `FAISS` includes support code for evaluation and parameter tuning, making it a versatile tool for various applications in machine learning and artificial intelligence.


----
Key Benefits:


- Efficient Large-Scale Search:
`FAISS` ensures fast and accurate vector searches, even with millions of high-dimensional vectors.


- Memory Optimization:
Offers advanced quantization techniques to reduce memory usage without sacrificing performance.


- Customizable Search Accuracy:
Users can fine-tune parameters to balance between search accuracy and speed according to specific requirements.


- Versatile Applications:
From machine learning to AI-powered recommendation systems, Faiss supports a wide range of use cases.


---- 
Implementation Steps:


To effectively integrate `FAISS` into your workflow, follow these steps:


1. Data Preparation:
Prepare and normalize your data, ensuring vectors are in a dense representation format.


2. Index Creation:
Select and build a Faiss index based on your dataset size and performance requirements. Common options include IndexFlat for brute-force search or IVF for scalable inverted file-based search.


3. Index Training (if needed):
For certain indices, such as `IVF` or `PQ`, train the index with representative data samples to optimize performance.


4. Search Execution:
Use the index to search for nearest neighbors, leveraging optional GPU acceleration for faster performance.


5. Evaluation and Tuning:
Test and evaluate the performance of your index, adjusting parameters like quantization levels or clustering size for improved results.


### Table of Contents


- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Load a Sample Dataset](#load-a-sample-dataset)
- [Create a VectorStore](#create-a-vectorstore)
    - [Create a FAISS VectorStore(from_documents)](#create-a-faiss-vectorstorefrom_documents)
    - [Create a FAISS VectorStore(from_texts)](#create-a-faiss-vectorstorefrom_texts)
- [Similarity Search](#similarity-search)
- [Data Addition Methods](#data-addition-methods)
- [Delete Documents](#delete-documents)
- [Local Persistence](#local-persistence)
- [FAISS Object Merge (Merge From)](#faiss-object-merge-merge-from)
- [Convert to Searcher (as_retriever)](#convert-to-searcher-as_retriever)


### References


- [LangChain : Faiss](https://python.langchain.com/docs/integrations/vectorstores/faiss)
- [Faiss Docs](https://faiss.ai/)


----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_openai",
        "langchain_community",
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "FAISS",
    }
)

Environment variables have been set successfully.


You can alternatively set `OPENAI_API_KEY` in `.env` file and load it. 

[Note] This is not necessary if you've already set `OPENAI_API_KEY` in previous steps.

In [4]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Load a Sample Dataset
Demonstrates how to load text files using LangChain’s `TextLoader` and split them into smaller chunks with `RecursiveCharacterTextSplitter`. 
The resulting documents are prepared for further embedding and storage in a FAISS vector store.

In [None]:
"""
Will be reflected in a fixed sample dataset in the future
"""

# from langchain_community.document_loaders import TextLoader
# from langchain.text_splitter import RecursiveCharacterTextSplitter

# # 텍스트 분할
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=0)

# # 텍스트 파일을 load -> List[Document] 형태로 변환
# loader1 = TextLoader("data/nlp-keywords.txt")
# loader2 = TextLoader("data/finance-keywords.txt")

# # 문서 분할
# split_doc1 = loader1.load_and_split(text_splitter)
# split_doc2 = loader2.load_and_split(text_splitter)

# # 문서 개수 확인
# len(split_doc1), len(split_doc2)

In [5]:
"""
Will be reflected in a fixed sample dataset in the future
"""

from uuid import uuid4
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Define the dataset
document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)
document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)
document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)
document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)
document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)
document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)
document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)
document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)
document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)
document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]

# Define the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)

# Split documents into smaller chunks and create a new list
split_documents = []
for doc in documents:
    split_content = text_splitter.split_text(doc.page_content)
    for chunk in split_content:
        split_documents.append(Document(page_content=chunk, metadata=doc.metadata))

# Generate a unique UUID for each split document
uuids = [str(uuid4()) for _ in range(len(split_documents))]

# Add the split documents to the VectorStore
# db.add_documents(documents=split_documents, ids=uuids)

# Verify the result (Print the number of split documents)
print(f"Number of split documents: {len(split_documents)}")

Number of split documents: 18


## Create a VectorStore

Key Initialization Parameters:

- Indexing Parameters
    - `embedding_function` (Embeddings): The embedding function to be used.
- Client Parameters
    - `index` (Any): The FAISS index to be used.
    - `docstore` (Docstore): The document store to be utilized.
    - `index_to_docstore_id` (Dict[int, str]): A mapping from the index to document store IDs.

**[Note]** 

- `FAISS` is a high-performance library for vector search and clustering.
- This class integrates `FAISS` with LangChain's VectorStore interface.
- By combining the `embedding function`, `FAISS index`, and `document store`, you can build an efficient vector search system.

In [6]:
import faiss
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_openai import OpenAIEmbeddings

# Embedding
embeddings = OpenAIEmbeddings()

# Calculate the size of the embedding dimension
dimension_size = len(embeddings.embed_query("hello world"))
print(dimension_size)

1536


In [7]:
# Create a FAISS vector store
db = FAISS(
    embedding_function=OpenAIEmbeddings(),
    index=faiss.IndexFlatL2(dimension_size),
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

## Create a FAISS VectorStore(from_documents)

The `from_documents` class method creates a FAISS vector store using a list of documents and an embedding function.

- Parameters:
    - `documents` (List[Document]): A list of documents to be added to the vector store.
    - `embedding` (Embeddings): The embedding function to be used.
    - `**kwargs`: Additional keyword arguments.

- How It Works:
1. Extracts the text content (`page_content`) and metadata from the list of documents.
2. Calls the `from_texts` method using the extracted text and metadata.

- Return Value:
    - `VectorStore`: An instance of the vector store initialized with the provided documents and embeddings.

**[Note]** 
- This method internally calls the `from_texts` method to create the vector store.
- The `page_content` of each document is used as text, while `metadata` is used as the document's metadata.
- Additional configurations can be passed through `kwargs`.

In [8]:
# Create a FAISS vector store from the documents
db = FAISS.from_documents(documents=split_documents, embedding=OpenAIEmbeddings())

In [9]:
# Check the document store IDs
db.index_to_docstore_id

{0: 'd7e03a33-c8d5-4139-a9a5-0ee287908e1a',
 1: '9a727937-e0af-4003-a3a8-4997f8dddf00',
 2: '9eb165f9-ab7a-4a8d-8fc2-547c575d5cf6',
 3: 'd8e0189a-17df-4ee4-860c-2c838e404c02',
 4: '77422209-888e-4f2e-b656-85a347ba56f7',
 5: '557c5ff3-6303-43e3-8919-ac78081e16b8',
 6: '43751c22-f479-4ce9-9c92-83b4a6bea494',
 7: 'dc2f4d57-fdd9-43c4-b355-4b188a89ce5b',
 8: 'f7a3f798-4459-45f0-9064-d91cf31bb466',
 9: 'f8cc3819-4f71-4256-9d9a-9b6fba8712c6',
 10: 'c5c40a21-ce67-447d-819c-d6971ffcb2d6',
 11: 'fd6a6f48-eda2-4753-93e9-aacb991b0519',
 12: 'ce25d341-1c44-4a7a-82b1-5d5ff139c137',
 13: '8a089124-0f45-419b-8c96-a8a6aacd5c0f',
 14: 'e203e4b6-4703-43b3-ac46-b46b9e825c00',
 15: 'fa0f3f69-3754-4ccf-8d97-096553677cdc',
 16: '02b8f2ad-2aa7-4b26-930b-762ab6b0204e',
 17: 'c261874c-5393-4d7c-80a0-e67c5f115fda'}

In [10]:
# Check the ID of the stored document: Document
db.docstore._dict

{'d7e03a33-c8d5-4139-a9a5-0ee287908e1a': Document(id='d7e03a33-c8d5-4139-a9a5-0ee287908e1a', metadata={'source': 'tweet'}, page_content='I had chocolate chip pancakes and scrambled eggs'),
 '9a727937-e0af-4003-a3a8-4997f8dddf00': Document(id='9a727937-e0af-4003-a3a8-4997f8dddf00', metadata={'source': 'tweet'}, page_content='eggs for breakfast this morning.'),
 '9eb165f9-ab7a-4a8d-8fc2-547c575d5cf6': Document(id='9eb165f9-ab7a-4a8d-8fc2-547c575d5cf6', metadata={'source': 'news'}, page_content='The weather forecast for tomorrow is cloudy and'),
 'd8e0189a-17df-4ee4-860c-2c838e404c02': Document(id='d8e0189a-17df-4ee4-860c-2c838e404c02', metadata={'source': 'news'}, page_content='and overcast, with a high of 62 degrees.'),
 '77422209-888e-4f2e-b656-85a347ba56f7': Document(id='77422209-888e-4f2e-b656-85a347ba56f7', metadata={'source': 'tweet'}, page_content='Building an exciting new project with LangChain -'),
 '557c5ff3-6303-43e3-8919-ac78081e16b8': Document(id='557c5ff3-6303-43e3-8919-ac7

## Create a FAISS VectorStore(from_texts)

The `from_texts` class method creates a FAISS vector store using a list of texts and an embedding function.

- Parameters:
    - `texts` (List[str]): A list of texts to be added to the vector store.
    - `embedding` (Embeddings): The embedding function to use.
    - `metadatas` (Optional[List[dict]]): A list of metadata. Default is None.
    - `ids` (Optional[List[str]]): A list of document IDs. Default is None.
    - `**kwargs`: Additional keyword arguments.

- How It Works:
    1. Texts are embedded using the provided embedding function.
    2. The `__from` method is called with the embedded vectors to create a `FAISS` instance.

- Return Value:
    - `FAISS`: The created FAISS vector store instance.

**[Note]**
- This method provides a user-friendly interface, handling document embedding, in-memory document storage, and `FAISS` database initialization in a single step.
- It’s a convenient way to get started quickly.

**[Caution]**
- Be mindful of memory usage when processing a large number of texts.
- When using metadata or IDs, ensure they are provided as lists with the same length as the text list.

In [13]:
# Create using a list of strings
db2 = FAISS.from_texts(
    ["Hello, it's nice to meet you.", "My name is Teddy."],
    embedding=OpenAIEmbeddings(),
    metadatas=[{"source": "text document"}, {"source": "text document"}],
    ids=["doc1", "doc2"],
)

# Stored content
db2.docstore._dict

{'doc1': Document(id='doc1', metadata={'source': 'text document'}, page_content="Hello, it's nice to meet you."),
 'doc2': Document(id='doc2', metadata={'source': 'text document'}, page_content='My name is Teddy.')}

## Similarity Search

The `similarity_search` method allows you to search for documents most similar to a given query.

- Parameters:
    - `query` (str): The search query text for finding similar documents.
    - `k` (int): The number of documents to return. Default is 4.
    - `filter` (Optional[Union[Callable, Dict[str, Any]]]): A metadata filtering function or dictionary. Default is None.
    - `fetch_k` (int): The number of documents to retrieve before applying filtering. Default is 20.
    - `**kwargs`: Additional keyword arguments.

- Returns:
    - `List[Document]`: A list of documents most similar to the query.

- How It Works:
    1. Internally calls the `similarity_search_with_score` method to search for documents along with their similarity scores.
    2. Extracts and returns only the documents from the results, excluding the scores.

- Key Features:
    - The `filter` parameter enables metadata-based filtering.
    - The `fetch_k` parameter allows control over the number of documents retrieved before filtering, ensuring enough documents remain after filtering.

- Considerations:
    - Search performance heavily depends on the quality of the embedding model used.
    - In large datasets, it is important to adjust the values of `k` and `fetch_k` to balance search speed and accuracy.
    - For complex filtering needs, pass a custom function to the `filter` parameter for fine-grained control.

- Optimization Tips:
    - Cache the results for frequently used queries to improve the speed of repeated searches.
    - Avoid setting `fetch_k` too high, as it may slow down search performance. Experiment to find an appropriate value.

In [14]:
# Similarity Search
db.similarity_search("Tell me about the LangChain project.")

[Document(id='77422209-888e-4f2e-b656-85a347ba56f7', metadata={'source': 'tweet'}, page_content='Building an exciting new project with LangChain -'),
 Document(id='8a089124-0f45-419b-8c96-a8a6aacd5c0f', metadata={'source': 'tweet'}, page_content='LangGraph is the best framework for building'),
 Document(id='fd6a6f48-eda2-4753-93e9-aacb991b0519', metadata={'source': 'website'}, page_content='Read this review to find out.'),
 Document(id='e203e4b6-4703-43b3-ac46-b46b9e825c00', metadata={'source': 'tweet'}, page_content='building stateful, agentic applications!')]

In [15]:
# Specify the value of k (number of documents to return)
db.similarity_search("Tell me about the LangChain project.", k=2)

[Document(id='77422209-888e-4f2e-b656-85a347ba56f7', metadata={'source': 'tweet'}, page_content='Building an exciting new project with LangChain -'),
 Document(id='8a089124-0f45-419b-8c96-a8a6aacd5c0f', metadata={'source': 'tweet'}, page_content='LangGraph is the best framework for building')]

In [16]:
# Use a filter to narrow results based on metadata
db.similarity_search(
    "Tell me about the LangChain project.", filter={"source": "tweet"}, k=2
)

[Document(id='77422209-888e-4f2e-b656-85a347ba56f7', metadata={'source': 'tweet'}, page_content='Building an exciting new project with LangChain -'),
 Document(id='8a089124-0f45-419b-8c96-a8a6aacd5c0f', metadata={'source': 'tweet'}, page_content='LangGraph is the best framework for building')]

## Data Addition Methods
The Data Addition Methods describes how to add data to a `FAISS` vector store using either documents or texts. These methods provide flexibility for different input types and allow the user to efficiently populate the vector store.

### Add from Document (add_documents)
The `add_documents` method allows you to add or update documents in the vector store.

- Parameters:
    - `documents` (List[Document]): A list of Document objects to be added to the vector store.
    - `**kwargs`: Additional keyword arguments.

- Return Value:
    - `List[str]`: A list of IDs for the added texts.

- Functionality:
    1. Extracts text content and metadata from the documents.
    2. Calls the `add_texts` method to perform the actual addition process.

- Key Features:
    - Convenient for handling Document objects directly.
    - Includes ID handling logic to ensure the uniqueness of the documents.
    - Operates based on the `add_texts` method, promoting code reusability.

In [17]:
from langchain_core.documents import Document

# Specify page_content and metadata
db.add_documents(
    [
        Document(
            page_content="Hello! This time, I will add a new document.",
            # Metadata specifying the source of the document
            metadata={"source": "mydata.txt"},
        )
    ],
    # Unique ID for the new document
    ids=["new_doc1"],
)

['new_doc1']

In [19]:
# Verify the added data by performing a similarity search
db.similarity_search("hello", k=1)

[Document(id='new_doc1', metadata={'source': 'mydata.txt'}, page_content='Hello! This time, I will add a new document.')]

### Add from text (add_texts)


The `add_texts` method provides the functionality to embed texts and add them to the vector store.


- Parameters:
    - `texts` (Iterable[str]): An iterable of texts to be added to the vector store.
    - `metadatas` (Optional[List[dict]]): A list of metadata associated with the texts (optional).
    - `ids` (Optional[List[str]]): A list of unique identifiers for the texts (optional).
    - `**kwargs`: Additional keyword arguments.


- Return Value:
    - `List[str]`: A list of IDs of the texts added to the vector store.


- How it works:
    1. The input texts iterable is converted into a list.
    2. The `_embed_documents` method is used to embed the texts.
    3. The `__add` method is called to add the embedded texts to the vector store.

In [20]:
# Add new text data
db.add_texts(
    [
        "This time, we're adding text data.",
        "This is the second text data being added.",
    ],
    metadatas=[{"source": "mydata.txt"}, {"source": "mydata.txt"}],
    ids=["new_doc2", "new_doc3"],
)

['new_doc2', 'new_doc3']

In [21]:
# Check the added data
db.index_to_docstore_id

{0: 'd7e03a33-c8d5-4139-a9a5-0ee287908e1a',
 1: '9a727937-e0af-4003-a3a8-4997f8dddf00',
 2: '9eb165f9-ab7a-4a8d-8fc2-547c575d5cf6',
 3: 'd8e0189a-17df-4ee4-860c-2c838e404c02',
 4: '77422209-888e-4f2e-b656-85a347ba56f7',
 5: '557c5ff3-6303-43e3-8919-ac78081e16b8',
 6: '43751c22-f479-4ce9-9c92-83b4a6bea494',
 7: 'dc2f4d57-fdd9-43c4-b355-4b188a89ce5b',
 8: 'f7a3f798-4459-45f0-9064-d91cf31bb466',
 9: 'f8cc3819-4f71-4256-9d9a-9b6fba8712c6',
 10: 'c5c40a21-ce67-447d-819c-d6971ffcb2d6',
 11: 'fd6a6f48-eda2-4753-93e9-aacb991b0519',
 12: 'ce25d341-1c44-4a7a-82b1-5d5ff139c137',
 13: '8a089124-0f45-419b-8c96-a8a6aacd5c0f',
 14: 'e203e4b6-4703-43b3-ac46-b46b9e825c00',
 15: 'fa0f3f69-3754-4ccf-8d97-096553677cdc',
 16: '02b8f2ad-2aa7-4b26-930b-762ab6b0204e',
 17: 'c261874c-5393-4d7c-80a0-e67c5f115fda',
 18: 'new_doc1',
 19: 'new_doc2',
 20: 'new_doc3'}

## Delete Documents


The `delete` method is used to remove documents from the vector store based on their specified IDs.


- Parameters:
    - `ids` (Optional[List[str]]): A list of document IDs to delete.
    - `**kwargs`: Additional keyword arguments (not utilized in this method).


- Return Value:
    - `Optional[bool]`: Returns True if the deletion is successful, False if it fails, or None if the functionality is not implemented.


- How It Works:
    1. Validates the provided IDs.
    2, Finds the indices corresponding to the IDs to be deleted.
    3. Removes the entries with the given IDs from the `FAISS` index.
    4. Deletes the documents associated with the IDs from the document store.
    5. Updates the index-to-ID mapping.


- Key Features:
    - Ensures precise document management using ID-based deletion.
    - Performs deletion on both the `FAISS` index and the document store for consistency.
    - Maintains data integrity by reordering the index after deletion.


- Caution:
    - Deletion is irreversible, so it should be done with care.
    - The method lacks concurrency control, which requires caution in multi-threaded environments.

In [22]:
# Add data for deletion
ids = db.add_texts(
    ["Adding data for deletion.", "This is the second data entry for deletion."],
    metadatas=[{"source": "mydata.txt"}, {"source": "mydata.txt"}],
    ids=["delete_doc1", "delete_doc2"],
)

# Verify the IDs of the added data
print(ids)

['delete_doc1', 'delete_doc2']


The `delete` method can remove documents by providing their IDs.

In [23]:
# Delete by IDs
db.delete(ids)

True

In [24]:
# Output the result after deletion
db.index_to_docstore_id

{0: 'd7e03a33-c8d5-4139-a9a5-0ee287908e1a',
 1: '9a727937-e0af-4003-a3a8-4997f8dddf00',
 2: '9eb165f9-ab7a-4a8d-8fc2-547c575d5cf6',
 3: 'd8e0189a-17df-4ee4-860c-2c838e404c02',
 4: '77422209-888e-4f2e-b656-85a347ba56f7',
 5: '557c5ff3-6303-43e3-8919-ac78081e16b8',
 6: '43751c22-f479-4ce9-9c92-83b4a6bea494',
 7: 'dc2f4d57-fdd9-43c4-b355-4b188a89ce5b',
 8: 'f7a3f798-4459-45f0-9064-d91cf31bb466',
 9: 'f8cc3819-4f71-4256-9d9a-9b6fba8712c6',
 10: 'c5c40a21-ce67-447d-819c-d6971ffcb2d6',
 11: 'fd6a6f48-eda2-4753-93e9-aacb991b0519',
 12: 'ce25d341-1c44-4a7a-82b1-5d5ff139c137',
 13: '8a089124-0f45-419b-8c96-a8a6aacd5c0f',
 14: 'e203e4b6-4703-43b3-ac46-b46b9e825c00',
 15: 'fa0f3f69-3754-4ccf-8d97-096553677cdc',
 16: '02b8f2ad-2aa7-4b26-930b-762ab6b0204e',
 17: 'c261874c-5393-4d7c-80a0-e67c5f115fda',
 18: 'new_doc1',
 19: 'new_doc2',
 20: 'new_doc3'}

## Local Persistence


### Save Local


The `save_local` method enables saving the `FAISS` index, document store, and index-to-document ID mapping to the local disk.


- Parameters:
    - `folder_path` (str): The folder path where the data will be saved.
    - `index_name` (str): The name of the index file to be saved (default: "index").


- How It Works:
    1. Creates the specified folder path (ignored if it already exists).
    2. Saves the `FAISS` index as a separate file.
    3. Stores the document store and index-to-document ID mapping in pickle format.


- Usage Considerations:
    - Write permissions are required for the specified save path.
    - For large datasets, significant storage space and time may be required.
    - Be mindful of potential security risks associated with using pickle.

In [25]:
# Save to local disk
db.save_local(folder_path="faiss_db", index_name="faiss_index")

### Load Local


The `load_local` class method allows loading a `FAISS` index, document store, and index-to-document ID mapping saved on the local disk.


- Parameters:
    - `folder_path` (str): The folder path where the saved files are located.
    - `embeddings` (Embeddings): The embedding object used for generating queries.
    - `index_name` (str): The name of the index file to load (default: "index").
    - `allow_dangerous_deserialization` (bool): Whether to allow deserialization of pickle files (default: False).


- Returns:
    - `FAISS`: The loaded `FAISS` object.


- How It Works:
    1. Ensures deserialization risks are considered and requires explicit user permission.
    2. Loads the `FAISS` index separately.
    3. Uses pickle to deserialize the document store and index-to-document ID mapping.
    4. Creates and returns a `FAISS` object using the loaded data.

In [26]:
# Load the saved data
loaded_db = FAISS.load_local(
    folder_path="faiss_db",
    index_name="faiss_index",
    embeddings=embeddings,
    allow_dangerous_deserialization=True,
)

# Verify the loaded data
loaded_db.index_to_docstore_id

{0: 'd7e03a33-c8d5-4139-a9a5-0ee287908e1a',
 1: '9a727937-e0af-4003-a3a8-4997f8dddf00',
 2: '9eb165f9-ab7a-4a8d-8fc2-547c575d5cf6',
 3: 'd8e0189a-17df-4ee4-860c-2c838e404c02',
 4: '77422209-888e-4f2e-b656-85a347ba56f7',
 5: '557c5ff3-6303-43e3-8919-ac78081e16b8',
 6: '43751c22-f479-4ce9-9c92-83b4a6bea494',
 7: 'dc2f4d57-fdd9-43c4-b355-4b188a89ce5b',
 8: 'f7a3f798-4459-45f0-9064-d91cf31bb466',
 9: 'f8cc3819-4f71-4256-9d9a-9b6fba8712c6',
 10: 'c5c40a21-ce67-447d-819c-d6971ffcb2d6',
 11: 'fd6a6f48-eda2-4753-93e9-aacb991b0519',
 12: 'ce25d341-1c44-4a7a-82b1-5d5ff139c137',
 13: '8a089124-0f45-419b-8c96-a8a6aacd5c0f',
 14: 'e203e4b6-4703-43b3-ac46-b46b9e825c00',
 15: 'fa0f3f69-3754-4ccf-8d97-096553677cdc',
 16: '02b8f2ad-2aa7-4b26-930b-762ab6b0204e',
 17: 'c261874c-5393-4d7c-80a0-e67c5f115fda',
 18: 'new_doc1',
 19: 'new_doc2',
 20: 'new_doc3'}

## FAISS Object Merge (Merge From)


The `merge_from` method allows merging another `FAISS` object into the current `FAISS` object.


- Parameters:
    - `target` (`FAISS`): The target `FAISS` object to be merged into the current object.


- How It Works:
    1. Checks if the document stores are compatible for merging.
    2. Assigns new indices to the incoming documents based on the length of the existing index.
    3. Merges the `FAISS` index.
    4. Extracts documents and ID information from the target `FAISS` object.
    5. Adds the extracted information to the current document store and index-to-document ID mapping.


- Key Features:
    - Merges the indices, document stores, and index-to-document ID mappings of two `FAISS` objects.
    - Maintains continuity of index numbering during the merge.
    - Ensures compatibility of document stores before proceeding with the merge.


- Cautions:
    - The structure of the target `FAISS` object must be compatible with the current object.
    - Be cautious of duplicate IDs, as the current implementation does not check for duplicates.
    - If an exception occurs during the merge process, it may leave the system in a partially merged state.

In [28]:
# Load the saved data
db = FAISS.load_local(
    folder_path="faiss_db",
    index_name="faiss_index",
    embeddings=embeddings,
    allow_dangerous_deserialization=True,
)

# Create a new FAISS vector store
db2 = FAISS.from_documents(documents=split_documents, embedding=OpenAIEmbeddings())

# Check the data in db
db.index_to_docstore_id

{0: 'd7e03a33-c8d5-4139-a9a5-0ee287908e1a',
 1: '9a727937-e0af-4003-a3a8-4997f8dddf00',
 2: '9eb165f9-ab7a-4a8d-8fc2-547c575d5cf6',
 3: 'd8e0189a-17df-4ee4-860c-2c838e404c02',
 4: '77422209-888e-4f2e-b656-85a347ba56f7',
 5: '557c5ff3-6303-43e3-8919-ac78081e16b8',
 6: '43751c22-f479-4ce9-9c92-83b4a6bea494',
 7: 'dc2f4d57-fdd9-43c4-b355-4b188a89ce5b',
 8: 'f7a3f798-4459-45f0-9064-d91cf31bb466',
 9: 'f8cc3819-4f71-4256-9d9a-9b6fba8712c6',
 10: 'c5c40a21-ce67-447d-819c-d6971ffcb2d6',
 11: 'fd6a6f48-eda2-4753-93e9-aacb991b0519',
 12: 'ce25d341-1c44-4a7a-82b1-5d5ff139c137',
 13: '8a089124-0f45-419b-8c96-a8a6aacd5c0f',
 14: 'e203e4b6-4703-43b3-ac46-b46b9e825c00',
 15: 'fa0f3f69-3754-4ccf-8d97-096553677cdc',
 16: '02b8f2ad-2aa7-4b26-930b-762ab6b0204e',
 17: 'c261874c-5393-4d7c-80a0-e67c5f115fda',
 18: 'new_doc1',
 19: 'new_doc2',
 20: 'new_doc3'}

In [29]:
# Check the data in db2
db2.index_to_docstore_id

{0: 'd49df9e3-dfc6-409b-a4f9-bd6bd39dd9e3',
 1: 'bebe6ee4-d20b-4bb6-b158-cb4d57a299da',
 2: 'b818c6d2-0cdc-4e27-b0d7-f0785df46d6a',
 3: '4e084641-b3d5-4ca9-9dbc-3351b1557d82',
 4: 'e1f9d47c-44a9-446a-9099-ded8c035b774',
 5: 'fe8dde72-e51b-4a93-86f8-cd89e9a394a5',
 6: '5485f7d6-9bc1-4105-910e-02da99121941',
 7: '74f01d67-66e8-40ea-b26c-a312c4a0d757',
 8: 'eb6966fe-57f8-49ee-8d07-617f661a04cd',
 9: 'f82e34df-65e7-450b-ae3a-f722da200b60',
 10: '13e95ae8-df7d-4405-8de9-906ec94f3952',
 11: 'f75e5201-d3ee-4d9f-987f-b2f8ca38a646',
 12: '15ef205e-bfeb-4654-bee7-42ddef7ab7e5',
 13: 'e2dcce5a-4cd4-4a93-ab4f-bb03f408e00a',
 14: '4763a9c6-41fb-4b13-94bd-516f2262223b',
 15: 'edfb52b2-fe9d-48b6-9abf-8e9f998b441e',
 16: '9bb69c8c-d73a-4812-9a7a-f610a232894c',
 17: 'edcf1a0f-6929-430b-bc60-d588080c3f59'}

Use `merge_from` to combine the two databases

In [30]:
# Merge db + db2
db.merge_from(db2)

# Check the merged data
db.index_to_docstore_id

{0: 'd7e03a33-c8d5-4139-a9a5-0ee287908e1a',
 1: '9a727937-e0af-4003-a3a8-4997f8dddf00',
 2: '9eb165f9-ab7a-4a8d-8fc2-547c575d5cf6',
 3: 'd8e0189a-17df-4ee4-860c-2c838e404c02',
 4: '77422209-888e-4f2e-b656-85a347ba56f7',
 5: '557c5ff3-6303-43e3-8919-ac78081e16b8',
 6: '43751c22-f479-4ce9-9c92-83b4a6bea494',
 7: 'dc2f4d57-fdd9-43c4-b355-4b188a89ce5b',
 8: 'f7a3f798-4459-45f0-9064-d91cf31bb466',
 9: 'f8cc3819-4f71-4256-9d9a-9b6fba8712c6',
 10: 'c5c40a21-ce67-447d-819c-d6971ffcb2d6',
 11: 'fd6a6f48-eda2-4753-93e9-aacb991b0519',
 12: 'ce25d341-1c44-4a7a-82b1-5d5ff139c137',
 13: '8a089124-0f45-419b-8c96-a8a6aacd5c0f',
 14: 'e203e4b6-4703-43b3-ac46-b46b9e825c00',
 15: 'fa0f3f69-3754-4ccf-8d97-096553677cdc',
 16: '02b8f2ad-2aa7-4b26-930b-762ab6b0204e',
 17: 'c261874c-5393-4d7c-80a0-e67c5f115fda',
 18: 'new_doc1',
 19: 'new_doc2',
 20: 'new_doc3',
 21: 'd49df9e3-dfc6-409b-a4f9-bd6bd39dd9e3',
 22: 'bebe6ee4-d20b-4bb6-b158-cb4d57a299da',
 23: 'b818c6d2-0cdc-4e27-b0d7-f0785df46d6a',
 24: '4e084641

## Convert to Searcher (as_retriever)


The `as_retriever` method creates a `VectorStoreRetriever` object based on the current vector store.


- Parameters:
    - `**kwargs`: Keyword arguments passed to the search function.
    - `search_type` (Optional[str]): Type of search to perform (`"similarity"`, `"mmr"`, or `"similarity_score_threshold"`).
    - `search_kwargs` (Optional[Dict]): Additional keyword arguments for the search function.


- Return Value:
    - `VectorStoreRetriever`: A retriever object based on the vector store.


- Key Features:
    - Supports Multiple Search Types
        - `"similarity"`: Default similarity-based search.
        - `"mmr"`: Maximal Marginal Relevance search.
        - `"similarity_score_threshold"`: Similarity threshold-based search.
    - Customizable Search Parameters
        - `k`: Number of documents to return.
        - `score_threshold`: Similarity score threshold.
        - `fetch_k`: Number of documents fetched for the MMR algorithm.
        - `lambda_mult`: Parameter to adjust diversity in `MMR`.
        - `filter`: Filter documents based on metadata.


- Usage Considerations:
    - Choose appropriate search types and parameters to balance the quality and diversity of search results.
    - Adjust `fetch_k` and `k` values for large datasets to balance performance and accuracy.
    - Use the filter option to search only documents that match specific conditions.


- Optimization Tips:
    - For `MMR` searches, increase `fetch_k` and adjust `lambda_mult` to balance diversity and relevance.
    - Use threshold-based search to return only highly relevant documents.


- Cautions:
    - Improper parameter settings may impact search performance or result quality.
    - High `k` values on large datasets can significantly increase search time. By default, similarity search retrieves 4 documents to ensure manageable results.

In [33]:
"""
Will be Update
"""

# Create a new FAISS vector store
db = FAISS.from_documents(
    # Will be Update
    documents=split_documents + split_documents,
    embedding=OpenAIEmbeddings(),
)

The default retriever returns 4 documents.

In [34]:
# Convert to retriever
retriever = db.as_retriever()

# Perform search
retriever.invoke("What can you tell me about iPhones?")

[Document(id='0a176035-9a36-4cdb-ab70-35306af9b254', metadata={'source': 'website'}, page_content='Is the new iPhone worth the price? Read this'),
 Document(id='344fc8d3-bf2f-4493-9166-b9a0a0ea0a00', metadata={'source': 'website'}, page_content='Is the new iPhone worth the price? Read this'),
 Document(id='baab5057-277d-4893-9937-f233cb707eec', metadata={'source': 'website'}, page_content='Read this review to find out.'),
 Document(id='a294ed5d-dad6-49ed-a718-4888e1b489ce', metadata={'source': 'website'}, page_content='Read this review to find out.')]

Higher diversity with more document retrieval


- `k`: The number of documents to return (default: 4)
- `fetch_k`: The number of documents to pass to the `MMR` algorithm (default: 20)
- `lambda_mult`: Adjusts the diversity of `MMR` results (range: 0 to 1, default: 0.5)

In [35]:
# Perform MMR search
retriever = db.as_retriever(
    search_type="mmr", search_kwargs={"k": 6, "lambda_mult": 0.25, "fetch_k": 10}
)
# Invoke search with a query
retriever.invoke("What can you tell me about iPhones?")

[Document(id='0a176035-9a36-4cdb-ab70-35306af9b254', metadata={'source': 'website'}, page_content='Is the new iPhone worth the price? Read this'),
 Document(id='b3571c49-6256-472b-9ffa-c1f5f733bd2c', metadata={'source': 'tweet'}, page_content='building stateful, agentic applications!'),
 Document(id='9daebe2f-0bfc-4ebb-bf84-001f040afd85', metadata={'source': 'website'}, page_content='The top 10 soccer players in the world right now.'),
 Document(id='926d5ce9-6c0b-480e-8278-c69c517af9b6', metadata={'source': 'tweet'}, page_content='- come check it out!'),
 Document(id='baab5057-277d-4893-9937-f233cb707eec', metadata={'source': 'website'}, page_content='Read this review to find out.'),
 Document(id='344fc8d3-bf2f-4493-9166-b9a0a0ea0a00', metadata={'source': 'website'}, page_content='Is the new iPhone worth the price? Read this')]

Fetch more documents for the `MMR` algorithm, but return only the top 2

In [36]:
# Perform MMR search, return only the top 2 documents
retriever = db.as_retriever(search_type="mmr", search_kwargs={"k": 2, "fetch_k": 10})
retriever.invoke("What can you tell me about iPhones?")

[Document(id='0a176035-9a36-4cdb-ab70-35306af9b254', metadata={'source': 'website'}, page_content='Is the new iPhone worth the price? Read this'),
 Document(id='b3571c49-6256-472b-9ffa-c1f5f733bd2c', metadata={'source': 'tweet'}, page_content='building stateful, agentic applications!')]

Perform search for documents with a similarity score above a certain threshold

In [37]:
# Perform threshold-based similarity search
retriever = db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.8}
)

retriever.invoke("What can you tell me about iPhones?")

[Document(id='0a176035-9a36-4cdb-ab70-35306af9b254', metadata={'source': 'website'}, page_content='Is the new iPhone worth the price? Read this'),
 Document(id='344fc8d3-bf2f-4493-9166-b9a0a0ea0a00', metadata={'source': 'website'}, page_content='Is the new iPhone worth the price? Read this')]

Retrieve only the most similar document

In [38]:
# Perform search to retrieve the most similar single document with k=1
retriever = db.as_retriever(search_kwargs={"k": 1})

retriever.invoke("What can you tell me about iPhones?")

[Document(id='0a176035-9a36-4cdb-ab70-35306af9b254', metadata={'source': 'website'}, page_content='Is the new iPhone worth the price? Read this')]

Apply specific metadata filters

In [41]:
# Apply filter based on metadata and retrieve top 2 documents
retriever = db.as_retriever(search_kwargs={"filter": {"source": "news"}, "k": 2})
retriever.invoke("What is the weather forecast for tomorrow?")

[Document(id='f9d46fec-ad45-459c-8103-815574b168d6', metadata={'source': 'news'}, page_content='The weather forecast for tomorrow is cloudy and'),
 Document(id='12d76ef2-5345-4e14-bde3-16870b6de65d', metadata={'source': 'news'}, page_content='The weather forecast for tomorrow is cloudy and')]