# FAISS

- Author: [Jeongeun Lim](https://www.linkedin.com/in/jeongeun-lim-808978188/)
- Design: []()
- Peer Review : 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/03-OutputParser/08-OutputFixingParser.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/03-OutputParser/08-OutputFixingParser.ipynb)

## Overview

`FAISS` is a library designed for the efficient similarity search and clustering of dense vectors. It provides robust algorithms for searching vector sets of any size, including those that may not fit entirely in `RAM`.

In addition to the core search functionality, `FAISS` includes support code for evaluation and parameter tuning, making it a versatile tool for various applications in machine learning and artificial intelligence.


----
Key Benefits:

- Efficient Large-Scale Search:
`FAISS` ensures fast and accurate vector searches, even with millions of high-dimensional vectors.

- Memory Optimization:
Offers advanced quantization techniques to reduce memory usage without sacrificing performance.

- Customizable Search Accuracy:
Users can fine-tune parameters to balance between search accuracy and speed according to specific requirements.

- Versatile Applications:
From machine learning to AI-powered recommendation systems, Faiss supports a wide range of use cases.


---- 
Implementation Steps:

To effectively integrate `FAISS` into your workflow, follow these steps:

1. Data Preparation:
Prepare and normalize your data, ensuring vectors are in a dense representation format.

2. Index Creation:
Select and build a Faiss index based on your dataset size and performance requirements. Common options include IndexFlat for brute-force search or IVF for scalable inverted file-based search.

3. Index Training (if needed):
For certain indices, such as `IVF` or `PQ`, train the index with representative data samples to optimize performance.

4. Search Execution:
Use the index to search for nearest neighbors, leveraging optional GPU acceleration for faster performance.

5. Evaluation and Tuning:
Test and evaluate the performance of your index, adjusting parameters like quantization levels or clustering size for improved results.


### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Load a Sample Dataset](#load-a-sample-dataset)
- [Create a VectorStore](#create-a-vectorstore)
- [Create a FAISS VectorStore(from_documents)](#using-outputfixingparser-to-correct-incorrect-formatting)

### References

- [LangChain Docs Faiss](https://python.langchain.com/docs/integrations/vectorstores/faiss)
- [Faiss Docs](https://faiss.ai/)

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_openai",
        "langchain_community",
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "FAISS",
    }
)

Environment variables have been set successfully.


You can alternatively set `OPENAI_API_KEY` in `.env` file and load it. 

[Note] This is not necessary if you've already set `OPENAI_API_KEY` in previous steps.

In [4]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Load a Sample Dataset
Demonstrates how to load text files using LangChain’s `TextLoader` and split them into smaller chunks with `RecursiveCharacterTextSplitter`. 
The resulting documents are prepared for further embedding and storage in a FAISS vector store.

In [None]:
"""
Will be reflected in a fixed sample dataset in the future
"""

# from langchain_community.document_loaders import TextLoader
# from langchain.text_splitter import RecursiveCharacterTextSplitter

# # 텍스트 분할
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=0)

# # 텍스트 파일을 load -> List[Document] 형태로 변환
# loader1 = TextLoader("data/nlp-keywords.txt")
# loader2 = TextLoader("data/finance-keywords.txt")

# # 문서 분할
# split_doc1 = loader1.load_and_split(text_splitter)
# split_doc2 = loader2.load_and_split(text_splitter)

# # 문서 개수 확인
# len(split_doc1), len(split_doc2)

In [5]:
"""
Will be reflected in a fixed sample dataset in the future
"""

from uuid import uuid4
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Define the dataset
document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)
document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)
document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)
document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)
document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)
document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)
document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)
document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)
document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)
document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]

# Define the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)

# Split documents into smaller chunks and create a new list
split_documents = []
for doc in documents:
    split_content = text_splitter.split_text(doc.page_content)
    for chunk in split_content:
        split_documents.append(Document(page_content=chunk, metadata=doc.metadata))

# Generate a unique UUID for each split document
uuids = [str(uuid4()) for _ in range(len(split_documents))]

# Add the split documents to the VectorStore
# db.add_documents(documents=split_documents, ids=uuids)

# Verify the result (Print the number of split documents)
print(f"Number of split documents: {len(split_documents)}")

Number of split documents: 18


## Create a VectorStore

Key Initialization Parameters:

- Indexing Parameters
    - `embedding_function` (Embeddings): The embedding function to be used.
- Client Parameters
    - `index` (Any): The FAISS index to be used.
    - `docstore` (Docstore): The document store to be utilized.
    - `index_to_docstore_id` (Dict[int, str]): A mapping from the index to document store IDs.

**[Note]** 

- `FAISS` is a high-performance library for vector search and clustering.
- This class integrates `FAISS` with LangChain's VectorStore interface.
- By combining the `embedding function`, `FAISS index`, and `document store`, you can build an efficient vector search system.

In [6]:
import faiss
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_openai import OpenAIEmbeddings

# Embedding
embeddings = OpenAIEmbeddings()

# Calculate the size of the embedding dimension
dimension_size = len(embeddings.embed_query("hello world"))
print(dimension_size)

1536


In [7]:
# Create a FAISS vector store
db = FAISS(
    embedding_function=OpenAIEmbeddings(),
    index=faiss.IndexFlatL2(dimension_size),
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

## Create a FAISS VectorStore(from_documents)

The `from_documents` class method creates a FAISS vector store using a list of documents and an embedding function.

- Parameters:
    - `documents` (List[Document]): A list of documents to be added to the vector store.
    - `embedding` (Embeddings): The embedding function to be used.
    - `**kwargs`: Additional keyword arguments.

- How It Works:
1. Extracts the text content (`page_content`) and metadata from the list of documents.
2. Calls the `from_texts` method using the extracted text and metadata.

- Return Value:
    - `VectorStore`: An instance of the vector store initialized with the provided documents and embeddings.

**Note** 
- This method internally calls the `from_texts` method to create the vector store.
- The `page_content` of each document is used as text, while `metadata` is used as the document's metadata.
- Additional configurations can be passed through `kwargs`.

In [None]:
# Create a FAISS vector store from the documents
db = FAISS.from_documents(documents=split_documents, embedding=OpenAIEmbeddings())

In [None]:
# Check the document store IDs
db.index_to_docstore_id

In [None]:
# Check the ID of the stored document: Document
db.docstore._dict