# An example code for creating Faiss index

Efficient text retrieval and matching from a large volume of text are crucial for building a Q&A system. One common method is to convert the text into vector representations and create index based on these vectors, which enables fast retrieval by utilizing the similarity between vectors. This example demonstrates the process of spliting the document into small chunks, leveraging an embedding store to convert the text into vectors and generating Faiss index.

## Install promptflow-vectordb SDK

In [None]:
%pip install promptflow-vectordb

## Import required libraries

In [None]:
import os
from typing import List
import urllib.request
from bs4 import BeautifulSoup
from langchain.text_splitter import RecursiveCharacterTextSplitter

from promptflow_vectordb.core.contracts import (
    EmbeddingModelType,
    StorageType,
    StoreCoreConfig,
)
from promptflow_vectordb.core.embeddingstore_core import EmbeddingStoreCore

## Prepare your data
For convenience, a few Azure Machine Learning documentation webpages are selected here as sample data. You can replace them with your own dataset.


In [None]:
URL_PREFIX = "https://learn.microsoft.com/en-us/azure/machine-learning/"
URL_NAME_LIST = [
    "tutorial-azure-ml-in-a-day",
    "overview-what-is-azure-machine-learning",
    "concept-v2",
]

Download the data to local path.

In [None]:
local_file_path = os.path.join(os.getcwd(), "data")
os.makedirs(local_file_path, exist_ok=True)
for url_name in URL_NAME_LIST:
    url = os.path.join(URL_PREFIX, url_name)
    destination_path = os.path.join(local_file_path, url_name)
    urllib.request.urlretrieve(url, destination_path)

## Configure and create an embedding store
promptflow-vectordb sdk supports multiple types of embedding models (Azure OpenAI, OpenAI) and multiple types of store path (local path, HTTP URL, Azure blob). In this example, configure an embedding store with Azure OpenAI embedding model and local store path.

Please refer to [create a resource and deploy a model using Azure OpenAI](https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal) to set up an AOAI embedding model deployment. The output vector returned by different embedding models has different dimensions. It is recommended to deploy `text-embedding-ada-002` model, and the dimension of the output vector returned by this model is 1536. 

To use AOAI model, please store `Azure_OpenAI_MODEL_ENDPOINT` and `Azure_OpenAI_MODEL_API_KEY` as environment variables.

In [None]:
MODEL_API_VERSION = "2023-05-15"
MODEL_DEPLOYMENT_NAME = "text-embedding-ada-002"
DIMENSION = 1536

# Configure an embedding store to store index file.
store_path = os.path.join(os.getcwd(), "faiss_index_store")
config = StoreCoreConfig.create_config(
    storage_type=StorageType.LOCAL,
    store_identifier=store_path,
    model_type=EmbeddingModelType.AOAI,
    model_api_base=os.environ["Azure_OpenAI_MODEL_ENDPOINT"],
    model_api_key=os.environ["Azure_OpenAI_MODEL_API_KEY"],
    model_api_version=MODEL_API_VERSION,
    model_name=MODEL_DEPLOYMENT_NAME,
    dimension=DIMENSION,
    create_if_not_exists=True,
)
store = EmbeddingStoreCore(config)

## Split document to chunks, embed chunks and create Faiss index.

In [None]:
def get_file_chunks(file_name: str) -> List[str]:
    with open(file_name, "r", encoding="utf-8") as f:
        page_content = f.read()
        # use BeautifulSoup to parse HTML content
        soup = BeautifulSoup(page_content, "html.parser")
        text = soup.get_text(" ", strip=True)
        chunks = []
        splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=10)
        for chunk in splitter.split_text(text):
            chunks.append(chunk)
        return chunks

When inserting chunks into embedding store, the chunks are transformed into embeddings and Faiss index is generated under the store path.

In [None]:
for root, _, files in os.walk(local_file_path):
    for file in files:
        each_file_path = os.path.join(root, file)

        # Split the file into chunks.
        chunks = get_file_chunks(each_file_path)
        count = len(chunks)
        if URL_PREFIX is not None:
            metadatas = [
                {"title": file, "source": os.path.join(URL_PREFIX, file)}
            ] * count
        else:
            metadatas = [{"title": file}] * count

        # Embed chunks into embeddings, generate index in embedding store.
        # If your data is large, inserting too many chunks at once may cause
        # rate limit error，you can refer to the following link to find solution
        # https://learn.microsoft.com/en-us/azure/cognitive-services/openai/quotas-limits
        store.batch_insert_texts(chunks, metadatas)
        print(f"Create index for {file} file successfully.\n")

## Next step
Now you have successfully created Faiss index. To build a complete Q&A system, you can use [Faiss Index Lookup tool](https://aka.ms/faiss_index_lookup_tool) to search relavant texts from the created index by [Azure Machine Learning Prompt Flow](https://aka.ms/AMLPromptflow).