# Create a FAISS based Vector Index for Document Retrieval with AzureML

We'll walk through setting up an AzureML Pipeline which pulls a Git Repo, processes the data into chunks, embeds the chunks, and creates a LangChain-compatible FAISS Vector Index.

## Get client for AzureML Workspace

The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to your workspace you created, in the main.bicep script, in which the job will be run.

`MLClient` is how you interact with AzureML

In [None]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient
from azureml.core import Workspace
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
ml_client = MLClient.from_config(credential=credential)

ws = Workspace(
    subscription_id=ml_client.subscription_id,
    resource_group=ml_client.resource_group_name,
    workspace_name=ml_client.workspace_name,
)
print(ml_client)

## Which Datasource?

We'll be using the Contoso Dental dataset, which is a collection questions and answers from the contoso dental practice. The dataset is available in the `data` folder of this repo.


In [None]:

local_path = "../data/contoso-dental.xls"

We will use AzureML's Data to create a dataset, which is a reference to the .xls data in the Datastore.

In [None]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

v1="initial"

my_data = Data(
    name="contoso-dental-clinic",
    description="Dental Clinic data",
    path=local_path,
    type=AssetTypes.URI_FILE,
)

ml_client.data.create_or_update(my_data)

## Which Embeddings Model to use?

We will be using Azure OpenAI's `text-embedding-ada-002` embedding model to supported Embedding the dataset . Here are some factors that might influence your decision:

### OpenAI

OpenAI has [great documentation](https://platform.openai.com/docs/guides/embeddings) on their Embeddings model `text-embedding-ada-002`, it can handle up to 8191 tokens and can be accessed using [Azure OpenAI](https://learn.microsoft.com/azure/cognitive-services/openai/concepts/models#embeddings-models) or OpenAI directly.
If you have an existing **Azure OpenAI** Instance you can connect it to AzureML. The main limitation when using `text-embedding-ada-002` is cost/quota available for the model. Otherwise it provides high quality embeddings across a wide array of text domains while being simple to use.



We can use the automatically created `Default_AzureOpenAI` connection.

If you would rather use an existing Azure OpenAI connection then change `aoai_connection_name` below.
If you would rather use an existing Azure OpenAI resource, but don't have a connection created, modify `aoai_connection_name`.

In [None]:
import os

aoai_connection_name = "azure-openai-conn"
aoai_connection = None

In [None]:
from azureml.rag.utils.connections import (
    get_connection_by_name_v2,
    create_connection_v2,
)

try:
    aoai_connection = get_connection_by_name_v2(ws, aoai_connection_name)
except Exception as ex:
    # Create New Connection
    # Modify the details below to match the `Endpoint` and API key of your AOAI resource, these details can be found in Azure Portal

    target = os.environ["AZURE_OPENAI_ENDPOINT"]  # example: 'https://<endpoint>.openai.azure.com/'
    key = os.environ["AZURE_OPENAI_KEY"]
    apiVersion = "2023-10-01-preview"

    aoai_connection = create_connection_v2(
        workspace=ws,
        name=aoai_connection_name,
        category="AzureOpenAI",
        target=target,
        auth_type="ApiKey",
        credentials={"key": key},
        metadata={"ApiType": "azure", "ApiVersion": apiVersion},
    )


Now that your Workspace has a connection to Azure OpenAI we will make sure the `text-embedding-ada-002` model has been deployed ready for inference. We will be using the `text-embedding-ada-002` model your created earlier with the main.bicep script.

This cell will fail if there is not deployment for the embeddings model, [follow these instructions](https://learn.microsoft.com/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#deploy-a-model) to deploy a model with Azure OpenAI.

In [None]:
from azureml.rag.utils.deployment import infer_deployment

aoai_embedding_model_name = "text-embedding-ada-002"
oai_completion_model_name = "gpt-35-turbo"


try:
    aoai_embedding_deployment_name = infer_deployment(
        aoai_connection, aoai_embedding_model_name
    )
    print(
        f"Deployment name in AOAI workspace for model '{aoai_embedding_model_name}' is '{aoai_embedding_deployment_name}'"
    )
except Exception as e:
    print(
        f"Please create a deployment for this model by following the deploy instructions on the resource page for '{aoai_connection['properties']['target']}' in Azure Portal."
    )
    if "ResourceId" in aoai_connection["properties"]["metadata"]:
        aoai_resource_url = f"https://portal.azure.com/resource/{aoai_connection['properties']['metadata']['ResourceId']}/overview"
        print(
            f"Please create a deployment for this model by following the deploy instructions on the resource page: {aoai_resource_url}"
        )
    else:
        print(
            f"Please create a deployment for this model by following the deploy instructions on the resource page for '{aoai_connection['properties']['target']}' in Azure Portal."
        )

Finally we will combine the deployment and model information into a uri form which the AzureML embeddings components expect as input.

In [None]:
aoai_embedding_deployment_name = os.environ["TEXT_EMBEDDING_DEPLOYMENT_NAME"]

embeddings_model_uri = f"azure_open_ai://deployment/{aoai_embedding_deployment_name}/model/{oai_completion_model_name}"

In [None]:
print(embeddings_model_uri)

## Setup Pipeline to process data into Index

AzureML [Pipelines](https://learn.microsoft.com/azure/machine-learning/concept-ml-pipelines?view=azureml-api-2) connect together multiple [Components](https://learn.microsoft.com/azure/machine-learning/concept-component?view=azureml-api-2). Each Component defines inputs, code that consumes the inputs and outputs produced from the code. To process your data for embedding and indexing we will chain together multiple components each performing their own step of the workflow.

In [None]:
ml_registry = MLClient(credential=credential, registry_name="azureml")

# Clones git repository to output folder of pipeline, by default this will be on the default Workspace Datastore `workspaceblobstore`
git_clone_component = ml_registry.components.get("llm_rag_git_clone", label="latest")
# Walks input folder according to provided glob pattern (all files by default: '**/*') and attempts to open them, extract text chunks and further chunk if necessary to fir within provided `chunk_size`.
crack_and_chunk_component = ml_registry.components.get(
    "llm_rag_crack_and_chunk", label="latest"
)
# Reads input folder of files containing chunks and their metadata as batches, in parallel, and generates embeddings for each chunk. Output format is produced and loaded by `azureml.rag.embeddings.EmbeddingContainer`.
generate_embeddings_component = ml_registry.components.get(
    "llm_rag_generate_embeddings", label="latest"
)
# Reads input folder produced by `azureml.rag.embeddings.EmbeddingsContainer.save()` and inserts all documents (chunk, metadata, embedding_vector) int a Faiss index and in-memory document store. Writes an MLIndex yaml detailing the index and embeddings model information.
create_faiss_index_component = ml_registry.components.get(
    "llm_rag_create_faiss_index", label="latest"
)
# Takes a uri to a storage location where an MLIndex yaml is stored and registers it as an MLIndex Data asset in the AzureML Workspace.
register_mlindex_component = ml_registry.components.get(
    "llm_rag_register_mlindex_asset", label="latest"
)

Each Component has documentation which provides an overall description of the Components purpose and each of the inputs/outputs.
For example we can see understand what `crack_and_chunk` does by inspecting the Component definition.

In [None]:
print(crack_and_chunk_component)

Below a Pipeline is built by defining a python function which chains together the above components inputs and outputs. Arguments to the function are inputs to the Pipeline itself and the return value is a dictionary defining the outputs of the Pipeline.

In [None]:
from azure.ai.ml import Input, Output
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.entities._job.pipeline._io import PipelineInput
from typing import Optional


#def use_automatic_compute(component, instance_count=1, instance_type="Standard_E8s_v3"):
def use_automatic_compute(component, instance_count=1, instance_type="Standard_DS12_v2"):
    """Configure input `component` to use automatic compute with `instance_count` and `instance_type`.

    This avoids the need to provision a compute cluster to run the component.
    """
    component.set_resources(
        instance_count=instance_count,
        instance_type=instance_type,
        properties={"compute_specification": {"automatic": True}},
    )
    return component


def optional_pipeline_input_provided(input: Optional[PipelineInput]):
    """Checks if optional pipeline inputs are provided."""
    return input is not None and input._data is not None


# If you have an existing compute cluster you want to use instead of automatic compute, uncomment the following line, replace `dedicated_cpu_compute` with the name of your cluster.
# Also comment out the `component.set_resources` line in `use_automatic_compute` above and the `default_compute='serverless'` line below.
# @pipeline(compute=dedicated_cpu_compute)
@pipeline(default_compute="serverless")
def local_to_faiss(
    input_data: Input,
    embeddings_model: str,
    asset_name: str,
    #branch_name: str = None,
    chunk_size: int = 1024,
    data_source_glob: str = None,
    data_source_url: str = None,
    document_path_replacement_regex: str = None,
    #git_connection_id=None,
    aoai_connection_id=None,
    embeddings_container=None,
):
    """Pipeline to generate embeddings for a `input_data` source and create a Faiss index."""


    crack_and_chunk = crack_and_chunk_component(
        input_data=input_data,
        input_glob=data_source_glob,
        #data_source_url=my_data.datastore,
        document_path_replacement_regex=document_path_replacement_regex,
    )
    use_automatic_compute(crack_and_chunk)

    generate_embeddings = generate_embeddings_component(
        chunks_source=crack_and_chunk.outputs.output_chunks,
        embeddings_model=embeddings_model_uri, 
    )
    use_automatic_compute(generate_embeddings)
    if optional_pipeline_input_provided(aoai_connection_id):
        generate_embeddings.environment_variables[
            "AZUREML_WORKSPACE_CONNECTION_ID_AOAI"
        ] = aoai_connection_id
    if optional_pipeline_input_provided(embeddings_container):
        # If provided, `embeddings_container` is expected to be a URI to folder, the folder can be empty.
        # Each sub-folder is generated by a `create_embeddings_component` run and can be reused for subsequent embeddings runs.
        generate_embeddings.outputs.embeddings = Output(
            type="uri_folder", path=f"{embeddings_container.path}/{{name}}"
        )

    create_faiss_index = create_faiss_index_component(
        embeddings=generate_embeddings.outputs.embeddings,
    )
    use_automatic_compute(create_faiss_index)

    register_mlindex = register_mlindex_component(
        storage_uri=create_faiss_index.outputs.index, asset_name=asset_name
    )
    use_automatic_compute(register_mlindex)
    return {
        "mlindex_asset_uri": create_faiss_index.outputs.index,
        "mlindex_asset_id": register_mlindex.outputs.asset_id,
    }

Now we can create the Pipeline Job by calling the `@pipeline` annotated function and providing input arguments.
`asset_name` will be used when registering the MLIndex Data Asset produced by the `register_mlindex` component in the pipeline. This is how you can refer to the MLIndex within AzureML.



In [None]:
asset_name = "dental_faiss_mlindex"
data_source_glob = "**/contoso-dental.xls"

In [None]:
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes

pipeline_job = local_to_faiss(
    input_data=Input(
        type=AssetTypes.URI_FOLDER, path="../data/"
    ),  # This will upload the data folder to the default Workspace Datastore `workspaceblobstore`
    data_source_glob=data_source_glob,
    data_source_url=my_data.path,
    # Each run will save latest Embeddings to subfolder under this path, runs will load latest embeddings from container and reuse any unchanged chunk embeddings.
    embeddings_model=embeddings_model_uri,
    aoai_connection_id=aoai_connection_id,
    embeddings_container=Input(
        type="uri_folder",
        path=f"azureml://datastores/workspaceblobstore/paths/embeddings/{asset_name}",
    ),
    # Name of asset to register MLIndex under
    asset_name=asset_name,
)

# By default AzureML Pipelines will reuse the output of previous component Runs when inputs have not changed.
# If you want to rerun the Pipeline every time each time so that any changes to upstream data sources are processed uncomment the below line.
# pipeline_job.settings.force_rerun = True # Rerun each time so that git_clone isn't cached, if intent is to ingest latest data.

Finally we add some properties to `pipeline_job` which ensure the Index generation progress and final Artifact appear in the PromptFlow Vector Index UI.

In [None]:
# These are added so that in progress index generations can be listed in UI, this tagging is done automatically by UI.
pipeline_job.properties["azureml.mlIndexAssetName"] = asset_name
pipeline_job.properties["azureml.mlIndexAssetKind"] = "faiss"
pipeline_job.properties["azureml.mlIndexAssetSource"] = "Uri Folder"

## Submit Pipeline

The output of each step in the pipeline can be inspected via the Workspace UI, click the link under 'Details Page' after running the below cell. 

In [None]:
running_pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name="local_to_faiss"
)
running_pipeline_job

In [None]:
ml_client.jobs.stream(running_pipeline_job.name)

## Use MLIndex with PromptFlow

To use the MLindex in PromptFlow the asset_id can be used with the `Vector Index Lookup​` Tool. Replace `versions/2` with `versions/latest` to use the latest version.

In [None]:
asset_id = f"azureml:/{ml_client.data.get(asset_name, label='latest').id}"
print(asset_id)

In [None]:
asset_id = f"azureml:/{ml_client.data.get(asset_name, label='latest').id}"
print(asset_id)

asset_id.replace("resourceGroups", "resourcegroups")