In [None]:
%pip install azure-ai-ml
%pip install -U 'azureml-rag[faiss]>=0.1.13'
# If using hugging_face embeddings add `hugging_face` extra, e.g. `azureml-rag[faiss,hugging_face]`

# Create a FAISS based Vector Index for Incremental Document Retrieval with AzureML for unstructured data

In this notebook, we'll walk through setting up an AzuremML Pipeline which pulls some unstructured data from a couple docx files, chunks it, incrementally embeds the chunks and creates a LangChain compatible FAISS Vector Index. By unstructured data, we mean data in files with the extensions `.pdf`, `.ppt`, `.pptx`, `.doc`, `.docx`, `.xls` and `.xlsx`.


### Get client for AzureML Workspace
The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

If you don't have a Workspace and want to create and Index locally see [here to create one](https://learn.microsoft.com/en-us/azure/machine-learning/quickstart-create-resources?view=azureml-api-2).

Enter your Workspace details below, running this still will write a `workspace.json` file to the current folder.

In [None]:
%%writefile workspace.json
{
    "subscription_id": "<subscription_id>",
    "resource_group": "<resource_group_name>",
    "workspace_name": "<workspace_name>"
}

`MLClient` is how you interact with AzureML 

In [17]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input, Output
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.dsl import pipeline
from azureml.core import Workspace

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

try:
    ml_client = MLClient.from_config(credential=credential, path="workspace.json")
except Exception as ex:
    raise Exception(
        "Failed to create MLClient from config file. Please modify and then run the above cell with your AzureML Workspace details."
    ) from ex
    # ml_client = MLClient(
    #     credential=credential,
    #     subscription_id="",
    #     resource_group_name="",
    #     workspace_name=""
    # )

ws = Workspace(
    subscription_id=ml_client.subscription_id,
    resource_group=ml_client.resource_group_name,
    workspace_name=ml_client.workspace_name,
)
ml_client

### Which Embeddings Model to use?
There are currently two supported Embedding options: OpenAI's `text-embedding-ada-002` embedding model or HuggingFace embedding models. Here are some factors that might influence your decision:

#### OpenAI
OpenAI has [great documentation](https://platform.openai.com/docs/guides/embeddings) on their Embeddings model `text-embedding-ada-002`, it can handle up to 8191 tokens and can be accessed using [Azure OpenAI](https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models#embeddings-models) or OpenAI directly. If you have an existing Azure OpenAI Instance you can connect it to AzureML, if you don't AzureML provisions a default one for you called `Default_AzureOpenAI`. The main limitation when using `text-embedding-ada-002` is cost/quota available for the model. Otherwise it provides high quality embeddings across a wide array of text domains while being simple to use.

#### HuggingFace
HuggingFace hosts many different models capable of embedding text into single-dimensional vectors. The [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) ranks the performance of embeddings models on a few axis, not all models ranked can be run locally (e.g. `text-embedding-ada-002` is on the list), though many can and there is a range of larger and smaller models. When embedding with HuggingFace the model is loaded locally for inference, this will potentially impact your choice of compute resources.

**NOTE**: The default PromptFlow Runtime does not come with HuggingFace model dependencies installed, Indexes created using HuggingFace embeddings will not work in PromptFlow by default. **Pick OpenAI if you want to use PromptFlow**.

#### For this example, we will be using an OpenAI embedding model

We can use the automatically created `Default_AzureOpenAI` connection.

If you would rather use an existing Azure OpenAI connection then change `aoai_connection_name` below. If you would rather use an existing Azure OpenAI resource, but don't have a connection created, modify `aoai_connection_name` and the details under the `# Create New Connection` code comment, or navigate to the `PromptFlow` section in your AzureML Workspace and use the Connections create UI flow.

In [33]:
aoai_connection_name = "Default_AzureOpenAI"
aoai_connection_id = None

In [None]:
from azureml.rag.utils.connections import (
    get_connection_by_name_v2,
    create_connection_v2,
)

try:
    aoai_connection = get_connection_by_name_v2(ws, aoai_connection_name)
except Exception as ex:
    # Create New Connection
    # Modify the details below to match the `Endpoint` and API key of your AOAI resource, these details can be found in Azure Portal
    raise RuntimeError(
        "Have you entered your AOAI resource details below? If so, delete me!"
    )
    target = "<target>"  # example: 'https://<endpoint>.openai.azure.com/'
    key = "<key>"
    apiVersion = "2023-03-15-preview"
    if key == "<key>":
        raise RuntimeError(f"Please provide a valid key for the Azure OpenAI service")
    if target == "<target>":
        raise RuntimeError(
            f"Please provide a valid target for the Azure OpenAI service"
        )
    if apiVersion == "<api_version>":
        raise RuntimeError(
            f"Please provide a valid api-version for the Azure OpenAI service"
        )
    aoai_connection_id = create_connection_v2(
        workspace=ws,
        name=aoai_connection,
        category="AzureOpenAI",
        target=target,
        auth_type="ApiKey",
        credentials={"key": key},
        metadata={"ApiType": "azure", "ApiVersion": apiVersion},
    )["id"]

aoai_connection_id = aoai_connection["id"]

Now that your Workspace has a connection to Azure OpenAI we will make sure the `text-embedding-ada-002` model has been deployed ready for inference. This cell will fail if there is not deployment for the embeddings model, [follow these instructions](https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#deploy-a-model) to deploy a model with Azure OpenAI.

In [None]:
from azureml.rag.utils.deployment import infer_deployment

aoai_embedding_model_name = "text-embedding-ada-002"
try:
    aoai_embedding_deployment_name = infer_deployment(
        aoai_connection, aoai_embedding_model_name
    )
    print(
        f"Deployment name in AOAI workspace for model '{aoai_embedding_model_name}' is '{aoai_embedding_deployment_name}'"
    )
except Exception as e:
    print(
        f"Deployment name in AOAI workspace for model '{aoai_embedding_model_name}' is not found."
    )
    if "ResourceId" in aoai_connection["properties"]["metadata"]:
        aoai_resource_url = f"https://portal.azure.com/resource/{aoai_connection['properties']['metadata']['ResourceId']}/overview"
        print(
            f"Please create a deployment for this model by following the deploy instructions on the resource page: {aoai_resource_url}"
        )
    else:
        print(
            f"Please create a deployment for this model by following the deploy instructions on the resource page for '{aoai_connection['properties']['target']}' in Azure Portal."
        )

Finally we will combine the deployment and model information into a uri form which the AzureML embeddings components expect as input.

In [37]:
embeddings_model_uri = f"azure_open_ai://deployment/{aoai_embedding_deployment_name}/model/{aoai_embedding_model_name}"

### Setup Pipeline to process data into Index
AzureML [Pipelines](https://learn.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines?view=azureml-api-2) connect together multiple [Components](https://learn.microsoft.com/en-us/azure/machine-learning/concept-component?view=azureml-api-2). Each Component defines inputs, code that consumes the inputs and outputs produced from the code. Pipelines themselves can have inputs, and outputs produced by connecting together individual sub Components. To process your data for embedding and indexing we will chain together multiple components each performing their own step of the workflow.

The Components are published to a [Registry](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-registries?view=azureml-api-2&tabs=cli), azureml, which should have access to after signing up to the Generative AI Private Preview, it can be accessed from any Workspace as long as your Tenant has been granted access. In the below cell we get the Component Definitions from the azureml registry.

In [35]:
ml_registry = MLClient(credential=credential, registry_name="azureml")

# Clones git repository to output folder of pipeline, by default this will be on the default Workspace Datastore `workspaceblobstore`
git_clone_component = ml_registry.components.get("llm_rag_git_clone", label="latest")
# Walks input folder according to provided glob pattern (all files by default: '**/*') and attempts to open them, extract text chunks and further chunk if necessary to fir within provided `chunk_size`.
crack_and_chunk_component = ml_registry.components.get(
    "llm_rag_crack_and_chunk", label="latest"
)
# Reads input folder of files containing chunks and their metadata as batches, in parallel, and generates embeddings for each chunk. Output format is produced and loaded by `azureml.rag.embeddings.EmbeddingContainer`.
generate_embeddings_component = ml_registry.components.get(
    "llm_rag_generate_embeddings", label="latest"
)
# # Reads an input folder produced by `azureml.rag.embeddings.EmbeddingsContainer.save()` and pushes all documents (chunk, metadata, embedding_vector) into an FAISS index. Writes an MLIndex yaml detailing the index and embeddings model information.
create_faiss_index_component = ml_registry.components.get(
    "llm_rag_create_faiss_index", label="latest"
)
# Takes a uri to a storage location where an MLIndex yaml is stored and registers it as an MLIndex Data asset in the AzureML Workspace.
register_mlindex_component = ml_registry.components.get(
    "llm_rag_register_mlindex_asset", label="latest"
)

Each Component has documentation which provides an overall description of the Components purpose and each of the inputs/outputs. For example we can see understand what `crack_and_chunk` does by inspecting the Component definition.

In [None]:
crack_and_chunk_component

Below a Pipeline is built by defining a python function which chains together the above components inputs and outputs. Arguments to the function are inputs to the Pipeline itself and the return value is a dictionary defining the outputs of the Pipeline

In [38]:
from azure.ai.ml.entities._job.pipeline._io import PipelineInput
from typing import Optional


def use_automatic_compute(component, instance_count=1, instance_type="Standard_E8s_v3"):
    """Configure input `component` to use automatic compute with `instance_count` and `instance_type`.

    This avoids the need to provision a compute cluster to run the component.
    """
    component.set_resources(
        instance_count=instance_count,
        instance_type=instance_type,
        properties={"compute_specification": {"automatic": True}},
    )
    return component


def optional_pipeline_input_provided(input: Optional[PipelineInput]):
    """Checks if optional pipeline inputs are provided."""
    return input is not None and input._data is not None


# If you have an existing compute cluster you want to use instead of automatic compute, uncomment the following line, replace `dedicated_cpu_compute` with the name of your cluster.
# Also comment out the `component.set_resources` line in `use_automatic_compute` above and the `default_compute='serverless'` line below.
# @pipeline(compute=dedicated_cpu_compute)
@pipeline(default_compute="serverless")
def urifolder_to_faiss(
    input_data: Input,
    embeddings_model: str,
    asset_name: str,
    data_source_glob: str = None,
    data_source_url: str = None,
    document_path_replacement_regex: str = None,
    chunk_size: int = 1024,
    aoai_connection_id=None,
    embeddings_container=None,
):
    crack_and_chunk = crack_and_chunk_component(
        input_data=input_data,
        input_glob=data_source_glob,
        chunk_size=chunk_size,
        data_source_url=data_source_url,
        document_path_replacement_regex=document_path_replacement_regex,
    )
    use_automatic_compute(crack_and_chunk)

    generate_embeddings = generate_embeddings_component(
        chunks_source=crack_and_chunk.outputs.output_chunks,
        embeddings_container=embeddings_container,
        embeddings_model=embeddings_model,
    )
    use_automatic_compute(generate_embeddings)
    if optional_pipeline_input_provided(aoai_connection_id):
        generate_embeddings.environment_variables[
            "AZUREML_WORKSPACE_CONNECTION_ID_AOAI"
        ] = aoai_connection_id

    if optional_pipeline_input_provided(embeddings_container):
        # If provided, previous_embeddings is expected to be a URI to an 'embeddings container' folder.
        # Each folder under this folder is generated by a `create_embeddings_component` run and can be reused for subsequent embeddings runs.
        generate_embeddings.outputs.embeddings = Output(
            type="uri_folder", path=f"{embeddings_container.path}/{{name}}"
        )

    create_faiss_index = create_faiss_index_component(
        embeddings=generate_embeddings.outputs.embeddings,
    )
    use_automatic_compute(create_faiss_index)
    register_mlindex = register_mlindex_component(
        storage_uri=create_faiss_index.outputs.index, asset_name=asset_name
    )
    use_automatic_compute(register_mlindex)
    return {
        "mlindex_asset_uri": create_faiss_index.outputs.index,
        "mlindex_asset_id": register_mlindex.outputs.asset_id,
    }

Now we can create the Pipeline Job by calling the `@pipeline` annotated function and providing input arguments. `asset_name` will be used when registering the MLIndex Data Asset produced by the `register_mlindex` component in the pipeline. This is how you can refer to the MLIndex within AzureML. For this job, the input data is an asset of type `URI_FOLDER` that lives in the default Datastore of the AzureML Workspace being used. The folder contains two files:

1. MSFT_FY23Q1_10Q.docx
2. MSFT_FY23Q2_10Q.docx

These files contain Microsoft's quarterly financial reports for the first and second quarter of the 2023 fiscal year. They are publicly available and are also available in the `data` folder relative to the location of this notebook file. In the below pipeline job, we only use the first one by filtering out the second one by using the glob pattern `**/*[Q1]_10Q.docx`. We do this to demonstrate the incremental embedding capability. In a later run, we will include both files and see how it affects the scenario.

Here are screenshots of the two tables from the financial report that our model will be referencing to answer questions.

Q1
![image-alt-text](data/General_and_administrative_q1.png)

Q2
![image-alt-text](data/General_and_administrative_q2.png)

In [None]:
asset_name = "microsoft-earnings-fy23"
data_source_glob = "**/*[Q1]_10Q.docx"

pipeline_job = urifolder_to_faiss(
    input_data=Input(
        type=AssetTypes.URI_FOLDER, path="data/"
    ),  # This will upload the data folder to the default Workspace Datastore `workspaceblobstore`
    data_source_glob=data_source_glob,
    embeddings_model=embeddings_model_uri,
    asset_name=asset_name,
    aoai_connection_id=aoai_connection_id,
    embeddings_container=Input(
        type="uri_folder",
        path=f"azureml://datastores/workspaceblobstore/paths/embeddings/{asset_name}",
    ),
)

**Note**: By default AzureML Pipelines will reuse the output of previous component Runs when inputs have not changed.
If you want to rerun the Pipeline every time each time so that any changes to upstream data sources are processed uncomment the below line.
`pipeline_job.settings.force_rerun = True`. Rerun each time so that the `crack_and_chunk` component isn't cached, if intent is to ingest latest data.

Finally we add some properties to `pipeline_job` which ensure the Index generation progress and final Artifact appear in the PromptFlow Vector Index UI.

In [None]:
# These are added so that in progress index generations can be listed in UI, this tagging is done automatically by UI.
pipeline_job.properties["azureml.mlIndexAssetName"] = asset_name
pipeline_job.properties["azureml.mlIndexAssetKind"] = "faiss"
pipeline_job.properties["azureml.mlIndexAssetSource"] = "Uri Folder"

### Submit Pipeline
**In case of any errors see** TROUBLESHOOT.md.

The output of each step in the pipeline can be inspected via the Workspace UI, click the link under 'Details Page' after running the below cell.

In [None]:
running_pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name="incremental_embedding_with_table"
)
running_pipeline_job

In [None]:
ml_client.jobs.stream(running_pipeline_job.name)

### Use Index with langchain
The Data Asset produced by the AzureML Pipeline above contains a yaml file named `MLIndex` which contains all the information needed to use the FAISS index. For instance if an AOAI deployment was used to embed the documents the details of that deployment and a reference to the secret are there. This allows easy loading of the MLIndex into a langchain retriever. If you have not deployed `gpt-35-turbo` on your Azure OpenAI resource the below cell will fail indicated the `API deployment for this resource does not exist`. Follow the previous instructions for deploying `text-embedding-ada-002` to deploy `gpt-35-turbo`, note the chosen deployment name below and use the same or update it if you choose different one.

In [48]:
from azureml.rag.mlindex import MLIndex
from langchain.chains import RetrievalQA
from azureml.rag.models import init_llm, parse_model_uri


model_config = parse_model_uri(
    "azure_open_ai://deployment/gpt-35-turbo/model/gpt-35-turbo"
)
model_config["api_base"] = aoai_connection["properties"]["target"]
model_config["key"] = aoai_connection["properties"]["credentials"]["key"]
model_config["temperature"] = 0.3

In [49]:
retriever = MLIndex(
    ml_client.data.get(asset_name, label="latest")
).as_langchain_retriever()
question = "In the three months ended September 30, 2022, what were the expenses of Microsoft in the 'General and administrative' segment and as a percentage of the revenue?"
retriever.get_relevant_documents(question)
qa = RetrievalQA.from_chain_type(
    llm=init_llm(model_config), chain_type="stuff", retriever=retriever
)
qa.run(question)

"In the three months ended September 30, 2022, Microsoft's expenses in the 'General and administrative' segment were $1,398 million. As a percentage of the revenue, it was 3%."

Amazing! The model was able to successfully retrieve information that was only available in a table in the provided Word document and answer the question accordingly. Following is the snippet of the table for reference.

In [50]:
question = "In the three months ended December 31, 2022, what were the expenses of Microsoft in the 'General and administrative' segment and as a percentage of the revenue?"
retriever.get_relevant_documents(question)
qa = RetrievalQA.from_chain_type(
    llm=init_llm(model_config), chain_type="stuff", retriever=retriever
)
qa.run(question)

"The financial information for the three months ended December 31, 2022 is not provided in the given context. The latest financial information available in the context is for the three months ended September 30, 2022. In that period, the expenses of Microsoft in the 'General and administrative' segment were $1,398 million, and as a percentage of revenue, it was 3%."

Note from the response that the model is able to retrieve the closest document to answer the question and is smart enough to determine that it doesn't have the requested information. Now let's give it more information and see it reacts to the same question. To do so, we update the `data_source_glob` to `**/*[Q12]_10Q.docx` so that it matches both Word documents.

In [None]:
data_source_glob = "**/*[Q12]_10Q.docx"
pipeline_job = urifolder_to_faiss(
    input_data=Input(
        type=AssetTypes.URI_FOLDER,
        path="azureml://datastores/workspaceblobstore/paths/msft_earnings_fy23",
    ),
    data_source_glob=data_source_glob,
    embeddings_model=embeddings_model_uri,
    asset_name=asset_name,
    aoai_connection_id=aoai_connection_id,
    embeddings_container=Input(
        type="uri_folder",
        path=f"azureml://datastores/workspaceblobstore/paths/embeddings/{asset_name}",
    ),
)

running_pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name="incremental_embedding_with_table"
)
pipeline_job

In [None]:
ml_client.jobs.stream(running_pipeline_job.name)

In [56]:
retriever = MLIndex(
    ml_client.data.get(asset_name, label="latest")
).as_langchain_retriever()
retriever.get_relevant_documents(question)
question = "In the three months ended December 31, 2022, what were the expenses of Microsoft in the 'General and administrative' segment and as a percentage of the revenue?"
qa = RetrievalQA.from_chain_type(
    llm=init_llm(model_config), chain_type="stuff", retriever=retriever
)
qa.run(question)

"In the three months ended December 31, 2022, Microsoft's General and administrative expenses were $2,337 million and represented 4% of the revenue."

This time, when asked the same question, the model was able to answer the question correctly because our new index contained the information about the second quarter financial data. 

Something that's not visible to the eye but happens in the backend is incremental embedding. What this means is that, during the second MLIndex creation, we don't re-embed the files that were already embedded during the first MLIndex creation. This can be verified from the user logs of the `LLM - Generate Embeddings Parallel` component of the corresponding job. Here is a snippet of it that shows the embedding of the first quarter being skipped.

`[2023-06-30 14:04:46] INFO     azureml.rag.azureml.rag.embeddings - Processing document: MSFT_FY23Q1_10Q.docx0 (embeddings.py:646)
INFO:azureml.rag.azureml.rag.embeddings:Processing document: MSFT_FY23Q1_10Q.docx0`

`[2023-06-30 14:04:46] INFO     azureml.rag.azureml.rag.embeddings - Skip embedding document MSFT_FY23Q1_10Q.docx0 as it has not been modified since last embedded (embeddings.py:670)`

The benefit of this is that it makes embedding generation really fast and it is very noticeable and most helpful when you have a large number of files.


