# Create Index from various sources

## Objective

This notebook demonstrates the following:

- Create an index from
    - Your local files/folders
    - Git Repo
    - Remote sources like S3, OneLake

This tutorial uses the following Azure AI services:

- Access to Azure OpenAI Service - you can apply for access [here](https://go.microsoft.com/fwlink/?linkid=2222006)
- Azure Cognitive Search service - you can create it from instructions [here](https://learn.microsoft.com/azure/search/search-create-service-portal)
- An Azure AI Studio project - go to [aka.ms/azureaistudio](https://aka.ms/azureaistudio) to create a project
- A connection to the Azure Cognitive Service in your project


## Time

You should expect to spend 15-30 minutes running this sample.

## About this example

This sample shows how to create an index from different sources like local files and remote sources like a git repo and cloud storage URLs. It adds an index to an Azure Cognitive Search Index.

This sample is useful for developers and data scientists who wish to use their data to create an Index which can be used in the RAG pattern.

### Data

For this sample we will use data from the blob https://azuremlexamples.blob.core.windows.net/datasets/product-info/

## Before you begin



### Installation

Install the following packages required to execute this notebook. 



In [None]:
# Install the packages
!pip3 install azure-identity azure-ai-generative azure-ai-resources

### Parameters

In [None]:
# project details
subscription_id: str = "<your-subscription-id>"
resource_group_name: str = "<your-resource-group>"
project_name: str = "<your-project-name>"

# Azure Cognitive Search Connection
acs_connection_name: str = "<your-acs-connection>"

# model used for embedding
embedding_model_deployment: str = "text-embedding-ada-002"

# names of indexes we will create
local_index_local_files_index_name = "local-index-local-files-index"
cloud_index_git_index_name = "cloud-index-git-index"
cloud_index_remote_url_index_name = "cloud-index-remote-url-index"
cloud_index_local_files_index_name = "cloud-index-local_files-index"

should_cleanup: bool = False

## Connect to your project

To start with let us create a config file with your project details. This file can be used in this sample or other samples to connect to your workspace. To get the required details, you can go to the Project Overview page in the AI Studio. 

In [None]:
import json
from pathlib import Path

config = {
    "subscription_id": subscription_id,
    "resource_group": resource_group_name,
    "project_name": project_name,
}

p = Path("config.json")

with p.open(mode="w") as file:
    file.write(json.dumps(config))

Let us connect to the project

In [None]:
from azure.ai.resources.client import AIClient
from azure.identity import DefaultAzureCredential

# connects to project defined in the first config.json found in this or parent folders
client = AIClient.from_config(DefaultAzureCredential())

## Retrieve Azure OpenAI and Cognitive Services Connections
We will use an Azure Open AI service to access the LLM and embedding model. We will also use an Azure Cognitive Search to store the index. Let us get the details of these from your project.

In [None]:
# Get the default Azure Open AI connection for your project
default_aoai_connection = client.get_default_aoai_connection()
default_aoai_connection.set_current_environment()

# Get the Azure Cognitive Search connection by name
default_acs_connection = client.connections.get(acs_connection_name)
default_acs_connection.set_current_environment()

## 1. Build an Index locally from local files or folders

You can build an index from your local files or folders. We will build an index using the `build_index` function. This will create an index on the machine where this sample is run. The local index can then be added/registered to your AI Studio project.

You can index files of type `.md, .txt, .html, .htm, .py, .doc, .docx, .ppt, .pptx, .pdf, .xls, .xlsx`. All other file types will be ignored.

> In this notebook, we will use Azure Cognitive Search (ACS) as the index store for all our scenarios. You could also use FAISS/Pinecone for index store.

### 1.1 Build the Index locally
The below step will chunk and embed your documents locally and then add it to an index in the Azure Cognitive Search Service. 

In [None]:
from azure.ai.resources.operations import LocalSource, ACSOutputConfig
from azure.ai.generative.index import build_index

# build the index
acs_index = build_index(
    output_index_name=local_index_local_files_index_name,  # name of your index
    vector_store="azure_cognitive_search",  # the type of vector store - in our case it is ACS
    # we are using ada 002 for embedding
    embeddings_model=f"azure_open_ai://deployment/{embedding_model_deployment}/model/text-embedding-ada-002",
    index_input_config=LocalSource(input_data="data/product-info/"),  # the location of your file/folders
    acs_config=ACSOutputConfig(
        acs_index_name=local_index_local_files_index_name
        + "-store",  # the name of the index store inside the azure cognitive search service
    ),
)

### 1.2 Register the index
Register the index so that it shows up in the AI Studio Project.

In [None]:
client.indexes.create_or_update(acs_index)

## 2. Build an index on the Cloud

You can build an index directly on the cloud (your AI Studio project) from local files or folders as well as remote sources like a Git Repo, [OneLake](https://learn.microsoft.com/fabric/onelake/onelake-overview), [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide), generic cloud URLs.

In this section we will use the `build_ml_index_on_cloud` function. This function will create an index directly in your AI Studio project by running a job to perform the required steps directly in your project.

> In this notebook, we will use Azure Cognitive Search (ACS) as the index store for all our scenarios. You could also use FAISS/Pinecone for index store.

### 2.1 Build an index on cloud from a git repo

Let us build an index from the rust github repository.

#### 2.1.1 Configure the source

Let us configure the git repo from where we will get the data. In this case we are using a public repo. If you need to use a private repo, you could add **New Connection** of type `Git` in the AI Studio and use that name for `git_connection_id`

In [None]:
from azure.ai.resources.operations import GitSource

git_config = GitSource(git_url="https://github.com/rust-lang/book.git", git_branch_name="main", git_connection_id="")

#### 2.1.2 Configure the index store

Let us configure index name and connection to Azure Cognitive Search

In [None]:
from azure.ai.resources.operations import ACSOutputConfig

index_output_config = ACSOutputConfig(
    acs_index_name=cloud_index_git_index_name
    + "-store",  # the name of the index store inside the azure cognitive search service
    acs_connection_id=default_acs_connection.id,
)

#### 2.1.3 Build the index

We will use the `build_ml_index_on_cloud` function. This function will create an index directly in your AI Studio project by running a job to perform the required steps directly in your project. The output of this cell will provide a link to the job which will create the index. Click on the link to track status. You need to wait for the job to complete before using the index.

In [None]:
client.build_ml_index_on_cloud(
    output_index_name=cloud_index_git_index_name,
    vector_store="azure_cognitive_search",
    embeddings_model="text-embedding-ada-002",
    aoai_connection_id=default_aoai_connection.id,
    data_source_url="https://github.com/rust-lang/book/blob/main",
    input_source=git_config,
    acs_config=index_output_config,
)

### 2.2 Build an index on cloud from storage URLs

Let us build an index from storage URLs (cloud locations). You can build an index from the following types of storage locations:

|Location|URL Examples|
|--|--|
|Blob|wasb[s]://<container_name>@<account_name>.blob.core.windows.net/<path_to_folder>|
|OneLake (Lakehouse)|abfss://<workspace-name>@onelake.dfs.fabric.microsoft.com/<LakehouseName>.Lakehouse/Files/<path_to_folder>|
|OneLake (Warehouse)|abfss://<workspace-name>@onelake.dfs.fabric.microsoft.com/<warehouseName>.warehouse/Files/<path_to_folder>|
|Amazon S3 (link as OneLakeShortcut)|abfss://<workspace-name>@onelake.dfs.fabric.microsoft.com/<LakehouseName>.Lakehouse/Files/<path_to_S3_folder>|
|ADLS|abfss://<filesystem>@<accountname>.dfs.core.windows.net/<path_to_folder>|

You will need to ensure, that either you or your Studio project has access to these specific resources to be able to get data.

#### 2.2.1 Configure the source

In this notebook we use a publicly accessible blob URL since it is simple to setup without specific user permissions. Let us configure the blob URL.

In [None]:
remote_source = "wasbs://datasets@azuremlexamples.blob.core.windows.net/product-info"

#### 2.2.2 Configure the index store
Let us configure index name and connection to Azure Cognitive Search

In [None]:
from azure.ai.resources.operations import ACSOutputConfig

index_output_config = ACSOutputConfig(
    acs_index_name=cloud_index_remote_url_index_name
    + "-store",  # the name of the index store inside the azure cognitive search service
    acs_connection_id=default_acs_connection.id,
)

#### 2.2.3 Build the index

We will use the `build_ml_index_on_cloud` function. This function will create an index directly in your AI Studio project by running a job to perform the required steps directly in your project. The output of this cell will provide a link to the job which will create the index. Click on the link to track status. You need to wait for the job to complete before using the index.

Since we are using a publicly accessible storage location, we will not configure the identity. However, you can create an index using your credentials (the person submitting the command) by using the [UserIdentityConfiguration](https://learn.microsoft.com/python/api/azure-ai-ml/azure.ai.ml.useridentityconfiguration). This could be useful in cases where only you have access to the storgae location.

In [None]:
# from azure.ai.ml import UserIdentityConfiguration

client.build_ml_index_on_cloud(
    output_index_name=cloud_index_remote_url_index_name,
    vector_store="azure_cognitive_search",
    embeddings_model="text-embedding-ada-002",
    aoai_connection_id=default_aoai_connection.id,
    data_source_url="https://azuremlexamples.blob.core.windows.net/product-info",
    input_source=remote_source,
    acs_config=index_output_config,
    # identity=UserIdentityConfiguration(),
)

### 2.3 Build an index on cloud from local files

Let us build an index on the cloud from local files or folders. In this case the index will directly get created on the cloud and not the local machine.

#### 2.3.1 Configure the source

Use the local files/folders as the source

In [None]:
from azure.ai.resources.operations import LocalSource

local_source = LocalSource(input_data="data/product-info/")

#### 2.3.2 Configure the index store
Let us configure index name and connection to Azure Cognitive Search

In [None]:
from azure.ai.resources.operations import ACSOutputConfig

index_output_config = ACSOutputConfig(
    acs_index_name=cloud_index_local_files_index_name
    + "-store",  # the name of the index store inside the azure cognitive search service
    acs_connection_id=default_acs_connection.id,
)

#### 2.3.3 Build the index

We will use the `build_ml_index_on_cloud` function. This function will create an index directly in your AI Studio project by running a job to perform the required steps directly in your project. The output of this cell will provide a link to the job which will create the index. Click on the link to track status. You need to wait for the job to complete before using the index.

In [None]:
# from azure.ai.ml import UserIdentityConfiguration

client.build_ml_index_on_cloud(
    output_index_name=cloud_index_local_files_index_name,
    vector_store="azure_cognitive_search",
    embeddings_model="text-embedding-ada-002",
    aoai_connection_id=default_aoai_connection.id,
    input_source=local_source,
    acs_config=index_output_config,
)

## Consuming an Index

Any of these indexes can be consumed in the same way. Refer to the [Retrieval Augmented Generation (RAG) using Azure AI SDK](../rag-e2e/rag-qna.ipynb) notebook for details on consuming the index

## Cleaning up

To clean up all Azure ML resources used in this example, you can delete the individual resources you created in this tutorial.

If you made a resource group specifically to run this example, you could instead [delete the resource group](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/delete-resource-group).

In [None]:
if should_cleanup:
    # {{TODO: Add resource cleanup}}
    pass