# Using SLMs and LLMs together for advance query processing with Llama-Index and Azure AI model catalog and tracing capabilities

In this notebook, you will learn how to use `llama-index` with models deployed from the Azure AI model catalog deployed to Azure AI Foundry or Azure Machine Learning to create advance routing for queries in a RAG application. You will learn how to use tracing to understand what your code is doing.

In [None]:
import nest_asyncio

nest_asyncio.apply()

## 1. Prerequisites

To run this tutorial you need either:

1. Using GitHub Models:

    1. You can use [GitHub models](https://github.com/marketplace/models) endpoint including the free tier experience.
    2. Use the endpoint `https://models.inference.ai.azure.com` along with your GitHub Token.

1. Using Azure AI Foundry:

    1. Create an [Azure subscription](https://azure.microsoft.com).
    2. Create an Azure AI hub resource as explained at [How to create and manage an Azure AI Studio hub](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/create-azure-ai-resource).
    3. Deploy an SLM, which is used to determine the query processor to use. In this example we use a `Phi-3-mini-4k-instruct` deployment.
    4. Deploy an LLM, which is used to generate summaries of the answers. In this example we use a `Cohere Command R+` deployment.
    5. Deploy an embeddings model. In this example we use a `Cohere Embed V3` deployment. 

        * You can follow the instructions at [Add and configure models to Azure AI model inference service](https://learn.microsoft.com/azure/ai-studio/ai-services/how-to/create-model-deployments).

You need the following packages. 

```bash
pip install -U llama-index llamaindex-llms-azure-inference llamaindex-embeddings-azure-inference azure-ai-projects
```

Note that to configure instrumentation, you need to install the OpenTelemetry extension for Azure AI Inference SDK:

```bash
pip install -U azure-ai-inference[opentelemetry] azure-monitor-opentelemetry opentelemetry-semantic-conventions-ai
```

## 2. Get the connection string to Application Insights

You can use the tracing capabilities in Azure AI Foundry by creating a tracer. Logs are stored in Azure Application Insights and can be queried at any time and hence you need a connection string to it. Each AI Hub has an Azure Application Insights created for you. You can get the connection string by **either**:

### Using the connection string directly:

In [None]:
import os

application_insights_connection_string = os.environ["AZURE_APPINSIGHT_CONNECTION"]

### Using the Azure AI Foundry SDK

You can also get the connection string to Application Insights by using the Azure AI Foundry SDK along with the connection string to the project, as follows:

Install the Azure AI Foundry SDK:

```bash
pip install azure-ai-projects
```

In [None]:
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

project_client = AIProjectClient.from_connection_string(
    credential=DefaultAzureCredential(),
    conn_str="<your-project-connection-string>",
)

application_insights_connection_string = project_client.telemetry.get_connection_string()

> You can find the project connection string in the landing page of your project.

## 3. Configure instrumentation

In [None]:
from azure.ai.inference.tracing import AIInferenceInstrumentor
from azure.core.settings import settings
from azure.monitor.opentelemetry import configure_azure_monitor

settings.tracing_implementation = "opentelemetry"
configure_azure_monitor(connection_string=application_insights_connection_string)
AIInferenceInstrumentor().instrument(enable_content_recording=True)

Configure LlamaIndex instrumentation for OpenTelemetry:

In [None]:
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

instrumentor = LlamaIndexInstrumentor()
instrumentor.instrument()

## 4. Creating a RAG application

In the following example, we will create a RAG application that uses multiple models.

In [None]:
import os
from llama_index.core import Settings
from llama_index.core import SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.core import SummaryIndex
from llama_index.core import VectorStoreIndex
from llama_index.core.selectors import LLMSingleSelector

from llama_index.llms.azure_inference import AzureAICompletionsModel
from llama_index.embeddings.azure_inference import AzureAIEmbeddingsModel

Let's instantiate an LLM:

In [None]:
llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=os.environ["AZURE_INFERENCE_CREDENTIAL"],
    model_name="mistral-large-2407",
)

And an embeddings models:

In [None]:
embed_model = AzureAIEmbeddingsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=os.environ["AZURE_INFERENCE_CREDENTIAL"],
    model_name="cohere-embed-v3-english",
)

We configure these models as the defaults in our application:

In [None]:
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 1024

### Load data

In this example, we will use documents from Paul Graham essays. Data is locally in the repository:

In [None]:
documents = SimpleDirectoryReader("data/paul_graham").load_data()

Generate the nodes based on the basic configuration for chunking:

In [None]:
nodes = Settings.node_parser.get_nodes_from_documents(documents)

In this simple example, we will store the documents in memory so we don't need a vector database:

In [None]:
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

### Summary index

Let's create first a summary index. We use this index to answer complex queries from the user that require going through many documents.

In [None]:
summary_index = SummaryIndex(nodes, storage_context=storage_context)

In [None]:
summarize_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize",
    use_async=True,
)

### Vector index

Let's create now a vector index. We use this index to answer simple queries from the user.

In [None]:
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)

In [None]:
vector_query_engine = vector_index.as_query_engine()

### Ensemble our query tools

We grab the two indexes that we created before to generate 2 different tools that the RAG system we select to used based on the complexity of the query:

In [None]:
from llama_index.core.tools import QueryEngineTool

summary_tool = QueryEngineTool.from_defaults(
    query_engine=summarize_query_engine,
    description=("Useful for summarization questions related to Paul Graham eassy on" " What I Worked On."),
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description=("Useful for retrieving specific context from Paul Graham essay on What" " I Worked On."),
)

To help the RAG pipeline to understand when a query is simple and when it's complex, we will use another language model. However, since the task is quite simple, we will use an SLM:

In [None]:
slm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=os.environ["AZURE_INFERENCE_CREDENTIAL"],
    model_name="phi-3-mini-4k-instruct",
)

Configure the router:

In [None]:
from llama_index.core.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(llm=slm),
    query_engine_tools=[
        summary_tool,
        vector_tool,
    ],
)

Let's see how this works:

In [None]:
response = query_engine.query("What did Paul Graham do after RICS?")
print(str(response))

## 5. Inspect traces

Traces will look in the portal as follows:

![](docs/inference/tracing/llamaindex-tracing.png)