# Using Llama-Index with models in the Azure AI model catalog

In this notebook, you learn how to use `llama-index` with models deployed from the Azure AI model catalog deployed to Azure AI Foundry.

## 1. Prerequisites

To run this tutorial you need either:

1. Using GitHub Models:

    1. You can use [GitHub models](https://github.com/marketplace/models) endpoint including the free tier experience.
    2. Use the endpoint `https://models.inference.ai.azure.com` along with your GitHub Token.

1. Using Azure AI Foundry:

    1. Create an [Azure subscription](https://azure.microsoft.com).
    2. Create an Azure AI hub resource as explained at [How to create and manage an Azure AI Studio hub](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/create-azure-ai-resource).
    3. Deploy one model supporting the [Azure AI model inference API](https://aka.ms/azureai/modelinference). In this example we use a `Mistral-Large` deployment. 

        * You can follow the instructions at [Add and configure models to Azure AI model inference service](https://learn.microsoft.com/azure/ai-studio/ai-services/how-to/create-model-deployments).

## 2. Install dependencies

Ensure you have `llama-index` installed:

```bash
pip install llama-index
```

Models deployed to Azure AI studio or Azure Machine Learning can be used with `llama-index` in two ways:

- **Using the Azure AI model inference API:** All models deployed to Azure AI studio and Azure Machine Learning support the Azure AI model inference API, which offers a common set of functionalities that can be used for most of the models in the catalog. The benefit of this API is that, since it's the same for all the models, changing from one to another is as simple as changing the model deployment being use. No further changes are required in the code. When working with `llama-index`, install the extensions `llama-index-llms-azure-inference` and `llama-index-embeddings-azure-inference`.
- **Using the model's provider specific API:** Some models, like OpenAI, Cohere, or Mistral, offer their own set of APIs and extensions for `llama-index`. Those extensions may include specific functionalities that the model support and hence are suitable if you want to exploit them. When working with `llama-index`, install the extension specific for the model you want to use, like `llama-index-llms-openai` or `llama-index-llms-cohere`.


In this example, we are working with the Azure AI model inference API, hence we install the following packages:

```bash
pip install -U llama-index-llms-azure-inference
pip install -U llama-index-embeddings-azure-inference
```

## 3. Set environment variables

Follow these steps to get the information you need from the model you want to use:

1. Go to the [Azure AI Foundry portal](https://ai.azure.com/) or [Azure Machine Learning studio](https://ml.azure.com), depending on the product you are using.

2. Go to **Models + Endpoints** (**Endpoints** in Azure Machine Learning) and select the model you deployed as indicated in the prerequisites.

3. Copy the endpoint URL and the key.
    
> If your model was deployed with Microsoft Entra ID support, you don't need a key.

In this scenario, we placed both the endpoint URL and key in the following environment variables:

```bash
export AZURE_INFERENCE_ENDPOINT="<your-model-endpoint-goes-here>"
export AZURE_INFERENCE_CREDENTIAL="<your-key-goes-here>"
```

## 4. Connect to your deployment and endpoint

To use LLMs deployed in Azure AI Foundry or Azure Machine Learning, you need the endpoint and credentials to connect to it. The parameter `model_name` is not required for endpoints serving a single model, like Managed Online Endpoints or Serverless API Endpoints.

In [None]:
import os
from llama_index.llms.azure_inference import AzureAICompletionsModel

llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=os.environ["AZURE_INFERENCE_CREDENTIAL"],
    model_name="mistral-large-2407",
)

> If you are using OpenAI models, the parameter `api_version` may be required in the constructor.

Alternatively, if your endpoint support Microsoft Entra ID, you can use the following code to create the client:

In [None]:
from azure.identity import DefaultAzureCredential

llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=DefaultAzureCredential(),
    model_name="mistral-large-2407",
)

> Note: When using Microsoft Entra ID, make sure that the endpoint was deployed with that authentication method and that you have the required permissions to invoke it.

If you are planning to use asynchronous calling, it's a best practice to use the asynchronous version for the credentials:

In [None]:
from azure.identity.aio import (
    DefaultAzureCredential as DefaultAzureCredentialAsync,
)

llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=DefaultAzureCredentialAsync(),
    model_name="mistral-large-2407",
)

## 4. Use LLMs models

Use the `complete` endpoint for text completion. The `complete` method is still available for model of type `chat-completions`. On those cases, your input text is converted to a message with `role="user"`.

In [None]:
response = llm.complete("The sky is a beautiful blue and")
print(response)

In [None]:
response = llm.stream_complete("The sky is a beautiful blue and")
for r in response:
    print(r.delta, end="")

Or use `chat` for chat completion models

In [None]:
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(role="system", content="You are a pirate with colorful personality."),
    ChatMessage(role="user", content="Hello"),
]

response = llm.chat(messages)
print(response)

In [None]:
response = llm.stream_chat(messages)
for r in response:
    print(r.delta, end="")

Rather than adding same parameters to each chat or completion call, you can set them at the client instance.

In [None]:
llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=os.environ["AZURE_INFERENCE_CREDENTIAL"],
    model_name="mistral-large-2407",
    temperature=0.0,
    model_kwargs={"top_p": 1.0},
)

Parameters not supported in the Azure AI model inference API ([reference](https://learn.microsoft.com/en-us/azure/ai-studio/reference/reference-model-inference-chat-completions.md)) but available in the underlying model, you can use the `model_extras` argument. In the following example, the parameter `safe_prompt`, only available for Mistral models, is being passed.

In [None]:
llm = AzureAICompletionsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=os.environ["AZURE_INFERENCE_CREDENTIAL"],
    temperature=0.0,
    model_kwargs={"model_extras": {"safe_prompt": True}},
)

## 5. Use embeddings models

In the same way you create an LLM client, you can connect to an embedding model. In the following example, we are setting again the environment variable to now point to an embeddings model:

```bash
export AZURE_INFERENCE_ENDPOINT="<your-model-endpoint-goes-here>"
export AZURE_INFERENCE_CREDENTIAL="<your-key-goes-here>"
```

In [None]:
from llama_index.embeddings.azure_inference import AzureAIEmbeddingsModel

embed_model = AzureAIEmbeddingsModel(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=os.environ["AZURE_INFERENCE_CREDENTIAL"],
    model_name="cohere-embed-v3-english",
)

Then configure your session to use the embeddings model:

In [None]:
from llama_index.core import Settings

Settings.embed_model = embed_model

## 6. Configure the models to be used by your code

You can use the LLM or embeddings model client individually in the code you develop with `llama-index` or you can configure the entire session using the `Settings` options. Configuring the session has the advantage of all your code using the same models for all the operations.

In [None]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model

However, there are scenarios where you want to use a general model for most of the operations but a specific one for a given task. On those cases, it's useful to set the LLM or embedding model you are using for each `llama-index` construct. In the following example, we set a specific model:

In [None]:
from llama_index.core.evaluation import RelevancyEvaluator

relevancy_evaluator = RelevancyEvaluator(llm=llm)

In general, you use a combination of both strategies.