# Foundry local

<img src="https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/media/architecture/foundry-local-arch.png">

Foundry Local is an on-device AI inference solution offering performance, privacy, customization, and cost advantages. It integrates seamlessly into your existing workflows and applications through an intuitive CLI, SDK, and REST API.

## Key features
- On-Device Inference: Run models locally on your own hardware, reducing your costs while keeping all your data on your device.
- Model Customization: Select from preset models or use your own to meet specific requirements and use cases.
- Cost Efficiency: Eliminate recurring cloud service costs by using your existing hardware, making AI more accessible.
- Seamless Integration: Connect with your applications through an SDK, API endpoints, or the CLI, with easy scaling to Azure AI Foundry as your needs grow.

## Use cases
Foundry Local is ideal for scenarios where:
- You want to keep sensitive data on your device.
- You need to operate in environments with limited or no internet connectivity.
- You want to reduce cloud inference costs.
- You need low-latency AI responses for real-time applications.
- You want to experiment with AI models before deploying to a cloud environment.

## Documentation
> https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/what-is-foundry-local
> https://github.com/microsoft/Foundry-Local/releases
> https://github.com/microsoft/Foundry-Local/tree/main/docs

## Note

Foundry Local is available in preview. Public preview releases provide early access to features that are in active deployment.
Features, approaches, and processes can change or have limited capabilities, before General Availability (GA).

In [None]:
#%pip install foundry-local-sdk

In [2]:
import json
import openai
import os
import pandas as pd
import requests

from foundry_local import FoundryLocalManager

## List of models

In [3]:
manager = FoundryLocalManager()
manager

<foundry_local.api.FoundryLocalManager at 0x1207cfb60>

In [4]:
manager.is_service_running()

True

In [5]:
manager.service_uri

'http://localhost:5273'

In [6]:
manager.endpoint

'http://localhost:5273/v1'

In [7]:
# List available models in the catalog
catalog = manager.list_catalog_models()
print(f"Available models in the catalog: {catalog}")

Available models in the catalog: [FoundryModelInfo(alias=phi-4, id=Phi-4-generic-gpu, runtime=webgpu, file_size=8570 MB, license=MIT), FoundryModelInfo(alias=phi-4, id=Phi-4-generic-cpu, runtime=cpu, file_size=10403 MB, license=MIT), FoundryModelInfo(alias=mistral-7b-v0.2, id=mistralai-Mistral-7B-Instruct-v0-2-generic-gpu, runtime=webgpu, file_size=4167 MB, license=apache-2.0), FoundryModelInfo(alias=mistral-7b-v0.2, id=mistralai-Mistral-7B-Instruct-v0-2-generic-cpu, runtime=cpu, file_size=4167 MB, license=apache-2.0), FoundryModelInfo(alias=phi-3.5-mini, id=Phi-3.5-mini-instruct-generic-gpu, runtime=webgpu, file_size=2211 MB, license=MIT), FoundryModelInfo(alias=phi-3.5-mini, id=Phi-3.5-mini-instruct-generic-cpu, runtime=cpu, file_size=2590 MB, license=MIT), FoundryModelInfo(alias=phi-3-mini-128k, id=Phi-3-mini-128k-instruct-generic-gpu, runtime=webgpu, file_size=2181 MB, license=MIT), FoundryModelInfo(alias=phi-3-mini-128k, id=Phi-3-mini-128k-instruct-generic-cpu, runtime=cpu, file_s

In [8]:
for idx, item in enumerate(catalog, start=1):
    print(f"{idx}: {item}\n")

1: alias='phi-4' id='Phi-4-generic-gpu' version='1' runtime=<ExecutionProvider.WEBGPU: 'WebGpuExecutionProvider'> uri='azureml://registries/azureml/models/Phi-4-generic-gpu/versions/1' file_size_mb=8570 prompt_template={'system': '<|system|>\n{Content}<|im_end|>', 'user': '<|user|>\n{Content}<|im_end|>', 'assistant': '<|assistant|>\n{Content}<|im_end|>', 'prompt': '<|user|>\n{Content}<|im_end|>\n<|assistant|>'} provider='AzureFoundry' publisher='Microsoft' license='MIT' task='chat-completion'

2: alias='phi-4' id='Phi-4-generic-cpu' version='1' runtime=<ExecutionProvider.CPU: 'CPUExecutionProvider'> uri='azureml://registries/azureml/models/Phi-4-generic-cpu/versions/1' file_size_mb=10403 prompt_template={'system': '<|system|>\n{Content}<|im_end|>', 'user': '<|user|>\n{Content}<|im_end|>', 'assistant': '<|assistant|>\n{Content}<|im_end|>', 'prompt': '<|user|>\n{Content}<|im_end|>\n<|assistant|>'} provider='AzureFoundry' publisher='Microsoft' license='MIT' task='chat-completion'

3: alia

In [9]:
data = []

for idx, item in enumerate(catalog, start=1):
    data.append({
        "index": idx,
        "alias": item.alias,
        "id": item.id,
        "version": item.version,
        "runtime": str(item.runtime),
        "uri": item.uri,
        "file_size_mb": item.file_size_mb,
        "prompt_template": item.prompt_template,
        "provider": item.provider,
        "publisher": item.publisher,
        "license": item.license,
        "task": item.task
    })

df = pd.DataFrame(data)
df

Unnamed: 0,index,alias,id,version,runtime,uri,file_size_mb,prompt_template,provider,publisher,license,task
0,1,phi-4,Phi-4-generic-gpu,1,WebGpuExecutionProvider,azureml://registries/azureml/models/Phi-4-gene...,8570,"{'system': '<|system|> {Content}<|im_end|>', '...",AzureFoundry,Microsoft,MIT,chat-completion
1,2,phi-4,Phi-4-generic-cpu,1,CPUExecutionProvider,azureml://registries/azureml/models/Phi-4-gene...,10403,"{'system': '<|system|> {Content}<|im_end|>', '...",AzureFoundry,Microsoft,MIT,chat-completion
2,3,mistral-7b-v0.2,mistralai-Mistral-7B-Instruct-v0-2-generic-gpu,1,WebGpuExecutionProvider,azureml://registries/azureml/models/mistralai-...,4167,"{'prompt': '[INST] {Content} [/INST]', 'assist...",AzureFoundry,Microsoft,apache-2.0,chat-completion
3,4,mistral-7b-v0.2,mistralai-Mistral-7B-Instruct-v0-2-generic-cpu,2,CPUExecutionProvider,azureml://registries/azureml/models/mistralai-...,4167,"{'system': '<s>', 'user': '[INST] {Content} [/...",AzureFoundry,Microsoft,apache-2.0,chat-completion
4,5,phi-3.5-mini,Phi-3.5-mini-instruct-generic-gpu,1,WebGpuExecutionProvider,azureml://registries/azureml/models/Phi-3.5-mi...,2211,{'prompt': '<|user|> {Content}<|end|> <|assist...,AzureFoundry,Microsoft,MIT,chat-completion
5,6,phi-3.5-mini,Phi-3.5-mini-instruct-generic-cpu,1,CPUExecutionProvider,azureml://registries/azureml/models/Phi-3.5-mi...,2590,{'prompt': '<|user|> {Content}<|end|> <|assist...,AzureFoundry,Microsoft,MIT,chat-completion
6,7,phi-3-mini-128k,Phi-3-mini-128k-instruct-generic-gpu,1,WebGpuExecutionProvider,azureml://registries/azureml/models/Phi-3-mini...,2181,"{'system': '<|system|> {Content}<|end|>', 'use...",AzureFoundry,Microsoft,MIT,chat-completion
7,8,phi-3-mini-128k,Phi-3-mini-128k-instruct-generic-cpu,2,CPUExecutionProvider,azureml://registries/azureml/models/Phi-3-mini...,2600,"{'system': '<|system|> {Content}<|end|>', 'use...",AzureFoundry,Microsoft,MIT,chat-completion
8,9,phi-3-mini-4k,Phi-3-mini-4k-instruct-generic-gpu,1,WebGpuExecutionProvider,azureml://registries/azureml/models/Phi-3-mini...,2181,"{'system': '<|system|> {Content}<|end|>', 'use...",AzureFoundry,Microsoft,MIT,chat-completion
9,10,phi-3-mini-4k,Phi-3-mini-4k-instruct-generic-cpu,2,CPUExecutionProvider,azureml://registries/azureml/models/Phi-3-mini...,2590,"{'system': '<|system|> {Content}<|end|>', 'use...",AzureFoundry,Microsoft,MIT,chat-completion


In [10]:
for idx in range(len(catalog)):
    model = catalog[idx].id
    print(model)

Phi-4-generic-gpu
Phi-4-generic-cpu
mistralai-Mistral-7B-Instruct-v0-2-generic-gpu
mistralai-Mistral-7B-Instruct-v0-2-generic-cpu
Phi-3.5-mini-instruct-generic-gpu
Phi-3.5-mini-instruct-generic-cpu
Phi-3-mini-128k-instruct-generic-gpu
Phi-3-mini-128k-instruct-generic-cpu
Phi-3-mini-4k-instruct-generic-gpu
Phi-3-mini-4k-instruct-generic-cpu
deepseek-r1-distill-qwen-14b-generic-gpu
deepseek-r1-distill-qwen-7b-generic-gpu
qwen2.5-0.5b-instruct-generic-gpu
qwen2.5-0.5b-instruct-generic-cpu
qwen2.5-1.5b-instruct-generic-gpu
qwen2.5-1.5b-instruct-generic-cpu
qwen2.5-coder-0.5b-instruct-generic-gpu
qwen2.5-coder-0.5b-instruct-generic-cpu
qwen2.5-coder-7b-instruct-generic-gpu
qwen2.5-coder-7b-instruct-generic-cpu
qwen2.5-coder-1.5b-instruct-generic-gpu
qwen2.5-coder-1.5b-instruct-generic-cpu
Phi-4-mini-instruct-generic-gpu
Phi-4-mini-reasoning-generic-gpu
Phi-4-mini-reasoning-generic-cpu
qwen2.5-14b-instruct-generic-cpu
qwen2.5-7b-instruct-generic-gpu
qwen2.5-7b-instruct-generic-cpu
qwen2.5-co

In [11]:
print("Number of models in the catalog =", len(catalog))

Number of models in the catalog = 30


## Testing

In [12]:
alias = "phi-3.5-mini"

In [13]:
# Download and load a model
model_info = manager.download_model(alias)
model_info = manager.load_model(alias)
print(f"Model info:\n{model_info}")

Model info:
alias='phi-3.5-mini' id='Phi-3.5-mini-instruct-generic-gpu' version='1' runtime=<ExecutionProvider.WEBGPU: 'WebGpuExecutionProvider'> uri='azureml://registries/azureml/models/Phi-3.5-mini-instruct-generic-gpu/versions/1' file_size_mb=2211 prompt_template={'prompt': '<|user|>\n{Content}<|end|>\n<|assistant|>', 'assistant': '<|assistant|>\n{Content}<|end|>'} provider='AzureFoundry' publisher='Microsoft' license='MIT' task='chat-completion'


In [14]:
# List models in cache
local_models = manager.list_cached_models()
print(f"Models in cache:\n{local_models}")

Models in cache:
[FoundryModelInfo(alias=phi-3-mini-4k, id=Phi-3-mini-4k-instruct-generic-gpu, runtime=webgpu, file_size=2181 MB, license=MIT), FoundryModelInfo(alias=phi-3.5-mini, id=Phi-3.5-mini-instruct-generic-gpu, runtime=webgpu, file_size=2211 MB, license=MIT), FoundryModelInfo(alias=phi-4-mini-reasoning, id=Phi-4-mini-reasoning-generic-gpu, runtime=webgpu, file_size=3225 MB, license=MIT)]


In [15]:
# List loaded models
loaded = manager.list_loaded_models()
print(f"Models running in the service:\n{loaded}")

# Unload a model
manager.unload_model(alias)

Models running in the service:
[FoundryModelInfo(alias=phi-3.5-mini, id=Phi-3.5-mini-instruct-generic-gpu, runtime=webgpu, file_size=2211 MB, license=MIT)]


In [16]:
# Streaming

alias = "phi-3.5-mini"

manager = FoundryLocalManager(alias)

client = openai.OpenAI(
    base_url=manager.endpoint,
    api_key=manager.api_key  # API key is not required for local usage
)

# Set the model to use and generate a streaming response
stream = client.chat.completions.create(model=manager.get_model_info(alias).id,
                                        messages=[{
                                            "role": "user",
                                            "content": "What is 3.1415?"
                                        }],
                                        stream=True)

# Print the streaming response
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

 The number 3.1415 is a decimal approximation of the mathematical constant pi (π), which represents the ratio of a circle' extruded circumference to its diameter. Pi is an irrational number, meaning it has an infinite number of non-repeating decimals. The value 3.1415 is often used as a close estimate for practical calculations, but for more precise work, more digits of pi are used. For example, in scientific and engineering contexts, pi is often approximated to several decimal places, such as 3.14159 or even more accurately with the help of computer algorithms.

In [19]:
# No streaming mode
resp = client.chat.completions.create(
    model=manager.get_model_info(alias).id,
    messages=[{
        "role": "user",
        "content": "What is the capital of Canada?"
    }],
)

resp

ChatCompletion(id='chat.id.2594', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' The capital of Canada is Ottawa. Located in the province of Ontario, Ottawa is not only the political center of the country but also home to many national institutions, including Parliament Hill, where the Senate and House of Commons meet. The city was chosen as the capital by Queen Victoria in 1857 due to its strategic location near the geographical and linguistic divide between English-speaking Canada West and French-speaking Canada East. Ottawa is the fourth largest city in the country and is known for its rich history and diverse culture.', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], name=None, tool_call_id=None), delta={'role': 'assistant', 'content': ' The capital of Canada is Ottawa. Located in the province of Ontario, Ottawa is not only the political center of the country but also home to man

In [20]:
print(resp.choices[0].message.content)

 The capital of Canada is Ottawa. Located in the province of Ontario, Ottawa is not only the political center of the country but also home to many national institutions, including Parliament Hill, where the Senate and House of Commons meet. The city was chosen as the capital by Queen Victoria in 1857 due to its strategic location near the geographical and linguistic divide between English-speaking Canada West and French-speaking Canada East. Ottawa is the fourth largest city in the country and is known for its rich history and diverse culture.


In [21]:
# List models in cache
local_models = manager.list_cached_models()
print(f"Models in cache:\n{local_models}")

Models in cache:
[FoundryModelInfo(alias=phi-3-mini-4k, id=Phi-3-mini-4k-instruct-generic-gpu, runtime=webgpu, file_size=2181 MB, license=MIT), FoundryModelInfo(alias=phi-3.5-mini, id=Phi-3.5-mini-instruct-generic-gpu, runtime=webgpu, file_size=2211 MB, license=MIT), FoundryModelInfo(alias=phi-4-mini-reasoning, id=Phi-4-mini-reasoning-generic-gpu, runtime=webgpu, file_size=3225 MB, license=MIT)]


In [22]:
print("Number of models in cache =", len(local_models))

Number of models in cache = 3


In [23]:
# List loaded models
loaded = manager.list_loaded_models()
print(f"Models running in the service:\n{loaded}")

# Unload a model
manager.unload_model(alias)

Models running in the service:
[FoundryModelInfo(alias=phi-3.5-mini, id=Phi-3.5-mini-instruct-generic-gpu, runtime=webgpu, file_size=2211 MB, license=MIT)]


## Rest API

In [24]:
alias = "mistralai-Mistral-7B-Instruct-v0-2-generic-cpu"

manager = FoundryLocalManager(alias)
url = manager.endpoint + "/chat/completions"

payload = {
    "model": manager.get_model_info(alias).id,
    "messages": [{
        "role": "user",
        "content": "What is Azure?",
    }]
}

headers = {"Content-Type": "application/json"}

response = requests.post(url, headers=headers, data=json.dumps(payload))
print(response.json()["choices"][0]["message"]["content"])

 Azure is a cloud computing platform and set of services offered by Microsoft. It provides a range of solutions including those for computing, storage, networking, databases, analytics, artificial intelligence, and Internet of Things (IoT). Azure allows individuals, businesses, and organizations to build, deploy, and manage applications and workloads through a global network of Microsoft-managed data centers. With Azure, users can benefit from the flexibility of the cloud without having to manage the underlying infrastructure, enabling them to focus on their applications and business logic.


In [25]:
response.json()

{'model': None,
 'choices': [{'delta': {'role': 'assistant',
    'content': ' Azure is a cloud computing platform and set of services offered by Microsoft. It provides a range of solutions including those for computing, storage, networking, databases, analytics, artificial intelligence, and Internet of Things (IoT). Azure allows individuals, businesses, and organizations to build, deploy, and manage applications and workloads through a global network of Microsoft-managed data centers. With Azure, users can benefit from the flexibility of the cloud without having to manage the underlying infrastructure, enabling them to focus on their applications and business logic.',
    'name': None,
    'tool_call_id': None,
    'function_call': None,
    'tool_calls': []},
   'message': {'role': 'assistant',
    'content': ' Azure is a cloud computing platform and set of services offered by Microsoft. It provides a range of solutions including those for computing, storage, networking, databases, an

In [26]:
# List models in cache
local_models = manager.list_cached_models()
print(f"Models in cache:\n{local_models}")

Models in cache:
[FoundryModelInfo(alias=phi-3-mini-4k, id=Phi-3-mini-4k-instruct-generic-gpu, runtime=webgpu, file_size=2181 MB, license=MIT), FoundryModelInfo(alias=mistral-7b-v0.2, id=mistralai-Mistral-7B-Instruct-v0-2-generic-cpu, runtime=cpu, file_size=4167 MB, license=apache-2.0), FoundryModelInfo(alias=phi-3.5-mini, id=Phi-3.5-mini-instruct-generic-gpu, runtime=webgpu, file_size=2211 MB, license=MIT), FoundryModelInfo(alias=phi-4-mini-reasoning, id=Phi-4-mini-reasoning-generic-gpu, runtime=webgpu, file_size=3225 MB, license=MIT)]


In [27]:
for idx in range(len(local_models)):
    print(local_models[idx])
    print()

alias='phi-3-mini-4k' id='Phi-3-mini-4k-instruct-generic-gpu' version='1' runtime=<ExecutionProvider.WEBGPU: 'WebGpuExecutionProvider'> uri='azureml://registries/azureml/models/Phi-3-mini-4k-instruct-generic-gpu/versions/1' file_size_mb=2181 prompt_template={'system': '<|system|>\n{Content}<|end|>', 'user': '<|user|>\n{Content}<|end|>', 'assistant': '<|assistant|>\n{Content}<|end|>', 'prompt': '<|user|>\n{Content}<|end|>\n<|assistant|>'} provider='AzureFoundry' publisher='Microsoft' license='MIT' task='chat-completion'

alias='mistral-7b-v0.2' id='mistralai-Mistral-7B-Instruct-v0-2-generic-cpu' version='2' runtime=<ExecutionProvider.CPU: 'CPUExecutionProvider'> uri='azureml://registries/azureml/models/mistralai-Mistral-7B-Instruct-v0-2-generic-cpu/versions/2' file_size_mb=4167 prompt_template={'system': '<s>', 'user': '[INST]\n{Content}\n[/INST]', 'assistant': '{Content}</s>', 'prompt': '[INST]\n{Content}\n[/INST]'} provider='AzureFoundry' publisher='Microsoft' license='apache-2.0' tas

In [28]:
print("Number of models in cache =", len(local_models))

Number of models in cache = 4


## CLI

In [29]:
!foundry -h

[?1h=Description:
  Foundry Local CLI: Run AI models on your device.
  
  🚀 Getting started:
  
     1. To view available models: foundry model list
     2. To run a model: foundry model run <model>
  
     EXAMPLES:
         foundry model run phi-3-mini-4k

Usage:
  foundry [command] [options]

Options:
  -?, -h, --help  Show help and usage information
  --version       Show version information

Commands:
  model    Discover, run and manage models
  cache    Manage the local cache
  service  Manage the local model inference service



In [30]:
!foundry --version

[?1h=0.3.9267.42993


In [31]:
!foundry model list

[?1h=Alias                          Device     Task               File Size    License      Model ID            
-----------------------------------------------------------------------------------------------
phi-4                          GPU        chat-completion    8.37 GB      MIT          Phi-4-generic-gpu   
                               CPU        chat-completion    10.16 GB     MIT          Phi-4-generic-cpu   
--------------------------------------------------------------------------------------------------------
mistral-7b-v0.2                GPU        chat-completion    4.07 GB      apache-2.0   mistralai-Mistral-7B-Instruct-v0-2-generic-gpu
                               CPU        chat-completion    4.07 GB      apache-2.0   mistralai-Mistral-7B-Instruct-v0-2-generic-cpu
-------------------------------------------------------------------------------------------------------------------------------------
phi-3.5-mini                   GPU        chat-completion    2.16 

In [32]:
!foundry model info phi-4-mini-reasoning

[?1h=Alias                          Device     Task               File Size    License      Model ID            
phi-4-mini-reasoning           GPU        chat-completion    3 GB         MIT          Phi-4-mini-reasoning-generic-gpu


In [33]:
!foundry service restart

[?1h=Restarting service...
🔴 Service is stopped.
🟢 Service is Started on http://localhost:5273, PID 36129!


In [34]:
!foundry service stop

[?1h=🔴 Service is stopped.
