# Mastering Llama Stack: A Comprehensive Guide to Building AI Systems with APIs

## Overview

 In this tutorial we will explore Meta AI’s Llama Stack, a powerful and standardized API framework designed to simplify the development of generative AI applications. The Llama Stack API offers essential building blocks that cover the entire development lifecycle, including model training, fine-tuning, inference, evaluation and more. 

These modular APIs are designed to interconnect seamlessly, allowing developers to create scalable AI development solutions that spans both local and cloud environments.

The Llama Stack is composed of a variety of APIs, each serving a specific function. These APIs include capabilities such as inference, safety, memory, agentic systems, evaluation, post-training processes, synthetic data generation and reward scoring.

Each of  APIs consists of multiple REST endpoints, serving as distinct access points for performing specific actions or retrieving data within the Llama Stack.

API providers deliver the functional support for Llama Stack APIs by offering concrete implementations that bring these APIs into action. For example, the inference API can be implemented using libraries like torch, vLLM, or TensorRT.

 Providers can either run locally or be just a pointer to a remote REST service, such as cloud-based or dedicated inference providers, enabling flexible deployment across different environments. 

A Llama Stack Distribution integrates APIs and providers into a cohesive framework for developers, enabling the flexibility to mix local and remote providers. This allows smaller models to run locally while larger models can leverage cloud providers. Regardless of the infrastructure, the higher-level APIs remain unchanged, ensuring a consistent development experience. This approach supports smooth transitions between different use cases, such as servers and mobile services, making it ideal for developing scalable generative AI applications.

## Prerequisites

<li> Python 3.10 installed

<li> Access to a GPU (e.g. an NVIDIA GPU) if working with larger models like Meta-Llama3.1-70B

<li> Basic familiarity with Large Language models (LLMs) and API integration.

## Setting up Environment and installing Llama Stack

### Installing :

 install LlamaStack package using `pip`:

``` shell
pip install llama-stack
```

Alternatively, clone the repository from GitHub and install from source: 

```shell
mkdir -p ~/local

cd ~/local

git clone git@github.com:meta-llama/llama-stack.git

conda create -n stack python=3.10

conda activate stack

cd llama-stack

pip install -e .
```

### Using the Llama CLI

The Llama CLI is a core tool that allows you to manage models, build distributions, and run servers. You can explore the availability by running:

```shell
llama --help
```

The CLI supports subcommands such as: 
<li> <code>download</code> : For downloading models from Meta or HuggingFace. </li>
<li> <code>model</code> : To list or describe available foundation models. </li>
<li> <code>stack</code> : To build and run a Llama Stack server. </li>

#### Working with models in Llama Stack

Listing the available models: Using the <code>model list</code> to see available models along with their hardware requirements:

```shell
llama model list
```
The output will display information such as **Meta-Llama3.1-8B** and **Meta-Llama3.1-70B**, including how many **GPUs** are required and their **context lengths**.



```shell
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Model Descriptor                      | HuggingFace Repo                            | Context Length | Hardware Requirements      |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-8B                      | meta-llama/Meta-Llama-3.1-8B                | 128K           | 1 GPU, each >= 20GB VRAM   |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-70B                     | meta-llama/Meta-Llama-3.1-70B               | 128K           | 8 GPUs, each >= 20GB VRAM  |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-405B:bf16-mp8           |                                             | 128K           | 8 GPUs, each >= 120GB VRAM |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-405B                    | meta-llama/Meta-Llama-3.1-405B-FP8          | 128K           | 8 GPUs, each >= 70GB VRAM  |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-405B:bf16-mp16          | meta-llama/Meta-Llama-3.1-405B              | 128K           | 16 GPUs, each >= 70GB VRAM |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-8B-Instruct             | meta-llama/Meta-Llama-3.1-8B-Instruct       | 128K           | 1 GPU, each >= 20GB VRAM   |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-70B-Instruct            | meta-llama/Meta-Llama-3.1-70B-Instruct      | 128K           | 8 GPUs, each >= 20GB VRAM  |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-405B-Instruct:bf16-mp8  |                                             | 128K           | 8 GPUs, each >= 120GB VRAM |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-405B-Instruct           | meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 | 128K           | 8 GPUs, each >= 70GB VRAM  |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-405B-Instruct:bf16-mp16 | meta-llama/Meta-Llama-3.1-405B-Instruct     | 128K           | 16 GPUs, each >= 70GB VRAM |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Llama-Guard-3-8B                      | meta-llama/Llama-Guard-3-8B                 | 128K           | 1 GPU, each >= 20GB VRAM   |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Llama-Guard-3-8B:int8-mp1             | meta-llama/Llama-Guard-3-8B-INT8            | 128K           | 1 GPU, each >= 10GB VRAM   |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Prompt-Guard-86M                      | meta-llama/Prompt-Guard-86M                 | 128K           | 1 GPU, each >= 1GB VRAM    |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
```

<li> Download the model

To download models, use the <code>download</code> subcommand. Here’s an example of downloading **Meta-Llama3.1-8B**:

```shell
llama download --source meta --model-id Meta-Llama3.1-8B-Instruct --meta-url META_URL
```

For larger models, such as **Meta-Llama3.1-70B**, make sure you have access to sufficient **GPU** resources:

```shell
llama download --source meta --model-id Meta-Llama3.1-70B-Instruct --meta-url META_UR
```

#### Building and configure Llama Stack Distributions

<li> Building a distribution

The Llama Stack lets you build distributions for running 
AI applications locally or in the cloud. For example, we can build a distribution using the **Meta-Llama3.1-8B-Instruct** model. 

Create a config file:

```yaml
 8b-instruct-build.yaml
name: 8b-instruct
distribution_spec:
  providers:
    inference: meta-reference
    memory: meta-reference-faiss
    safety: meta-reference
image_type: conda
```

Then, build the distribution:

```shell
llama stack build ./8b-instruct-build.yaml
```

<li> Configuring the Distribution
After building the distribution, configure it for your specific requirements, such as model parameters and memory options: 

```shell
llama stack configure ~/.llama/distributions/conda/8b-instruct-build.yaml
```

This command will guid you through configuring each API, including the inference model, safetyAPI, and Agentic System API.

#### Running the Llama Stack Server

Once the distribution is built and configured, start the server to run the API:

```shell
llama stack run ~/.llama/builds/conda/8b-instruct-run.yaml
```
You should see the logs indicating the APIs, including inference, memory, and agentic system, are now active. 



#### Using the APIs:

##### Inference API:

With the Inference API, you can run queries on the model. Here’s an example using the Meta-Llama3.1-8B-Instruct model: 

```shell
curl -X POST http://localhost:5000/inference/completion \
  -H 'Content-Type: application/json' \
  -d '{"model": "Meta-Llama3.1-8B-Instruct", "input": "Explain quantum computing."}'
```

##### safety API:

The safety API ensures that the model’s outputs are ethical and safe. Configure and invoke the Llama Guard model:

```shell
curl -X POST http://localhost:5000/safety/run_shields \
  -H 'Content-Type: application/json' \
  -d '{"model": "Llama-Guard-3-8B", "input": "Analyze this text for harmful content."}'
```

## Conclusion 

Llama Stack is open source. The repository containing its specifications and implementations is available on GitHub, allowing developers to contribute and collaborate on improving the platform. The goal of Llama Stack is to provide a standardized set of APIs for building and deploying generative AI applications, and its open-source nature enables broader community involvement in its development. 
In this tutorial, we covered how to setup and configure the Llama Stack for working with LLM models, including downloading models, building distributions, and running inference servers. With Llama Stack’s robust API system, you can integrate API capabilities across local and cloud environments, ensuring flexibility and scalability for your applications. 