# Level 0: Getting Started with Llama Stack

This notebook will help you set up your environment for this tutorial. Specifically, we will cover installing the necessary libraries, configuring essential parameters, and connecting to a Llama Stack server.

## Prerequisites


Ensure you have access to a [Llama Stack](https://llama-stack.readthedocs.io/en/latest/) server.

If you need to set one up, please follow the instruction set below that is appropriate for your environment:

* [Local](../../../local_setup_guide.md) setup guide for a laptop.
* [Remote](../../../kubernetes/llama-stack/README.md) setup guide for an OpenShift cluster.

## Setting the Environment Variables

Rename or copy the [`.env.example`](../../../.env.example) file to create a new file called `.env`. We've included as many reasonable defaults as possible to get you started, but please use this file to make any customizations needed for your environment such as the the location of the Llama Stack server endpoint or your personal [Tavily](https://app.tavily.com) api key for web search.  

```bash
cp .env.example .env
```

### Environment variables required for all demos
- `REMOTE_BASE_URL`: the URL of the remote Llama Stack server.
- `TEMPERATURE` (optional): the temperature to use during inference. Defaults to 0.0.
- `TOP_P` (optional): the top_p parameter to use during inference. Defaults to 0.95.
- `MAX_TOKENS` (optional): the maximum number of tokens that can be generated in the completion. Defaults to 512.
- `STREAM` (optional): set this to True to stream the output of the model/agent and False otherwise. Defaults to False.
- `VDB_PROVIDER`: the vector DB provider to be used. Must be supported by Llama Stack. For this demo, we use Milvus Lite which is our preferred solution.
- `VDB_EMBEDDING`: the embedding model to be used for ingestion and retrieval. For this demo, we use all-MiniLM-L6-v2.
- `VDB_EMBEDDING_DIMENSION` (optional): the dimension of the embedding. Defaults to 384.
- `VECTOR_DB_CHUNK_SIZE` (optional): the chunk size for the vector DB. Defaults to 512.
- `REMOTE_OCP_MCP_URL`: the URL for your Openshift MCP server. If the client does not find the tool registered to the llama-stack instance, it will use this URL to register the Openshift tool.
- `REMOTE_SLACK_MCP_URL`: the URL for your Slack MCP server. If the client does not find the tool registered to the llama-stack instance, it will use this URL to register the Slack tool.
- `USE_PROMPT_CHAINING`: dictates if the prompt should be formatted as a few separate prompts to isolate each step or in a single turn.

## Necessary Imports

In [1]:
!pip install dotenv llama_stack_client fire

Collecting dotenv
  Downloading dotenv-0.9.9-py2.py3-none-any.whl.metadata (279 bytes)
Collecting llama_stack_client
  Downloading llama_stack_client-0.2.12-py3-none-any.whl.metadata (15 kB)
Collecting fire
  Downloading fire-0.7.1-py3-none-any.whl.metadata (5.8 kB)
Collecting python-dotenv (from dotenv)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Collecting distro<2,>=1.7.0 (from llama_stack_client)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting pyaml (from llama_stack_client)
  Downloading pyaml-25.7.0-py3-none-any.whl.metadata (12 kB)
Downloading dotenv-0.9.9-py2.py3-none-any.whl (1.9 kB)
Downloading llama_stack_client-0.2.12-py3-none-any.whl (340 kB)
Downloading fire-0.7.1-py3-none-any.whl (115 kB)
Downloading distro-1.9.0-py3-none-any.whl (20 kB)
Downloading pyaml-25.7.0-py3-none-any.whl (26 kB)
Downloading python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv, pyaml, fire, distro, dotenv, llama_

In [8]:
# for accessing the environment variables
import os
from dotenv import load_dotenv
load_dotenv()

# for communication with Llama Stack
from llama_stack_client import LlamaStackClient
from llama_stack_client.types import UserMessage

## Setting Up the Server Connection

Establish the connection to your Llama Stack server.

_Note: A Tavily search API key is required for some of our demos and must be provided to the client upon initialization. If you do not have one, you can set one up for free at https://app.tavily.com_

In [9]:
base_url = os.getenv("REMOTE_BASE_URL", "http://llamastack:8321")

# Tavily search API key is required for some of our demos and must be provided to the client upon initialization.
# We will cover it in the agentic demos that use the respective tool. Please ignore this parameter for all other demos.
tavily_search_api_key = os.getenv("TAVILY_SEARCH_API_KEY")
if tavily_search_api_key is None:
    provider_data = None
else:
    provider_data = {"tavily_search_api_key": tavily_search_api_key}


client = LlamaStackClient(
    base_url=base_url,
    provider_data=provider_data
)

print(f"Connected to Llama Stack server")

Connected to Llama Stack server


## Initializing the Inference Parameters

Fetch the inference-related parameters from the corresponding environment variables and convert them to the format Llama Stack expects.

In [10]:
temperature = float(os.getenv("TEMPERATURE", 0.0))
if temperature > 0.0:
    top_p = float(os.getenv("TOP_P", 0.95))
    strategy = {"type": "top_p", "temperature": temperature, "top_p": top_p}
else:
    strategy = {"type": "greedy"}

max_tokens = int(os.getenv("MAX_TOKENS", 9192))

# sampling_params will later be used to pass the parameters to Llama Stack Agents/Inference APIs
sampling_params = {
    "strategy": strategy,
    "max_tokens": max_tokens,
}

stream_env = os.getenv("STREAM", "True")
# the Boolean 'stream' parameter will later be passed to Llama Stack Agents/Inference APIs
# any value non equal to 'False' will be considered as 'True'
stream = (stream_env != "False")

print(f"Inference Parameters:\n\tSampling Parameters: {sampling_params}\n\tstream: {stream}")

Inference Parameters:
	Sampling Parameters: {'strategy': {'type': 'greedy'}, 'max_tokens': 2048}
	stream: False


Now, let's use the Llama stack inference API to greet our LLM. 

In [11]:
message = UserMessage(
    content="Hi, how are you?",
    role="user",
)
client.inference.chat_completion(
    model_id="qwen",
    messages=[message],
    sampling_params=sampling_params,
    stream=stream
).completion_message.content

INFO:httpx:HTTP Request: POST http://llamastack:8321/v1/inference/chat-completion "HTTP/1.1 200 OK"


"<think>\nLet me think about how to respond to this friendly greeting. First, I should acknowledge the greeting in a warm and welcoming way. I want to make sure my response is positive and engaging.\n\nI should express that I'm doing well, as that's a common and appropriate response. Then, I can add a bit of personality by mentioning my enthusiasm for our conversation. This helps create a more natural and friendly tone.\n\nI should also invite them to share how they're doing, as that's a good way to keep the conversation flowing. I want to be open and approachable, so I'll phrase it in a way that makes them feel comfortable sharing.\n\nLet me check if there's anything else I should consider. The response should be concise but not too short, and it should maintain a professional yet personable tone. I don't want to overcomplicate things or add unnecessary information at this stage.\n\nOverall, I think a simple, cheerful response that acknowledges their greeting and invites further conve

## Bonus: Customizing LLM Responses

Let's explore how to customize the LLM's response style. The cell below asks a simple question that any LLM would know. We've provided two versions of the prompt - a regular one and a pirate-themed one. Try commenting out the regular prompt and uncommenting the pirate version to see how we can change the personality of the response! You will need to restart to see the changes if the notebook has already been executed.

Feel free to:
- Switch between the two prompts by commenting/uncommenting
- Change the question to anything you'd like
- Create your own personality styles (try a medieval knight, a robot, or a Shakespeare character!)
- Experiment with different prompting techniques

In [12]:
# Feel free to change this question to anything you'd like!
# Uncomment one of the prompts below:

# Regular version:
prompt = "What is the capital of France?"

# Pirate version (uncomment this line and comment out the line above):
# prompt = "Please answer the following question as if you were a pirate captain: What is the capital of France?"

# Create the message
message = UserMessage(
    content=prompt,
    role="user",
)

# Get the response
response = client.inference.chat_completion(
    model_id="qwen",
    messages=[message],
    sampling_params=sampling_params,
    stream=stream
)

print(response.completion_message.content)

INFO:httpx:HTTP Request: POST http://llamastack:8321/v1/inference/chat-completion "HTTP/1.1 200 OK"


<think>
Okay, so the user is asking, "What is the capital of France?" Hmm, I need to make sure I get this right. Let me think. I remember from school that France is a country in Europe, and their capital is a major city. I think it's Paris. Wait, is that correct? Let me verify. I've heard of Paris being known for the Eiffel Tower and the Louvre Museum. Yeah, that sounds right. But wait, sometimes people confuse capitals with other big cities. For example, some countries have capitals that aren't their largest cities, but in France's case, Paris is both the capital and the largest city. I don't think there's any other city that's commonly mistaken for the capital. Let me think of other European capitals. Germany's is Berlin, Italy's is Rome, Spain's is Madrid. So France's should be Paris. I'm pretty confident about that. But just to be thorough, maybe I should recall some historical context. Paris has been the capital for a long time, right? Even during different regimes and governments

# Next

Now that we've set up our Tutorial environment, Let's get started building with Llama Stack! The next notebook will teach you how to build a [Simple RAG](./Level1_simple_RAG.ipynb) application.

#### Any Feedback?

If you have any feedback on this or any other notebook in this demo series we'd love to hear it! Please go to https://www.feedback.redhat.com/jfe/form/SV_8pQsoy0U9Ccqsvk and help us improve our demos. 