# Llama Stack Inference Guide

This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.1-8B-Instruct` model. 

Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html).


### Table of Contents
1. [Quickstart](#quickstart)
2. [Building Effective Prompts](#building-effective-prompts)
3. [Conversation Loop](#conversation-loop)
4. [Conversation History](#conversation-history)
5. [Streaming Responses](#streaming-responses)


## Quickstart

This section walks through each step to set up and make a simple text generation request.



### 0. Configuration
Set up your connection parameters:

In [1]:
HOST = "localhost"  # Replace with your host
PORT = 8321       # Replace with your port
MODEL_NAME='meta-llama/Llama-3.2-3B-Instruct'

### 1. Set Up the Client

Begin by importing the necessary components from Llama Stack’s client library:

In [2]:
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')

### 2. Create a Chat Completion Request

Use the `chat_completion` function to define the conversation context. Each message you include should have a specific role and content:

In [3]:
response = client.inference.chat_completion(
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "Write a two-sentence poem about llama."}
    ],
    model_id=MODEL_NAME,
)

print(response.completion_message.content)

Here is a two-sentence poem about a llama:

With soft fur and gentle eyes,
The llama roams, a gentle surprise.


## Building Effective Prompts

Effective prompt creation (often called 'prompt engineering') is essential for quality responses. Here are best practices for structuring your prompts to get the most out of the Llama Stack model:

### Sample Prompt

In [4]:
response = client.inference.chat_completion(
    messages=[
        {"role": "system", "content": "You are shakespeare."},
        {"role": "user", "content": "Write a two-sentence poem about llama."}
    ],
    model_id=MODEL_NAME,  # Changed from model to model_id
)
print(response.completion_message.content)

"O, fairest llama, with thy fleece so bright,
In Andean hills, thou dost delight."


## Conversation Loop

To create a continuous conversation loop, where users can input multiple messages in a session, use the following structure. This example runs an asynchronous loop, ending when the user types 'exit,' 'quit,' or 'bye.'

In [5]:
import asyncio
from llama_stack_client import LlamaStackClient
from termcolor import cprint

client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')

async def chat_loop():
    while True:
        user_input = input('User> ')
        if user_input.lower() in ['exit', 'quit', 'bye']:
            cprint('Ending conversation. Goodbye!', 'yellow')
            break

        message = {"role": "user", "content": user_input}
        response = client.inference.chat_completion(
            messages=[message],
            model_id=MODEL_NAME
        )
        cprint(f'> Response: {response.completion_message.content}', 'cyan')

# Run the chat loop in a Jupyter Notebook cell using await
await chat_loop()
# To run it in a python file, use this line instead
# asyncio.run(chat_loop())


User>  who are you?


[36m> Response: I'm an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."[0m


User>  what can you do for me?


[36m> Response: I can assist you with a wide range of tasks and provide information on various topics. Here are some examples of what I can do for you:

1. **Answer questions**: I can provide information on various subjects, including science, history, technology, literature, and more.
2. **Generate text**: I can create text based on a prompt or topic, and can even help with writing tasks such as proofreading and editing.
3. **Translate text**: I can translate text from one language to another, including popular languages such as Spanish, French, German, Chinese, and many more.
4. **Summarize content**: I can summarize long pieces of text into shorter, more digestible versions, highlighting the main points and key information.
5. **Offer suggestions**: I can provide suggestions for things like gift ideas, travel destinations, books to read, and more.
6. **Play games**: I can play text-based games with you, such as Hangman, 20 Questions, and Word Jumble.
7. **Chat and converse**: I can

User>  exit


[33mEnding conversation. Goodbye![0m


## Conversation History

Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session.

In [6]:
async def chat_loop():
    conversation_history = []
    while True:
        user_input = input('User> ')
        if user_input.lower() in ['exit', 'quit', 'bye']:
            cprint('Ending conversation. Goodbye!', 'yellow')
            break

        user_message = {"role": "user", "content": user_input}
        conversation_history.append(user_message)

        response = client.inference.chat_completion(
            messages=conversation_history,
            model_id=MODEL_NAME,
        )
        cprint(f'> Response: {response.completion_message.content}', 'cyan')

        # Append the assistant message with all required fields
        assistant_message = {
            "role": "user",
            "content": response.completion_message.content,
            # Add any additional required fields here if necessary
        }
        conversation_history.append(assistant_message)

# Use `await` in the Jupyter Notebook cell to call the function
await chat_loop()
# To run it in a python file, use this line instead
# asyncio.run(chat_loop())


User>  who are you?


[36m> Response: I'm an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."[0m


User>  can you tell me more about it?


[36m> Response: I'd be happy to tell you more about me.

I'm a type of artificial intelligence model called a large language model. This means I've been trained on a massive dataset of text from the internet, books, and other sources, which allows me to understand and generate human-like language.

Here are some key things about me:

1. **Training data**: My training data consists of a massive corpus of text, which I use to learn patterns and relationships in language. This corpus is sourced from various places, including but not limited to, the internet, books, and user-generated content.
2. **Language understanding**: I can understand natural language, including grammar, syntax, and semantics. I can also comprehend nuances like idioms, colloquialisms, and figurative language.
3. **Language generation**: I can generate human-like text based on the input I receive. This can include answering questions, providing explanations, generating text on a given topic, or even creating stories.

User>  exit


[33mEnding conversation. Goodbye![0m


## Streaming Responses

Llama Stack offers a `stream` parameter in the `chat_completion` function, which allows partial responses to be returned progressively as they are generated. This can enhance user experience by providing immediate feedback without waiting for the entire response to be processed.

In [7]:
from llama_stack_client.lib.inference.event_logger import EventLogger

async def run_main(stream: bool = True):
    client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')

    message = {
        "role": "user",
        "content": 'Write me a 3 sentence poem about llama'
    }
    cprint(f'User> {message["content"]}', 'green')

    response = client.inference.chat_completion(
        messages=[message],
        model_id=MODEL_NAME,
        stream=stream,
    )

    if not stream:
        cprint(f'> Response: {response.completion_message.content}', 'cyan')
    else:
        for log in EventLogger().log(response):
            log.print()

# In a Jupyter Notebook cell, use `await` to call the function
await run_main()
# To run it in a python file, use this line instead
# asyncio.run(run_main())


[32mUser> Write me a 3 sentence poem about llama[0m
[33mHere[0m[33m is[0m[33m a[0m[33m [0m[33m3[0m[33m sentence[0m[33m poem[0m[33m about[0m[33m a[0m[33m llama[0m[33m:

[33mWith[0m[33m soft[0m[33m fur[0m[33m and[0m[33m gentle[0m[33m eyes[0m[33m,
[33mThe[0m[33m llama[0m[33m ro[0m[33mams[0m[33m,[0m[33m a[0m[33m peaceful[0m[33m surprise[0m[33m,
[33mIn[0m[33m the[0m[33m And[0m[33mes[0m[33m,[0m[33m its[0m[33m beauty[0m[33m resides[0m[33m.[0m[97m[0m
