# Using Rosie's NVIDIA NIM for LLMs

Rosie has an always available instance of [NVIDIA NIM for LLMs](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) running [Llama-3.3-70B-Instruct](https://catalog.ngc.nvidia.com/orgs/nim/teams/meta/models/llama-3.3-70b-instruct). You can access it from code running on Rosie.

NIM for LLMs works by implementing the OpenAI API. Thus, to talk to it from Python, we use the `openai` package.

In [None]:
from openai import AsyncOpenAI

We then instantiate an instance of this class, which lets us talk with the services hosted on NIM.

In [None]:
client = AsyncOpenAI(
   base_url = "http://dh-dgxh100-2.hpc.msoe.edu:8000/v1",
   api_key = "not_used" # this field needs to be included but is ignored
)

We can list out the currently available models by calling `client.models.list()`.

In [None]:
async for model in client.models.list():
    print(model)

We use the completions API to prompt the model for a response.

In [None]:
await client.completions.create(
   model="meta/llama-3.3-70b-instruct",
   prompt="Why is MSOE the best school to study CS?", # your prompt goes here
)

If we want to get back the response token by token as it is generated, we can use the `stream` parameter.

In [None]:
async for event in await client.completions.create(
        model="meta/llama-3.3-70b-instruct",
        prompt="Why is MSOE the best school to study CS?",
        stream=True,
    ):
    print(event)

We can control how long the response is by setting the `max_tokens` parameter. This is a tradeoff between the amount of information we get back and the time it takes to get a response. The more tokens we ask for, the longer it will take to get a response.

In [None]:
await client.completions.create(
    model="meta/llama-3.3-70b-instruct",
    prompt="Why is MSOE the best school to study CS?",
    max_tokens=200,
    )

We can control how repetitive the model allows itself to be by setting the `frequency_penalty` parameter.

In [None]:
print(await client.completions.create(
   model="meta/llama-3.3-70b-instruct",
   prompt="Repeat the word poem.",
   frequency_penalty=-2,
))
print(await client.completions.create(
   model="meta/llama-3.3-70b-instruct",
   prompt="Repeat the word poem.",
   frequency_penalty=2,
))

If you need the prompt you gave the model back in the response, you can use the `echo` parameter.

In [None]:
await client.completions.create(
    model="meta/llama-3.3-70b-instruct",
    prompt="Why is MSOE the best school to study CS?",
    echo=True
    )

You can force the model to talk about new topics by increasing the `presence_penalty` parameter. However, setting it too high will likely cause the model to veer further off topic than you want.

In [None]:
await client.completions.create(
    model="meta/llama-3.3-70b-instruct",
    prompt="Why is MSOE the best school to study CS?",
    presence_penalty=2,
    max_tokens=100
    )

The completions API is for cases where you just want to predict next words instead of having a conversation. If you do want to chat with the model, you can instead use the chat completion API.

In [None]:
messages = [
    {"role": "user", "content": "Hello! How are you?"},
    {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
    {"role": "user", "content": "Why is MSOE the best school to study CS?"}
]
await client.chat.completions.create(
    model="meta/llama-3.3-70b-instruct",
    messages=messages,
)


Like the completions API, the chat completion API has a `stream` parameter that lets you get back the response token by token as it is generated. It also has a `max_tokens` parameter that lets you control how long the response is.

In [None]:
async for token in await client.chat.completions.create(
        model="meta/llama-3.3-70b-instruct",
        messages=messages,
        stream=True,
        max_tokens=10,
    ):
    print(token)

`frequency_penalty` is also available in the chat completion API.

In [None]:
poem_message = [
    {"role": "user", "content": "Continuously repeat the word poem."},
]


print(await client.chat.completions.create(
   model="meta/llama-3.3-70b-instruct",
   messages=poem_message,
   frequency_penalty=-2,
   max_tokens=30,
))
print(await client.chat.completions.create(
   model="meta/llama-3.3-70b-instruct",
   messages=poem_message,
   frequency_penalty=2,
   max_tokens=30,
))