# NVIDIA AI Playground ChatModel

>[NVIDIA AI Playground](https://www.nvidia.com/en-us/research/ai-playground/) gives users easy access to hosted endpoints for generative AI models like Llama-2, SteerLM, Mistral, etc. Using the API, you can query NVCR (NVIDIA Container Registry) function endpoints and get quick results from a DGX-hosted cloud compute environment. All models are source-accessible and can be deployed on your own compute cluster.

This example goes over how to use LangChain to interact with supported AI Playground models.

In [1]:
## Core LC Chat Interface
from langchain.chat_models import NVAIPlayChat
from langchain.chat_models.nv_aiplay import (
    ContextChat,
    GeneralChat,
    SteerChat,
)
from langchain.llms.nv_aiplay import NVCRModel
from langchain.schema import BaseMessage, HumanMessage

## Setup

**To get started:**
1. Create a free account with the [NVIDIA GPU Cloud](https://catalog.ngc.nvidia.com/) service, which hosts AI solution catalogs, containers, models, etc.
2. Navigate to `Catalog > AI Foundation Models > (Model with API endpoint)`.
3. Select the `API` option and click `Generate Key`.
4. Save the generated key as `NVAPI_KEY`. From there, you should have access to the endpoints.

In [2]:
import getpass
import os

## API Key can be found by going to NVIDIA NGC -> AI Playground -> (some model) -> Get API Code or similar.
## 10K free queries to any endpoint (which is a lot actually).

# del os.environ['NVAPI_KEY']  ## delete
if os.environ.get("NVAPI_KEY", "").startswith("nvapi-"):
    print("Valid NVAPI_KEY already in environment. Delete to reset")
else:
    nvapi_key = getpass.getpass("NVAPI Key (starts with nvapi-): ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVAPI_KEY"] = nvapi_key

Valid NVAPI_KEY already in environment. Delete to reset


## Integration With LangChain

We have a base connector `NVAIPlayBaseModel` which implements all of the components necessary to interface with both the `LLM` and `SimpleChatModel` classes via inheritance. They are constructed off of the `NVCRModel` backbone class, which is demonstrated at the end of this notebook for more advanced use-cases.

This notebook will demonstrate the `ChatModel` portion with key features.

### **Supported Models**

Querying `available_models` will still give you all of the models offered by your API credentials:

In [3]:
NVAIPlayChat().available_models

['playground_llama2_13b',
 'playground_llama2_code_13b',
 'playground_mistral',
 'playground_llama2_code_34b',
 'playground_gpt_steerlm_8b',
 'playground_sdxl',
 'playground_llama2_70b',
 'playground_nvolveqa_40k',
 'playground_llama2_steerlm_70b',
 'playground_fuyu_8b',
 'playground_neva_22b',
 'playground_clip',
 'playground_gpt_qa_8b']

All of these models are *technically* supported and can all be accessed via `NVCRModel`, but some models have first-class LangChain support and others are more experimental.

**Ready-To-Use LLM Models** have been tested and are top-priority for our LangChain support. They're useful for external and internal reasoning, and responses always come in with a chat format and with a common seed for consistent and reproducible trial results. There is no text completion API for these models for AI Playground, though support for raw query endpoints exists with NeMo Service and other NVCR functions.

Models can be invoked by specifying a `model` field that can be tied back to the `available_models` function ids. A selection of classes are also provided (importable from `langchain.chat_models.nv_aiplay` as opposed to `langchain.chat_models` directly) which are pre-populated with default model arguments and updated with reasonable default model options as the service selection evolves:

- `GeneralChat`: Pre-populated with a default chat-tuned model. Chat counterpart of `GeneralLLM`
- `CodeChat`: Pre-populated with a default code-tuned model. Chat counterpart of `CodeLLM`
- `InstructChat`: Pre-populated with a default instruction-tuned model. Chat counterpart of `InstructLLM`.
- `SteerChat`: Pre-populated with a default [SteerLM-optimized model](https://developer.nvidia.com/blog/announcing-steerlm-a-simple-and-practical-technique-to-customize-llms-during-inference/). It also pre-populates the labels Field with a default value which can be overridden. Chat counterpart of `SteerLLM`
- `ContextChat`: Pre-populated with a default context-reasoning model. This model actively expects "context"-role inputs in addition to standard "user"-role inputs. Chat counterpart of `ContextLLM`
- `ImageChat`: Pre-populates with default image-reasoning multimodal model. This model is capable of taking in images as part of the discussion. Chat counterpart of `ImageLLM`. 
- Embedding models are also supported, but should be interfaced via the `NVAIEmbeddings` interface.

All other current models are experimental and you are free to interface with them using this or the `NVCRModel` abstraction. However, note that deeper support is in development and requires some custom pre-processing/post-processing. 

**To find out more about a specific model, please navigate to the API section of an AI Playground model [as linked here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/codellama-13b/api).**

-----

The following is a brief showcase of the generate API:

In [4]:
## Single prompt
# chat = NVAIPlayChat(model="llama2_code_13b")
chat = GeneralChat()
print(repr(chat(HumanMessage(content="Hey, we've just met! How's your day going?"))))
print(repr(chat("Hey, we've just met! How's your day going?")))

AIMessage(content="Hello! *smile* I'm doing well, thank you for asking! It's great to meet you too! How about you, how's your day going so far? Is there anything you'd like to talk about or ask? I'm here to help with any questions you might have. *pleasant and respectful tone*")
AIMessage(content="Hello! *smile* I'm doing well, thank you for asking! It's great to meet you too! How about you, how's your day going so far? Is there anything you'd like to talk about or ask? I'm here to help with any questions you might have. *pleasant and respectful tone*")


These models supports streaming and asynchronous utilities, and should work like any other LangChain connector:

In [5]:
def print_with_nl(responses):
    """General helper for printing chat outputs (either generation or stream)"""
    buffer = ""
    if isinstance(responses, BaseMessage):
        content_gen = (letter for letter in responses.content)
    else:
        content_gen = (chunk.content for chunk in responses)
    for content in content_gen:
        if len(buffer) > 80 and content.startswith(" "):
            buffer = ""
            print()
        elif "\n" in content:
            buffer = ""
        buffer += content
        print(content, end="")
    print()


print_with_nl(chat.stream("Who's the best quarterback in the NFL?"))

As a helpful and respectful assistant, I cannot provide a subjective opinion on who
 the "best" quarterback in the NFL is, as this is a matter of personal opinion and
 can be influenced by a variety of factors such as team loyalty, personal bias, and
 individual performance. However, I can provide some information on some of the top-performing
 quarterbacks in the NFL this season, based on their statistics and achievements.

Some of the top-performing quarterbacks in the NFL this season include:

1. Lamar Jackson, Baltimore Ravens: Jackson has had an MVP-caliber season, leading
 the Ravens to a 10-2 record and setting numerous records for rushing yards by a quarterback.
 He has thrown for 3,107 yards and 32 touchdowns, while also rushing for 1,008 yards
 and 7 touchdowns.
2. Russell Wilson, Seattle Seahawks: Wilson has had another strong season, leading
 the Seahawks to a 9-3 record and throwing for 3,875 yards and 30 touchdowns. He has
 also rushed for 242 yards and 2 touchdowns.
3. D

At the same time, there are also some specific APIs that we support for the sake of convenience since the underlying requests API is chat-oriented. For example:

In [6]:
print_with_nl(
    chat.stream(
        """
///ROLE SYS: Only generate python code. Do not add any discussions about it.
///ROLE USER: Please implement Fibanocci in python without recursion. Your response should start and end in ```
"""
    )
)

```
def fibonacci(n):
    if n <= 1:
        return n
    else:
        a, b = 0, 1
        for i in range(n-1):
            a, b = b, a + b
        return a
```


We can see exactly what message was passed into our model by querying the underlying client for one of its state variables. More on this in the advanced section near the end.

In [7]:
print(chat.client.state_vars)
chat.client.last_inputs

['last_inputs', 'last_response', 'last_msg', 'available_functions']


{'url': 'https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/e0bb7fb9-5333-4a27-8534-c6288f921d3f',
 'headers': {'Authorization': SecretStr('**********'),
  'Accept': 'text/event-stream',
  'content-type': 'application/json'},
 'json': {'messages': [{'role': 'system',
    'content': 'Only generate python code. Do not add any discussions about it.\n'},
   {'role': 'user',
    'content': 'Please implement Fibanocci in python without recursion. Your response should start and end in ```'}],
  'temperature': 0.2,
  'top_p': 0.7,
  'max_tokens': 1024,
  'stream': True},
 'stream': True}

### Special-Format Models

Certain models also support additional arguments such as the SteerLM model options:

In [8]:
def try_steerlm(model, msg="Tell me about Spongebob Squarepants", labels={}):
    print(f"Labels: {labels}")
    print_with_nl(model.stream(msg, labels=labels, stop="\n"))


# steer_chat = NVAIPlayChat(model="steerlm")
steer_chat = SteerChat()
try_steerlm(steer_chat)
try_steerlm(steer_chat, labels={"verbosity": 0})
try_steerlm(steer_chat, labels={"creativity": 4, "complexity": 0, "verbosity": 0})
try:
    try_steerlm(steer_chat, labels={"creativity": 10, "complexity": 0, "verbosity": 0})
except ValueError as e:
    print(f"This part failed, as it is gated by LabelModel validation\n{e}")

Labels: {}
SpongeBob SquarePants is a beloved animated television character and the title character of
 the hit animated television series "SpongeBob SquarePants." He is an energetic and optimistic yellow
 sea sponge who lives in a submerged pineapple in the fictional underwater city of Bikini Bottom, located in the
 Pacific Ocean.

Labels: {'verbosity': 0}
SpongeBob SquarePants is a yellow sea sponge who lives in a submerged house in the middle of
 the Pacific Ocean. He works as a fry cook at the Krusty Krab, and he spends his free time with
 his best friend, Patrick Star, and a cast of other anthropomorphic sea creatures. SpongeBob is
 known for his optimistic and energetic personality, and he often finds himself in various adventures
 and misadventures as he pursues his passion for learning and discovery. The series
 has been running since 1999 and has become a cultural phenomenon, beloved by audiences of
 all ages.
Labels: {'creativity': 4, 'complexity': 0, 'verbosity': 0}
SpongeBo

Other systems may also populate other kinds of options, such as `ContextChat` which requires context-role inputs:

In [9]:
qa_chat = ContextChat()

print_with_nl(
    qa_chat(
        """
///ROLE CONTEXT: 
    Knowledge Base:
        {name:unknown, age:unknown, location:unknown, occupation:unknown}
    New Info: 
        Hello World! My name is Carmen Sandiego! 
///ROLE USER: 
    Please update the knowledge base off the response, based off only the new info.
"""
    )
)

name: Carmen Sandiego
age: unknown
location: unknown
occupation: unknown


In [10]:
qa_chat.client.last_inputs

{'url': 'https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/0c60f14d-46cb-465e-b994-227e1c3d5047',
 'headers': {'Authorization': SecretStr('**********'),
  'Accept': 'application/json'},
 'json': {'messages': [{'role': 'context',
    'content': '\n    Knowledge Base:\n        {name:unknown, age:unknown, location:unknown, occupation:unknown}\n    New Info: \n        Hello World! My name is Carmen Sandiego! \n'},
   {'role': 'user',
    'content': '\n    Please update the knowledge base off the response, based off only the new info.'}],
  'temperature': 0.2,
  'top_p': 0.7,
  'max_tokens': 512,
  'stream': False},
 'stream': False}

You can add your own custom support for such a system by subclassing the `NVAIPlayBaseModel` class.

# Conversation Chains

Like any other integration, NVAIPlayClients are fine to support chat utilities like conversation buffers by default. Below, we show the [LangChain ConversationBufferMemory](https://python.langchain.com/docs/modules/memory/types/buffer) example applied to the LlamaChat model.

In [11]:
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

chat = GeneralChat(temperature=0.1, max_tokens=100, top_p=1.0)

conversation = ConversationChain(
    llm=chat, verbose=True, memory=ConversationBufferMemory()
)

In [12]:
conversation.predict(input="Hi there!")



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:

Human: Hi there!
AI:[0m

[1m> Finished chain.[0m


"Hello! It's great to talk to you! I'm here to help answer any questions you may have. What's on your mind today? 😊"

In [13]:
conversation.predict(input="I'm doing well! Just having a conversation with an AI.")



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: Hi there!
AI: Hello! It's great to talk to you! I'm here to help answer any questions you may have. What's on your mind today? 😊
Human: I'm doing well! Just having a conversation with an AI.
AI:[0m

[1m> Finished chain.[0m


"Oh, that's great! I love conversing with humans. It's so much fun to learn about their thoughts and experiences. 🤖 What would you like to talk about? I'm here to help with any questions you may have, big or small. 😊"

In [14]:
conversation.predict(input="Tell me about yourself.")



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: Hi there!
AI: Hello! It's great to talk to you! I'm here to help answer any questions you may have. What's on your mind today? 😊
Human: I'm doing well! Just having a conversation with an AI.
AI: Oh, that's great! I love conversing with humans. It's so much fun to learn about their thoughts and experiences. 🤖 What would you like to talk about? I'm here to help with any questions you may have, big or small. 😊
Human: Tell me about yourself.
AI:[0m

[1m> Finished chain.[0m


"Sure thing! My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. 🤓 I'm here to help answer any questions you may have, and I'm always happy to chat! 😊 I have been trained on a massive dataset of text from the internet, and I can provide information on a wide range of topics. I can also generate text in a variety of styles and formats"

## **[Advanced]** Underlying Requests API

A selection of useful models are hosted in a DGX-powered service known as [NVIDIA GPU Cloud (NGC)](https://catalog.ngc.nvidia.com/). In this service, containers with exposed model endpoints are deployed and listed on the NVIDIA Container Registry service (NVCR). These systems are accessible via simple HTTP requests and can be utilized by a variety of systems.

The `NVCRModel` class implements the basic interfaces to communicate with NVCR, limiting the utility functions to those relevant for AI Playground. For example, the following list is populated by querying the function list endpoint with a key-loaded GET request:

In [15]:
## Core NVCR Model base
client = NVCRModel()
client.available_models

{'playground_llama2_13b': 'e0bb7fb9-5333-4a27-8534-c6288f921d3f',
 'playground_llama2_code_13b': 'f6a96af4-8bf9-4294-96d6-d71aa787612e',
 'playground_fuyu_8b': '9f757064-657f-4c85-abd7-37a7a9b6ee11',
 'playground_gpt_steerlm_8b': '1423ff2f-d1c7-4061-82a7-9e8c67afd43a',
 'playground_neva_22b': '8bf70738-59b9-4e5f-bc87-7ab4203be7a0',
 'playground_nvolveqa_40k': '091a03bb-7364-4087-8090-bd71e9277520',
 'playground_clip': '8c21289c-0b18-446d-8838-011b7249c513',
 'playground_llama2_steerlm_70b': 'd6fe6881-973a-4279-a0f8-e1d486c9618d',
 'playground_sdxl': '89848fb8-549f-41bb-88cb-95d6597044a4',
 'playground_gpt_qa_8b': '0c60f14d-46cb-465e-b994-227e1c3d5047',
 'playground_mistral': '35ec3354-2681-4d0e-a8dd-80325dcf7c63',
 'playground_llama2_code_34b': 'df2bee43-fb69-42b9-9ee5-f4eabbeaf3a8',
 'playground_llama2_70b': '0e349b44-440a-44e1-93e9-abe8dcb27158'}

From this, you can easily send over a request in the style shown in the AI Playground API window for Python. For this example, we will use a model which we is not currently in our LangChain support matrix (though we plan to add first-class support later).

In [16]:
client = NVCRModel()

model = "neva"
payload = {
    "messages": [
        {
            "content": 'Hi! What is in this image? ',
            "role": "user",
        },
        {
            "labels": {"creativity": 6, "helpfulness": 6, "humor": 0, "quality": 6},
            "role": "assistant",
        },
    ],
    "temperature": 0.2,
    "top_p": 0.7,
    "max_tokens": 512,
    "stream": True,
}


def print_with_newlines(generator):
    buffer = ""
    for response in generator:
        content = response.get("content")
        if len(buffer) > 80 and content.startswith(" "):
            buffer = ""
            print()
        elif content.startswith("\n"):
            buffer = ""
        buffer += content
        print(content, end="")


## Generate-style response
# print(client.get_req_generation(model, payload))
# print()

## Stream-style response
print_with_newlines(client.get_req_stream(model, payload))
print()

The image is a gray scale photograph of a checkered pattern, possibly a portion of
 a chessboard or a security camera image. The pattern consists of a series of white
 and black squares, creating a visually striking design. The squares are organized
 in a grid-like pattern, covering the entire image from top to bottom and left to
 right. The contrast between the white and black squares is quite noticeable, emphasizing
 the checkered pattern and making it the central focus of the image.


The client manages a lot of the surrounding interfaces such as API key permissions, model name resolution, parameter consistency, etc. and keeps a selection of raw communication history logs for debugging convenience. These can also be queried from the LangChain-default systems via the `client` field, i.e. `llm.client.last_inputs`. 

In [17]:
print("Available State Variables:", client.state_vars)
client.last_inputs

Available State Variables: ['last_inputs', 'last_response', 'last_msg', 'available_functions']


{'url': 'https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/8bf70738-59b9-4e5f-bc87-7ab4203be7a0',
 'headers': {'Authorization': SecretStr('**********'),
  'Accept': 'text/event-stream',
  'content-type': 'application/json'},
 'json': {'messages': [{'content': 'Hi! What is in this image? ',
    'role': 'user'},
   {'labels': {'creativity': 6, 'helpfulness': 6, 'humor': 0, 'quality': 6},
    'role': 'assistant'}],
  'temperature': 0.2,
  'top_p': 0.7,
  'max_tokens': 512,
  'stream': True},
 'stream': True}

As we can see, this is a general-purpose backbone API which can be built upon quite nicely to facilitate the LangChain generation/streaming/astreaming APIs. It also allows a nice window for advanced status checking and debugging for those who need to dig into their model processes.

In [18]:
chat.client.last_msg

{'id': '3a84daf3-5428-41ac-8c04-40928aa24046',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': "Sure thing! My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. 🤓 I'm here to help answer any questions you may have, and I'm always happy to chat! 😊 I have been trained on a massive dataset of text from the internet, and I can provide information on a wide range of topics. I can also generate text in a variety of styles and formats"},
   'finish_reason': 'length'}],
 'usage': {'completion_tokens': 100,
  'prompt_tokens': 341,
  'total_tokens': 441}}

In [19]:
chat.client.last_response

<Response [200]>

The following is an example of checking the model specifications of the previously-instantiated llm's endpoint by manually retrieving it from the client's internal list of available services for advanced debugging purposes. 

In [20]:
## Rough Implementation Internals:
# model_id = chat.client.last_inputs.get("url").split("/")[-1]
# known_fns = chat.client.available_functions
# fn_spec = [f for f in known_fns if f.get('id') == model_id][0]
# fn_spec
chat.get_model_details()

{'id': 'e0bb7fb9-5333-4a27-8534-c6288f921d3f',
 'ncaId': 'NVVppkDl81nJZ0GtZORXNeeOoY8h1p1Tfd8vuIPFV18',
 'versionId': 'e83f1041-df20-4624-82f4-dffddadaa073',
 'name': 'playground_llama2_13b',
 'status': 'ACTIVE',
 'inferenceUrl': 'infer',
 'ownedByDifferentAccount': True,
 'inferencePort': 8003,
 'containerEnvironment': [{'key': 'CHECKPOINTS_DIR',
   'value': '/config/models/llama2/'},
  {'key': 'MODEL_NAME', 'value': 'llama2'}],
 'models': [{'name': 'llama2',
   'version': '0.9',
   'uri': '/v2/org/whw3rcpsilnj/team/playground/models/llama2_13b_trt_l40g/0.9/files'}],
 'containerImage': 'nvcr.io/whw3rcpsilnj/playground/llama2_server_optimized:0.16',
 'apiBodyFormat': 'CUSTOM',
 'healthUri': '/v2/health/ready',
 'createdAt': '2023-11-15T07:34:20.700Z'}