<a href="https://colab.research.google.com/github/StrategicalIT/PipedPiperAI/blob/main/Lab01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB1: Using NVIDIA NIM API
## Introducing NVIDIA's API catalog
In this lab we are going to use Python to access various models from NVIDIA's API catalog. You can explore the catalog by opening [https://build.nvidia.com/](https://build.nvidia.com/) in your web browser.
You can interact graphically with the models but in this workshop we will learn how to access them programatically. For this, you will need an API key.

## Getting a key from NVIDIA
<font color="red"><b>NOTE:</b> You need to have an NVIDIA API key and save it as a Google Colab secret.  Do that before proceeding any further</font>



## NIM API Reference
Another site worth mentioning is the API reference site. It can be accessed at [https://docs.api.nvidia.com/nim/reference/llm-apis](https://docs.api.nvidia.com/nim/reference/llm-apis). It shows the actual API endpoints exposed by the NIM for each model which is typically <b>POST /v1/chat/completions</b>

## Installing depencies and loading libraries
The first step is to install the necessary libraries. In this case we will install the openai Python library. This is considered the de-facto industry standard and most providers including NVIDIA NIM use the same API.

In [None]:
!pip install openai

Now we can import the component we need for this lab

In [None]:
from openai import OpenAI

Next we read the key from the environment and store it in a variable called "apikey" for future use. You can "uncomment" the "print" command if you want to validate that it has been read correctly. Comments are created in Python by adding a #
We'll also set a variable for the model we're using.

In [None]:
#import os
#apikey = os.environ["NVIDIA_API_KEY"]
#change from OS variable import to using Google Colab secret
from google.colab import userdata
apikey = userdata.get('apikey')
#print(apikey)

Now we'll set a default model.  Start with llama-3.2-3b-instruct, you can come back and change this once you complete the lab and play around to see different results.
Make sure only 1 line is setting a default i.e. all other default model lines shoudl be commented out with a #

In [None]:
#set the model
default_model = "meta/llama-3.2-3b-instruct"
#default_model = "meta/llama-3.1-405b-instruct"
#default_model = "deepseek-ai/deepseek-r1"#
print(default_model)

## Getting our first completion

Let's create a client instance. This client will be able to access all models. No need for a separate client connection for each model.
Notice how were we are using the "apikey" variable. The alternative would be to put your key wrapped in double quotes.

In [None]:
client = OpenAI(
  base_url = "https://integrate.api.nvidia.com/v1",
  api_key = apikey
)

<b>IMPORTANT</b>: NVIDIA is hosting the NIMs and exposing them via REST API. However, the same NIMs are available for download as containers. In that case the only change to the code here is the "base_url" parameter which will point to the NIM running in your Kubernetes cluster. When running NIM's locally the "api_key" parameter can be set to anything

Finally, we can send our prompt to a model by using the client we just created. This is done with the "chat.completions.create" method which mimics OpenAI's API <br>
Notice the following:
- we are requesting a specific model with model=default_model
- syntax of message "role", "content". Role can be "user" or "system"
- parameters that modify the behavior of the model.
  - temperature" controls how random the LLM is when it generates text. Lower temperatures will produce more predictable and deterministic responses. Conversely, higher temperatures will yield more creative responses
  - max_tokens is number of tokens the LLM can process in a single operation and it includes both the input and output tokens. This is a way of controlling the performance requirements.

In [None]:
completion = client.chat.completions.create(
  model=default_model,
  messages=[
      {
          "role":"user",
          "content":"Write a limerick about the wonders of GPU computing."
      }
  ],
  temperature=0.2,
  max_tokens=1024
)

The completion variable contains the entire response from the model. In the next command we show only the actual content of the response message.

In [None]:
print(completion.choices[0].message.content)

## Streaming response

In the previous code we are waiting for the whole output to be generated before we displayed in the screen. However, for use cases where the user is having a live interaction with the system is better to stream the tokens as they are generated. If the words are generated faster than a human can read them (approx 5 words per second) the experience will be satisfactory.

Notice how we are setting the "stream" flag and then extracting the content from each of the different completion chunks as they come.

In [None]:
completion = client.chat.completions.create(
  model=default_model,
  messages=[{"role":"user","content":"Write a limerick about the wonders of GPU computing."}],
  temperature=0.2,
  top_p=0.7,
  max_tokens=1024,
  stream=True
)

for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")


## Ideas to explore further

To finish this lab you can experiment with other models. Try using the same prompt with different models and compare the quality of the response. Some model suggestions:
- meta/llama-3.1-8b-instruct
- meta/llama-3.1-405b-instruct
- deepseek-ai/deepseek-r1

Things to observe:
- You will notice how larger models have a larger latency
- LRM's or Large Reasoning Models like Deepseek-r1 generate a lot of reasoning tokens. See all the "reasoning" Deepseek R1 does while creating the little poem


## End of Lab1