<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>

<br>

# <font color="#76b900">**Notebook 2:** LLM Services and AI Foundation Models</font>

<br>

In this notebook, we will explore LLM services! We'll discuss the reasons for and against deploying LLMs on edge devices alongside ways to deliver powerful models to end users through scalable server deployments like those accessible through the NVIDIA AI Foundation Endpoints.

<br>

### **Learning Objectives:**

- Understanding the pros and cons of running LLM services locally vs in a scalable cloud environment.
- Getting familiar with the AI Foundation Model Endpoint schemes, including:
    - The raw low-level connection interface facilitated by packages like `curl` and `requests`
    - The abstractions created to make this interface function seamlessly with open-sourced software like LangChain.
- Getting comfortable with retrieving LLM generations from the pool of endpoints and being able to select a subset of models to build your software on.

<br>

### **Questions To Think About:**

1. What kind of model access should you give a person developing an LLM stack, and how does it compare to the access you need to provide to end-users of an AI-powered web application?
2. When considering which devices to support, what kinds of rigid assumptions are you making about their local compute resources and what types of fallbacks should you implement?
    - What is you wanted to deliver a jupyter labs interface with access to a private LLM deployment to customers.
    - What if now you wanted to support their local jupyter lab environment with your private LLM deployment?
    - Would anything have to change if you decided to support embedded devices (i.e. Jetson Nano)?
3. **[Harder]** Assume you have Stable Diffusion, Mixtral, and Llama-13B deployed on your own compute instance in a cloud environment sharing the same GPU resource. You currently do not have a business use case for Stable Diffusion, but your teams are experimenting with the other two for LLM applications. Should you remove Stable Diffusion from your deployment?

<br>

### **Notebook Source:**

- This notebook is part of a larger [**NVIDIA Deep Learning Institute**](https://www.nvidia.com/en-us/training/) course titled [**Building RAG Agents with LLMs**](https://learn.next.courses.nvidia.com/courses/course-v1:DLI+S-FX-15+V1/about). If sharing this material, please give credit and link back to the original course.

<br>


----

<br>

## **Part 1**: Getting Large Models Into Your Environment

Recall from the last notebook that our current environment has several microservices running on an allocated cloud instance: `jupyter-notebook-server`, `frontend`, and `milvus` (among others). We briefly discussed these, and you may have noticed that there an "LLM Service" was never mentioned despite the course's emphasis on LLMs for retrieval...

In this course, we'll be introducing you to the [**NVIDIA AI Foundation Models and Endpoints service**](https://catalog.ngc.nvidia.com/ai-foundation-models) to use it as a legitimate prototyping LLM service with a natural path towards self-hosted large model deployment.

$$---$$


Across just about every domain, deploying massive deep learning models is a common yet challenging task. Today's models, such as Llama 2 (70B parameters) or ensemble models like Mixtral 7x8B, are products of advanced training methods, vast data resources, and powerful computing systems. Luckily for us, these models have already been trained and many use cases can already be achieved with off-the-shelf solutions. The real hurdle, however, lies in effectively hosting these models.

**Deployment Scenarios for Large Models:**

1. **High-End Datacenter Deployment:**
> An uncompressed, unquantized model on a data center stack equipped with GPUs like NVIDIA's [A100](https://www.nvidia.com/en-us/data-center/a100/) or [H100](https://www.nvidia.com/en-us/data-center/a100/) to facilitate fast inference and experimentation.
> - **Pros**: Ideal for scalable deployment and experimentation, this stack is ideal for either large training workflows or for supporting multiple users or models at the same time.  
> - **Cons:** It is inefficient to allocate this resource for each user of your service unless the use cases involve model training/fine-tuning or interfacing with lower-level model components.

2. **Modest Datacenter/Specialized Consumer Hardware Deployment:**
> Quantized and further-optimized models can be run (one or two per instance) on more conservative datacenter GPUs such as [L40](https://www.nvidia.com/en-us/data-center/l40/)/[A30](https://www.nvidia.com/en-us/data-center/products/a30-gpu/)/[A10](https://www.nvidia.com/en-us/data-center/products/a10-gpu/) or even on some modern consumer GPUs such as the higher-VRAM [RTX 40-series GPUs](https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/).
> - **Pros:** This setup balances inference speed with manageable limitations for single-user applications. These sessions can also be deployed on a per-user basis to run one or two large models at a time with raw access to model internals (even if they need quantization).
> - **Cons:** Deploying an instance for each user is still costly at scale, though it may be justifiable for some niche workloads. Alternatively, assuming that users can access these resources in their local environments is likely unreasonable.

3. **Consumer Hardware Deployment:**
> Though heavily limited in ability to propagate data through a neural network, most consumer hardware does have a graphical user interface (GUI), a web browser with internet access, some amount of memory (can safely assume at least 1 GB), and a decently-powerful CPU.
> - **Cons:** Most hardware at the moment cannot run more than one local large model at a time in any configuration, and running even one model will require significant amounts of resource management and optimizing restrictions.
> - **Pros:** This is a reasonable and inclusive starting assumption when considering what kinds of users your services should support.


In this course, your environment will be quite representative of typical consumer hardware; though we can kickstart and prototype with microservices, we are constrained by a CPU-only compute environment that will struggle to run an LLM model. While this is a significant limitation, we will still be able to take advantage of fast LLM capabilities via:
- Access to a compute-capable service for hosting large models.
- A streamlined interface for command input and result retrieval.

With our foundation in microservices and port-based connections, we are well-positioned to explore effective interfacing options for getting LLM access for our development environment!

----

<br>

## **Part 2:** Hosted Large Model Services

In our pursuit to provide access to Large Language Models (LLMs) in a resource-constrained environment like ours, characterized by CPU-only instances, we'll evaluate various hosting options:

**Black-Box Hosted Models:**
> Services such as [**OpenAI**](https://openai.com/) offer APIs to interact with black-box models like GPT-4. These powerful, well-integrated services can provide simple interfaces to complex pipelines that automatically track memory, call additional models, and incorporate multimodal interfaces as necessary to simplify typical use scenarios. At the same time, they maintain operational opacity and often lack a straightforward path to self-hosting.
> - **Pros:** Easy to use out-of-the-box with shallow barriers to entry for an average user.
> - **Cons:** Black-box deployments suffer from potential privacy concerns, limited customization, and cost implications at scale.

**Self-Hosted Models:**

> Behind the scenes of just about all scaled model deployments is one or more giant models running in a data center with scalable resources and lightning-fast bandwidth at their disposal. Though necessary to deploy large models at scale and maintain strong control over the provided interfaces, these systems often require expertise to set up and generally do not work well for supporting non-developer workflows for only one individual at a time. Such systems are much better for supporting many simultaneous users, multiple models, and custom interfaces.
> - **Pros:** They offer the capability to integrate custom datasets and APIs and are primarily designed to support numerous users concurrently.
> - **Cons:** These setups demand technical expertise to set up and are not as practical for an individual non-developer user.

To get the best of both worlds, we will utilize the [**NVIDIA NGC Service**](https://www.nvidia.com/en-us/gpu-cloud/). NGC offers a suite of developer tools for designing and deploying AI solutions. Central to our needs are the [NVIDIA AI Foundation Models](https://www.nvidia.com/en-us/ai-data-science/foundation-models/), which are pre-tuned and pre-optimized models designed for easy out-of-the-box scalable deployment (as-is or with further customization). Furthermore, NGC hosts accessible model endpoints for querying live foundation models in a [scalable DGX-accelerated compute environment](https://www.nvidia.com/en-us/data-center/dgx-platform/).

Currently, NGC endpoints offers a slew of complementary queries, as few as 10,000 for the largest models and much more for the smaller models. This generous quota makes NGC an ideal, virtually free resource for prototyping production-oriented systems and experimenting with new models.

----

<br>

## **Part 3:** Getting Access to AI Foundation Endpoints

To get access to your endpoints, you can visit [https://catalog.ngc.nvidia.com/ai-foundation-models](https://catalog.ngc.nvidia.com/ai-foundation-models). This will take you to a screen with a selection of available models, and may look something like this:

<!-- > <img src="https://drive.google.com/uc?export=view&id=1_2qNTs5xr0AXSh6abFeJ2vrXmc9HK9gq" width=1000px/> -->
> <img src="https://dli-lms.s3.amazonaws.com/assets/s-fx-15-v1/imgs/ai-playground-ui.png" width=1000px/>
<!-- > <img style="max-width: 1000px;" src="imgs/ai-playground-ui.png" /> -->

You will see a selection of large models spanning a variety of use-cases including chat, instruction following, image generation, etc. Navigating to a particular model starts you off with a demo application that lets you try out the model and see what it's capable of:

<!-- > <img style="max-width: 800px;" src="imgs/ai-playground-demo.png" /> -->
<!-- > <img src="https://drive.google.com/uc?export=view&id=1X-RNB8ouPfxEmU5Ac0vSIBQi_239E4jP" width=1000px/> -->
> <img src="https://dli-lms.s3.amazonaws.com/assets/s-fx-15-v1/imgs/ai-playground-demo.png" width=1000px/>

However, we will be primarily interested in the API tab, which will provide us with some boilerplate code to access these models in our own development environment:

<!-- > <img style="max-width: 800px;" src="imgs/ai-playground-api.png" /> -->
<!-- > <img src="https://drive.google.com/uc?export=view&id=1ckAIZoy7tvtK1uNqzA9eV5RlKMbVqs1-" width=1000px/> -->
> <img src="https://dli-lms.s3.amazonaws.com/assets/s-fx-15-v1/imgs/ai-playground-api.png" width=1000px/>

This will be our actual starting point, since we intend to use these models to build interesting systems throughout the course! Before we do that, we will need to generate a free NVAPI-key by signing into or creating an NGC account, which you will be prompted to do when you click the **"Generate Key"** button. This will give you the necessary `$API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC` key with a sufficient query quota to perform extensive experimentation and develop custom solutions!

#### **Storing Your API Key**

After generating your key, **please copy the key on your own storage and store it for future use!**

**In The Course Environment:** We have provided a persistent storage system in the form of the `set_key` and `get_key` routes in the `docker_router` service. By allowing the microservice to store the key, all subsequent notebooks will be able to pull it from the perpetual service without requiring you to re-enter it. (It's also just a great opportunity to show how message bodies can be accepted via a port interface, as shown near the end of [`docker_router/docker_router.py`](docker_router/docker_router.py))

**In Any Other Environment:** If the service isn't running or there is any problem, the try body will raise an exception and the regular "set the current notebook runtime's environment variable from user input" routine is invoked as a fallback.

In [1]:
from getpass import getpass
import requests
import os

hard_reset = False  ## <-- Set to True if you want to reset your NVIDIA_API_KEY
while "nvapi-" not in os.environ.get("NVIDIA_API_KEY", "") or hard_reset:
    ## Try to set NVIDIA_API_KEY as part of docker_router routine.
    ##  When running in course container, this helps to save your API key between sessions.
    try: 
        response = requests.get("http://docker_router:8070/get_key").json()
        assert response.get('nvapi_key')
    except: response = {'nvapi_key' : getpass("NVIDIA API Key: ")}
    os.environ["NVIDIA_API_KEY"] = response.get("nvapi_key")
    try: requests.post("http://docker_router:8070/set_key/", json={'nvapi_key' : os.environ["NVIDIA_API_KEY"]}).json()
    except: pass
    hard_reset = False

print(f"Retrieved NVIDIA_API_KEY beginning with \"{os.environ.get('NVIDIA_API_KEY')[:9]}...\"")

NVIDIA API Key:  ········


Retrieved NVIDIA_API_KEY beginning with "nvapi-gOp..."


Now that we have our `NVIDIA_API_KEY` loaded in to a persistent service, we'll be able to pull it in automatically in subsequent notebooks and set the environment variable accordingly.

----

<br>

## **Part 4: [Exercise]** Trying Out The AI Foundation Endpoints

To start out, let's use the provided `requests` boilerplate from an LLM model's API entry. Perhaps `Llama-2-13B` or `mixtral-7bx8`is a reasonable candidate, but feel free to try your own and consider the pool of active options.

To query the model, you can paste the code for the python routine in the cell below. It is much easier to use the streaming script in Python thanks to some special features of the `requests` library, so we have provided some hints to get that started for you. *Feel free to try the non-streaming way if you have the time and interest.*

**NOTE:** If you would like, you can bypass this exercise by simply clicking the `Execute` button under the code block. Doing this will show what happens when you try to invoke the provided example.


In [5]:
import requests
import json

####################################################################################
## HELPERS

## HINT 1: The following streaming header tosses in your API key from the environment:

headers = {
    "Authorization": f"Bearer {os.environ.get('NVIDIA_API_KEY')}",
    "accept": "text/event-stream",
    "content-type": "application/json",
}

## HINT 2: If you're streaming, you can use print(line.decode("utf-8")) for raw responses
##  For more user-friendly responses, you may want to get_stream_token(line):
def get_stream_token(entry: bytes):
    """Utility: Coerces out ['choices'][0]['delta'][content] from the bytestream"""
    if not entry: return ""
    entry = entry.decode('utf-8')
    if entry.startswith('data: '):
        try: entry = json.loads(entry[5:])
        except ValueError: return ""
    return entry.get('choices', [{}])[0].get('delta', {}).get('content')

####################################################################################
## TODO: Save the invocation URL for the endpoint here
invoke_url = "https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/df2bee43-fb69-42b9-9ee5-f4eabbeaf3a8 \"

## TODO: Construct the payload, which will be sent over to the endpoint
payload = {
    "messages": [
    {
      "content": "Write the Fibonnaci sequence in Python",
      "role": "user"
    }
  ],
  "temperature": 0.2,
  "top_p": 0.7,
  "max_tokens": 1024,
  "seed": 42,
  "stream": True
}


## Use requests.post to send the header (streaming meta-info) the payload to the endpoint
## Make sure streaming is enabled, and expect the response to have an iter_lines response.
response = requests.post(invoke_url, headers=headers, json=payload, stream=True)

## If your response is an error message, this will raise an exception in Python
response.raise_for_status()

## If the post request is honored, you should be able to iterate over 
for line in response.iter_lines():
    print(get_stream_token(line), end="")
    # if line: print(line.decode("utf-8"))

SyntaxError: unterminated string literal (detected at line 28) (2586572211.py, line 28)

<br>

**OBSERVATIONS:**

**You may notice that the chat models expect "messages" as input:**

This may be unexpected if you're more used to raw LLM interfaces like those of local HuggingFace models, but it will look pretty standard to users of OpenAI models. By enforcing a restricted interface instead of a raw text completion one, the service can have more control over what the users can do. There are plenty of pros and cons to this interface, with some noteworthy ones below:
- A service might restrict the use of a specific role type or parameter (i.e. system message restriction, priming message to get arbitrary generation, etc).
- A service might enforce custom prompt formats and implement extra options under the hood that rely on the chat interface.
- A service might use stronger assumptions to implement deeper optimizations in the inference pipeline.
- A service might mimic another popular interface to leverage existing ecosystem compatibilities.

All of these are valid reasons, and it's important to consider which interface options are best for your particular use cases when choosing or deploying your own service.

**You may notice that there are two fundamental ways of querying the models:**

You can **invoke without streaming**, in which case the service response will come all at once after it has been computed in full. This is great when you need the entire output of the model before doing anything else; for example, when you want to print out the whole result or use it for downstream tasks. The response body will look something like this:

```json
{
    "id": "d34d436a-c28b-4451-aa9c-02eed2141ed3",
    "choices": [{
        "index": 0,
        "message": { "role": "assistant", "content": "Bonjour! ..." },
        "finish_reason": "stop"
    }],
    "usage": {
        "completion_tokens": 450,
        "prompt_tokens": 152,
        "total_tokens": 602
    }
}
```

You can also **invoke with streaming**, in which case the service will send out a series of requests until a final request is sent out. This is great when you can use the responses of the service as it becomes available (which is very good for language model components that print the output directly to the user as it gets generated). In this case, the response body will look a lot more like this:

```json
data:{"id":"80553c6c-5243-41a7-a3ea-6d344fc9e636","choices":[{"index":0,"delta":{"role":"assistant","content":"Bon"},"finish_reason":null}]}
data:{"id":"80553c6c-5243-41a7-a3ea-6d344fc9e636","choices":[{"index":0,"delta":{"role":"assistant","content":"j"},"finish_reason":null}]}
...
data:{"id":"80553c6c-5243-41a7-a3ea-6d344fc9e636","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":"stop"}]}
data:[DONE]
```

Both of these options can be done with relative ease using Python's `requests` library, but using the interface as-is will result in a lot of repetitive code. Luckily, we have some systems that make this significantly easier to use and incorporate into larger projects!

----

<br>

## **Part 5:** AI Foundation Models in LangChain

Despite seeming cumbersome, the manual request-oriented API provided by the endpoints is extremely useful for developing arbitrary applications. Whether your goal environment is a Jupyter environment, a node backend, or an embedded edge device, the core commands will be sufficient to construct an interface to make the service work from just about any network-accessible device – given enough time and effort. With that said, we'll need to make some connectors if we want our service to work with other frameworks.

The goal of a **connector** is to convert an arbitrary API from its native core into one that a target code-base would expect. In this course, we'll want to take advantage of LangChain's thriving chain-centric ecosystem, but the raw `requests` API will not take us all the way there. Under the hood, every LangChain chat model that isn't hosted locally has to rely on such an API, but the developer-facing API is a much cleaner [`LLM` or `SimpleChatModel`-style interface](https://python.langchain.com/docs/modules/model_io/) with default parameters and some simple utility functions like `generate` and `stream`. Unfortunately, creating such a connector is well out of scope for this course and requires a decent amount of effort to get right. ***Luckily for us, it already exists!***

In [None]:
## Necessary for Colab, not necessary for course environment
# %pip install -q langchain-nvidia-ai-endpoints

#### **ChatNVIDIA**

To start off our exploration into the LangChain interface, we can start by the NVIDIA AI Foundation Endpoint LangChain model, `ChatNVIDIA`.

This model is part of the LangChain extended ecosystem and can be installed locally via `pip install langchain-nvidia-aiplay`.

Below, assuming we do not know what models we have at our disposal, we can query the list of models in one of two ways:

In [None]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_nvidia_ai_endpoints._common import NVEModel  ## Backend Model

## Using the backbone NVIDIA Endpoints client, which makes the calls as you saw above
NVEModel().available_models

In order to pull in and start using a model, you just need to:
- Import and instantiate your `ChatNVidia` instance with a model name.
- Then, you can choose from LangChain's supported chat model interfacing options, chief among them `.invoke`.

Below, find an LLM model you would like to use and supply its name as the `model` argument to your `ChatNVIDIA` constructor. Then, feel free to query it with whatever string you'd like!

In [None]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

## NOTE: "playground_" prefix is optional for our client
chat = ChatNVIDIA(model='llama2_13b')
chat.invoke("Hello! How's it going?")

<br>


So... that works, but why? We imported some model called `ChatNVIDIA`, constructed it, and called it. The result was a LangChain `AIMessage`, and allegedly the response came from the AI Foundation endpount, so where is all of the information about the model urls, request structurings, and even the API key?

It turns out that under the hood, a background `NVEModel` (NVIDIA Endpoint client) does all of the primitive GET and POST requests on our behalf! You can check it out and will notice that it has all the kinds of information that is necessary to make the manual call, and can even be used to do so via a method like the primitive `_get`/`_post` as well as more specialized (but still low level) `get_req_generation`/`get_req_astream`/etc. In contrast, the higher-level `ChatNVIDIA` model does a lot of default value specification and implements the components required by LangChain while relying on the `NVEModel` lower-level functionality.  

Of note, both of these components are [pydantic `BaseModel`](https://docs.pydantic.dev/latest/concepts/models/) types, which is a type that will keep showing up throughout the course. Details about this construct will be brought up as necessary, but one nice thing is that instances can be interpreted as dictionaries. This means we can print out what variables belong to `ChatNVIDIA` and which ones have been moved into the `NVEModel` model. This should help to further validate that these components are involved at different abstraction levels.

In [3]:
## Dictionary comprehension of the form {key: value for key, value in dict.items()}
{k:v for k,v in chat if k != 'client'}

NameError: name 'chat' is not defined

In [None]:
{k:v for k,v in chat.client if k != 'headers_tmpl'}

----

<br>

## **Part 6:** Wrap-Up

The goal of this notebook was to provide some discussion centered around LLM service hosting strategies and introduce you to the AI Foundation Model endpoints. Along the way, we hope that you have an intuitive understanding of how remote LLM systems can be provided and accessed from edge devices!

### <font color="#76b900">**Great Job!**</font>

### **Next Steps:**
1. **Make sure your API key is stored somewhere where you can easily get it.**
2. **[Optional]** Revisit the **"Questions To Think About" Section** at the top of the notebook and think about some possible answers.
3. Continue on to the next video, which will talk about **LangChain and Gradio**.
4. After the video, go on to the corresponding notebook on **LangChain and Gradio**.

<br>

---

<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>