Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,53 @@ All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.


## [2.3.0] - 2025-10-20
## [2.3] - 2026-01-08

This release focuses on improved sizing recommendations, enhanced Nemotron model integration, and comprehensive documentation updates.

### Added
- **Demo Screenshots** — Added visual examples showcasing the Configuration Wizard, RAG-powered sizing recommendations, and Local Deployment verification
- **Official Documentation Link** — Added link to [NVIDIA vGPU Docs Hub](https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html) in README

### Changed
- **README Overhaul** — Reorganized documentation to highlight NVIDIA Nemotron models
- Llama-3.3-Nemotron-Super-49B powers the RAG backend
- Nemotron-3 Nano 30B (FP8) as default for workload sizing
- New Demo section with screenshots demonstrating key features

- **Sizing Recommendation Improvements**
- Enhanced 95% usable capacity rule for profile selection (5% reserved for system overhead)
- Improved profile selection logic: picks smallest profile where (profile × 0.95) >= workload
- Better handling of edge cases near profile boundaries

- **GPU Passthrough Logic**
- Automatic passthrough recommendation when workload exceeds max single vGPU profile
- Clearer passthrough examples in RAG context (e.g., 92GB on BSE → 2× BSE GPU passthrough)
- Calculator now returns `vgpu_profile: null` with multi-GPU passthrough recommendation

- **vLLM Local Deployment**
- Updated to vLLM v0.12.0 for proper NemotronH (hybrid Mamba-Transformer) architecture support
- Improved GPU memory utilization calculations for local testing
- Better max-model-len auto-detection (only set when explicitly specified)

- **Chat Improvements**
- Enhanced conversational mode with vGPU configuration context
- Better model extraction from sizing responses for follow-up questions
- Improved context handling for RAG vs inference workload discussions

### Improved
- **Nemotron Model Integration**
- Default model changed to Nemotron-3 Nano 30B FP8 in configuration wizard
- Nemotron thinking prompt support for enhanced reasoning
- Better model matching for Nemotron variants in calculator

## [2.2] - 2025-11-04

### Changed
- Updated branding from "vGPU Sizing Advisor" to "AI vWS Sizing Advisor" throughout UI and documentation
- Improved user-facing verbiage for better clarity and consistency

## [2.1] - 2025-10-20

This release focuses on local deployment improvements, enhanced workload differentiation, and improved user experience with advanced configuration options.

Expand Down Expand Up @@ -52,7 +98,7 @@ This release focuses on local deployment improvements, enhanced workload differe
- Better visual feedback and status indicators
- Improved configuration wizard flow

## [2.2.0] - 2025-10-13
## [2.0] - 2025-10-13

This release focuses on the AI vWS Sizing Advisor with enhanced deployment capabilities, improved user experience, and zero external dependencies for SSH operations.

Expand Down Expand Up @@ -137,8 +183,7 @@ This release focuses on the AI vWS Sizing Advisor with enhanced deployment capab
- SSH key-based authentication (more secure than passwords)
- Automatic key generation with proper permissions (700/600)

## [2.1.0] - 2025-05-13

## [1.2] - 2025-05-13

This release reduces overall GPU requirement for the deployment of the blueprint. It also improves the performance and stability for both docker and helm based deployments.

Expand Down Expand Up @@ -168,7 +213,7 @@ This release reduces overall GPU requirement for the deployment of the blueprint

A detailed guide is available [here](./docs/migration_guide.md) for easing developers experience, while migrating from older versions.

## [2.0.0] - 2025-03-18
## [1.1] - 2025-03-18

This release adds support for multimodal documents using [Nvidia Ingest](https://github.com/NVIDIA/nv-ingest) including support for parsing PDFs, Word and PowerPoint documents. It also significantly improves accuracy and perf considerations by refactoring the APIs, architecture as well as adds a new developer friendly UI.

Expand Down Expand Up @@ -202,7 +247,7 @@ This release adds support for multimodal documents using [Nvidia Ingest](https:/

A detailed guide is available [here](./docs/migration_guide.md) for easing developers experience, while migrating from older versions.

## [1.0.0] - 2025-01-15
## [1.0] - 2025-01-15

### Added

Expand Down
Original file line number Diff line number Diff line change
@@ -1,18 +1,67 @@
# vGPU Sizing Advisor for AI vWS
# AI vWS Sizing Advisor

<p align="center">
<img src="deployment_examples/example_rag_config.png" alt="AI vWS Sizing Advisor" width="800">
</p>

<p align="center">
<strong>RAG-powered vGPU sizing recommendations for AI Virtual Workstations</strong><br>
Powered by NVIDIA NeMo™ and Nemotron models
</p>

<p align="center">
<a href="https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html">Official Documentation</a> •
<a href="#demo">Demo</a> •
<a href="#deployment">Quick Start</a> •
<a href="./CHANGELOG.md">Changelog</a>
</p>

---

## Overview

vGPU Sizing Advisor is a RAG-powered tool that helps you determine the optimal NVIDIA vGPU configuration for AI workloads on NVIDIA AI Virtual Workstation (AI vWS). Using NVIDIA vGPU documentation and best practices, it provides tailored recommendations for optimal performance and resource efficiency.
AI vWS Sizing Advisor is a RAG-powered tool that helps you determine the optimal NVIDIA vGPU sizing configuration for AI workloads on NVIDIA AI Virtual Workstation (AI vWS). Using NVIDIA vGPU documentation and best practices, it provides tailored recommendations for optimal performance and resource efficiency.

### Powered by NVIDIA Nemotron

This tool leverages **NVIDIA Nemotron models** for intelligent sizing recommendations:

- **[Llama-3.3-Nemotron-Super-49B](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1)** — Powers the RAG backend for intelligent conversational sizing guidance
- **[Nemotron-3 Nano 30B](https://build.nvidia.com/nvidia/nvidia-nemotron-3-nano-30b-a3b-fp8)** — Default model for workload sizing calculations (FP8 optimized)

### Key Capabilities

Enter your workload requirements and receive validated recommendations including:

- **vGPU Profile** - Recommended profile (e.g., L40S-24Q) based on your workload
- **Resource Requirements** - vCPUs, GPU memory, system RAM needed
- **Performance Estimates** - Expected latency, throughput, and time to first token
- **Live Testing** - Instantly deploy and validate your configuration locally using vLLM containers
- **vGPU Profile** Recommended profile (e.g., L40S-24Q) based on your workload
- **Resource Requirements** vCPUs, GPU memory, system RAM needed
- **Performance Estimates** Expected latency, throughput, and time to first token
- **Live Testing** Instantly deploy and validate your configuration locally using vLLM containers

The tool differentiates between RAG and inference workloads by accounting for embedding vectors and database overhead. It intelligently suggests GPU passthrough when jobs exceed standard vGPU profile limits.

---

## Demo

### Configuration Wizard

Configure your workload parameters including model selection, GPU type, quantization, and token sizes:

<p align="center">
<img src="deployment_examples/configuration_wizard.png" alt="Configuration Wizard" width="700">
</p>

### Local Deployment Verification

Validate your configuration by deploying a vLLM container locally and comparing actual GPU memory usage against estimates:

<p align="center">
<img src="deployment_examples/local_deployment.png" alt="Local Deployment" width="700">
</p>

---

## Prerequisites

### Hardware
Expand Down Expand Up @@ -44,15 +93,17 @@ docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
> **Note:** Docker must be at `/usr/bin/docker` (verified in `deploy/compose/docker-compose-rag-server.yaml`). User must be in docker group or have socket permissions.

### API Keys
- **NVIDIA Build API Key** (Required) - [Get your key](https://build.nvidia.com/settings/api-keys)
- **HuggingFace Token** (Optional) - [Create token](https://huggingface.co/settings/tokens) for gated models
- **NVIDIA Build API Key** (Required) — [Get your key](https://build.nvidia.com/settings/api-keys)
- **HuggingFace Token** (Optional) — [Create token](https://huggingface.co/settings/tokens) for gated models

---

## Deployment

**1. Clone and navigate:**
```bash
git clone https://github.com/NVIDIA/GenerativeAIExamples.git
cd GenerativeAIExamples/community/vgpu-sizing-advisor
cd GenerativeAIExamples/community/ai-vws-sizing-advisor
```

**2. Set NGC API key:**
Expand All @@ -74,28 +125,32 @@ npm install
npm run dev
```

---

## Usage

2. **Select Workload Type:** RAG or Inference
1. **Select Workload Type:** RAG or Inference

3. **Enter Parameters:**
- Model name (e.g., `meta-llama/Llama-2-7b-chat-hf`)
2. **Enter Parameters:**
- Model name (default: **Nemotron-3 Nano 30B FP8**)
- GPU type
- Prompt size (input tokens)
- Response size (output tokens)
- Quantization (FP16, INT8, INT4)
- Quantization (FP16, FP8, INT8, INT4)
- For RAG: Embedding model and vector dimensions

4. **View Recommendations:**
3. **View Recommendations:**
- Recommended vGPU profiles
- Resource requirements (vCPUs, RAM, GPU memory)
- Performance estimates

5. **Test Locally** (optional):
4. **Test Locally** (optional):
- Run local inference with a containerized vLLM server
- View performance metrics
- Compare actual results versus suggested profile configuration

---

## Management Commands

```bash
Expand All @@ -120,6 +175,8 @@ The stop script automatically performs Docker cleanup operations:
- Optionally removes dangling images (`--cleanup-images`)
- Optionally removes all data volumes (`--volumes`)

---

## Adding Documents to RAG Context

The tool includes NVIDIA vGPU documentation by default. To add your own:
Expand All @@ -134,8 +191,7 @@ curl -X POST -F "file=@./vgpu_docs/your-document.pdf" http://localhost:8082/v1/i

**Supported formats:** PDF, TXT, DOCX, HTML, PPTX



---

## License

Expand All @@ -145,6 +201,6 @@ Models governed by [NVIDIA AI Foundation Models Community License](https://docs.

---

**Version:** 2.3.0 (October 2025) - See [CHANGELOG.md](./CHANGELOG.md)
**Version:** 2.3 (January 2026) — See [CHANGELOG.md](./CHANGELOG.md)

**Support:** [GitHub Issues](https://github.com/NVIDIA/GenerativeAIExamples/issues) | [NVIDIA Forums](https://forums.developer.nvidia.com/)
**Support:** [GitHub Issues](https://github.com/NVIDIA/GenerativeAIExamples/issues) | [NVIDIA Forums](https://forums.developer.nvidia.com/) | [Official Docs](https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html)
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
# ============================================================================
# CENTRALIZED MODEL CONFIGURATION
# Change these values to use different models throughout the application
# ============================================================================
x-model-config:
# Embedding Model Configuration
embedding-model: &embedding-model "nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"

services:

# Main ingestor server which is responsible for ingestion
Expand Down Expand Up @@ -38,10 +46,14 @@ services:
NGC_API_KEY: ${NGC_API_KEY:?"NGC_API_KEY is required"}

##===Embedding Model specific configurations===
# Model name - pulls from centralized config at top of file (can be overridden by env var)
APP_EMBEDDINGS_MODELNAME: *embedding-model
# url on which embedding model is hosted. If "", Nvidia hosted API is used
APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL-"nemoretriever-embedding-ms:8000"}
APP_EMBEDDINGS_MODELNAME: ${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-mistral-7b-v2}
APP_EMBEDDINGS_DIMENSIONS: ${APP_EMBEDDINGS_DIMENSIONS:-2048}
APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL:-"nemoretriever-embedding-ms:8000"}
# Embedding dimensions - IMPORTANT: Must match your embedding model!
# nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1: 4096
# nvidia/nv-embedqa-mistral-7b-v2: 2048
APP_EMBEDDINGS_DIMENSIONS: ${APP_EMBEDDINGS_DIMENSIONS:-4096}

##===NV-Ingest Connection Configurations=======
APP_NVINGEST_MESSAGECLIENTHOSTNAME: ${APP_NVINGEST_MESSAGECLIENTHOSTNAME:-"nv-ingest-ms-runtime"}
Expand Down Expand Up @@ -115,9 +127,10 @@ services:
- AUDIO_INFER_PROTOCOL=grpc
- CUDA_VISIBLE_DEVICES=0
- MAX_INGEST_PROCESS_WORKERS=${MAX_INGEST_PROCESS_WORKERS:-16}
- EMBEDDING_NIM_MODEL_NAME=${EMBEDDING_NIM_MODEL_NAME:-${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-7b-v2}}
# Embedding model - uses APP_EMBEDDINGS_MODELNAME which pulls from centralized config
- EMBEDDING_NIM_MODEL_NAME=${APP_EMBEDDINGS_MODELNAME:-nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1}
# Incase of self-hosted embedding model, use the endpoint url as - https://integrate.api.nvidia.com/v1
- EMBEDDING_NIM_ENDPOINT=${EMBEDDING_NIM_ENDPOINT:-${APP_EMBEDDINGS_SERVERURL-http://nemoretriever-embedding-ms:8000/v1}}
- EMBEDDING_NIM_ENDPOINT=${EMBEDDING_NIM_ENDPOINT:-http://nemoretriever-embedding-ms:8000/v1}
- INGEST_LOG_LEVEL=DEFAULT
- INGEST_EDGE_BUFFER_SIZE=64
# Message client for development
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
# ============================================================================
# CENTRALIZED MODEL CONFIGURATION
# Change these values to use different models throughout the application
# ============================================================================
x-model-config:
# Chat/LLM Model Configuration
llm-model: &llm-model "nvidia/llama-3.3-nemotron-super-49b-v1"

# Embedding Model Configuration
embedding-model: &embedding-model "nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"

services:

# Main orchestrator server which stiches together all calls to different services to fulfill the user request
Expand Down Expand Up @@ -35,25 +46,16 @@ services:
VECTOR_DB_TOPK: ${VECTOR_DB_TOPK:-100}

##===LLM Model specific configurations===
APP_LLM_MODELNAME: ${APP_LLM_MODELNAME:-"meta/llama-3.1-8b-instruct"}
# Model name - pulls from centralized config at top of file (can be overridden by env var)
APP_LLM_MODELNAME: *llm-model
# url on which llm model is hosted. If "", Nvidia hosted API is used
APP_LLM_SERVERURL: ${APP_LLM_SERVERURL-""}

##===Query Rewriter Model specific configurations===
APP_QUERYREWRITER_MODELNAME: ${APP_QUERYREWRITER_MODELNAME:-"meta/llama-3.1-8b-instruct"}
# url on which query rewriter model is hosted. If "", Nvidia hosted API is used
APP_QUERYREWRITER_SERVERURL: ${APP_QUERYREWRITER_SERVERURL-"nim-llm-llama-8b-ms:8000"}
APP_LLM_SERVERURL: ${APP_LLM_SERVERURL:-""}

##===Embedding Model specific configurations===
# Model name - pulls from centralized config at top of file (can be overridden by env var)
APP_EMBEDDINGS_MODELNAME: *embedding-model
# url on which embedding model is hosted. If "", Nvidia hosted API is used
APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL-""}
APP_EMBEDDINGS_MODELNAME: ${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-mistral-7b-v2}

##===Reranking Model specific configurations===
# url on which ranking model is hosted. If "", Nvidia hosted API is used
APP_RANKING_SERVERURL: ${APP_RANKING_SERVERURL-""}
APP_RANKING_MODELNAME: ${APP_RANKING_MODELNAME:-nv-rerank-qa-mistral-4b:1}
ENABLE_RERANKER: ${ENABLE_RERANKER:-True}
APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL:-""}

NVIDIA_API_KEY: ${NGC_API_KEY:?"NGC_API_KEY is required"}

Expand All @@ -65,7 +67,7 @@ services:

# enable multi-turn conversation in the rag chain - this controls conversation history usage
# while doing query rewriting and in LLM prompt
ENABLE_MULTITURN: ${ENABLE_MULTITURN:-False}
ENABLE_MULTITURN: ${ENABLE_MULTITURN:-True}

# enable query rewriting for multiturn conversation in the rag chain.
# This will improve accuracy of the retrieiver pipeline but increase latency due to an additional LLM call
Expand Down Expand Up @@ -139,10 +141,10 @@ services:
context: ../../frontend
dockerfile: ./Dockerfile
args:
# Model name for LLM
NEXT_PUBLIC_MODEL_NAME: ${APP_LLM_MODELNAME:-meta/llama-3.1-8b-instruct}
# Model name for embeddings
NEXT_PUBLIC_EMBEDDING_MODEL: ${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-mistral-7b-v2}
# Model name for LLM - pulls from centralized config at top of file
NEXT_PUBLIC_MODEL_NAME: *llm-model
# Model name for embeddings - pulls from centralized config at top of file
NEXT_PUBLIC_EMBEDDING_MODEL: *embedding-model
# Model name for reranking
NEXT_PUBLIC_RERANKER_MODEL: ${APP_RANKING_MODELNAME:-nv-rerank-qa-mistral-4b:1}
# URL for rag server container
Expand Down
Loading