NVIDIA · chloecrozier · Jan 8, 2026
diff --git a/community/vgpu-sizing-advisor/.dockerignore → ...unity/ai-vws-sizing-advisor/.dockerignore b/community/vgpu-sizing-advisor/.dockerignore → ...unity/ai-vws-sizing-advisor/.dockerignore
diff --git a/community/vgpu-sizing-advisor/.gitattributes → ...nity/ai-vws-sizing-advisor/.gitattributes b/community/vgpu-sizing-advisor/.gitattributes → ...nity/ai-vws-sizing-advisor/.gitattributes
diff --git a/community/vgpu-sizing-advisor/.gitignore → community/ai-vws-sizing-advisor/.gitignore b/community/vgpu-sizing-advisor/.gitignore → community/ai-vws-sizing-advisor/.gitignore
diff --git a/community/vgpu-sizing-advisor/CHANGELOG.md → community/ai-vws-sizing-advisor/CHANGELOG.md b/community/vgpu-sizing-advisor/CHANGELOG.md → community/ai-vws-sizing-advisor/CHANGELOG.md
@@ -3,7 +3,53 @@ All notable changes to this project will be documented in this file.
 The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
 
 
-## [2.3.0] - 2025-10-20
+## [2.3] - 2026-01-08
+
+This release focuses on improved sizing recommendations, enhanced Nemotron model integration, and comprehensive documentation updates.
+
+### Added
+- **Demo Screenshots** — Added visual examples showcasing the Configuration Wizard, RAG-powered sizing recommendations, and Local Deployment verification
+- **Official Documentation Link** — Added link to [NVIDIA vGPU Docs Hub](https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html) in README
+
+### Changed
+- **README Overhaul** — Reorganized documentation to highlight NVIDIA Nemotron models
+  - Llama-3.3-Nemotron-Super-49B powers the RAG backend
+  - Nemotron-3 Nano 30B (FP8) as default for workload sizing
+  - New Demo section with screenshots demonstrating key features
+
+- **Sizing Recommendation Improvements**
+  - Enhanced 95% usable capacity rule for profile selection (5% reserved for system overhead)
+  - Improved profile selection logic: picks smallest profile where (profile × 0.95) >= workload
+  - Better handling of edge cases near profile boundaries
+
+- **GPU Passthrough Logic**
+  - Automatic passthrough recommendation when workload exceeds max single vGPU profile
+  - Clearer passthrough examples in RAG context (e.g., 92GB on BSE → 2× BSE GPU passthrough)
+  - Calculator now returns `vgpu_profile: null` with multi-GPU passthrough recommendation
+
+- **vLLM Local Deployment**
+  - Updated to vLLM v0.12.0 for proper NemotronH (hybrid Mamba-Transformer) architecture support
+  - Improved GPU memory utilization calculations for local testing
+  - Better max-model-len auto-detection (only set when explicitly specified)
+
+- **Chat Improvements**
+  - Enhanced conversational mode with vGPU configuration context
+  - Better model extraction from sizing responses for follow-up questions
+  - Improved context handling for RAG vs inference workload discussions
+
+### Improved
+- **Nemotron Model Integration**
+  - Default model changed to Nemotron-3 Nano 30B FP8 in configuration wizard
+  - Nemotron thinking prompt support for enhanced reasoning
+  - Better model matching for Nemotron variants in calculator
+
+## [2.2] - 2025-11-04
+
+### Changed
+- Updated branding from "vGPU Sizing Advisor" to "AI vWS Sizing Advisor" throughout UI and documentation
+- Improved user-facing verbiage for better clarity and consistency
+
+## [2.1] - 2025-10-20
 
 This release focuses on local deployment improvements, enhanced workload differentiation, and improved user experience with advanced configuration options.
 
@@ -52,7 +98,7 @@ This release focuses on local deployment improvements, enhanced workload differe
   - Better visual feedback and status indicators
   - Improved configuration wizard flow
 
-## [2.2.0] - 2025-10-13
+## [2.0] - 2025-10-13
 
 This release focuses on the AI vWS Sizing Advisor with enhanced deployment capabilities, improved user experience, and zero external dependencies for SSH operations.
 
@@ -137,8 +183,7 @@ This release focuses on the AI vWS Sizing Advisor with enhanced deployment capab
 - SSH key-based authentication (more secure than passwords)
 - Automatic key generation with proper permissions (700/600)
 
-## [2.1.0] - 2025-05-13
-
+## [1.2] - 2025-05-13
 
 This release reduces overall GPU requirement for the deployment of the blueprint. It also improves the performance and stability for both docker and helm based deployments.
 
@@ -168,7 +213,7 @@ This release reduces overall GPU requirement for the deployment of the blueprint
 
 A detailed guide is available [here](./docs/migration_guide.md) for easing developers experience, while migrating from older versions.
 
-## [2.0.0] - 2025-03-18
+## [1.1] - 2025-03-18
 
 This release adds support for multimodal documents using [Nvidia Ingest](https://github.com/NVIDIA/nv-ingest) including support for parsing PDFs, Word and PowerPoint documents. It also significantly improves accuracy and perf considerations by refactoring the APIs, architecture as well as adds a new developer friendly UI.
 
@@ -202,7 +247,7 @@ This release adds support for multimodal documents using [Nvidia Ingest](https:/
 
 A detailed guide is available [here](./docs/migration_guide.md) for easing developers experience, while migrating from older versions.
 
-## [1.0.0] - 2025-01-15
+## [1.0] - 2025-01-15
 
 ### Added
 

diff --git a/community/vgpu-sizing-advisor/README.md → community/ai-vws-sizing-advisor/README.md b/community/vgpu-sizing-advisor/README.md → community/ai-vws-sizing-advisor/README.md
@@ -1,18 +1,67 @@
-# vGPU Sizing Advisor for AI vWS
+# AI vWS Sizing Advisor
+
+<p align="center">
+  <img src="deployment_examples/example_rag_config.png" alt="AI vWS Sizing Advisor" width="800">
+</p>
+
+<p align="center">
+  <strong>RAG-powered vGPU sizing recommendations for AI Virtual Workstations</strong><br>
+  Powered by NVIDIA NeMo™ and Nemotron models
+</p>
+
+<p align="center">
+  <a href="https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html">Official Documentation</a> •
+  <a href="#demo">Demo</a> •
+  <a href="#deployment">Quick Start</a> •
+  <a href="./CHANGELOG.md">Changelog</a>
+</p>
+
+---
 
 ## Overview
 
-vGPU Sizing Advisor is a RAG-powered tool that helps you determine the optimal NVIDIA vGPU configuration for AI workloads on NVIDIA AI Virtual Workstation (AI vWS). Using NVIDIA vGPU documentation and best practices, it provides tailored recommendations for optimal performance and resource efficiency.
+AI vWS Sizing Advisor is a RAG-powered tool that helps you determine the optimal NVIDIA vGPU sizing configuration for AI workloads on NVIDIA AI Virtual Workstation (AI vWS). Using NVIDIA vGPU documentation and best practices, it provides tailored recommendations for optimal performance and resource efficiency.
+
+### Powered by NVIDIA Nemotron
+
+This tool leverages **NVIDIA Nemotron models** for intelligent sizing recommendations:
+
+- **[Llama-3.3-Nemotron-Super-49B](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1)** — Powers the RAG backend for intelligent conversational sizing guidance
+- **[Nemotron-3 Nano 30B](https://build.nvidia.com/nvidia/nvidia-nemotron-3-nano-30b-a3b-fp8)** — Default model for workload sizing calculations (FP8 optimized)
+
+### Key Capabilities
 
 Enter your workload requirements and receive validated recommendations including:
 
-- **vGPU Profile** - Recommended profile (e.g., L40S-24Q) based on your workload
-- **Resource Requirements** - vCPUs, GPU memory, system RAM needed
-- **Performance Estimates** - Expected latency, throughput, and time to first token
-- **Live Testing** - Instantly deploy and validate your configuration locally using vLLM containers
+- **vGPU Profile** — Recommended profile (e.g., L40S-24Q) based on your workload
+- **Resource Requirements** — vCPUs, GPU memory, system RAM needed
+- **Performance Estimates** — Expected latency, throughput, and time to first token
+- **Live Testing** — Instantly deploy and validate your configuration locally using vLLM containers
 
 The tool differentiates between RAG and inference workloads by accounting for embedding vectors and database overhead. It intelligently suggests GPU passthrough when jobs exceed standard vGPU profile limits.
 
+---
+
+## Demo
+
+### Configuration Wizard
+
+Configure your workload parameters including model selection, GPU type, quantization, and token sizes:
+
+<p align="center">
+  <img src="deployment_examples/configuration_wizard.png" alt="Configuration Wizard" width="700">
+</p>
+
+### Local Deployment Verification
+
+Validate your configuration by deploying a vLLM container locally and comparing actual GPU memory usage against estimates:
+
+<p align="center">
+  <img src="deployment_examples/local_deployment.png" alt="Local Deployment" width="700">
+</p>
+
+---
+
 ## Prerequisites
 
 ### Hardware
@@ -44,15 +93,17 @@ docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
 > **Note:** Docker must be at `/usr/bin/docker` (verified in `deploy/compose/docker-compose-rag-server.yaml`). User must be in docker group or have socket permissions.
 
 ### API Keys
-- **NVIDIA Build API Key** (Required) - [Get your key](https://build.nvidia.com/settings/api-keys)
-- **HuggingFace Token** (Optional) - [Create token](https://huggingface.co/settings/tokens) for gated models
+- **NVIDIA Build API Key** (Required) — [Get your key](https://build.nvidia.com/settings/api-keys)
+- **HuggingFace Token** (Optional) — [Create token](https://huggingface.co/settings/tokens) for gated models
+
+---
 
 ## Deployment
 
 **1. Clone and navigate:**
 ```bash
 git clone https://github.com/NVIDIA/GenerativeAIExamples.git
-cd GenerativeAIExamples/community/vgpu-sizing-advisor
+cd GenerativeAIExamples/community/ai-vws-sizing-advisor
 ```
 
 **2. Set NGC API key:**
@@ -74,28 +125,32 @@ npm install
 npm run dev
 ```
 
+---
+
 ## Usage
 
-2. **Select Workload Type:** RAG or Inference
+1. **Select Workload Type:** RAG or Inference
 
-3. **Enter Parameters:**
-   - Model name (e.g., `meta-llama/Llama-2-7b-chat-hf`)
+2. **Enter Parameters:**
+   - Model name (default: **Nemotron-3 Nano 30B FP8**)
    - GPU type
    - Prompt size (input tokens)
    - Response size (output tokens)
-   - Quantization (FP16, INT8, INT4)
+   - Quantization (FP16, FP8, INT8, INT4)
    - For RAG: Embedding model and vector dimensions
 
-4. **View Recommendations:**
+3. **View Recommendations:**
    - Recommended vGPU profiles
    - Resource requirements (vCPUs, RAM, GPU memory)
    - Performance estimates
 
-5. **Test Locally** (optional):
+4. **Test Locally** (optional):
    - Run local inference with a containerized vLLM server
    - View performance metrics
    - Compare actual results versus suggested profile configuration
 
+---
+
 ## Management Commands
 
 ```bash
@@ -120,6 +175,8 @@ The stop script automatically performs Docker cleanup operations:
 - Optionally removes dangling images (`--cleanup-images`)
 - Optionally removes all data volumes (`--volumes`)
 
+---
+
 ## Adding Documents to RAG Context
 
 The tool includes NVIDIA vGPU documentation by default. To add your own:
@@ -134,8 +191,7 @@ curl -X POST -F "file=@./vgpu_docs/your-document.pdf" http://localhost:8082/v1/i
 
 **Supported formats:** PDF, TXT, DOCX, HTML, PPTX
 
-
-
+---
 
 ## License
 
@@ -145,6 +201,6 @@ Models governed by [NVIDIA AI Foundation Models Community License](https://docs.
 
 ---
 
-**Version:** 2.3.0 (October 2025) - See [CHANGELOG.md](./CHANGELOG.md)
+**Version:** 2.3 (January 2026) — See [CHANGELOG.md](./CHANGELOG.md)
 
-**Support:** [GitHub Issues](https://github.com/NVIDIA/GenerativeAIExamples/issues) | [NVIDIA Forums](https://forums.developer.nvidia.com/)
+**Support:** [GitHub Issues](https://github.com/NVIDIA/GenerativeAIExamples/issues) | [NVIDIA Forums](https://forums.developer.nvidia.com/) | [Official Docs](https://docs.nvidia.com/vgpu/toolkits/sizing-advisor/latest/intro.html)
diff --git a/...y/vgpu-sizing-advisor/deploy/compose/.env → ...ai-vws-sizing-advisor/deploy/compose/.env b/...y/vgpu-sizing-advisor/deploy/compose/.env → ...ai-vws-sizing-advisor/deploy/compose/.env
diff --git a/...visor/deploy/compose/accuracy_profile.env → ...visor/deploy/compose/accuracy_profile.env b/...visor/deploy/compose/accuracy_profile.env → ...visor/deploy/compose/accuracy_profile.env
diff --git a/...loy/compose/docker-compose-bootstrap.yaml → ...loy/compose/docker-compose-bootstrap.yaml b/...loy/compose/docker-compose-bootstrap.yaml → ...loy/compose/docker-compose-bootstrap.yaml
diff --git a/...mpose/docker-compose-ingestor-server.yaml → ...mpose/docker-compose-ingestor-server.yaml b/...mpose/docker-compose-ingestor-server.yaml → ...mpose/docker-compose-ingestor-server.yaml
@@ -1,3 +1,11 @@
+# ============================================================================
+# CENTRALIZED MODEL CONFIGURATION
+# Change these values to use different models throughout the application
+# ============================================================================
+x-model-config:
+  # Embedding Model Configuration
+  embedding-model: &embedding-model "nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"
+
 services:
 
   # Main ingestor server which is responsible for ingestion
@@ -38,10 +46,14 @@ services:
       NGC_API_KEY: ${NGC_API_KEY:?"NGC_API_KEY is required"}
 
       ##===Embedding Model specific configurations===
+      # Model name - pulls from centralized config at top of file (can be overridden by env var)
+      APP_EMBEDDINGS_MODELNAME: *embedding-model
       # url on which embedding model is hosted. If "", Nvidia hosted API is used
-      APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL-"nemoretriever-embedding-ms:8000"}
-      APP_EMBEDDINGS_MODELNAME: ${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-mistral-7b-v2}
-      APP_EMBEDDINGS_DIMENSIONS: ${APP_EMBEDDINGS_DIMENSIONS:-2048}
+      APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL:-"nemoretriever-embedding-ms:8000"}
+      # Embedding dimensions - IMPORTANT: Must match your embedding model!
+      # nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1: 4096
+      # nvidia/nv-embedqa-mistral-7b-v2: 2048
+      APP_EMBEDDINGS_DIMENSIONS: ${APP_EMBEDDINGS_DIMENSIONS:-4096}
 
       ##===NV-Ingest Connection Configurations=======
       APP_NVINGEST_MESSAGECLIENTHOSTNAME: ${APP_NVINGEST_MESSAGECLIENTHOSTNAME:-"nv-ingest-ms-runtime"}
@@ -115,9 +127,10 @@ services:
       - AUDIO_INFER_PROTOCOL=grpc
       - CUDA_VISIBLE_DEVICES=0
       - MAX_INGEST_PROCESS_WORKERS=${MAX_INGEST_PROCESS_WORKERS:-16}
-      - EMBEDDING_NIM_MODEL_NAME=${EMBEDDING_NIM_MODEL_NAME:-${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-7b-v2}}
+      # Embedding model - uses APP_EMBEDDINGS_MODELNAME which pulls from centralized config
+      - EMBEDDING_NIM_MODEL_NAME=${APP_EMBEDDINGS_MODELNAME:-nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1}
       # Incase of self-hosted embedding model, use the endpoint url as - https://integrate.api.nvidia.com/v1
-      - EMBEDDING_NIM_ENDPOINT=${EMBEDDING_NIM_ENDPOINT:-${APP_EMBEDDINGS_SERVERURL-http://nemoretriever-embedding-ms:8000/v1}}
+      - EMBEDDING_NIM_ENDPOINT=${EMBEDDING_NIM_ENDPOINT:-http://nemoretriever-embedding-ms:8000/v1}
       - INGEST_LOG_LEVEL=DEFAULT
       - INGEST_EDGE_BUFFER_SIZE=64
       # Message client for development

diff --git a/...mpose/docker-compose-nemo-guardrails.yaml → ...mpose/docker-compose-nemo-guardrails.yaml b/...mpose/docker-compose-nemo-guardrails.yaml → ...mpose/docker-compose-nemo-guardrails.yaml
diff --git a/...oy/compose/docker-compose-rag-server.yaml → ...oy/compose/docker-compose-rag-server.yaml b/...oy/compose/docker-compose-rag-server.yaml → ...oy/compose/docker-compose-rag-server.yaml
@@ -1,3 +1,14 @@
+# ============================================================================
+# CENTRALIZED MODEL CONFIGURATION
+# Change these values to use different models throughout the application
+# ============================================================================
+x-model-config:
+  # Chat/LLM Model Configuration
+  llm-model: &llm-model "nvidia/llama-3.3-nemotron-super-49b-v1"
+
+  # Embedding Model Configuration
+  embedding-model: &embedding-model "nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1"
+
 services:
 
   # Main orchestrator server which stiches together all calls to different services to fulfill the user request
@@ -35,25 +46,16 @@ services:
       VECTOR_DB_TOPK: ${VECTOR_DB_TOPK:-100}
 
       ##===LLM Model specific configurations===
-      APP_LLM_MODELNAME: ${APP_LLM_MODELNAME:-"meta/llama-3.1-8b-instruct"}
+      # Model name - pulls from centralized config at top of file (can be overridden by env var)
+      APP_LLM_MODELNAME: *llm-model
       # url on which llm model is hosted. If "", Nvidia hosted API is used
-      APP_LLM_SERVERURL: ${APP_LLM_SERVERURL-""}
-
-      ##===Query Rewriter Model specific configurations===
-      APP_QUERYREWRITER_MODELNAME: ${APP_QUERYREWRITER_MODELNAME:-"meta/llama-3.1-8b-instruct"}
-      # url on which query rewriter model is hosted. If "", Nvidia hosted API is used
-      APP_QUERYREWRITER_SERVERURL: ${APP_QUERYREWRITER_SERVERURL-"nim-llm-llama-8b-ms:8000"}
+      APP_LLM_SERVERURL: ${APP_LLM_SERVERURL:-""}
 
       ##===Embedding Model specific configurations===
+      # Model name - pulls from centralized config at top of file (can be overridden by env var)
+      APP_EMBEDDINGS_MODELNAME: *embedding-model
       # url on which embedding model is hosted. If "", Nvidia hosted API is used
-      APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL-""}
-      APP_EMBEDDINGS_MODELNAME: ${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-mistral-7b-v2}
-
-      ##===Reranking Model specific configurations===
-      # url on which ranking model is hosted. If "", Nvidia hosted API is used
-      APP_RANKING_SERVERURL: ${APP_RANKING_SERVERURL-""}
-      APP_RANKING_MODELNAME: ${APP_RANKING_MODELNAME:-nv-rerank-qa-mistral-4b:1}
-      ENABLE_RERANKER: ${ENABLE_RERANKER:-True}
+      APP_EMBEDDINGS_SERVERURL: ${APP_EMBEDDINGS_SERVERURL:-""}
 
       NVIDIA_API_KEY: ${NGC_API_KEY:?"NGC_API_KEY is required"}
 
@@ -65,7 +67,7 @@ services:
 
       # enable multi-turn conversation in the rag chain - this controls conversation history usage
       # while doing query rewriting and in LLM prompt
-      ENABLE_MULTITURN: ${ENABLE_MULTITURN:-False}
+      ENABLE_MULTITURN: ${ENABLE_MULTITURN:-True}
 
       # enable query rewriting for multiturn conversation in the rag chain.
       # This will improve accuracy of the retrieiver pipeline but increase latency due to an additional LLM call
@@ -139,10 +141,10 @@ services:
       context: ../../frontend
       dockerfile: ./Dockerfile
       args:
-        # Model name for LLM
-        NEXT_PUBLIC_MODEL_NAME: ${APP_LLM_MODELNAME:-meta/llama-3.1-8b-instruct}
-        # Model name for embeddings
-        NEXT_PUBLIC_EMBEDDING_MODEL: ${APP_EMBEDDINGS_MODELNAME:-nvidia/nv-embedqa-mistral-7b-v2}
+        # Model name for LLM - pulls from centralized config at top of file
+        NEXT_PUBLIC_MODEL_NAME: *llm-model
+        # Model name for embeddings - pulls from centralized config at top of file
+        NEXT_PUBLIC_EMBEDDING_MODEL: *embedding-model
         # Model name for reranking
         NEXT_PUBLIC_RERANKER_MODEL: ${APP_RANKING_MODELNAME:-nv-rerank-qa-mistral-4b:1}
         # URL for rag server container