diff --git a/nemo/Evaluator/Live Evaluation/README.md b/nemo/Evaluator/Live Evaluation/README.md new file mode 100644 index 000000000..c515a3722 --- /dev/null +++ b/nemo/Evaluator/Live Evaluation/README.md @@ -0,0 +1,83 @@ +# Live Evaluation Implementation + +This repository demonstrates how to leverage Live Evaluation through NeMo Evaluator Microservice for real-time evaluation of LLM outputs. The example includes both simple string checking and Custom LLM-as-a-Judge evaluation of medical consultation summaries using Llama 3.3 Nemotron Super 49B as the judge. + +## Overview + +Live Evaluation enables real-time evaluation without pre-creating persistent evaluation targets and configurations. The implementation demonstrates two key evaluation types: +- **Simple String Checking**: Direct validation of outputs against expected values +- **Custom LLM-as-a-Judge**: Real-time evaluation of medical summaries for correctness (rated 0-4) + +## Prerequisites + +- Docker and Docker Compose installed +- NVIDIA NGC API key for container access +- NVIDIA API key from build.nvidia.com (for the judge LLM) +- NeMo Microservices Python SDK + +## Project Structure + +The project includes: +- Docker Compose configuration for local NeMo Evaluator deployment +- Jupyter notebook demonstrating live evaluation workflows +- Configuration files for the microservices setup +- Example medical consultation data for evaluation + +## Key Components + +1. **Local Deployment** + - Uses Docker Compose to run NeMo Evaluator locally + - Includes NeMo Data Store for data management + - Configured for development and testing + +2. **Simple String Checking** + - Validates outputs using string comparison + - Supports various comparison operators + - Returns immediate evaluation results + +3. **Custom LLM-as-a-Judge** + - Uses Llama 3.3 Nemotron Super 49B as judge + - Custom prompt templates for evaluating correctness + - Regex-based score extraction (0-4 scale) + - Real-time evaluation without persistent configs + +## Setup Instructions + +1. **Login to NGC Container Registry** + ```bash + docker login -u '$oauthtoken' -p YOUR_NGC_KEY_HERE nvcr.io + ``` + +2. **Set Environment Variables** + ```bash + export EVALUATOR_IMAGE=nvcr.io/nvidia/nemo-microservices/evaluator:25.07 + export DATA_STORE_IMAGE=nvcr.io/nvidia/nemo-microservices/datastore:25.07 + export USER_ID=$(id -u) + export GROUP_ID=$(id -g) + ``` + +3. **Start Services** + ```bash + docker compose -f docker_compose.yaml up evaluator -d + ``` + +## Results + +The live evaluation provides immediate feedback including: +- Evaluation status (completed/failed) +- Scores with statistical metrics (mean, count, sum) +- Detailed results for each evaluation metric + +## Dependencies + +See `pyproject.toml` for a complete list of dependencies. Key requirements include: +- datasets>=3.5.0 +- huggingface-hub>=0.30.2 +- nemo-microservices>=1.0.1 +- openai>=1.76.0 + +You can run `uv sync` to produce the required `.venv`! + +## Documentation + +For more detailed information about Live Evaluation, refer to the [official NeMo documentation](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-live.html). diff --git a/nemo/Evaluator/Live Evaluation/docker_compose.yaml b/nemo/Evaluator/Live Evaluation/docker_compose.yaml new file mode 100644 index 000000000..7ca4dad50 --- /dev/null +++ b/nemo/Evaluator/Live Evaluation/docker_compose.yaml @@ -0,0 +1,310 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: LicenseRef-NvidiaProprietary +# +# NVIDIA CORPORATION, its affiliates and licensors retain all intellectual +# property and proprietary rights in and to this material, related +# documentation and any modifications thereto. Any use, reproduction, +# disclosure or distribution of this material and related documentation +# without an express license agreement from NVIDIA CORPORATION or +# its affiliates is strictly prohibited. + +# docker compose -f docker_compose.yaml up -d +services: + customizer: + image: ${CUSTOMIZER_IMAGE:-""} + container_name: nemo-customizer + restart: on-failure + ports: + - "8001:8001" + volumes: + - ./customizer:/mount/cfg + # map a path to model if already exists + # otherwise, the model will be auotmatically downloaded from NGC to /app/models and /app + # Ex: /raid/models/llama-3_1-8b-instruct:/app/models/llama-3_1-8b-instruct + # - /llama-3_1-8b-instruct:/app/models/llama-3_1-8b-instruct + environment: + - CONFIG_PATH=/mount/cfg/customizer_config.yaml + - DB_HOST=nemo-postgresql + - DB_PORT=5432 + - DB_USER=test_user + - DB_PASSWORD=1234 + - DB_NAME=customizer + - PORT=8001 + - NGC_API_KEY=${NGC_API_KEY:-""} + - OTEL_SDK_DISABLED=true + healthcheck: + test: ["CMD", "curl", "http://localhost:8001/v1/health/live"] + interval: 10s + timeout: 3s + retries: 3 + depends_on: + nemo-postgresql: + condition: service_healthy + entity-store: + condition: service_started + data-store: + condition: service_started + networks: + - nemo-ms + deploy: + resources: + reservations: + devices: + - driver: nvidia + capabilities: [gpu] + count: all + shm_size: "1G" + + entity-store: + image: ${ENTITY_STORE_IMAGE:-""} + platform: linux/amd64 + container_name: nemo-entity-store + restart: on-failure + ports: + - "8003:8000" + environment: + - POSTGRES_PASSWORD=1234 + - POSTGRES_USER=test_user + - POSTGRES_HOST=nemo-postgresql + - POSTGRES_DB=entity-store + - BASE_URL_DATASTORE=http://data-store:3000/v1/hf + - BASE_URL_NIM=http://nim:8002 + depends_on: + entity-store-initializer: + condition: service_completed_successfully + networks: + - nemo-ms + + entity-store-initializer: + image: ${ENTITY_STORE_IMAGE:-""} + platform: linux/amd64 + working_dir: /app/services/entity-store + environment: + - POSTGRES_PASSWORD=1234 + - POSTGRES_USER=test_user + - POSTGRES_HOST=nemo-postgresql + - POSTGRES_DB=entity-store + depends_on: + nemo-postgresql: + condition: service_healthy + entrypoint: ["/app/.venv/bin/python3", "-m", "scripts.run_db_migration"] + networks: + - nemo-ms + + evaluator: + image: ${EVALUATOR_IMAGE:-""} + container_name: nemo-evaluator + restart: on-failure + ports: + - 7331:7331 + depends_on: + data-store: + condition: service_started + nemo-postgresql: + condition: service_healthy + evaluator-postgres-db-migration: + condition: service_completed_successfully + otel-collector: + condition: service_started + networks: + - nemo-ms + healthcheck: + test: ["CMD", "curl", "http://localhost:7331/health"] + interval: 10s + timeout: 3s + retries: 3 + environment: + MODE: standalone + # Dependencies + POSTGRES_URI: postgresql://test_user:1234@nemo-postgresql:5432/evaluation + ARGO_HOST: none + NAMESPACE: nemo-evaluation + DATA_STORE_URL: http://data-store:3000/v1/hf + EVAL_CONTAINER: ${EVALUATOR_IMAGE} + SERVICE_ACCOUNT: nemo-evaluator-test-workflow-executor + EVAL_ENABLE_VALIDATION: False + # OpenTelemetry environmental variables + OTEL_SERVICE_NAME: nemo-evaluator + OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317 + OTEL_TRACES_EXPORTER: otlp + OTEL_METRICS_EXPORTER: none + OTEL_LOGS_EXPORTER: otlp + OTEL_PYTHON_EXCLUDED_URLS: "health" + OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED: "true" + CONSOLE_LOG_LEVEL: DEBUG + OTEL_LOG_LEVEL: DEBUG + LOG_LEVEL: DEBUG + + evaluator-postgres-db-migration: + image: ${EVALUATOR_IMAGE:-""} + environment: + MODE: standalone + POSTGRES_URI: postgresql://test_user:1234@nemo-postgresql:5432/evaluation + DATA_STORE_URL: none + ARGO_HOST: none + NAMESPACE: none + EVAL_CONTAINER: none + LOG_LEVEL: INFO + entrypoint: /bin/sh + command: ["-c", "/app/scripts/run-db-migration.sh"] + depends_on: + nemo-postgresql: + condition: service_healthy + networks: + - nemo-ms + + nemo-postgresql: + image: bitnami/postgresql:16.1.0-debian-11-r20 + container_name: nemo-postgresql + platform: linux/amd64 + restart: unless-stopped + environment: + - POSTGRESQL_VOLUME_DIR=/bitnami/postgresql + - PGDATA=/bitnami/postgresql/data + - POSTGRES_USER=test_user + - POSTGRES_PASSWORD=1234 + - POSTGRES_DATABASE=postgres + # List of databases to create if they do not exist + - DATABASES=entity-store,ndsdb,customizer,evaluation + ports: + - "5432:5432" + volumes: + - nemo-postgresql:/bitnami/postgresql:rw + - ./init_scripts:/docker-entrypoint-initdb.d:ro + networks: + - nemo-ms + healthcheck: + test: ["CMD-SHELL", "pg_isready -U $${POSTGRES_USER} -d $${POSTGRES_DATABASE}"] + interval: 10s + timeout: 3s + retries: 3 + + data-store-volume-init: + image: busybox + command: ["sh", "-c", "chmod -R 777 /nds-data"] + volumes: + - nemo-data-store:/nds-data + restart: no + deploy: + restart_policy: + condition: none + + data-store: + image: ${DATA_STORE_IMAGE:-""} + platform: linux/amd64 + container_name: nemo-data-store + restart: on-failure + environment: + - USER_UID=${USER_ID} # match this to the UID of the owner of the data directory + - USER_GID=${GROUP_ID} # match this to the GID of the owner of the data directory + - APP_NAME=Datastore + - INSTALL_LOCK=true + - DISABLE_SSH=true + - GITEA_WORK_DIR=/nds-data + - GITEA__SERVER__APP_DATA_PATH=/nds-data + - GITEA__DAEMON_USER=git + - GITEA__HTTP_PORT=3000 + - GITEA__APP__NAME=datastore + - GITEA__SERVER__LFS_START_SERVER=true + - GITEA__LFS__SERVE_DIRECT=true + - GITEA__LFS__STORAGE_TYPE=local + - GITEA__LFS_START__SERVER=true + - GITEA__SECURITY__INSTALL_LOCK=true + - GITEA__SERVICE__DEFAULT_ALLOW_CREATE_ORGANIZATION=true + - GITEA__SMTP_ENABLED=false + # Database + - GITEA__DATABASE__DB_TYPE=postgres + - GITEA__DATABASE__HOST=nemo-postgresql:5432 + - GITEA__DATABASE__NAME=ndsdb + - GITEA__DATABASE__USER=test_user + - GITEA__DATABASE__PASSWD=1234 + - GITEA__DATABASE_SSL_MODE=disable + volumes: + - nemo-data-store:/nds-data:rw + - /etc/timezone:/etc/timezone:ro + - /etc/localtime:/etc/localtime:ro + ports: + - "3000:3000" + healthcheck: + test: ["CMD", "curl", "http://localhost:3000/v1/health"] + interval: 10s + timeout: 3s + retries: 3 + depends_on: + nemo-postgresql: + condition: service_healthy + data-store-volume-init: + condition: service_completed_successfully + networks: + - nemo-ms + + # Optional NIM requires additional 1 GPU of at least 40GB /v1/health/ready + # nim: + # image: ${NIM_IMAGE:-""} + # container_name: nim + # restart: on-failure + # ports: + # - 8002:8000 + # environment: + # - NGC_API_KEY=${NGC_API_KEY} + # - NIM_SERVER_PORT=8000 + # - NIM_SERVED_MODEL_NAME=${NIM_MODEL_ID} + # - NIM_PEFT_REFRESH_INTERVAL=60 + # - NIM_MAX_GPU_LORAS=1 + # - NIM_MAX_CPU_LORAS=16 + # - NIM_PEFT_SOURCE=http://entity-store:8000 + # runtime: nvidia + # volumes: [] + # # Map a local directory to the cache directory to avoid downloading the model every time + # # Ensure to set write permissions on the local directory for all users: chmod -R a+w /path/to/directory + # # Ex: /raid/nim-cache:/opt/nim/.cache. Brev: - /ephemeral/.cache/nim-cache:/opt/nim/.cache + # networks: + # - nemo-ms + # shm_size: 16GB + # user: root + # deploy: + # resources: + # reservations: + # devices: + # - driver: nvidia + # capabilities: [gpu] + # count: all + # healthcheck: + # test: [ + # "CMD", + # "python3", + # "-c", + # "import requests, sys; sys.exit(0 if requests.get('http://localhost:8002/v1/health/live').ok else 1)" + # ] + # interval: 10s + # timeout: 3s + # retries: 20 + # # allow for 60 seconds to download a model and start up + # start_period: 60s + + + ### + # OpenTelemetry Collector (local) + # adapted from https://jessitron.com/2021/08/11/run-an-opentelemetry-collector-locally-in-docker/ + # and https://github.com/open-telemetry/opentelemetry-demo/blob/main/docker-compose.yml + ### + otel-collector: + image: otel/opentelemetry-collector-contrib:0.91.0 + command: ["--config=/etc/otel-collector-config.yaml"] + volumes: + - ./config/otel-collector-config.yaml:/etc/otel-collector-config.yaml + ports: + - "4317:4317" # OTLP over gRPC receiver + - "55679:55679" # UI + networks: + - nemo-ms + +networks: + nemo-ms: + driver: bridge + +volumes: + nemo-data-store: + driver: local + nemo-postgresql: + driver: local diff --git a/nemo/Evaluator/Live Evaluation/init_scripts/create_databases.sh b/nemo/Evaluator/Live Evaluation/init_scripts/create_databases.sh new file mode 100644 index 000000000..3ab9e8a7e --- /dev/null +++ b/nemo/Evaluator/Live Evaluation/init_scripts/create_databases.sh @@ -0,0 +1,29 @@ +#!/bin/bash +set -e + +# DATABASES is a list of database names to create if they do not exist, separated by commas +IFS=',' read -r -a DATABASES <<< ${DATABASES:-"entity-store,ndsdb,customizer,evaluation"} +echo "Creating additional databases: ${DATABASES[*]}" + +until PGPASSWORD=$POSTGRESQL_PASSWORD psql -h localhost -U $POSTGRESQL_USERNAME -d $POSTGRESQL_DATABASE -c '\q'; do + echo "PostgreSQL is unavailable - sleeping" + sleep 1 +done + +echo "PostgreSQL is up - creating NeMo microservices databases" + +# Create each database if it doesn't exist +for db in "${DATABASES[@]}"; do + if ! PGPASSWORD=$POSTGRESQL_PASSWORD psql -h localhost -U $POSTGRESQL_USERNAME -lqt | cut -d \| -f 1 | grep -qw $db; then + echo "Creating database: $db" + PGPASSWORD=$POSTGRESQL_PASSWORD psql -v ON_ERROR_STOP=1 -h localhost -U $POSTGRESQL_USERNAME -d $POSTGRESQL_DATABASE <<-EOSQL + CREATE DATABASE "$db"; + GRANT ALL PRIVILEGES ON DATABASE "$db" TO $POSTGRESQL_USERNAME; +EOSQL + echo "Database $db created successfully." + else + echo "Database $db already exists. Skipping creation." + fi +done + +echo "Additional databases created successfully: ${DATABASES[*]}" diff --git a/nemo/Evaluator/Live Evaluation/live_evaluation.ipynb b/nemo/Evaluator/Live Evaluation/live_evaluation.ipynb new file mode 100644 index 000000000..894b7a76d --- /dev/null +++ b/nemo/Evaluator/Live Evaluation/live_evaluation.ipynb @@ -0,0 +1,319 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Live Evaluations with NeMo Evaluator\n", + "\n", + "In the following notebook, we'll be walking through an example of how you can leverage Live Evaluatoions through NeMo Evaluator Microservice. \n", + "\n", + "Full documentation is available [here](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-live.html)!\n", + "\n", + "In our example - we'll be looking at the following scenarios: \n", + "\n", + "1. Simple String Checking\n", + "2. Custom LLM-as-a-Judge on Synthetically Created Medical Summaries\n", + "\n", + "> NOTE: Currently, live evaluation is only supported with the `custom` evaluation type!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Necessary Configurations\n", + "\n", + "You will need to spin up the NeMo Evaluator Microservice through the provided `docker-compose,yaml` file provided in this directory. \n", + "\n", + "You can do so with the following commands:\n", + "\n", + "1. Login to NVIDIA NGC Container Registry:\n", + "\n", + "```bash\n", + "docker login -u '$oauthtoken' -p YOUR_NGC_KEY_HERE nvcr.io\n", + "```\n", + "\n", + "2. Set-up the initial environment variables (make sure you're correctly set-up Docker so that it can be run from your user group)\n", + "\n", + "```bash \n", + "export EVALUATOR_IMAGE=nvcr.io/nvidia/nemo-microservices/evaluator:25.07\n", + "export DATA_STORE_IMAGE=nvcr.io/nvidia/nemo-microservices/datastore:25.07\n", + "export USER_ID=$(id -u)\n", + "export GROUP_ID=$(id -g)\n", + "```\n", + "\n", + "3. Spin up NeMo Evaluator Microservice through `docker compose`!\n", + "\n", + "```bash\n", + "docker compose -f docker_compose.yaml up evaluator -d\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### NeMo Microservices Client\n", + "\n", + "Next, let's initialize our NeMo Microservices client through the [Python SDK](https://docs.nvidia.com/nemo/microservices/latest/pysdk/index.html)!\n", + "\n", + "> NOTE: By default, the NeMo Evaluator API will be available at: `http://localhost:7331`. " + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from nemo_microservices import NeMoMicroservices\n", + "\n", + "client = NeMoMicroservices(\n", + " base_url=\"http://localhost:7331\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using NeMo Evaluator Microservice for Live Simple String Checking\n", + "\n", + "We can kick off an evaluation job for simple string checking right away using the `custom` evaluation type, with the `data` subtype!\n", + "\n", + "Let's look at how we've do this with the SDK." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Status: completed\n", + "Results: EvaluationResult(job='eval-JoR3GCSrjtRkC9jPFYyMv2', id='evaluation_result-3iaZjDE3a6tt4W7ag3vNsZ', created_at=datetime.datetime(2025, 7, 16, 22, 12, 21, 687891), custom_fields={}, description=None, files_url=None, groups={}, namespace='default', ownership=None, project=None, tasks={'qa': TaskResult(metrics={'accuracy': MetricResult(scores={'string-check': Score(value=1.0, stats=ScoreStats(count=1, max=None, mean=1.0, min=None, stddev=None, stderr=None, sum=1.0, sum_squared=None, variance=None))})})}, updated_at=datetime.datetime(2025, 7, 16, 22, 12, 21, 687893))\n" + ] + } + ], + "source": [ + "# Run a basic string check live evaluation\n", + "response = client.evaluation.live(\n", + " config={\n", + " \"type\": \"custom\",\n", + " \"tasks\": {\n", + " \"qa\": {\n", + " \"type\": \"data\",\n", + " \"metrics\": {\n", + " \"accuracy\": {\n", + " \"type\": \"string-check\",\n", + " \"params\": {\"check\": [\"{{some_output}}\", \"contains\", \"{{expected}}\"]}\n", + " }\n", + " }\n", + " }\n", + " }\n", + " },\n", + " target={\n", + " \"type\": \"rows\",\n", + " \"rows\": [\n", + " {\n", + " \"some_input\": \"Do you agree?\",\n", + " \"some_output\": \"yes\",\n", + " \"expected\": \"yes\"\n", + " }\n", + " ]\n", + " }\n", + ")\n", + "\n", + "print(f\"Status: {response.status}\")\n", + "print(f\"Results: {response.result}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Using NeMo Evaluator Microservice for Live Custom LLM-as-a-Judge\n", + "\n", + "We can also extend this to Custom LLM-as-a-Judge using a dataset that we have in our local environment!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We're going to use the [llama-3.3-nemotron-super-49b-v1](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1) as our judge model today .\n", + "\n", + "> NOTE: You can find the API key on `build.nvidia.com` by clicking the green \"Get API Key\" button!" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "\n", + "os.environ[\"NVIDIA_API_KEY\"] = getpass.getpass(\"Enter your NVIDIA API key: \")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To keep things organized, we'll initialize our model object in a separate code cell - but this is going to be provided alongside the rest of our evaluation config when we create it through the SDK!" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "model_target = {\n", + " \"api_endpoint\": {\n", + " \"url\": \"https://integrate.api.nvidia.com/v1\",\n", + " \"model_id\": \"nvidia/llama-3.3-nemotron-super-49b-v1\",\n", + " \"api_key\": os.getenv(\"NVIDIA_API_KEY\")\n", + " }\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll do the same for our prompt. Notice that we're able to key into the appropriate fields using the `{{}}` templating. \n", + "\n", + "> NOTE: Since we're using regex to parse the output scores - ensure your output format template is well defined. " + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "correctness_prompt = \"\"\"\n", + "Your task is to determine if the summary correctly reflects the consultation.\n", + "\n", + "CONSULTATION CONTENT: {{content}}\n", + "SUMMARY: {{summary}}\n", + "\n", + "Reply with a score between 0 and 4, where 0 is the worst and 4 is the best. You must response with: \"SCORE: \" only.\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, like usual, we can create our custom LLM-as-a-Judge config and target below!\n", + "\n", + "> NOTE: The Live feature currently requires you to create the config and target at call time - this is to ensure low latency responses. " + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Status: completed\n", + "Results: {'correct': Score(value=2.8, stats=ScoreStats(count=5, max=None, mean=2.8, min=None, stddev=None, stderr=None, sum=14.0, sum_squared=None, variance=None))}\n" + ] + } + ], + "source": [ + "response = client.evaluation.live(\n", + " config={\n", + " \"type\": \"custom\",\n", + " \"tasks\": {\n", + " \"correctness\": {\n", + " \"type\": \"data\",\n", + " \"metrics\": {\n", + " \"correctness-likert\": {\n", + " \"type\": \"llm-judge\",\n", + " \"params\": {\n", + " \"model\": model_target,\n", + " \"template\": {\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"detailed thinking off\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": correctness_prompt\n", + " }\n", + " ]\n", + " },\n", + " \"scores\": {\n", + " \"correct\": {\n", + " \"type\": \"int\",\n", + " \"parser\": {\n", + " \"type\": \"regex\",\n", + " \"pattern\": \"SCORE: (\\\\d)\"\n", + " }\n", + " }\n", + " }\n", + " }\n", + " }\n", + " }\n", + " }\n", + " }\n", + " },\n", + " target={\n", + " \"type\": \"rows\",\n", + " \"rows\": [\n", + " {\"ID\": \"C001\", \"content\": \"Date: 2025-04-01\\nChief Complaint (CC): Cough and fever\\nHistory of Present Illness (HPI): three‑day history of productive cough with yellow sputum and low‑grade fevers peaking at 38.5 °C. He denies pleuritic chest pain but reports mild shortness of breath while climbing stairs.\\nPast Medical History (PMH): History significant for essential hypertension well‑controlled on lisinopril. No prior pulmonary disease. Immunizations up to date.\\nReview of Systems (ROS): Denies unintentional weight loss, night sweats, or hemoptysis. Endorses mild malaise.\\nPhysical Examination (PE): Vital signs: T 38.1 °C, HR 96 bpm, BP 132/80 mm Hg, RR 18, SpO₂ 96 % RA. General: alert, mildly ill‑appearing. Lungs: bronchial breath sounds with crackles at right base. Cardiac: regular rhythm, no murmurs. Abdomen: soft, non‑tender. No peripheral edema.\\nFamily History (FHx): Father with coronary artery disease diagnosed at 55; mother with rheumatoid arthritis.\\nSocial History (SocHx): Non‑smoker, occasional alcohol, works as software engineer, walks 30 minutes daily.\\nAssessment & Plan: Discussed likely community‑acquired pneumonia. Initiated amoxicillin‑clavulanate 875 mg twice daily for 10 days, advised rest, hydration, and proper cough hygiene. Educated patient on red‑flag symptoms such as worsening dyspnea or persistent fever.\\nLabs / Imaging Ordered: CBC with differential, basic metabolic panel, chest radiograph PA/lat.\", \"summary\": \"Patient fine, probs viral.\\nCase Summary: ¯\\\\_(ツ)_/¯\\nDisease Specific Elements: none included because meh.\\nPaste Medical History: unimportant.\"},\n", + " {\"ID\": \"C002\", \"content\": \"Date: 2025-04-02\\nChief Complaint (CC): Sharp substernal chest pain\\nHistory of Present Illness (HPI): sudden‑onset, stabbing chest pain that began one hour prior to arrival while shoveling snow. The pain is non‑radiating, rated 7/10, partially relieved by resting and sitting upright.\\nPast Medical History (PMH): Known hyperlipidemia managed with a statin. No previous cardiac events.\\nReview of Systems (ROS): Reports diaphoresis with exertion but no palpitations or presyncope.\\nPhysical Examination (PE): Vitals: BP 148/88 mm Hg, HR 90 bpm, RR 20, afebrile. Cardiovascular: no JVD; S1, S2 normal; no murmurs. Chest wall non‑tender. Lungs clear. Extremities: no edema.\\nFamily History (FHx): Mother with breast cancer diagnosed at 60; no early cardiac deaths in family.\\nSocial History (SocHx): Smokes one pack per day for 15 years, no alcohol; works as mechanic.\\nAssessment & Plan: Obtained ECG and baseline troponin. Administered 325 mg aspirin and nitroglycerin spray with symptomatic improvement. Will repeat troponin in 3 hours and admit to observation for serial cardiac enzymes and potential cardiology consult.\\nLabs / Imaging Ordered: Serial troponins, stat ECG, chest X‑ray, lipid panel.\", \"summary\": \"New Clinical Problem: Sharp substernal chest pain\\nCase Summary: sudden‑onset, stabbing chest pain that began one hour prior to arrival while shoveling snow. The pain is non‑radiating, rated 7/10, partially relieved by resting and sitting upright.\\nDisease Specific Elements: ROS: Reports diaphoresis with exertion but no palpitations or presyncope.. PE: Vitals: BP 148/88 mm Hg, HR 90 bpm, RR 20, afebrile. Cardiovascular: no JVD; S1, S2 normal; no murmurs. Chest….\\nPaste Medical History: Known hyperlipidemia managed with a statin. No previous cardiac events.\"},\n", + " {\"ID\": \"C003\", \"content\": \"Date: 2025-04-03\\nChief Complaint (CC): Diffuse abdominal discomfort\\nHistory of Present Illness (HPI): gradually worsening, cramp‑like abdominal pain that started around the umbilicus six hours ago and has now localized to the right lower quadrant. Associated symptoms include nausea and anorexia, with no bowel movements today.\\nPast Medical History (PMH): No significant past medical or surgical history.\\nReview of Systems (ROS): Positive for anorexia and nausea; negative for hematochezia, melena, or dysuria.\\nPhysical Examination (PE): Vitals stable. Abdomen: soft with guarding in RLQ; rebound tenderness present; positive Rovsing sign. Bowel sounds hypoactive.\\nFamily History (FHx): No known hereditary diseases in immediate family.\\nSocial History (SocHx): No tobacco, drinks wine socially; accountant who exercises at gym twice weekly.\\nAssessment & Plan: NPO status, started IV fluids and performed pain management with morphine. Surgical consult requested for possible appendectomy. Ceftriaxone 1 g IV administered as prophylactic antibiotic.\\nLabs / Imaging Ordered: CBC, CMP, CRP, abdominal ultrasound if CT unavailable.\", \"summary\": \"New Clinical Problem: Diffuse abdominal discomfort\\nCase Summary: gradually worsening, cramp‑like abdominal pain that started around the umbilicus six hours ago and has now localized to the right lower quadrant. Associated symptoms include nausea and anorexia, with no bowel movements today.\\nDisease Specific Elements: ROS: Positive for anorexia and nausea; negative for hematochezia, melena, or dysuria.. PE: Vitals stable. Abdomen: soft with guarding in RLQ; rebound tenderness present; positive Rovsing sign. Bowel sounds hypoactive..\\nPaste Medical History: No significant past medical or surgical history.\"},\n", + " {\"ID\": \"C004\", \"content\": \"Date: 2025-04-04\\nChief Complaint (CC): Throbbing frontal headache\\nHistory of Present Illness (HPI): intermittent, pulsating headaches over the last week, predominantly in the frontal region, accompanied by photophobia and phonophobia. Over‑the‑counter ibuprofen provides partial relief.\\nPast Medical History (PMH): Migraine headaches since adolescence, currently on low‑dose propranolol.\\nReview of Systems (ROS): Positive photophobia and phonophobia; denies vision changes or weakness.\\nPhysical Examination (PE): Afebrile. Neurological exam intact. Fundoscopic exam shows no papilledema. Neck supple. No focal deficits.\\nFamily History (FHx): Mother suffers from migraines; father healthy.\\nSocial History (SocHx): Former smoker, quit five years ago; IT consultant; enjoys cycling.\\nAssessment & Plan: Increased propranolol long‑acting to 80 mg daily. Encouraged migraine diary and trigger avoidance. Provided information about triptan therapy if headaches persist.\\nLabs / Imaging Ordered: No labs today; MRI brain if headaches worsen.\", \"summary\": \"New Clinical Problem: Throbbing frontal headache\\nCase Summary: intermittent, pulsating headaches over the last week, predominantly in the frontal region, accompanied by photophobia and phonophobia. Over‑the‑counter ibuprofen provides partial relief.\\nDisease Specific Elements: ROS: Positive photophobia and phonophobia; denies vision changes or weakness.. PE: Afebrile. Neurological exam intact. Fundoscopic exam shows no papilledema. Neck supple. No focal deficits..\\nPaste Medical History: Migraine headaches since adolescence, currently on low‑dose propranolol.\"},\n", + " {\"ID\": \"C005\", \"content\": \"Date: 2025-04-05\\nChief Complaint (CC): Progressive exertional shortness of breath\\nHistory of Present Illness (HPI): shortness of breath on exertion for the past two weeks, now noticeable after climbing two flights of stairs. She endorses mild orthopnea requiring two pillows at night but denies chest tightness.\\nPast Medical History (PMH): Paroxysmal atrial fibrillation on apixaban; otherwise healthy.\\nReview of Systems (ROS): Denies paroxysmal nocturnal dyspnea; endorses mild ankle swelling.\\nPhysical Examination (PE): Blood pressure 140/90 mm Hg; pulse 88 bpm. Lungs: mild bibasilar crackles. Heart: regular rate. No peripheral edema.\\nFamily History (FHx): Father with congestive heart failure; mother with hypertension.\\nSocial History (SocHx): Never smoked; occasionally consumes beer; office administrator with sedentary lifestyle.\\nAssessment & Plan: Ordered transthoracic echocardiogram and BNP level to evaluate for heart failure exacerbation. Increased furosemide to 40 mg daily and reinforced sodium‑restricted diet.\\nLabs / Imaging Ordered: BNP, basic metabolic panel, echocardiogram.\", \"summary\": \"New Clinical Problem: Progressive exertional shortness of breath\\nCase Summary: shortness of breath on exertion for the past two weeks, now noticeable after climbing two flights of stairs. She endorses mild orthopnea requiring two pillows at night but denies chest tightness.\\nDisease Specific Elements: ROS: Denies paroxysmal nocturnal dyspnea; endorses mild ankle swelling.. PE: Blood pressure 140/90 mm Hg; pulse 88 bpm. Lungs: mild bibasilar crackles. Heart: regular rate. No peripheral edema..\\nPaste Medical History: Paroxysmal atrial fibrillation on apixaban; otherwise healthy.\"},\n", + " ]\n", + " }\n", + ")\n", + "\n", + "print(f\"Status: {response.status}\")\n", + "print(f\"Results: {response.result.tasks[\"correctness\"].metrics[\"correctness-likert\"].scores}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To learn more about the live evaluation feature - please check out [this](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-live.html) documentation!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/nemo/Evaluator/Live Evaluation/pyproject.toml b/nemo/Evaluator/Live Evaluation/pyproject.toml new file mode 100644 index 000000000..91764392d --- /dev/null +++ b/nemo/Evaluator/Live Evaluation/pyproject.toml @@ -0,0 +1,15 @@ +[project] +name = "live-evaluation" +version = "0.1.0" +description = "Add your description here" +readme = "README.md" +requires-python = ">=3.13" +dependencies = [ + "datasets>=3.5.0", + "huggingface-hub>=0.30.2", + "jinja2>=3.1.0", + "jsonschema>=4.23.0", + "jupyterlab>=4.4.1", + "nemo-microservices>=1.0.1", + "openai>=1.76.0", +]