# NVIDIA NIM ASR Model Deployment on Amazon SageMaker AI

## Introduction

The **parakeet-ctc-1.1b-en-us** is a 1.1-billion-parameter automatic speech recognition (ASR) model developed by NVIDIA, designed for high-accuracy English (US) transcription with robust performance across diverse acoustic conditions. It delivers industry-leading accuracy for offline speech-to-text applications, making it an excellent choice for developers and researchers requiring reliable transcription capabilities.

## About the model:

**Architecture:** The model architecture is based on the FastConformer encoder combined with a CTC (Connectionist Temporal Classification) decoder. This proven architecture enables efficient and accurate speech transcription with excellent generalization across different speaking styles and acoustic environments.

**Model:** Parakeet CTC 1.1B EN-US

**Performance:** The model achieves exceptional word error rate (WER) performance with fast inference speeds on NVIDIA GPUs, providing state-of-the-art accuracy for US English transcription tasks. The larger 1.1B parameter size enables superior handling of challenging audio conditions including accents, background noise, and domain-specific vocabulary.

**Features:**
- High-accuracy transcription for US English speech
- Robust performance across diverse acoustic conditions and speaking styles
- Automatic punctuation and capitalization
- Excellent handling of spontaneous speech, disfluencies, and conversational audio
- Supports various audio formats and sampling rates
- Optimized for NVIDIA GPUs (Ampere, Hopper, Blackwell, Volta architectures)

**Deployment:** The model is deployment-ready for production inference and can be integrated into platforms such as Amazon SageMaker. It is compatible with NVIDIA NeMo toolkit and packaged as a NIM microservice, enabling seamless deployment, scaling, and integration for enterprise speech transcription applications.

## Use Cases:

- **Call Center Analytics** - High-accuracy transcription of customer service calls, sales conversations, and support interactions for quality assurance and analytics.
- **Meeting & Conference Transcription** - Reliable transcription of business meetings, webinars, and conference calls with support for multiple speakers and spontaneous speech.
- **Medical Documentation** - Accurate dictation and transcription for clinical notes, patient consultations, and healthcare documentation where precision is paramount.
- **Legal & Compliance** - Transcription of depositions, court proceedings, and legal interviews where accuracy and reliability are critical.
- **Content Creation & Media** - Transcription of podcasts, interviews, and video content for subtitles, captions, and content accessibility.
- **Enterprise Voice Applications** - Integration into business workflows for voice-enabled documentation, automated note-taking, and voice-command systems.

## Helpful Links

**Free API Access & Prototyping**  
[NVIDIA API Catalog (build.nvidia.com)](https://build.nvidia.com)
- Get free API credits for testing (10K requests)
- Try multiple NVIDIA speech models through browser UI
- Generate API keys for integration into your applications

**Technical Overview & Deployment Guide**  
[NVIDIA Speech AI Models Blog Post](https://developer.nvidia.com/blog)
- Comprehensive overview of all Parakeet models
- Performance benchmarks and use case examples
- Links to deployment options (Riva, NIM, NGC)

In this example we show how to deploy the **Parakeet CTC 1.1B EN-US** NIM from AWS Marketplace on Amazon SageMaker. The Parakeet CTC 1.1B EN-US NIM simplifies the deployment of the automatic speech recognition model which is optimized for high-accuracy US English transcription with robust performance across diverse acoustic conditions, punctuation, and capitalization, and outperforms many available open source ASR models on common industry benchmarks. The NIM is built on robust foundations including inference engines like Triton Inference Server,and NVIDIA Riva. NIM provides features like low latency, high throughput, metrics export, standard API, optimized profiles & enterprise support.

## Pre-requisites:
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. To deploy this ML model successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to one of the models listed above.


## Subscribe to the model package
To subscribe to the model package:
1. Open the model package listing page
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model. Copy the ARN corresponding to your region and specify the same in the following cell.

In [16]:
import boto3, json, sagemaker, time, os
from sagemaker import get_execution_role, ModelPackage
import time

sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")
region = sess.region_name

In [17]:
# replace the arn below with the model package arn you want to deploy
nim_package = "nvidia-parakeet-1-1b-ctc-en-us-7c08e725839733449bed125e2a10a2c8" # Need to change 

# Mapping for Model Packages
model_package_map = {
    "us-east-1": f"arn:aws:sagemaker:us-east-1:865070037744:model-package/{nim_package}",
    "us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:model-package/{nim_package}",
    "us-west-1": f"arn:aws:sagemaker:us-west-1:382657785993:model-package/{nim_package}",
    "us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:model-package/{nim_package}",
    "ca-central-1": f"arn:aws:sagemaker:ca-central-1:470592106596:model-package/{nim_package}",
    "eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:model-package/{nim_package}",
    "eu-west-1": f"arn:aws:sagemaker:eu-west-1:985815980388:model-package/{nim_package}",
    "eu-west-2": f"arn:aws:sagemaker:eu-west-2:856760150666:model-package/{nim_package}",
    "eu-west-3": f"arn:aws:sagemaker:eu-west-3:843114510376:model-package/{nim_package}",
    "eu-north-1": f"arn:aws:sagemaker:eu-north-1:136758871317:model-package/{nim_package}",
    "ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{nim_package}",
    "ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{nim_package}",
    "ap-northeast-2": f"arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{nim_package}",
    "ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{nim_package}",
    "ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:model-package/{nim_package}",
    "sa-east-1": f"arn:aws:sagemaker:sa-east-1:270155090741:model-package/{nim_package}",
}

region = boto3.Session().region_name
if region not in model_package_map.keys():
    raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]
model_package_arn

'arn:aws:sagemaker:us-east-1:865070037744:model-package/nvidia-parakeet-1-1b-ctc-en-us-7c08e725839733449bed125e2a10a2c8'

## Create the SageMaker Endpoint

We first define SageMaker model using the specified ModelPackageArn.

In [19]:
# Define the model details
sm_model_name ='parakeet-ctc-1-1b-en-us'
timestamp = int(time.time())

# Create the SageMaker model
create_model_response = sm.create_model(
    ModelName=sm_model_name,
    PrimaryContainer={
        'ModelPackageName': model_package_arn
    },
    ExecutionRoleArn=role,
    EnableNetworkIsolation=True
)
print("Model Arn: " + create_model_response["ModelArn"])


Model Arn: arn:aws:sagemaker:us-east-1:492681118881:model/parakeet-ctc-1-1b-en-us


Next we create endpoint configuration specifying instance type, in this case it's g5.4xlarge or g5.12xlarge.

In [20]:
# Create the endpoint configuration
endpoint_config_name = sm_model_name

create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            'VariantName': 'primary',
            'ModelName': sm_model_name,
            'InitialInstanceCount': 1,
            'InstanceType': 'ml.g5.4xlarge', # or g5.12xlarge
            'ModelDataDownloadTimeoutInSeconds': 3600, # Specify the model download timeout in seconds.
            'ContainerStartupHealthCheckTimeoutInSeconds': 3600, # Specify the health checkup timeout in seconds
        }
    ]
)
print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Endpoint Config Arn: arn:aws:sagemaker:us-east-1:492681118881:endpoint-config/parakeet-ctc-1-1b-en-us


Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to InService once the deployment is successful.

In [21]:
# Create the endpoint
endpoint_name = endpoint_config_name
create_endpoint_response = sm.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Endpoint Arn: arn:aws:sagemaker:us-east-1:492681118881:endpoint/parakeet-ctc-1-1b-en-us


In [22]:
resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-east-1:492681118881:endpoint/parakeet-ctc-1-1b-en-us
Status: InService


## Inference Testing and Validation

### Test Audio File Preparation

Let's create a sample audio file for testing our NIM ASR endpoint:

In [23]:
# Create a simple test audio file using text-to-speech or use existing file
import tempfile
import soundfile as sf

# Create test audio file
test_audio_path = "data/test.wav"

### Run Inference

### Test Endpoint with Different Protocols

Our NIM ASR endpoint supports multiple inference methods:

1. **Auto-routing** (`/invocations`): Automatically chooses HTTP or gRPC based on file size
2. **Force HTTP** (`X-Amzn-SageMaker-Custom-Attributes: /invocations/http`): Direct HTTP route
3. **Force gRPC** (`X-Amzn-SageMaker-Custom-Attributes: /invocations/grpc`): Direct gRPC route with diarization support

#### Test 1: Auto-routing (Recommended)

In [24]:
# Test 1: Auto-routing inference
import uuid
sm_runtime = boto3.client('sagemaker-runtime')
def test_endpoint_auto_routing(audio_file_path):
    """Test endpoint with auto-routing"""
    print(f"Testing auto-routing with {audio_file_path}")
    
    try:
        # Read audio file
        with open(audio_file_path, 'rb') as f:
            audio_data = f.read()
        
        file_size = len(audio_data)
        print(f"Audio file size: {file_size:,} bytes ({file_size / (1024*1024):.2f} MB)")
        
        # Create multipart form data
        boundary = f"----WebKitFormBoundary{uuid.uuid4().hex}"
        
        # Build multipart payload
        parts = []
        parts.append(f'--{boundary}')
        parts.append('Content-Disposition: form-data; name="file"; filename="test.wav"')
        parts.append('Content-Type: audio/wav')
        parts.append('')
        
        # Join text parts
        text_part = '\r\n'.join(parts) + '\r\n'
        language_part = f'\r\n--{boundary}\r\nContent-Disposition: form-data; name="language_code"\r\n\r\nen-US\r\n--{boundary}--'
        
        # Combine all parts
        payload = text_part.encode() + audio_data + language_part.encode()

        # print("Input:/n")
        # print(payload)
        
        # Invoke endpoint
        response = sm_runtime.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType=f'multipart/form-data; boundary={boundary}',
            Body=payload
        )

        print("Result:/n")
        print(response)
        
        # Parse response
        result = json.loads(response['Body'].read().decode())
        print(f"\nAuto-routing inference successful!")
        print(f"Response: {json.dumps(result, indent=2)}")
        return result
        
    except Exception as e:
        print(f"Auto-routing test failed: {e}")
        return None

# Run auto-routing test
auto_result = test_endpoint_auto_routing(test_audio_path)

Testing auto-routing with data/test.wav
Audio file size: 237,964 bytes (0.23 MB)
Result:/n
{'ResponseMetadata': {'RequestId': '04603604-f707-441d-9402-f80f424d2988', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '04603604-f707-441d-9402-f80f424d2988', 'x-amzn-invoked-production-variant': 'primary', 'date': 'Wed, 22 Oct 2025 23:56:54 GMT', 'content-type': 'application/json; charset=utf-8', 'content-length': '135', 'connection': 'keep-alive'}, 'RetryAttempts': 0}, 'ContentType': 'application/json; charset=utf-8', 'InvokedProductionVariant': 'primary', 'Body': <botocore.response.StreamingBody object at 0x7f2933fce1d0>}

Auto-routing inference successful!
Response: {
  "text": "Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. It is certainly very like the old portrait. "
}


#### Test 2: Force HTTP Route
Test the HTTP-only route (optimized for files <5MB):

In [25]:
# Test 2: Force HTTP route
def test_endpoint_http_route(audio_file_path):
    """Test endpoint with forced HTTP route"""
    print(f"Testing HTTP route with {audio_file_path}")
    
    try:
        # Read audio file
        with open(audio_file_path, 'rb') as f:
            audio_data = f.read()
        
        # Create multipart form data
        boundary = f"----WebKitFormBoundary{uuid.uuid4().hex}"
        
        # Build multipart payload (same as auto-routing)
        parts = []
        parts.append(f'--{boundary}')
        parts.append('Content-Disposition: form-data; name="file"; filename="test.wav"')
        parts.append('Content-Type: audio/wav')
        parts.append('')
        
        text_part = '\r\n'.join(parts) + '\r\n'
        language_part = f'\r\n--{boundary}\r\nContent-Disposition: form-data; name="language_code"\r\n\r\nen-US\r\n--{boundary}--'
        payload = text_part.encode() + audio_data + language_part.encode()
        
        # Invoke endpoint with HTTP route forced
        response = sm_runtime.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType=f'multipart/form-data; boundary={boundary}',
            Body=payload,
            CustomAttributes='/invocations/http'  # Force HTTP route
        )
        
        result = json.loads(response['Body'].read().decode())
        print(f"\nHTTP route inference successful!")
        print(f"Response: {json.dumps(result, indent=2)}")
        return result
        
    except Exception as e:
        print(f"HTTP route test failed: {e}")
        return None

# Run HTTP route test
http_result = test_endpoint_http_route(test_audio_path)

Testing HTTP route with data/test.wav

HTTP route inference successful!
Response: {
  "text": "Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. It is certainly very like the old portrait. "
}


#### Test 3: Force gRPC Route with Speaker Diarization

Test the gRPC route with speaker diarization capabilities:

In [26]:
# Test 3: Force gRPC route with speaker diarization
def test_endpoint_grpc_route(audio_file_path, enable_diarization=True, max_speakers=4):
    """Test endpoint with forced gRPC route and speaker diarization"""
    print(f"Testing gRPC route with speaker diarization: {enable_diarization}")
    
    try:
        # Read audio file
        with open(audio_file_path, 'rb') as f:
            audio_data = f.read()
        
        # Create multipart form data with diarization parameters
        boundary = f"----WebKitFormBoundary{uuid.uuid4().hex}"
        
        # Build multipart payload with additional parameters
        payload_parts = []
        
        # Audio file
        payload_parts.append(f'--{boundary}')
        payload_parts.append('Content-Disposition: form-data; name="file"; filename="test.wav"')
        payload_parts.append('Content-Type: audio/wav')
        payload_parts.append('')
        
        text_part = '\r\n'.join(payload_parts) + '\r\n'
        
        # Additional parameters
        additional_params = []
        
        # Language code
        additional_params.append(f'--{boundary}')
        additional_params.append('Content-Disposition: form-data; name="language_code"')
        additional_params.append('')
        additional_params.append('en-US')
        
        # Speaker diarization
        additional_params.append(f'--{boundary}')
        additional_params.append('Content-Disposition: form-data; name="speaker_diarization"')
        additional_params.append('')
        additional_params.append('true' if enable_diarization else 'false')
        
        # Max speakers
        additional_params.append(f'--{boundary}')
        additional_params.append('Content-Disposition: form-data; name="max_speakers"')
        additional_params.append('')
        additional_params.append(str(max_speakers))
        
        # Close boundary
        additional_params.append(f'--{boundary}--')
        
        params_part = '\r\n'.join(additional_params)
        
        # Combine all parts
        payload = text_part.encode() + audio_data + ('\r\n' + params_part).encode()
        
        # Invoke endpoint with gRPC route forced
        response = sm_runtime.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType=f'multipart/form-data; boundary={boundary}',
            Body=payload,
            CustomAttributes='/invocations/grpc'  # Force gRPC route
        )
        
        result = json.loads(response['Body'].read().decode())
        print(f"\ngRPC route inference successful!")
        print(f"Speaker diarization enabled: {enable_diarization}")
        print(f"Max speakers: {max_speakers}")
        print(f"Response: {json.dumps(result, indent=2)}")
        return result
        
    except Exception as e:
        print(f"gRPC route test failed: {e}")
        return None

# Run gRPC route test
grpc_result = test_endpoint_grpc_route("data/medical-diarization.wav")

Testing gRPC route with speaker diarization: True

gRPC route inference successful!
Speaker diarization enabled: True
Max speakers: 4
Response: {
  "predictions": [
    {
      "results": [
        {
          "alternatives": [
            {
              "transcript": "Hey Jane, so what brings you into my office today? ",
              "confidence": 0.1833893358707428,
              "words": [
                {
                  "word": "Hey",
                  "start_time": 1.04,
                  "end_time": 1.28,
                  "confidence": 0.14124539494514465,
                  "speaker_tag": 0
                },
                {
                  "word": "Jane,",
                  "start_time": 1.36,
                  "end_time": 1.6,
                  "confidence": 0.0971662625670433,
                  "speaker_tag": 0
                },
                {
                  "word": "so",
                  "start_time": 1.76,
                  "end_time": 1.84,
              

### Terminate endpoint and clean up artifacts

In [27]:
sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': '4edda3cb-3515-401f-a268-c1ab99fef3ef',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '4edda3cb-3515-401f-a268-c1ab99fef3ef',
   'strict-transport-security': 'max-age=47304000; includeSubDomains',
   'x-frame-options': 'DENY',
   'content-security-policy': "frame-ancestors 'none'",
   'cache-control': 'no-cache, no-store, must-revalidate',
   'x-content-type-options': 'nosniff',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Wed, 22 Oct 2025 23:57:35 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}

**🎉 Congratulations!** You have successfully deployed a production-ready NVIDIA NIM ASR solution on Amazon SageMaker