## What is Riva

NVIDIA **Riva** is a GPU-accelerated SDK for developing speech AI applications. Riva is designed to help you access conversational AI functionalities easily and quickly. With a few commands, you can access the high-performance services through API operations and try demos. Task-specific AI services and gRPC endpoints provide out-of-the-box, high-performance ASR, NLP, and TTS. All these AI services are trained with thousands of hours of public and internal datasets to reach high accuracy. You can start using the pretrained models or fine-tune them with your own dataset to further improve model performance. 

The Riva text-to-speech or speech synthesis skill generates human-like speech and uses non-autoregressive models to deliver 12x higher performance on NVIDIA A100 GPUs when compared with Tacotron 2 and WaveGlow models on NVIDIA V100 GPUs. Furthermore, the service enables you to create a natural custom voice for every brand and virtual assistant with 30 mins of an actor’s data in a day.

![riva capabilities](https://developer-blogs.nvidia.com/wp-content/uploads/2021/11/riva-services-capabilities-2.png)
![riva pipeline](https://developer-blogs.nvidia.com/wp-content/uploads/2021/11/riva-skills.png)

Riva services are exposed through API operations accessible by `gRPC` endpoints that hide all the complexity. The gRPC API operations are exposed by the API server running in a Docker container. They are responsible for processing all the speech and NLP incoming and outgoing data.

## Setting up the service

In [3]:
# install related libraries
!apt-get update && apt-get install -y libsndfile-dev
!pip install tritonclient
!pip install librosa
!pip install nvidia-pyindex

Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
Hit:2 http://archive.ubuntu.com/ubuntu focal InRelease                         
Hit:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease                 
Hit:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Hit:5 http://security.ubuntu.com/ubuntu focal-security InRelease
Reading package lists... Done                        
W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list:50 and /etc/apt/sources.list.d/cuda-ubuntu2004-x86_64.list:1
W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list:50 and /etc/apt/sources.list.d/cuda-ubuntu2004-x86_64.list:1
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Note, selecting 'libsndfile1-dev' instead of 'libsndfile-dev'
libsndfile1-dev is already the newest version (1.0.28-7ubuntu0.1).
The following package was automatically installed

In [4]:
# required imports
import io
import IPython.display as ipd
import grpc
import librosa 
import numpy as np
import riva.client

# NLP proto(Not necessary)
from riva.client.proto.riva_nlp_pb2 import (
    AnalyzeIntentResponse,
    NaturalQueryResponse,
    TextClassResponse,
    TextTransformResponse,
    TokenClassResponse,
)


In [16]:
# Create Riva clients and connect to Riva Speech API server
auth =riva.client.Auth(uri="localhost:50051")

#server
riva_asr = riva.client.ASRService(auth)
riva_nlp = riva.client.NLPService(auth)
riva_tts = riva.client.SpeechSynthesisService(auth)

## Check Server status via Triton API

For conversational AI applications, it is crucial to keep the latency below a given threshold. This latency requirement translates into the execution of inference requests as soon as they arrive. To saturate the GPUs and increase performance, you must increase the batch size and delay the inference execution until more requests are received and a bigger batch can be formed.

Riva uses NVIDIA **Triton Inference Server** to serve multiple models for efficient and robust resource allocation, as well as to achieve high performance in terms of high throughput, low latency, and high accuracy. The API server sends inference requests to NVIDIA Triton and receives the results.

**Triton Inference Server** provides a cloud and edge inferencing solution optimized for both CPUs and GPUs. Triton supports an HTTP/REST and GRPC protocol that allows remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton is available as a shared library with a C API that allows the full functionality of Triton to be included directly in an application.

![Triton](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/images/arch.jpg?raw=truehttps://github.com/triton-inference-server/server/blob/main/docs/user_guide/images/arch.jpg?raw=true)

For more details: https://github.com/triton-inference-server/server

In [14]:
from tritonclient.grpc import service_pb2
from tritonclient.grpc import service_pb2_grpc

trt_channel = grpc.insecure_channel("localhost:8001")
grpc_stub = service_pb2_grpc.GRPCInferenceServiceStub(trt_channel)

try:
    request = service_pb2.ServerLiveRequest()
    response = grpc_stub.ServerLive(request)
    print("server {}".format(response))
except Exception as ex:
    print(ex)

request = service_pb2.ServerReadyRequest()
response = grpc_stub.ServerReady(request)
print("server {}".format(response))

server live: true

server ready: true



In [18]:
request = service_pb2.RepositoryIndexRequest()
response = grpc_stub.RepositoryIndex(request)

print("num models: {}\n".format(len(response.models)))
print(response.models)

num models: 53

[name: "conformer-en-US-asr-offline"
version: "1"
state: "READY"
, name: "conformer-en-US-asr-offline-ctc-decoder-cpu-streaming-offline"
version: "1"
state: "READY"
, name: "conformer-en-US-asr-offline-endpointing-streaming-offline"
version: "1"
state: "READY"
, name: "conformer-en-US-asr-offline-feature-extractor-streaming-offline"
version: "1"
state: "READY"
, name: "conformer-en-US-asr-streaming"
version: "1"
state: "READY"
, name: "conformer-en-US-asr-streaming-ctc-decoder-cpu-streaming"
version: "1"
state: "READY"
, name: "conformer-en-US-asr-streaming-endpointing-streaming"
version: "1"
state: "READY"
, name: "conformer-en-US-asr-streaming-feature-extractor-streaming"
version: "1"
state: "READY"
, name: "fastpitch_hifigan_ensemble-English-US"
version: "1"
state: "READY"
, name: "fastpitch_hifigan_ensemble-woojin"
version: "1"
state: "READY"
, name: "intent_slot_detokenizer"
version: "1"
state: "READY"
, name: "intent_slot_label_tokens_weather"
version: "1"
state: 

In [11]:
[i for i in response.models if "punctuation" in i.name]

[name: "riva-punctuation-en-US"
 version: "1"
 state: "READY",
 name: "riva-trt-riva-punctuation-en-US-nn-bert-base-uncased"
 version: "1"
 state: "READY"]

---

## 1. NLP Service Examples
For more details, refer to https://github.com/nvidia-riva/python-clients/blob/main/tutorials/NLP.ipynb
- transform_text - map an input string to an output string
- classify_tokens - return a label per input token
- classify_text - return a single label for the input string
- analyze_intent - return the most likely intent as well as slots relevant to that intent 
- natural_query - return answers to the question

### Punctuation and Capitalization

In [43]:
# Use the TextTransform API to run the punctuation model
texts = ["add punctuation to this sentence"]
texts.append("do you have any red nvidia shirts")
texts.append("i need one cpu four gpus and lots of memory "
                "for my new computer it's going to be very cool")
model_name='riva-punctuation-en-US'

response: TextTransformResponse = riva_nlp.transform_text(input_strings=texts, model_name=model_name)

print("Transformed results are:")
print("\n".join([i for i in response.text]))

Transformed results are:
Add punctuation to this sentence.
Do you have any red Nvidia shirts?
I need one Cpu, four Gpus and lots of memory for my new computer. It's going to be very cool.


### Token Classification

In [39]:
# Use the TokenClassification API to run a Named Entity Recognition (NER) model
texts = ["Jensen Huang is the CEO of NVIDIA Corporation, " "located in Santa Clara, California"]
model_name = "riva_ner"

response: TokenClassResponse = riva_nlp.classify_tokens(texts, model_name)

print("Named Entities:")
for result in response.results[0].results:
    print(f"  {result.token} ({result.label[0].class_name})")

Named Entities:
  jensen huang (PER)
  nvidia corporation (ORG)
  santa clara (LOC)
  california (LOC)


### Text Classification

In [48]:
# Submit a TextClassRequest for text classification.
# Riva NLP comes with a default text_classification domain called "domain_misty" which consists of 
# 4 classes: meteorology, personality, weather and nomatch

texts = ["Is it going to snow in Burlington, Vermont tomorrow night?", "What causes rain?", "What is your favorite season?"]
model_name = "riva_text_classification_domain"
# If you have deployed a custom model with the `--domain_name` parameter in ServiceMaker's `riva-build` command,
# then you should use "riva_text_classification_<your_input_domain_name>" where <your_input_domain_name> is the name you provided to the domain_name parameter. In this case the domain_name is "domain"


response: TextClassResponse = riva_nlp.classify_text(texts, model_name)
print(response)

results {
  labels {
    class_name: "weather"
    score: 0.997551
  }
}
results {
  labels {
    class_name: "meteorology"
    score: 0.984299
  }
}
results {
  labels {
    class_name: "personality"
    score: 0.98435
  }
}



### Intent Analysis

In [65]:
# The AnalyzeIntent API can be used to query a Intent Slot classifier. The API can leverage a
# text classification model to classify the domain of the input query and then route to the 
# appropriate intent slot model.

# Lets first see an example where the domain is known. This skips execution of the domain classifier
# and proceeds directly to the intent/slot model for the requested domain.
query = "How is the humidity in San Francisco?"
options = riva.client.AnalyzeIntentOptions(lang='en-US', domain='weather')
# The <domain_name> is appended to "riva_intent_" to look for a model "riva_intent_<domain_name>". So in this e.g., the model "riva_intent_weather"
# needs to be preloaded in riva server. If you would like to deploy your custom Joint Intent and Slot model use the `--domain_name` parameter in ServiceMaker's `riva-build intent_slot` command.

response: AnalyzeIntentResponse = riva_nlp.analyze_intent(query, options)

print("intent name:", response.intent.class_name)
print("intent score:", response.intent.score)
print("domain name:", response.domain.class_name)
print("domain score:", response.domain.score)
print("first slot token:", response.slots[0].token)
print("first slot most probable label name:", response.slots[0].label[0].class_name)
print("first slot most probable label score:", response.slots[0].label[0].score)


intent name: weather.humidity
intent score: 1.0
domain name: weather
domain score: 1.0
first slot token: san francisco ?
first slot most probable label name: weatherplace
first slot most probable label score: 0.9999120235443115


In [75]:
# Below is an example where the input domain is not provided. See the error unless the relevent model was preloaded.
query = "Is this pizza too salty, isn't it?"
response: AnalyzeIntentResponse = riva_nlp.analyze_intent(query)

# The input query is routed to the a text classification model called "riva_text_classification_<certain domain>"
# Then, the output class label of "riva_text_classification_domain" is appended to "riva_intent_" to get the appropriate Intent Slot model to execute for the input query.
# Note: The model "riva_text_classification_<certain_domain>" needs to be loaded into Riva server and have the appropriate class labels that would invoke the corresponding intent slot model.

print("intent name:", response.intent.class_name)
print("intent score:", response.intent.score)
print("domain name:", response.domain.class_name)
print("domain score:", response.domain.score)
print("first slot token:", response.slots[0].token)
print("first slot most probable label name:", response.slots[0].label[0].class_name)
print("first slot most probable label score:", response.slots[0].label[0].score)

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Error: Model riva_intent_nomatch is not a Riva API model, execution cannot be done"
	debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:50051 {grpc_message:"Error: Model riva_intent_nomatch is not a Riva API model, execution cannot be done", grpc_status:14, created_time:"2023-01-02T06:47:55.088244296+00:00"}"
>

### Quenstion and Answering

In [81]:
qa_query = "How many gigatons of carbon dioxide was released in 2005?"
qa_context = (
    "In 2010 the Amazon rainforest experienced another severe drought, in some ways more extreme than the "
    "2005 drought. The affected region was approximate 1,160,000 square miles (3,000,000 km2) of "
    "rainforest, compared to 734,000 square miles (1,900,000 km2) in 2005. The 2010 drought had three "
    "epicenters where vegetation died off, whereas in 2005 the drought was focused on the southwestern "
    "part. The findings were published in the journal Science. In a typical year the Amazon absorbs 1.5 "
    "gigatons of carbon dioxide; during 2005 instead 5 gigatons were released and in 2010 8 gigatons were "
    "released."
)
response: NaturalQueryResponse = riva_nlp.natural_query(qa_query, qa_context)

answer = response.results[0].answer
score = response.results[0].score
print("The answer is: ")
print(answer)

The answer is: 
5


### Asynchronous calls

Any of the above methods can be used in asynchronous manner. For this you need set parameter `future=True`. Then instead of response the methods will return future objects. Responses can be retrieved by calling `result()` on future objects.

In [102]:
from time import time
# Demonstrate latency by calling repeatedly.
# NOTE: this is a synchronous API call, so request #N will not be sent until
# response #N-1 is returned. This means latency and throughput will be negatively
# impacted by long-distance & VPN connections

query = "i need one cpu four gpus and lots of memory for my new computer it's going to be very cool"
# req = rnlp.TextTransformRequest()
# req.text.append()

iterations = 100
# Demonstrate synchronous performance
start_time = time()
for _ in range(iterations):
    nlp_resp = riva_nlp.punctuate_text(query)
end_time = time()
print(f"Time to complete {iterations} synchronous requests: {end_time-start_time} sec.")

# Demonstrate async performance
start_time = time()
futures = []
for _ in range(iterations):
    futures.append(riva_nlp.punctuate_text(query, future=True))
for f in futures:
    f.result()
end_time = time()
print(f"Time to complete {iterations} asynchronous requests: {end_time-start_time} sec.\n")


Time to complete 100 synchronous requests: 0.44115304946899414 sec.
Time to complete 100 asynchronous requests: 0.3341188430786133 sec.



## 2. ASR Examples
For more details, refer to https://github.com/nvidia-riva/python-clients/blob/main/tutorials/ASR.ipynbhttps://github.com/nvidia-riva/python-clients/blob/main/tutorials/ASR.ipynb

Riva Speech API supports `.wav` files in PCM format, `.alaw`, `.mulaw` and `.flac` formats with single channel in this release. 

In [140]:
# This example uses a .wav file with LINEAR_PCM encoding.
path = "./samples/en-US_sample.wav"
audio, sr = librosa.core.load(path, sr=None)
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)

In [119]:
from copy import deepcopy
offline_config = riva.client.RecognitionConfig(
    encoding=riva.client.AudioEncoding.LINEAR_PCM,                     # Supports LINEAR_PCM, FLAC, MULAW and ALAW audio encodings
    sample_rate_hertz = sr,                                            # Audio will be resampled if necessary
    max_alternatives=1,                                                # How many top-N hypotheses to return
    enable_automatic_punctuation=True,                                 # Add punctuation when end of VAD detected
    audio_channel_count = 1,                                           # Mono channel"
    verbatim_transcripts=False,
    model="conformer-en-US-asr-offline"                                #  In the case where multiple models might be able to fulfill the client request, one model is selected at random. Y
)
response = riva_asr.offline_recognize(content, offline_config)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript:", asr_best_transcript)

print("\n\nFull Response Message:")
print(response)

ASR Transcript: What is Natural Language Processing? 


Full Response Message:
results {
  alternatives {
    transcript: "What is Natural Language Processing? "
    confidence: -0.999436259
  }
  channel_tag: 1
  audio_processed: 4.8
}



## 3. TTS Service Example
For more details, refer to https://github.com/nvidia-riva/python-clients/blob/main/tutorials/TTS.ipynbhttps://github.com/nvidia-riva/python-clients/blob/main/tutorials/TTS.ipynb

Subsequent releases will include added features, including model registration to support multiple languages/voices with the same API. Support for resampling to alternative sampling rates will also be added.

In [168]:
language_code = 'en-US'
sample_rate_hz = 22050
nchannels = 1
sampwidth = 2
text = (
    "The United States of America, commonly known as the United States or America, "
    "is a country primarily located in North America. It consists of 50 states, "
    "a federal district, five major unincorporated territories, 326 Indian reservations, "
    "and nine minor outlying islands."
)
response = riva_tts.synthesize(text, language_code = language_code, sample_rate_hz = sample_rate_hz)

In [169]:
import wave
offline_output_file = "./outputs/my_offline_synthesized_speech.wav"
with wave.open(offline_output_file, 'wb') as out_f:
    out_f.setnchannels(nchannels)
    out_f.setsampwidth(sampwidth)
    out_f.setframerate(sample_rate_hz)
    out_f.writeframesraw(response.audio)

In [170]:
import IPython
IPython.display.Audio(offline_output_file)