# Contact Center Insights

This notebook demonstrates extracting valuable insights for contact centers using NVIDIA Riva and NVIDIA NIM microservices. 

Utilizing NVIDIA's Parakeet CTC 1.1b ASR model, it accurately transcribes audio interactions between two speakers. Subsequently, NVIDIA NIM Llama 3.3 70B processes the transcripts to extract key entities and evaluate agent performance, providing actionable insights to enhance contact center operations.

Here is an architecture diagram of the workflow:

![Contact Center Insights Architecture Diagram](./Architecture_Diagram.png)

## Contact Center Insights generation involves two steps:

1. **Transcription with speaker Diarization**
   - **NVIDIA Riva Integration:** Transcribes incoming audio calls between two speakers using NVIDIA Riva's Parakeet CTC 1.1b ASR model and creates a structured transcript.

2. **Insight Generation**
   - **Entity Extraction:** Extracts key entities like customer and agent names, topic and subtopic of the conversation.
    - **Agent Performance Evaluation:** Evaluates agent performance based several key metrics.
    - **Combine Insights:** Combines all extracted insights into a structured JSON.

## Content Overview
1. [Install dependencies](#Install-dependencies)
2. [Set required environment variables](#Set-required-environment-variables)
3. [Transcribe Audio](#Transcribe-Audio)
4. [Generate Insights](#Generate-Insights)

# 1. Install dependencies

In [None]:
%pip install -r requirements.txt

# 2. Set required environment variables

In [None]:
import getpass
import os
from dotenv import load_dotenv
from io import BytesIO
from pydub import AudioSegment
import riva.client

load_dotenv()

# validate we have the required variables
REQUIRED_VARIABLES = [
    "NVIDIA_PARAKEET_NIM_API_KEY",
    "NVIDIA_LLAMA_NIM_API_KEY",
]

for var in REQUIRED_VARIABLES:
    if var not in os.environ:
        os.environ[var] = getpass.getpass(f"Please set the {var} environment variable.")

# optional variables
os.environ["RIVA_SPEECH_API_SERVER"] = os.getenv("RIVA_SPEECH_API_SERVER", "grpc.nvcf.nvidia.com")

# Look for audio files in the current directory with .wav format
audio_files = [f for f in os.listdir("audio") if f.endswith(".wav")]
if not audio_files:
    raise Exception("No .wav files found in the current directory.")

AUDIO_FILE = audio_files[0]
print(f"Using audio file: {AUDIO_FILE}")

# validate the audio file, it must have two channels
audio = AudioSegment.from_file(f"audio/{AUDIO_FILE}")
if audio.channels != 2:
    raise Exception("Audio file must have exactly two channels.")

# 3. Transcribe Audio

In [2]:
def split_audio_channels(filename: str) -> tuple[BytesIO, BytesIO]:
    """Split the audio file into two channels."""
    audio = AudioSegment.from_file(f"audio/{filename}", format="wav")

    left_channel = audio.split_to_mono()[0]
    right_channel = audio.split_to_mono()[1]

    left_channel_bytes, right_channel_bytes = BytesIO(), BytesIO()
    left_channel.export(left_channel_bytes, format="wav")
    right_channel.export(right_channel_bytes, format="wav")

    return left_channel_bytes, right_channel_bytes

In [3]:
def transcribe_with_riva(audio_bytes):
    """Transcribe the audio file using Riva Speech API."""

    # Authenticate with Riva Speech API
    auth = riva.client.Auth(
        uri=os.environ["RIVA_SPEECH_API_SERVER"],
        use_ssl=True,
        metadata_args=[
            ['authorization', 'Bearer {}'.format(os.environ["NVIDIA_PARAKEET_NIM_API_KEY"])],
            ['function-id', '1598d209-5e27-4d3c-8079-4751568b1081']
        ]
    )

    # Configure the transcription
    config = riva.client.RecognitionConfig(
        language_code="en-US",
        enable_word_time_offsets=True,      # Enables word timestamps
        max_alternatives=1,                 # Set to 1 for single-best result
        enable_automatic_punctuation=True,
        audio_channel_count = 1,
    )

    riva_asr = riva.client.ASRService(auth)
    response = riva_asr.offline_recognize(audio_bytes, config)
    
    return response.results

In [4]:
def combine_and_format_results(left_results, right_results):
    """Combine the results from the two channels and format them."""
    
    def extract_transcript(results, speaker_label):
        """Extract the transcript and start time from the first word of each alternative."""

        transcript_results = []
        for result in results:
            for alternative in result.alternatives:
                transcript_results.append({
                    'transcript': alternative.transcript,
                    'start_time': alternative.words[0].start_time, # Start time of the first word
                    'speaker': speaker_label
                })

        return transcript_results

    left_results = extract_transcript(left_results, 'Speaker 1')
    right_results = extract_transcript(right_results, 'Speaker 2')

    combined_results = left_results + right_results
    # sort all utterances by start_time
    combined_results.sort(key=lambda x: x['start_time'])

    # convert start_time to hh:mm:ss format
    for result in combined_results:
        seconds = result['start_time'] // 1000
        hours, remainder = divmod(seconds, 3600)
        minutes, seconds = divmod(remainder, 60)
        result['start_time'] = f"{hours:02}:{minutes:02}:{seconds:02}"

    # format the results
    utterances = []
    for result in combined_results:
        utterances.append(f"{result['start_time']} - {result['speaker']}: {result['transcript']}")

    return utterances

In [5]:
# split the audio file into two channels
left_channel, right_channel = split_audio_channels(AUDIO_FILE)

# transcribe the both channels individually
left_results = transcribe_with_riva(left_channel.getvalue())
right_results = transcribe_with_riva(right_channel.getvalue())

In [6]:
# combine and format the results
CALL_TRANSCRIPT = combine_and_format_results(left_results, right_results)

# 4. Generate Insights

In [None]:
# TODO:  Set up nvidia langchain with access to the llama model

In [None]:
# TODO:  Prompt llama 3.3 70B to generate entities - Agent name, Customer name, Topic

In [None]:
# TODO:  Prompt llama 3.3 70B to generate agent performance metrics