# Real-Time Transcription and GPT Response using Whisper and OpenAI

### Why is this worth talking about?

In order for a sales agent to be effective it needs to provide fast, real-time feedback while the conversation is ongoing. In order for this to happen we need to have an open stream of audio input so it can be processed in real-time.

The project is written in python because we get the best goodies out of the box.  We will be able to expand on this example with additional AI models and audio formatting libraries programmatically.  Though similar options are available through node.js, they are not as easy to use.

We are choosing to use the OpenAI whisper local model so we can pass in the streaming bytes explicitly.  Using the new version we would have to slice up the audio so it can process mp3s.  This is not real-time.

## Set Up OpenAI API Client


In [None]:
client = OpenAI(
    api_key="YOUR_API_KEY_HERE"  
)

gpt_queue = Queue()

## Helper functions for GPT queue processing


In [None]:
def process_gpt_queue():
    """
    Processes lines from the transcription queue with GPT to generate actionable insights.
    """
    while True:
        try:
            if not gpt_queue.empty():
                line = gpt_queue.get()
                stream = client.chat.completions.create(
                    model="gpt-3.5-turbo",
                    messages=[
                        {"role": "system", "content": """
                            You are an expert sales coaching assistant...
                            """.strip()},
                        {"role": "user", "content": f"Evaluate this line: {line}"}
                    ],
                    stream=True,
                )
                for chunk in stream:
                    if chunk.choices[0].delta.content:
                        print(chunk.choices[0].delta.content, end="")
            else:
                sleep(0.25)
        except Exception as e:
            print(f"Error in GPT processing: {e}")

### Existing Code

gpt_thread = Thread(target=process_gpt_queue, daemon=True)
gpt_thread.start()
# Cue the user that we're ready to go.
print("Model loaded * You can start your call now\n")


The reason we are putting this in a thread is because we want to be able to process the GPT responses in real time.

The audio indput and the real-time connection to the source microphone will be taking a thread so we need a new thread to do the inference.

### Setting up the audio input

In [None]:
# The last time a recording was retrieved from the queue.
phrase_time = None

# Thread safe Queue for passing data from the threaded recording callback.
data_queue = Queue()

# We use SpeechRecognizer to record our audio because it has a nice feature where it can detect when speech ends.
recorder = sr.Recognizer()
recorder.energy_threshold = args.energy_threshold

# Definitely do this, dynamic energy compensation lowers the energy threshold dramatically to a point where the SpeechRecognizer never stops recording.
recorder.dynamic_energy_threshold = False

We need a local store for the audio that we are receiving so it can be processed in meaningful chunks by the GPT.  We can do this with a local file store but we chose to do it with a queue.  Basically each line spoken is added to the queue and when our phrase_time is reached (say 2 seconds) we send the line to the GPT.

### Actually processing the audio bytes and generating the transcription

# Get our model
audio_model = whisper.load_model(model)

# Combine audio data from queue
audio_data = b''.join(data_queue.queue)
data_queue.queue.clear()

# Convert in-ram buffer to something the model can use directly without needing a temp file.
# Convert data from 16 bit wide integers to floating point with a width of 32 bits.
# Clamp the audio stream frequency to a PCM wavelength compatible default of 32768hz max.
audio_np = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0

# Read the transcription.
result = audio_model.transcribe(audio_np, fp16=torch.cuda.is_available())
text = result['text'].strip()

# If we detected a pause between recordings, add a new item to our transcription.
# Otherwise edit the existing one.
if phrase_complete:
  transcription.append(text)
  gpt_queue.put(text)
else:
  transcription[-1] = text
  gpt_queue.put(text)

This code processes audio input in real-time using OpenAI's Whisper model. It captures audio through a microphone, converts the raw bytes to a numpy array with proper audio formatting, transcribes it to text, and adds each transcribed segment to both a transcription list and a GPT queue for further analysis. The code handles continuous streaming by detecting pauses between phrases and either appending new text or updating the existing transcription.


## Comparison of Whisper Implementations

| Feature | OpenAI Whisper (Local) | OpenAI API Whisper |
|---------|------------------------|-------------------|
| Setup | Requires local installation and model download | API key only |
| Processing | Local CPU/GPU processing | Cloud-based processing |
| Model Options | Multiple model sizes (tiny to large) | Single optimized model |
| Cost | Free (after download) | Pay per minute of audio |
| Latency | Depends on local hardware | Network-dependent |
| Integration | More code required for audio handling | Simple API calls |
| Customization | Full control over parameters | Limited configuration |
| Dependencies | Requires torch, numpy, etc. | Minimal dependencies |
| Streaming | Manual implementation needed | Built-in streaming support |
| Resource Usage | Uses local system resources | Cloud resources |


## Challenges of Real-Time Streaming Integration

When building real-time streaming applications, several key challenges need to be addressed:

1. **Latency Management**: Balancing between buffering enough audio data for accurate transcription while maintaining real-time responsiveness

2. **Resource Usage**: Managing CPU and memory consumption, especially when processing continuous audio streams

3. **Error Handling**: Gracefully handling network issues, audio device problems, and service interruptions

4. **Queue Management**: Coordinating multiple queues for audio data, transcriptions, and API responses without bottlenecks

5. **State Management**: Tracking the state of ongoing streams and managing transitions between phrases

These challenges require careful consideration of buffer sizes, timeout values, and error recovery strategies to create a smooth user experience.
