## **Real-time Speech Transcription with GPT-4o-transcribe**

Azure OpenAI now supports blazing fast, real-time speech recognition with two models:

- **GPT-4o-transcribe**: High accuracy, full power  
- **GPT-4o-mini-transcribe**: Super fast, lighter, lower latency

Both use WebSocket connections for live, streaming transcription—perfect for any app that needs instant speech-to-text.

### **How It Works**

1. **Audio Ingestion**
    - Capture audio from your device mic, phone, or browser
    - Or stream audio from another service (e.g., Azure Communication Services - ACS)

2. **Stream Audio to Azure OpenAI**
    - Open a WebSocket to the GPT-4o-transcribe endpoint
    - Push audio chunks as you record—no need to wait for the whole file

3. **Get Real-Time Transcripts**
    - Receive live text output—see it as you speak
    - Handle partial (interim) and final (confirmed) results

4. **Do More with Your Transcripts**
    - Feed transcripts to your LLM for intent, function calling, or downstream processing
    - Send back to the user as captions, messages, or voice

This Notebooks will guide you to laverage *Real-time Speech Transcription*

- **From Your Phone(Local)**: Record and stream your voice straight to Azure for live transcription
- **From Other Apps (e.g., ACS)**: Pipe in audio streams from calls, meetings, or bots—transcribe them in real-time


In [1]:
import os
import json
import base64
import asyncio
import wave
from datetime import datetime
from typing import Optional, Callable, Dict, Any

import pyaudio
import websockets
from dotenv import load_dotenv

load_dotenv()

True

## **Record and stream your voice straight to Azure for live transcription**

In [2]:
def list_audio_input_devices() -> None:
    """
    Print all available input devices (microphones) for user selection.
    """
    p = pyaudio.PyAudio()
    print("\nAvailable audio input devices:")
    for i in range(p.get_device_count()):
        dev = p.get_device_info_by_index(i)
        if dev["maxInputChannels"] > 0:
            print(f"{i}: {dev['name']}")
    p.terminate()


def choose_audio_device(predefined_index: int = None) -> int:
    """
    Return the index of the selected audio input device.
    If predefined_index is provided and valid, use it.
    Otherwise, prompt user if multiple devices are available.
    """
    p = pyaudio.PyAudio()
    try:
        mic_indices = [
            i
            for i in range(p.get_device_count())
            if p.get_device_info_by_index(i)["maxInputChannels"] > 0
        ]
        if not mic_indices:
            raise RuntimeError("❌ No audio input (microphone) devices found.")

        if predefined_index is not None:
            if predefined_index in mic_indices:
                print(f"🎤 Using predefined audio input device: {predefined_index}")
                return predefined_index
            else:
                print(f"Provided index {predefined_index} is not a valid input device.")

        if len(mic_indices) == 1:
            print(f"🎤 Only one audio input device found: {mic_indices[0]}")
            return mic_indices[0]

        print("Available audio input devices:")
        for idx in mic_indices:
            info = p.get_device_info_by_index(idx)
            print(f"  [{idx}]: {info['name']}")
        while True:
            try:
                selection = input(
                    f"Select audio input device index [{mic_indices[0]}]: "
                ).strip()
                if selection == "":
                    return mic_indices[0]
                selected_index = int(selection)
                if selected_index in mic_indices:
                    return selected_index
                print(
                    f"Index {selected_index} is not valid. Please choose from {mic_indices}."
                )
            except ValueError:
                print("Invalid input. Please enter a valid integer index.")

    finally:
        p.terminate()

In [3]:
class AudioRecorder:
    """
    Async audio recorder using PyAudio.
    Allows independent recording (to memory and .wav) and streaming (for STT).
    """

    def __init__(
        self,
        rate: int,
        channels: int,
        format_: int,
        chunk: int,
        device_index: Optional[int] = None,
    ):
        self.rate = rate
        self.channels = channels
        self.format = format_
        self.chunk = chunk
        self.device_index = (
            device_index if device_index is not None else choose_audio_device()
        )
        self.p = pyaudio.PyAudio()
        self.stream = None
        self.frames = []
        self.audio_queue: asyncio.Queue[bytes] = asyncio.Queue()
        self._loop = asyncio.get_event_loop()
        self._running = False

    def start(self) -> None:
        """
        Start the audio stream and begin capturing to the queue.
        """

        def callback(in_data, frame_count, time_info, status):
            self.frames.append(in_data)
            self._loop.call_soon_threadsafe(self.audio_queue.put_nowait, in_data)
            return (None, pyaudio.paContinue)

        self.stream = self.p.open(
            format=self.format,
            channels=self.channels,
            rate=self.rate,
            input=True,
            input_device_index=self.device_index,
            frames_per_buffer=self.chunk,
            stream_callback=callback,
        )
        self._running = True
        self.stream.start_stream()

    def stop(self) -> None:
        """
        Stop and close the stream, release audio resources.
        """
        self._running = False
        if self.stream is not None:
            self.stream.stop_stream()
            self.stream.close()
        self.p.terminate()

    def save_wav(self, filename: str) -> None:
        """
        Save the recorded audio to a .wav file.
        Ensures there is audio data before saving.
        Creates the output directory if it does not exist.
        """
        if not self.frames:
            print("⚠️ No audio recorded. Nothing to save.")
            return
        directory = os.path.dirname(filename)
        if directory and not os.path.exists(directory):
            os.makedirs(directory, exist_ok=True)
        wf = wave.open(filename, "wb")
        wf.setnchannels(self.channels)
        wf.setsampwidth(self.p.get_sample_size(self.format))
        wf.setframerate(self.rate)
        wf.writeframes(b"".join(self.frames))
        wf.close()
        print(f"🎙️ Audio saved to {filename}")

In [4]:
class TranscriptionClient:
    """
    Handles async websocket transcription session to Azure OpenAI STT.
    Can be used independently: just supply an async generator of audio chunks.
    """

    def __init__(
        self,
        url: str,
        headers: dict,
        session_config: Dict[str, Any],
        on_delta: Optional[Callable[[str], None]] = None,
        on_transcript: Optional[Callable[[str], None]] = None,
    ):
        self.url = url
        self.headers = headers
        self.session_config = session_config
        self.ws: Optional[websockets.WebSocketClientProtocol] = None
        self._on_delta = on_delta
        self._on_transcript = on_transcript
        self._running = False
        self._send_task = None
        self._recv_task = None

    async def __aenter__(self):
        try:
            self.ws = await websockets.connect(
                self.url, additional_headers=self.headers
            )
        except TypeError:
            self.ws = await websockets.connect(self.url, extra_headers=self.headers)
        self._running = True
        return self

    async def __aexit__(self, exc_type, exc, tb):
        self._running = False
        if self.ws:
            await self.ws.close()
        if self._send_task:
            self._send_task.cancel()
        if self._recv_task:
            self._recv_task.cancel()

    async def send_json(self, data: dict) -> None:
        if self.ws:
            await self.ws.send(json.dumps(data))

    async def send_audio_chunk(self, audio_data: bytes) -> None:
        audio_base64 = base64.b64encode(audio_data).decode("utf-8")
        await self.send_json(
            {"type": "input_audio_buffer.append", "audio": audio_base64}
        )

    async def start_session(self, rate: int, channels: int) -> None:
        session_config = {
            "type": "transcription_session.update",
            "session": self.session_config,
        }
        await self.send_json(session_config)
        await self.send_json(
            {
                "type": "audio_start",
                "data": {"encoding": "pcm", "sample_rate": rate, "channels": channels},
            }
        )

    async def receive_loop(self) -> None:
        async for message in self.ws:
            try:
                data = json.loads(message)
                event_type = data.get("type", "")
                if event_type == "conversation.item.input_audio_transcription.delta":
                    delta = data.get("delta", "")
                    if delta and self._on_delta:
                        self._on_delta(delta)
                elif (
                    event_type
                    == "conversation.item.input_audio_transcription.completed"
                ):
                    transcript = data.get("transcript", "")
                    if transcript and self._on_transcript:
                        self._on_transcript(transcript)
                elif event_type == "conversation.item.created":
                    transcript = data.get("item", "")
                    if (
                        isinstance(transcript, dict)
                        and "content" in transcript
                        and transcript["content"]
                    ):
                        t = transcript["content"][0].get("transcript")
                        if t and self._on_transcript:
                            self._on_transcript(t)
                    elif transcript and self._on_transcript:
                        self._on_transcript(str(transcript))
            except Exception as e:
                print("❌ Error parsing message:", e)

    async def run(self, audio_chunk_iter: asyncio.Queue, rate: int, channels: int):
        """
        Main loop: configure session, send audio from queue, receive results.
        """
        await self.start_session(rate, channels)
        self._send_task = asyncio.create_task(self._send_audio_loop(audio_chunk_iter))
        self._recv_task = asyncio.create_task(self.receive_loop())
        done, pending = await asyncio.wait(
            [self._send_task, self._recv_task], return_when=asyncio.FIRST_COMPLETED
        )
        for task in pending:
            task.cancel()

    async def _send_audio_loop(self, audio_queue: asyncio.Queue):
        while self._running:
            try:
                audio_data = await audio_queue.get()
                if audio_data is None:
                    break
                await self.send_audio_chunk(audio_data)
            except asyncio.CancelledError:
                break

In [5]:
class AudioTranscriber:
    """
    High-level orchestrator for audio recording and real-time transcription.
    Use as: record only, transcribe only, or chain both (record+transcribe).
    """

    def __init__(
        self,
        url: str,
        headers: dict,
        rate: int,
        channels: int,
        format_: int,
        chunk: int,
        device_index: Optional[int] = None,
    ):
        self.url = url
        self.headers = headers
        self.rate = rate
        self.channels = channels
        self.format = format_
        self.chunk = chunk
        self.device_index = device_index

    async def record(
        self, duration: Optional[float] = None, output_file: Optional[str] = None
    ) -> AudioRecorder:
        """
        Record audio from mic. Returns AudioRecorder.
        Optionally, specify duration (seconds). Use output_file to auto-save.
        """
        recorder = AudioRecorder(
            rate=self.rate,
            channels=self.channels,
            format_=self.format,
            chunk=self.chunk,
            device_index=self.device_index,
        )
        recorder.start()
        print(
            f"Recording{' for ' + str(duration) + ' seconds' if duration else ' (Ctrl+C to stop)'}..."
        )
        try:
            if duration:
                await asyncio.sleep(duration)
            else:
                while True:
                    await asyncio.sleep(0.5)
        except (KeyboardInterrupt, asyncio.CancelledError):
            pass
        finally:
            recorder.stop()
            if output_file:
                recorder.save_wav(output_file)
        return recorder

    async def transcribe(
        self,
        audio_queue: Optional[asyncio.Queue] = None,
        model: str = "gpt-4o-transcribe",
        prompt: Optional[str] = "Respond in English.",
        language: Optional[str] = None,
        noise_reduction: str = "near_field",
        vad_type: str = "server_vad",
        vad_config: Optional[dict] = None,
        on_delta: Optional[Callable[[str], None]] = None,
        on_transcript: Optional[Callable[[str], None]] = None,
        output_wav_file: Optional[str] = None,
    ):
        """
        Run a transcription session with full model/config control.

        If audio_queue is None, creates and uses a live AudioRecorder.

        Args:
            audio_queue: Asyncio queue containing audio chunks to transcribe.
            model: Transcription model to use.
            prompt: Custom prompt for the model.
            language: Language hint for recognition.
            noise_reduction: Type of noise reduction.
            vad_type: Voice activity detection type.
            vad_config: Config dict for VAD.
            on_delta: Callback for interim results.
            on_transcript: Callback for final results.
            output_wav_file: Filename for saving raw microphone audio (if recording).
        """
        recorder = None
        if audio_queue is None:
            recorder = AudioRecorder(
                rate=self.rate,
                channels=self.channels,
                format_=self.format,
                chunk=self.chunk,
                device_index=self.device_index,
            )
            recorder.start()
            audio_queue = recorder.audio_queue

        session_config = {
            "input_audio_format": "pcm16",
            "input_audio_transcription": {
                "model": model,
                "prompt": prompt,
            },
            "input_audio_noise_reduction": {"type": noise_reduction},
            "turn_detection": {"type": vad_type} if vad_type else None,
        }
        if vad_config:
            session_config["turn_detection"].update(vad_config)
        if language:
            session_config["input_audio_transcription"]["language"] = language

            

        async with TranscriptionClient(
            self.url, self.headers, session_config, on_delta, on_transcript
        ) as client:
            try:
                await client.run(audio_queue, self.rate, self.channels)
            except asyncio.CancelledError:
                print("Transcription cancelled.")
            finally:
                if recorder:
                    recorder.stop()
                    if output_wav_file is None:
                        # Default to timestamped file if not provided
                        output_wav_file = (
                            f"microphone_capture_{datetime.now():%Y%m%d_%H%M%S}.wav"
                        )
                    recorder.save_wav(output_wav_file)

#### **🎤 Real-time Audio Transcription with Azure OpenAI & Microphone Input**

Stream your voice, get instant captions, and interact with state-of-the-art transcription AI—all in real time!

**How Does It Work?**

- **AudioRecorder**  
    Captures audio from your microphone, streams it chunk-by-chunk to a queue, and can save the recording as a `.wav` file.
- **TranscriptionClient**  
    Connects to Azure OpenAI via WebSocket, manages the transcription session, streams audio, and receives live transcription results. Calls your handler functions for real-time feedback.
- **AudioTranscriber**  
    The high-level orchestrator. Chains recording and transcription together, so you can focus on your workflow—not the plumbing.

**Result:**  
Speak into your mic and watch as your words are transcribed live, with both interim and final results—perfect for captions, notes, or downstream AI processing.

In [6]:
import os
import asyncio
import pyaudio
from dotenv import load_dotenv
from typing import Optional

# Audio configuration constants
RATE = 24000
CHANNELS = 1
FORMAT = pyaudio.paInt16
CHUNK = 1024
AUDIO_INDEX = 0
OUTPUT_AUDIO_FILE = "recordings/test/microphone_output.wav"


def get_env_variable(name: str) -> str:
    """Get environment variable or raise RuntimeError if missing."""
    value = os.environ.get(name)
    if not value:
        raise RuntimeError(f"❌ Required environment variable '{name}' is missing.")
    return value


async def main() -> None:
    """
    Main entry point for real-time transcription.
    Loads environment, configures audio, and starts transcription session.
    """
    load_dotenv()
    try:
        OPENAI_API_KEY = get_env_variable("AZURE_OPENAI_STT_TTS_KEY")
        AZURE_OPENAI_ENDPOINT = get_env_variable("AZURE_OPENAI_STT_TTS_ENDPOINT")
    except RuntimeError as e:
        print(e)
        return

    url = f"{AZURE_OPENAI_ENDPOINT.replace('https', 'wss')}/openai/realtime?api-version=2025-04-01-preview&intent=transcription"
    headers = {"api-key": OPENAI_API_KEY}
    device_index = choose_audio_device(AUDIO_INDEX)

    transcriber = AudioTranscriber(
        url=url,
        headers=headers,
        rate=RATE,
        channels=CHANNELS,
        format_=FORMAT,
        chunk=CHUNK,
        device_index=device_index,
    )

    def print_delta(delta: str):
        """Prints incremental transcription results."""
        print(delta, end=" ", flush=True)

    def print_transcript(transcript: str):
        """Prints the final transcript."""
        print(f"\n✅ Transcript: {transcript}")

    print(">>> Starting real-time transcription session. Press Ctrl+C to stop.")
    try:
        await transcriber.transcribe(
            model="gpt-4o-transcribe",
            prompt="Respond in English. This is a medical environment.",
            noise_reduction="near_field",
            vad_type="server_vad",
            vad_config={
                "threshold": 0.5,
                "prefix_padding_ms": 300,
                "silence_duration_ms": 1000,
            },
            on_delta=print_delta,
            on_transcript=print_transcript,
            output_wav_file=OUTPUT_AUDIO_FILE,
        )
    except (KeyboardInterrupt, asyncio.CancelledError):
        print("\n🛑 Interrupted by user. Exiting...")
    except Exception as ex:
        print(f"\n❌ Error: {ex}")

#### **How the code above works?**

**Configure Audio:**  
Select your microphone (default or prompt).  
Settings: `RATE=24000`, `CHANNELS=1`, `FORMAT=pyaudio.paInt16`, `CHUNK=1024`.

**WebSocket URL:**  
Convert endpoint to `wss://.../openai/realtime?...` for Azure real-time transcription.  
Add `intent=transcription` and API version in the URL.

**AudioTranscriber:**  
Acts as the high-level orchestrator:  
- Records from the mic  
- Streams audio chunks  
- Pipes data to Azure OpenAI in real time

**Handlers:**  
- `print_delta`: Prints live streaming text as you speak  
- `print_transcript`: Prints final transcript (with ✅)

**Run Session:**  
- Notify user that transcription is starting (`Ctrl+C` to exit)
- Start the transcriber with:
    - Model: `gpt-4o-transcribe`
    - Prompt: (e.g., "Respond in English. This is a medical environment.")
    - Noise reduction: enabled
    - Voice Activity Detection (VAD): enabled, with custom settings
    - Pass in the handlers for real-time feedback

**Streaming Process:**  
- Microphone records your voice  
- Audio is streamed in small chunks instantly to Azure OpenAI via WebSocket  
- Azure transcribes in real time, sending:
    - Deltas (partial text as you speak)
    - Final transcript (when the speech is complete)
- Handlers print both live and final results to your terminal

**Exit Gracefully:**  
- On `Ctrl+C`, the session stops and resources are released cleanly

In [8]:
await main()

🎤 Using predefined audio input device: 0
>>> Starting real-time transcription session. Press Ctrl+C to stop.

✅ Transcript: Hey Heather, this is Pablo, are you working in my region?

✅ Transcript: Oh yeah, it seems they are working.
Transcription cancelled.
🎙️ Audio saved to recordings/test/microphone_output.wav


### **From Other Apps (e.g., ACS)**: Pipe in audio streams from calls, meetings, or bots—transcribe them in real-time


```python
@app.websocket(ACS_WEBSOCKET_PATH)
async def acs_websocket_endpoint(websocket: WebSocket):
    """Handles the bidirectional audio stream for an ACS call, using AOAI streaming STT, and records audio as a WAV file."""
    acs_caller_instance = app.state.acs_caller

    if not acs_caller_instance:
        logger.error("ACS Caller not available. Cannot process ACS audio.")
        return

    await websocket.accept()
    call_connection_id = websocket.headers.get("x-ms-call-connection-id", "UnknownCall")
    logger.info(f"▶ ACS media WebSocket accepted for call {call_connection_id}")

    cm = ConversationManager(auth=False)
    cm.cid = call_connection_id

    # --- AOAI Streaming Setup ---
    AOAI_STT_KEY = os.environ.get("AZURE_OPENAI_STT_TTS_KEY")
    AOAI_STT_ENDPOINT = os.environ.get("AZURE_OPENAI_STT_TTS_ENDPOINT")
    aoai_url = f"{AOAI_STT_ENDPOINT.replace('https', 'wss')}/openai/realtime?api-version=2025-04-01-preview&intent=transcription"
    aoai_headers = {"api-key": AOAI_STT_KEY}
    RATE = 16000
    CHANNELS = 1
    FORMAT = 16  # PCM16

    audio_queue = asyncio.Queue()

    async def on_delta(delta: str):
        await broadcast_message(delta, "User")

    async def on_transcript(transcript: str):
        logger.info(f"[AOAI-Transcript] 🎤🎶🎧📼 {transcript}")
        await broadcast_message(transcript, "User")
        await process_gpt_response(cm, transcript, websocket, is_acs=True)

    # --- Open WAV file for writing ---
    wav_filename = f"acs_audio_{call_connection_id}.wav"
    wav_file = wave.open(wav_filename, "wb")
    wav_file.setnchannels(CHANNELS)
    wav_file.setsampwidth(2)  # 16-bit PCM = 2 bytes
    wav_file.setframerate(RATE)

    async def record_audio_chunk(audio_bytes: bytes):
        wav_file.writeframes(audio_bytes)

    transcriber = AudioTranscriber(
        url=aoai_url,
        headers=aoai_headers,
        rate=RATE,
        channels=CHANNELS,
        format_=FORMAT,
        chunk=1024,
        device_index=None,
    )
    transcribe_task = asyncio.create_task(
        transcriber.transcribe(
            audio_queue=audio_queue,
            model="gpt-4o-transcribe",
            prompt="Respond in English. This is a medical environment.",
            noise_reduction="near_field",
            vad_type="server_vad",
            vad_config={
                "threshold": 0.5,
                "prefix_padding_ms": 300,
                "silence_duration_ms": 2000,
            },
            on_delta=lambda delta: asyncio.create_task(on_delta(delta)),
            on_transcript=lambda t: asyncio.create_task(on_transcript(t)),
        )
    )

    greeted_call_ids = app.state.greeted_call_ids
    if (
        call_connection_id != "UnknownCall"
        and call_connection_id not in greeted_call_ids
    ):
        initial_greeting = "Hello from XMYX Healthcare Company! Before I can assist you, let’s verify your identity. How may I address you?"
        logger.info(f"Playing initial greeting for call {call_connection_id}")
        await broadcast_message(initial_greeting, "Assistant")
        await send_response_to_acs(websocket, initial_greeting)
        cm.hist.append({"role": "assistant", "content": initial_greeting})
        greeted_call_ids.add(call_connection_id)
    else:
        logger.info(
            f"Skipping initial greeting for already greeted call {call_connection_id}"
        )

    try:
        while True:
            try:
                raw_data = await asyncio.wait_for(websocket.receive_text(), timeout=5.0)
                data = json.loads(raw_data)
            except asyncio.TimeoutError:
                if websocket.client_state != WebSocketState.CONNECTED:
                    logger.warning(
                        f"ACS WebSocket {call_connection_id} disconnected while waiting for data."
                    )
                    break
                continue
            except WebSocketDisconnect:
                logger.info(f"ACS WebSocket disconnected for call {call_connection_id}")
                break
            except json.JSONDecodeError:
                logger.warning(
                    f"Received invalid JSON from ACS for call {call_connection_id}"
                )
                continue
            except Exception as e:
                logger.error(
                    f"Error receiving from ACS WebSocket {call_connection_id}: {e}",
                    exc_info=True,
                )
                break

            kind = data.get("kind")
            if kind == "AudioData":
                b64 = data.get("audioData", {}).get("data")
                if b64:
                    audio_bytes = base64.b64decode(b64)
                    await audio_queue.put(audio_bytes)  # AOAI streaming
                    await record_audio_chunk(audio_bytes)  # Write to .wav file
            elif kind == "CallConnected":
                connected_participant_id = (
                    data.get("callConnected", {}).get("participant", {}).get("rawID")
                )
                if (
                    connected_participant_id
                    and call_connection_id not in call_user_raw_ids
                ):
                    call_user_raw_ids[call_connection_id] = connected_participant_id
            elif kind in ("PlayCompleted", "PlayFailed", "PlayCanceled"):
                logger.info(
                    f"Received {kind} event via WebSocket for call {call_connection_id}"
                )

    except WebSocketDisconnect:
        logger.info(f"ACS WebSocket {call_connection_id} disconnected.")
    except Exception as e:
        logger.error(
            f"Unhandled error in ACS WebSocket handler for call {call_connection_id}: {e}",
            exc_info=True,
        )
    finally:
        logger.info(
            f"🧹 Cleaning up ACS WebSocket handler for call {call_connection_id}."
        )
        await audio_queue.put(None)  # End audio for AOAI transcriber
        await transcribe_task         # Flush all transcripts

        try:
            wav_file.close()  # <--- IMPORTANT: Close file so it's readable!
            logger.info(f"WAV file closed: {wav_filename}")
        except Exception as e:
            logger.error(f"Failed to close WAV file: {e}")

        if websocket.client_state == WebSocketState.CONNECTED:
            await websocket.close()
            logger.info(
                f"ACS WebSocket connection closed for call {call_connection_id}"
            )
        if call_connection_id in call_user_raw_ids:
            try:
                del call_user_raw_ids[call_connection_id]
                logger.info(f"Removed call ID mapping for {call_connection_id}")
            except KeyError:
                logger.warning(
                    f"Call ID mapping for {call_connection_id} already removed."
                )
```


#### **Let's understand code above**

**1. WebSocket Connection from ACS:**  
ACS calls your `/websocket` endpoint and streams audio (base64 in JSON) and call events.

```python
@app.websocket(ACS_WEBSOCKET_PATH)
async def acs_websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    # ... initialization ...
```
**2. Audio & Session Setup**

Create an async queue for streaming audio chunks.
Prepare a .wav file for raw audio storage.
Set up callbacks for delta/final transcript results.

```python
audio_queue = asyncio.Queue()
wav_file = wave.open(wav_filename, "wb")
wav_file.setnchannels(CHANNELS)
wav_file.setsampwidth(2)  # PCM16 = 2 bytes
wav_file.setframerate(RATE)
```

**3. AOAI Transcriber Setup**
Instantiate AudioTranscriber for AOAI real-time endpoint over WebSocket.
Start as async task, streaming from audio_queue.

```python
transcriber = AudioTranscriber(
    url=aoai_url,
    headers=aoai_headers,
    rate=RATE,
    channels=CHANNELS,
    format_=FORMAT,
    chunk=1024,
    device_index=None,
)
transcribe_task = asyncio.create_task(
    transcriber.transcribe(
        audio_queue=audio_queue,
        model="gpt-4o-transcribe",
        prompt="Respond in English. This is a medical environment.",
        noise_reduction="near_field",
        vad_type="server_vad",
        vad_config={"threshold": 0.5, "prefix_padding_ms": 300, "silence_duration_ms": 2000},
        on_delta=lambda delta: asyncio.create_task(on_delta(delta)),
        on_transcript=lambda t: asyncio.create_task(on_transcript(t)),
    )
)
```

**4.Receiving Audio from ACS:**
Main loop parses incoming ACS JSON:
If "AudioData", decode, push to audio_queue, and write to .wav.
If event, handle accordingly.

```python
if kind == "AudioData":
    b64 = data.get("audioData", {}).get("data")
    if b64:
        audio_bytes = base64.b64decode(b64)
        await audio_queue.put(audio_bytes)    # Stream to AOAI
        await record_audio_chunk(audio_bytes) # Write to .wav
```

**5.Real-Time Streaming to AOAI:**
Audio in audio_queue is streamed to AOAI.
AOAI returns:
- Deltas: Partial text updates
+ Final transcripts: Confirmed utterances - Handlers forward both to the client/chat.

```python
async def on_delta(delta: str):
    await broadcast_message(delta, "User")

async def on_transcript(transcript: str):
    await broadcast_message(transcript, "User")
    await process_gpt_response(cm, transcript, websocket, is_acs=True)
```

**6.Session Cleanup**
On disconnect/error, end AOAI stream, wait for task to finish, and close .wav file.

```python
finally:
    await audio_queue.put(None)   # Signal end to AOAI
    await transcribe_task         # Wait for completion
    wav_file.close()   
```           

In [25]:
p = pyaudio.PyAudio()

In [26]:
p.get_sample_size(pyaudio.paInt16)

2

In [11]:
import os
import json
import base64
import threading
import pyaudio
import websocket
from dotenv import load_dotenv

load_dotenv(".env")  # Load environment variables from .env

OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_STT_TTS_KEY")
if not OPENAI_API_KEY:
    raise RuntimeError("❌ OPENAI_API_KEY is missing!")

# WebSocket endpoint for OpenAI Realtime API (transcription model)
url = f"{os.environ.get('AZURE_OPENAI_STT_TTS_ENDPOINT').replace('https', 'wss')}/openai/realtime?api-version=2025-04-01-preview&intent=transcription"
headers = { "api-key": OPENAI_API_KEY}
# Audio stream parameters (16-bit PCM, 16kHz mono)
RATE = 24000
CHANNELS = 1
FORMAT = pyaudio.paInt16
CHUNK = 1024

audio_interface = pyaudio.PyAudio()
stream = audio_interface.open(format=FORMAT,
                              channels=CHANNELS,
                              rate=RATE,
                              input=True,
                              frames_per_buffer=CHUNK)

In [12]:
def on_open(ws):
    print("Connected! Start speaking...")
    session_config = {
        "type": "transcription_session.update",
        "session": {
            "input_audio_format": "pcm16",
            "input_audio_transcription": {
                "model": "gpt-4o-mini-transcribe",
                "prompt": "Respond in English."
            },
            #"input_audio_noise_reduction": {"type": "near_field"},
            "turn_detection": {"type": "server_vad", "threshold": 0.5, "prefix_padding_ms": 300, "silence_duration_ms": 200}
        }
    }
    ws.send(json.dumps(session_config))

    def stream_microphone():
        try:
            while ws.keep_running:
                audio_data = stream.read(CHUNK, exception_on_overflow=False)
                audio_base64 = base64.b64encode(audio_data).decode('utf-8')
                ws.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": audio_base64
                }))
        except Exception as e:
            print("Audio streaming error:", e)
            ws.close()

    threading.Thread(target=stream_microphone, daemon=True).start()

In [13]:
def on_message(ws, message):
    try:
        data = json.loads(message)
        event_type = data.get("type", "")
        print("Event type:", event_type)
        #print(data)   
        # Stream live incremental transcripts
        if event_type == "conversation.item.input_audio_transcription.delta":
            transcript_piece = data.get("delta", "")
            if transcript_piece:
                print(transcript_piece, end=' ', flush=True)
        if event_type == "conversation.item.input_audio_transcription.completed":
            print(data["transcript"])
        if event_type == "item":
            transcript = data.get("item", "")
            if transcript:
                print("\nFinal transcript:", transcript)

    except Exception:
        pass  # Ignore unrelated events

In [14]:
def on_error(ws, error):
    print("WebSocket error:", error)

def on_close(ws, close_status_code, close_msg):
    print("Disconnected from server.")
    stream.stop_stream()
    stream.close()
    audio_interface.terminate()

In [15]:
print("Connecting to OpenAI Realtime API...")
ws_app = websocket.WebSocketApp(
    url,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close
)

ws_app.run_forever()

Connecting to OpenAI Realtime API...
Connected! Start speaking...
Event type: transcription_session.created
Event type: transcription_session.updated
Event type: input_audio_buffer.speech_started
Event type: input_audio_buffer.speech_stopped
Event type: input_audio_buffer.committed
Event type: conversation.item.created
Event type: conversation.item.input_audio_transcription.failed
WebSocket error: 
Disconnected from server.


True