### Overview

The Live API enables low-latency bidirectional voice and video interactions with Gemini. The API can process text, audio, and video input, and it can provide text and audio output. See the [Live API page](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-live) for more details.

This tutorial demonstrates the following simple examples to help you get started with the Live API in Vertex AI using Google Gen AI SDK.

1. Using Gemini 2.0 Flash
   1. Text-to-text generation
   2. Text-to-audio generation
   3. Text-to-audio conversation
   4. Function calling
   5. Code execution
   6. Audio transcription
   7. Voice Activity Detection (VAD)
2. Using Gemini 2.5 Flash native audio dialog
   1. Proactive audio
   2. Affective dialog


In [53]:
from pathlib import Path

from IPython.display import Audio, Markdown, display
from google.genai.types import (
    AudioTranscriptionConfig,
    AutomaticActivityDetection,
    Content,
    EndSensitivity,
    GoogleSearch,
    LiveConnectConfig,
    Part,
    PrebuiltVoiceConfig,
    ProactivityConfig,
    RealtimeInputConfig,
    SpeechConfig,
    StartSensitivity,
    Tool,
    ToolCodeExecution,
    Blob,
    VoiceConfig,
    Modality,
)
import numpy as np

In [31]:
from google import genai
from dotenv import load_dotenv
import os

load_dotenv()

client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))

Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.


In [35]:
# live 相關的 Model 在 Google AI Studio 比較多
for model in client.models.list(config={"query_base": True}):
    if "live" in model.name or "audio" in model.name:
        print(model.name)

models/gemini-2.5-flash-preview-native-audio-dialog
models/gemini-2.5-flash-exp-native-audio-thinking-dialog
models/gemini-2.0-flash-live-001
models/gemini-live-2.5-flash-preview
models/gemini-2.5-flash-live-preview


In [None]:
MODEL_ID = "gemini-live-2.5-flash-preview"

### Example 1: Text-to-text generation

**Notes**

- A session session represents a single WebSocket connection between the client and the server.
- A session configuration includes the model, generation parameters, system instructions, and tools.
  - response_modalities accepts TEXT or AUDIO.
- After a new session is initiated, the session can exchange messages with the server to
  - Send text, audio, or video to the server.
  - Receive audio, text, or function call responses from the server.
- When sending messages to the server, set end_of_turn to True to indicate that the server content generation should start with the currently accumulated prompt.Otherwise, the server awaits additional messages before starting generation.


In [37]:
# async with 建立並管理一個非同步的 Session , 確保連線在使用後會自動關閉
async with client.aio.live.connect(
    model=MODEL_ID,
    config=LiveConnectConfig(response_modalities=[Modality.TEXT]),
) as session:
    text_input = "嗨, Gemini 你在嗎？"
    display(Markdown(f"**Input:** {text_input}"))

    await session.send_client_content(
        turns=Content(role="user", parts=[Part(text=text_input)])
    )

    response = []

    async for message in session.receive():
        if message.text:
            response.append(message.text)

    display(Markdown(f"**Response >** {''.join(response)}"))

**Input:** 嗨, Gemini 你在嗎？

**Response >** 哈囉！我在喔。有什麼我可以幫你的嗎？

### Example 2: Text-to-audio generation

- You send a text prompt and receive a model response in audio.
- Notes
  - The Live API supports the following voices: Puck, Charon, Kore, Fenrir, Aoede
- To specify a voice, set the voice_name within the speech_config object, as part of your session configuration.
- To specify a language , set the language_code within the speech_config as part of your session configuration. See the supported languages.(不要忘記加上 language_code 有加上表現會好很多)


In [41]:
voice_name = "Zephyr"  # @param ["Aoede", "Puck", "Charon", "Kore", "Fenrir", "Leda", "Orus", "Zephyr"]

config = LiveConnectConfig(
    response_modalities=["AUDIO"],
    speech_config=SpeechConfig(
        voice_config=VoiceConfig(
            prebuilt_voice_config=PrebuiltVoiceConfig(
                voice_name=voice_name,
            )
        ),
        language_code="cmn-CN",  # 目前 Only 的中文選項
    ),
    system_instruction=Content(
        role="model",
        parts=[
            Part(text="You are a helpful assistant and 只會用台灣的中文和語氣來回覆")
        ],
    ),
)

async with client.aio.live.connect(
    model=MODEL_ID,
    config=config,
) as session:
    text_input = "Hi, 你在嗎？今天心情如何？"
    display(Markdown(f"**Input:** {text_input}"))

    await session.send_client_content(
        turns=Content(role="user", parts=[Part(text=text_input)])
    )

    audio_data = []
    async for message in session.receive():
        if (
            message.server_content.model_turn
            and message.server_content.model_turn.parts
        ):
            for part in message.server_content.model_turn.parts:
                if part.inline_data:
                    audio_data.append(
                        np.frombuffer(part.inline_data.data, dtype=np.int16)
                    )

    if audio_data:
        display(Audio(np.concatenate(audio_data), rate=24000, autoplay=True))

**Input:** Hi, 你在嗎？今天心情如何？

### Example 3: Text-to-audio conversation

- Step 1: You set up a conversation with the API that allows you to send text prompts and receive audio responses.
- Notes
  - While the model keeps track of in-session interactions, explicit session history accessible through the API isn't available yet. When a session is terminated the corresponding context is erased.


In [None]:
config = LiveConnectConfig(
    response_modalities=["AUDIO"], speech_config=SpeechConfig(language_code="cmn-CN")
)


async def main() -> None:
    async with client.aio.live.connect(model=MODEL_ID, config=config) as session:

        async def send() -> bool:
            text_input = input("Input > ")
            if text_input.lower() in ("q", "quit", "exit"):
                return None  # 回傳 None 來表示對話結束
            await session.send_client_content(
                turns=Content(role="user", parts=[Part(text=text_input)])
            )
            return text_input  # 回傳使用者輸入的文字

        async def receive(user_input: str) -> None:
            display(Markdown(f"**Input >** {user_input}"))

            audio_data = []

            async for message in session.receive():
                if (
                    message.server_content.model_turn
                    and message.server_content.model_turn.parts
                ):
                    for part in message.server_content.model_turn.parts:
                        if part.inline_data:
                            audio_data.append(
                                np.frombuffer(part.inline_data.data, dtype=np.int16)
                            )

                if message.server_content.turn_complete:
                    display(Markdown("**Response >**"))
                    display(
                        Audio(np.concatenate(audio_data), rate=24000, autoplay=True)
                    )
                    break

            return

        while True:
            user_input = await send()
            if user_input is None:  # 如果是 None，則退出迴圈
                print("Exiting chat.")
                break
            # 將使用者輸入的文字傳遞給 receive 函式
            await receive(user_input)

In [45]:
await main()

**Input >** 早安

**Response >**

**Input >** 我想請假，幫我想理由

**Response >**

**Input >** 我想要請病假，單純不想上班

**Response >**

Exiting chat.


### Example : Google Search

- The google_search tool lets the model conduct Google searches. For example, try asking it about events that are too recent to be in the training data.


In [None]:
config = LiveConnectConfig(
    response_modalities=["TEXT"],
    tools=[Tool(google_search=GoogleSearch())],
)

async with client.aio.live.connect(
    model=MODEL_ID,
    config=config,
) as session:
    text_input = "請問目前的時間？(UTC + 8)以及台北的天氣狀況？"
    display(Markdown(f"**Input:** {text_input}"))

    await session.send_client_content(
        turns=Content(role="user", parts=[Part(text=text_input)])
    )

    final_response = []

    print("--- 模型完整回應過程 ---")
    async for message in session.receive():
        # 檢查是否有 server_content，這是所有模型回應的容器
        if message.server_content:
            # 1. 處理模型生成的程式碼 (executable_code)
            model_turn = message.server_content.model_turn
            if model_turn:
                for part in model_turn.parts:
                    if part.executable_code:
                        print(
                            f"⚙️ [模型生成的程式碼]:\n```python\n{part.executable_code.code}\n```"
                        )

                    # 2. 處理程式碼的執行結果 (code_execution_result)
                    if part.code_execution_result:
                        print(
                            f"📊 [程式碼執行結果]:\n{part.code_execution_result.output}\n"
                        )

            # 3. 處理最終的文字回覆
            if message.text:
                # 為了避免重複輸出，我們可以在這裡先不顯示，最後再一起顯示
                final_response.append(message.text)

    print("--- 最終組合回覆 ---")
    # 最後將所有文字片段組合起來顯示
    display(Markdown(f"**Response >** {''.join(final_response)}"))

**Input:** 請問目前的時間？(UTC + 8)以及台北的天氣狀況？

--- 模型完整回應過程 ---




⚙️ [模型生成的程式碼]:
```python
print(google_search.search(queries=["台北天氣", "目前UTC+8時間"]))
```




📊 [程式碼執行結果]:
Looking up information on Google Search.


--- 最終組合回覆 ---


**Response >** 目前時間 (UTC+8) 為 **2025年7月23日星期三上午11:59** [1]。

至於台北的天氣狀況：

目前台北的天氣是**多雲轉晴**，氣溫約 **34°C (94°F)**，感覺像 **42°C (107°F)**，濕度約 **54%**，降雨機率約 **0%** [2]。

未來幾天的天氣預報如下：
*   **今天 (7月23日，星期三)**：白天多雲轉晴，晚上局部多雲，白天降雨機率15%，晚上20%。氣溫約27°C到35°C。濕度約73% [2, 3]。
*   **明天 (7月24日，星期四)**：白天小雨，晚上有雨，白天降雨機率45%，晚上55%。氣溫約26°C到31°C。濕度約83% [2]。
*   **後天 (7月25日，星期五)**：白天小雨，晚上多雲，白天降雨機率50%，晚上20%。氣溫約26°C到28°C。濕度約89% [2]。

### Example : Audio transcription


In [None]:
config = LiveConnectConfig(
    response_modalities=["AUDIO"],
    input_audio_transcription=AudioTranscriptionConfig(),
    output_audio_transcription=AudioTranscriptionConfig(),
    speech_config=SpeechConfig(language_code="cmn-CN"),
)


async with client.aio.live.connect(
    model=MODEL_ID,
    config=config,
) as session:
    text_input = "Hello? 在嗎？我想請假，幫我想個合理的理由"
    display(Markdown(f"**Input:** {text_input}"))

    await session.send_client_content(
        turns=Content(role="user", parts=[Part(text=text_input)])
    )

    audio_data = []
    input_transcription = []
    output_transcription = []

    async for message in session.receive():
        if (
            message.server_content.input_transcription
            and message.server_content.input_transcription.text
        ):
            input_transcription.append(message.server_content.input_transcription)
        if (
            message.server_content.output_transcription
            and message.server_content.output_transcription.text
        ):
            output_transcription.append(
                message.server_content.output_transcription.text
            )
        if (
            message.server_content.model_turn
            and message.server_content.model_turn.parts
        ):
            for part in message.server_content.model_turn.parts:
                if part.inline_data:
                    audio_data.append(
                        np.frombuffer(part.inline_data.data, dtype=np.int16)
                    )

    if input_transcription:
        display(Markdown(f"**Input transcription >** {''.join(input_transcription)}"))

    if audio_data:
        display(Audio(np.concatenate(audio_data), rate=24000, autoplay=True))

    if output_transcription:
        display(Markdown(f"**Output transcription >** {''.join(output_transcription)}"))

**Input:** Hello? 在嗎？我想請假，幫我想個合理的理由

**Output transcription >** 你說你想請假，我可以幫你想個合理的理由。你希望請假的理由是個人事務、健康問題，還是其他原因呢？

### Example : Voice Activity Detection (VAD)

- Voice Activity Detection (VAD) allows the model to recognize when a person is speaking. This is essential for creating natural conversations, as it allows a user to interrupt the model at any time.
- By default, the model automatically performs voice activity detection on a continuous audio input stream. Voice activity detection can be configured with the realtimeInputConfig.automaticActivityDetection field of the setup message.
- When voice activity detection detects an interruption, the ongoing generation is canceled and discarded. Only the information already sent to the client is retained in the session history. The server then sends a message to report the interruption.
- When the audio stream is paused for more than a second (for example, because the user switched off the microphone), an audioStreamEnd event should be sent to flush any cached audio. The client can resume sending audio data at any time.


In [50]:
# Download an example audio file
URL = "https://storage.googleapis.com/cloud-samples-data/generative-ai/audio/hello_are_you_there.pcm"
!wget -q $URL -O sample.pcm

In [54]:
# Configure automatic activity detection
config = LiveConnectConfig(
    response_modalities=["TEXT"],
    realtime_input_config=RealtimeInputConfig(
        automatic_activity_detection=AutomaticActivityDetection(
            disabled=False,  # default
            start_of_speech_sensitivity=StartSensitivity.START_SENSITIVITY_LOW,
            end_of_speech_sensitivity=EndSensitivity.END_SENSITIVITY_LOW,
            prefix_padding_ms=20,
            silence_duration_ms=100,
        )
    ),
)

async with client.aio.live.connect(
    model=MODEL_ID,
    config=config,
) as session:
    audio_bytes = Path("sample.pcm").read_bytes()

    await session.send_realtime_input(
        media=Blob(data=audio_bytes, mime_type="audio/pcm;rate=16000")
    )

    # if stream gets paused, send:
    # await session.send_realtime_input(audio_stream_end=True)

    response = []
    async for message in session.receive():
        if message.server_content.interrupted is True:
            # The model generation was interrupted
            response.append("The session was interrupted")

        if message.text:
            response.append(message.text)

    display(Markdown(f"**Response >** {''.join(response)}"))

**Response >** Yes, I can hear you clearly. How can I help you today?

### Native Audio

- 'gemini-2.5-flash-preview-native-audio-dialog'
- Enhanced voice quality and adaptability
- Introducing proactive audio
- Introducing affective dialog


In [None]:
# v1alpha 的 Client 才有
client = genai.Client(
    api_key=os.getenv("GOOGLE_API_KEY"), http_options={"api_version": "v1alpha"}
)

Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.


In [59]:
MODEL_ID = "gemini-2.5-flash-preview-native-audio-dialog"

### Example 1: Proactive audio

- When proactive audio is enabled, the model only responds when it's relevant. The model generates text transcripts and audio responses proactively only for queries directed to the device, and does not respond to non-device directed queries.


In [60]:
config = LiveConnectConfig(
    response_modalities=["AUDIO"],
    input_audio_transcription=AudioTranscriptionConfig(),
    output_audio_transcription=AudioTranscriptionConfig(),
    proactivity=ProactivityConfig(proactive_audio=True),
)


async with client.aio.live.connect(
    model=MODEL_ID,
    config=config,
) as session:
    text_input = "Hello? Gemini are you there?"
    display(Markdown(f"**Input:** {text_input}"))

    await session.send_client_content(
        turns=Content(role="user", parts=[Part(text=text_input)])
    )

    audio_data = []
    input_transcription = []
    output_transcription = []

    async for message in session.receive():
        if (
            message.server_content.input_transcription
            and message.server_content.input_transcription.text
        ):
            input_transcription.append(message.server_content.input_transcription)
        if (
            message.server_content.output_transcription
            and message.server_content.output_transcription.text
        ):
            output_transcription.append(
                message.server_content.output_transcription.text
            )
        if (
            message.server_content.model_turn
            and message.server_content.model_turn.parts
        ):
            for part in message.server_content.model_turn.parts:
                if part.inline_data:
                    audio_data.append(
                        np.frombuffer(part.inline_data.data, dtype=np.int16)
                    )

    if input_transcription:
        display(Markdown(f"**Input transcription >** {''.join(input_transcription)}"))

    if audio_data:
        display(Audio(np.concatenate(audio_data), rate=24000, autoplay=True))

    if output_transcription:
        display(Markdown(f"**Output transcription >** {''.join(output_transcription)}"))

**Input:** Hello? Gemini are you there?

**Output transcription >** Yes, I am here. How can I help you today?

### Example 2: Affective dialog

- When affective dialog is enabled, the model can understand and respond appropriately to users' emotional expressions for more nuanced conversations.


In [61]:
config = LiveConnectConfig(
    response_modalities=["AUDIO"],
    enable_affective_dialog=True,
)

async with client.aio.live.connect(
    model=MODEL_ID,
    config=config,
) as session:
    text_input = "Hello? Gemini are you there? It's really a good day!"
    display(Markdown(f"**Input:** {text_input}"))

    await session.send_client_content(
        turns=Content(role="user", parts=[Part(text=text_input)])
    )

    audio_data = []
    async for message in session.receive():
        if (
            message.server_content.model_turn
            and message.server_content.model_turn.parts
        ):
            for part in message.server_content.model_turn.parts:
                if part.inline_data:
                    audio_data.append(
                        np.frombuffer(part.inline_data.data, dtype=np.int16)
                    )

    if audio_data:
        display(Audio(np.concatenate(audio_data), rate=24000, autoplay=True))

**Input:** Hello? Gemini are you there? It's really a good day!