In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Getting Started with the Multimodal Live API using Gen AI SDK


<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fmultimodal-live-api%2Fintro_multimodal_live_api_genai_sdk.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

## Overview
Multimodal Live API 提供低延遲雙向 Gemini 互動。輸入資料可以為文字、聲音或影片，輸出資料可以是文字或是聲音。這次的教程會是一個簡單針對 Vertex AI 中 Multimodal Live API 的範例。

本次的範例包含：
- 文字對文字生成
- 文字對語音生成
- 文字對語音對話
- 函數呼叫
- 程式碼執行
- Google搜尋

在 [Multimodal Live API](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-live) 頁面看更多資訊

## Getting Started

### Install Google Gen AI SDK for Python


首先會是安裝所需要的套件，在 notebook 中前面打%的指令會在 terminal 中執行 
這次安裝的套件會是 google-genai ，此套件將會允許我們透過 Python 調用 Gemini 

In [1]:
%pip install --upgrade --quiet google-genai

Note: you may need to restart the kernel to use updated packages.


### Authenticate your notebook environment
我們將在 colab 上運行，須先驗證環境

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type: "string"}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

### Import libraries

接下來會調用需要的套件


In [2]:
import os
from IPython.display import Audio, Markdown, display
from google import genai
from google.genai.types import (
    FunctionDeclaration,
    GoogleSearch,
    LiveConnectConfig,
    PrebuiltVoiceConfig,
    SpeechConfig,
    Tool,
    ToolCodeExecution,
    VoiceConfig,
)
import numpy as np

### Set Google Cloud project information and create client


首先在 [Google Cloud Console](https://console.cloud.google.com/welcome) 上新增一個專案，接下來[開啟 Vertex AI API ](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com)，接下來將下面的`[your-project-id]`改成你的Project ID，這一步將會設定你所使用的專案名稱。下一步將會設定用戶端使用你所設定的專案。

In [8]:
PROJECT_ID = "[your-project-id]"  # 
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

In [4]:
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

### Use the Gemini 2.0 Flash model

本次使用的模型為 Gemini 2.0 flash，可以視使用情況做更改

In [5]:
MODEL_ID = "gemini-2.0-flash-exp"

## Use the Multimodal Live API

Multimodal Live API is a stateful API that uses [WebSockets](https://en.wikipedia.org/wiki/WebSocket). This section shows some basic examples of how to use Multimodal Live API for text-to-text and text-to-audio generation.

Multimodal Live API 使用 [WebSockets](https://en.wikipedia.org/wiki/WebSocket) 來建立雙向連接。其跟一般http request的差別為她只需要做一次的連接就可以保持雙向連接，不需要重複發request。這個區塊會包含文字對文字還有文字對語音的範例

### **Example 1**: Text-to-text generation

本範例為發送訊息，並獲得文字回覆

**Notes**
 - `Session`代表一段WebSocket連接
 - 當連接成功後，可以選擇傳送`語音` `文字`或`影片`，回傳的資料也可以選擇為`音檔` `文字`或`函數`
 - `response_modalities` 可以設定回傳得資料為 `TEXT` 或 `AUDIO`
 - 將 `end_of_turn` 設定為 `True` 來根據現有資料回覆，否則將會等待更多資料


In [28]:
config = LiveConnectConfig(response_modalities=["TEXT"])

async with client.aio.live.connect(
    model=MODEL_ID,
    config=config,
) as session:
    text_input = "Hello? Gemini are you there?"
    display(Markdown(f"**Input:** {text_input}"))

    await session.send(input=text_input, end_of_turn=True)

    response = []

    async for message in session.receive():
        if message.text:
            response.append(message.text)

    display(Markdown(f"**Response >** {''.join(response)}"))

**Input:** Hello? Gemini are you there?

**Response >** Yes, I'm here. What would you like to talk about today?


### **Example 2**: Text-to-audio generation

本範例為發送訊息，並獲得語音回覆

**Notes**
- Multimodal Live API 支援以下幾種人聲:
  - Puck
  - Charon
  - Kore
  - Fenrir
  - Aoede
- 設定 `speech_config` 裡的 `voice_name` 來指定人聲

In [None]:
config = LiveConnectConfig(
    response_modalities=["AUDIO"],
    speech_config=SpeechConfig(
        voice_config=VoiceConfig(
            prebuilt_voice_config=PrebuiltVoiceConfig(
                voice_name="Aoede",
            )
        )
    ),
)

async with client.aio.live.connect(
    model=MODEL_ID,
    config=config,
) as session:
    text_input = "Hello? Gemini are you there?"
    display(Markdown(f"**Input:** {text_input}"))

    await session.send(input=text_input, end_of_turn=True)

    audio_data = []
    async for message in session.receive():
        if message.server_content.model_turn:
            for part in message.server_content.model_turn.parts:
                if part.inline_data:
                    audio_data.append(
                        np.frombuffer(part.inline_data.data, dtype=np.int16)
                    )

    if audio_data:
        display(Audio(np.concatenate(audio_data), rate=24000, autoplay=True))

### **Example 3**: Text-to-audio conversation

**Step 1**: 這裡示範了透過 API 傳送文字並獲得語音回覆的對話函數

**Notes**
- 模型會記錄當下 session 的互動，但一旦結束 session 聊天記錄將會被抹除，無法透過API找回

In [26]:
config = LiveConnectConfig(response_modalities=["AUDIO"])


async def main() -> None:
    async with client.aio.live.connect(model=MODEL_ID, config=config) as session:

        async def send() -> bool:
            text_input = input("Input > ")
            if text_input.lower() in ("q", "quit", "exit"):
                return False
            await session.send(input=text_input, end_of_turn=True)
            return True

        async def receive() -> None:

            audio_data = []

            async for message in session.receive():
                if message.server_content.model_turn:
                    for part in message.server_content.model_turn.parts:
                        if part.inline_data:
                            audio_data.append(
                                np.frombuffer(part.inline_data.data, dtype=np.int16)
                            )

                if message.server_content.turn_complete:
                    display(Markdown("**Response >**"))
                    display(
                        Audio(np.concatenate(audio_data), rate=24000, autoplay=True)
                    )
                    break

            return

        while True:
            if not await send():
                break
            await receive()

**Step 2** 運行此對話，輸入你的指令，或輸入`q`,`quit`或`exit`來退出。

In [None]:
await main()

### **Example 4**: Function calling

你可以透過 function calling 來組成函數的描述，再傳送該描述到模型中。模型會在描述成立的時候回傳函數呼叫指令，並且將參數傳入

**Notes**:
- 所有函數都需要在 session 開始的時候透過 tool definition 做定義
- 目前 API 只支援一種工具

In [None]:
get_current_weather = FunctionDeclaration(
    name="get_current_weather",
    description="Get current weather in the given location",
    parameters={
        "type": "OBJECT",
        "properties": {
            "location": {
                "type": "STRING",
            },
        },
    },
)

config = LiveConnectConfig(
    response_modalities=["TEXT"],
    tools=[Tool(function_declarations=[get_current_weather])],
)

async with client.aio.live.connect(
    model=MODEL_ID,
    config=config,
) as session:
    text_input = "Get the current weather in Santa Clara, San Jose and Mountain View"
    display(Markdown(f"**Input:** {text_input}"))

    await session.send(input=text_input, end_of_turn=True)

    async for message in session.receive():
        if message.tool_call:
            for function_call in message.tool_call.function_calls:
                display(Markdown(f"**FunctionCall >** {str(function_call)}"))

**Input:** Get the current weather in Santa Clara, San Jose and Mountain View

**FunctionCall >** id=None args={'location': 'Santa Clara'} name='get_current_weather'

**FunctionCall >** id=None args={'location': 'San Jose'} name='get_current_weather'

**FunctionCall >** id=None args={'location': 'Mountain View'} name='get_current_weather'

### **Example 5**: Code Execution

 你可以使用 API 的 code exectution 能力來生成並執行 Python 程式

 在這個範例中，我們在 `Tool` 裡面傳入 `code_execution`，並在 session 開始時初始化

In [30]:
config = LiveConnectConfig(
    response_modalities=["TEXT"],
    tools=[Tool(code_execution=ToolCodeExecution())],
)

async with client.aio.live.connect(
    model=MODEL_ID,
    config=config,
) as session:
    text_input = "Write code to calculate the 15th fibonacci number then find the nearest palindrome to it"
    display(Markdown(f"**Input:** {text_input}"))

    await session.send(input=text_input, end_of_turn=True)

    response = []

    async for message in session.receive():
        if message.text:
            response.append(message.text)
        if message.server_content.model_turn.parts:
            for part in message.server_content.model_turn.parts:
                if part.executable_code:
                    display(
                        Markdown(
                            f"""
**Executable code:**
```py
{part.executable_code.code}
```
"""
                        )
                    )

    display(Markdown(f"**Response >** {''.join(response)}"))

**Input:** Write code to calculate the 15th fibonacci number then find the nearest palindrome to it

**Response >** Okay, I understand. Here's the plan:

1.  **Calculate the 15th Fibonacci number:** I'll use a simple iterative approach for this.
2.  **Find the nearest palindrome:** Once I have the Fibonacci number, I'll create a function to find the nearest palindrome. This will involve checking both smaller and larger numbers to see which palindrome is closest.

Here's the Python code:

```python
def fibonacci(n):
  """Calculates the nth Fibonacci number."""
  a, b = 0, 1
  for _ in range(n):
    a, b = b, a + b
  return a

def nearest_palindrome(num):
  """Finds the nearest palindrome to a given number."""
  num_str = str(num)
  length = len(num_str)

  # Helper function to generate palindrome
  def generate_palindrome(number_str, length, even):
    if even:
      return number_str + number_str[::-1]
    else:
      return number_str + number_str[:-1][::-1]
  
  # Create a palindrome from the first half of the number
  first_half = num_str[:(length + 1) // 2]
  
  # Generate even and odd length palindromes based on the first half
  even_palindrome  = int(generate_palindrome(first_half, length, length % 2 == 0))

  #Generate smaller and larger values of the first half
  smaller_half = str(int(first_half) -1)
  larger_half = str(int(first_half) + 1)
  
  #Account for the edge cases
  if smaller_half == '-1':
    smaller_palindrome = 0
  else:
    smaller_palindrome  = int(generate_palindrome(smaller_half, length, length % 2 == 0))

  larger_palindrome  = int(generate_palindrome(larger_half, length, length % 2 == 0))

  # Find the differences
  diff_smaller = abs(num - smaller_palindrome)
  diff_larger = abs(num - larger_palindrome)
  diff_even = abs(num - even_palindrome)

  # Return the closest palindrome
  if diff_smaller <= diff_larger and diff_smaller <= diff_even:
    return smaller_palindrome
  elif diff_larger <= diff_smaller and diff_larger <= diff_even:
    return larger_palindrome
  else:
    return even_palindrome
  
# Calculate the 15th Fibonacci number
fib_15 = fibonacci(15)
print(f"The 15th Fibonacci number is: {fib_15}")

# Find the nearest palindrome
nearest_pal = nearest_palindrome(fib_15)
print(f"The nearest palindrome to {fib_15} is: {nearest_pal}")
```

This code first calculates the 15th Fibonacci number, which is 610. Then, the `nearest_palindrome` function generates potential palindromes smaller, equal and larger than the number and finds the closest one. Finally, the code prints the results.


### **Example 6**: Google Search

`google_search` 工具可以利用 Google Search 功能，我們可以用新到不可能在訓練集的問題做測試。


In [31]:
config = LiveConnectConfig(
    response_modalities=["TEXT"],
    tools=[Tool(google_search=GoogleSearch())],
)

async with client.aio.live.connect(
    model=MODEL_ID,
    config=config,
) as session:
    text_input = (
        "Tell me about the largest earthquake in California the week of Dec 5 2024?"
    )
    display(Markdown(f"**Input:** {text_input}"))

    await session.send(input=text_input, end_of_turn=True)

    response = []

    async for message in session.receive():
        if message.text:
            response.append(message.text)

    display(Markdown(f"**Response >** {''.join(response)}"))

**Input:** Tell me about the largest earthquake in California the week of Dec 5 2024?

**Response >** The largest earthquake in California during the week of December 5, 2024, occurred on December 5th. It had a magnitude of 7.0 and struck off the coast of Northern California, about 54 miles southwest of Eureka, in Humboldt County. A tsunami warning was issued for parts of California and Oregon but was later canceled. There were reports of non-structural impacts and some structural damage in Humboldt County.


## What's next

- Learn how to [build a web application that enables you to use your voice and camera to talk to Gemini 2.0 through the Multimodal Live API.](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/gemini/multimodal-live-api/websocket-demo-app)
- See the [Multimodal Live API reference docs](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-live).
- See the [Google Gen AI SDK reference docs](https://googleapis.github.io/python-genai/).
- Explore other notebooks in the [Google Cloud Generative AI GitHub repository](https://github.com/GoogleCloudPlatform/generative-ai).

## What We're Actually Doing Next
 - Try out app [built on Gemini Multimodal Live API](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/multimodal-live-api/websocket-demo-app/README.md)