<a href="https://colab.research.google.com/github/Nithingopi47/google_ai_studio/blob/main/Multimodal_Live_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multimodal Live API

The Multimodal Live API enables low-latency bidirectional voice and video interactions with Gemini. Using the Multimodal Live API, you can provide end users with the experience of natural, human-like voice conversations, and with the ability to interrupt the model's responses using voice commands. The model can process text, audio, and video input, and it can provide text and audio output.

# Capabilities
Multimodal Live API includes the following key capabilities:

*   Multimodality: The model can see, hear, and speak.

*   Low-latency real-time interaction: Provides fast responses

*   Session memory: The model retains memory of all interactions within a single session, recalling previously heard or seen information.

*   Support for function calling, code execution, and Search as a tool: Enables integration with external services and data sources.
*   Automated voice activity detection (VAD): The model can accurately recognize when the user begins and stops speaking. This allows for natural, conversational interactions and empowers users to interrupt the model at any time.
You can try the Multimodal Live API in Google AI Studio.





# Get started

Multimodal Live API is a stateful API that uses WebSockets.

This section shows an example of how to use Multimodal Live API for text-to-text generation, using Python 3.9+

# Install the Gemini API library
To install the google-genai package, use the following pip command:

In [None]:
pip install google-genai



# Import dependencies
To import dependencies:

In [None]:
from google import genai

  warn(


# Send and receive a text message

In [None]:
import asyncio
import nest_asyncio
from google import genai

# Apply nest_asyncio to allow nested event loops
nest_asyncio.apply()

client = genai.Client(api_key="AIzaSyB9njJSYWlnVpfCGiP_u8DJE4_mmPrQEpQ", http_options={'api_version': 'v1alpha'})
model_id = "gemini-2.0-flash-exp"
config = {"response_modalities": ["TEXT"]}

async def main():
    # The main async function where the connection to the live API is made
    async with client.aio.live.connect(model=model_id, config=config) as session:
        while True:
            message = input("User> ")
            if message.lower() == "exit":
                break
            await session.send(input=message, end_of_turn=True)

            async for response in session.receive():
                if response.text is None:
                    continue
                print(response.text, end="")

async def wrapper():
    # This function will wrap main() so that it can be run in a pre-existing event loop.
    await main()

if __name__ == "__main__":
    # Instead of asyncio.run(), use asyncio.create_task() to schedule the async task
    # to run in the existing loop.
    asyncio.create_task(wrapper())
    # Let the loop run indefinitely.
    loop = asyncio.get_event_loop()
    loop.run_forever()

  warn(


User> hello
Hello! How can I help you today?
User> what can u help me in 
I can help you with a wide variety of tasks! Here are some of the things I can do:

**General Knowledge & Information:**

*   **Answer your questions:** I can access and process information from the real world through Google Search and keep my response consistent with search results. I can answer questions on a vast range of topics, from historical events to scientific concepts to current events.
*   **Provide definitions and explanations:** If you need clarification on a word, concept, or idea, I can help break it down for you.
*   **Offer summaries of articles or texts:** I can quickly condense large amounts of text into a concise overview.
*   **Research topics:** If you need information on a particular subject, I can help you gather relevant details and resources.
*   **Stay updated on current events:** I have access to the latest news and can provide you with summaries of what's happening around the world.



KeyboardInterrupt: Interrupted by user

In [None]:
# Install the required library
!pip install nest_asyncio



# Integration guide
This section describes how integration works with Multimodal Live API.

# Sessions
A WebSocket connection establishes a session between the client and the Gemini server.

After a client initiates a new connection the session can exchange messages with the server to:




*   LSend text, audio, or video to the Gemini server.
*   Receive audio, text, or function call requests from the Gemini server.


The session configuration is sent in the first message after connection. A session configuration includes the model, generation parameters, system instructions, and tools.

See the following example configuration. Note that the name casing in SDKs may vary. You can look up the Python SDK configuration options here.




In [None]:
{
  "model": string,
  "generationConfig": {
    "candidateCount": integer,
    "maxOutputTokens": integer,
    "temperature": number,
    "topP": number,
    "topK": integer,
    "presencePenalty": number,
    "frequencyPenalty": number,
    "responseModalities": [string],
    "speechConfig": object
  },
  "systemInstruction": string,
  "tools": [object]
}

For more information, see BidiGenerateContentSetup.

# Send messages
Messages are JSON-formatted objects exchanged over the WebSocket connection.

To send a message the client must send a JSON object over an open WebSocket connection. The JSON object must have exactly one of the fields from the following object set:

In [None]:
{
  "setup": BidiGenerateContentSetup,
  "clientContent": BidiGenerateContentClientContent,
  "realtimeInput": BidiGenerateContentRealtimeInput,
  "toolResponse": BidiGenerateContentToolResponse
}


Supported client messages
See the supported client messages in the following table:

  

---


Message	: BidiGenerateContentSetup

Description : Session configuration to be sent in the first message

Message	: BidiGenerateContentClientContent

Description : Incremental content update of the current conversation delivered from the client

Message	: BidiGenerateContentRealtimeInput

Description : Real time audio or video input

Message	: BidiGenerateContentToolResponse

Description : Response to a ToolCallMessage received from the server


---


# Receive messages
To receive messages from Gemini, listen for the WebSocket 'message' event, and then parse the result according to the definition of the supported server messages.

See the following:

In [None]:
# Python code does not support JavaScript syntax.
# To process websocket data in Python, you'll need a different approach using a library like websockets.
# The following is an example of handling binary or text data received over a websocket in Python:


import asyncio
import websockets

async def handler(websocket):
    async for message in websocket:
        if isinstance(message, bytes):
            # Process binary data (audio, video)
            print("Received binary data:", message)
        else:
            # Process JSON or text response
            print("Received text data:", message)

async def main():
    async with websockets.serve(handler, "localhost", 8765):
        await asyncio.Future()  # Run forever

if __name__ == "__main__":
    asyncio.run(main())

KeyboardInterrupt: 

Server messages will have exactly one of the fields from the following object set:

In [None]:
{
  "setupComplete": BidiGenerateContentSetupComplete,
  "serverContent": BidiGenerateContentServerContent,
  "toolCall": BidiGenerateContentToolCall,
  "toolCallCancellation": BidiGenerateContentToolCallCancellation
}

# Supported server messages

Message : BidiGenerateContentSetupComplete

Description : A BidiGenerateContentSetup message from the client, sent when setup is complete

Message : BidiGenerateContentServerContent

Description : Content generated by the model in response to a client message

Message : BidiGenerateContentToolCall

Description : Request for the client to run the function calls and return the responses with the matching IDs

Message : BidiGenerateContentToolCallCancellation

Description : Sent when a function call is canceled due to the user interrupting model output


# Incremental content updates
Use incremental updates to send text input, establish session context, or restore session context. For short contexts you can send turn-by-turn interactions to represent the exact sequence of events. For longer contexts it's recommended to provide a single message summary to free up the context window for the follow up interactions.

See the following example context message:

In [None]:
{
  "clientContent": {
    "turns": [
      {
          "parts":[
          {
            "text": ""
          }
        ],
        "role":"user"
      },
      {
          "parts":[
          {
            "text": ""
          }
        ],
        "role":"model"
      }
    ],
    "turnComplete": true
  }
}

Note that while content parts can be of a functionResponse type, BidiGenerateContentClientContent shouldn't be used to provide a response to the function calls issued by the model. BidiGenerateContentToolResponse should be used instead. BidiGenerateContentClientContent should only be used to establish previous context or provide text input to the conversation.

# Streaming audio and video

# Function calling
All functions must be declared at the start of the session by sending tool definitions as part of the BidiGenerateContentSetup message.

See the Function calling tutorial to learn more about function calling.

From a single prompt, the model can generate multiple function calls and the code necessary to chain their outputs. This code executes in a sandbox environment, generating subsequent BidiGenerateContentToolCall messages. The execution pauses until the results of each function call are available, which ensures sequential processing.

The client should respond with BidiGenerateContentToolResponse.

Audio inputs and audio outputs negatively impact the model's ability to use function calling.

Audio formats
Multimodal Live API supports the following audio formats:




*   Input audio format: Raw 16 bit PCM audio at 16kHz little-endian
*   Output audio format: Raw 16 bit PCM audio at 24kHz little-endian


# System instructions
You can provide system instructions to better control the model's output and specify the tone and sentiment of audio responses.

System instructions are added to the prompt before the interaction begins and remain in effect for the entire session.

System instructions can only be set at the beginning of a session, immediately following the initial connection. To provide further input to the model during the session, use incremental content updates.

# Interruptions
Users can interrupt the model's output at any time. When Voice activity detection (VAD) detects an interruption, the ongoing generation is canceled and discarded. Only the information already sent to the client is retained in the session history. The server then sends a BidiGenerateContentServerContent message to report the interruption.

In addition, the Gemini server discards any pending function calls and sends a BidiGenerateContentServerContent message with the IDs of the canceled calls.

# Voices
Multimodal Live API supports the following voices: Aoede, Charon, Fenrir, Kore, and Puck.

To specify a voice, set the voiceName within the speechConfig object, as part of your session configuration.

See the following JSON representation of a speechConfig object:

In [None]:
{
  "voiceConfig": {
    "prebuiltVoiceConfig": {
      "voiceName": "VOICE_NAME"
    }
  }
}

{'voiceConfig': {'prebuiltVoiceConfig': {'voiceName': 'VOICE_NAME'}}}

# Limitations
Consider the following limitations of Multimodal Live API and Gemini 2.0 when you plan your project.

# Client authentication
Multimodal Live API only provides server to server authentication and isn't recommended for direct client use. Client input should be routed through an intermediate application server for secure authentication with the Multimodal Live API.

For web and mobile app deployments, you can explore options from:



*   Daily
*   Livekit




# Conversation history
While the model keeps track of in-session interactions, conversation history isn't stored. When a session ends, the corresponding context is erased.

In order to restore a previous session or provide the model with historic context of user interactions, the application should maintain its own conversation log and use a BidiGenerateContentClientContent message to send this information at the start of a new session.

# Maximum session duration
Session duration is limited to up to 15 minutes for audio or up to 2 minutes of audio and video. When the session duration exceeds the limit, the connection is terminated.

The model is also limited by the context size. Sending large chunks of content alongside the video and audio streams may result in earlier session termination.

# Voice activity detection (VAD)
The model automatically performs voice activity detection (VAD) on a continuous audio input stream. VAD is always enabled, and its parameters aren't configurable.

# Token count
Token count isn't supported.

# Rate limits
The following rate limits apply:

3 concurrent sessions per API key
4M tokens per minute