##### Copyright 2024 Google LLC.

In [1]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Gemini API: Audio Quickstart

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Audio.ipynb"><img src="../images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

This notebook provides an example of how to prompt Gemini 1.5 Flash using an audio file. In this case, you'll use a [sound recording](https://www.jfklibrary.org/asset-viewer/archives/jfkwha-006) of President John F. Kennedy’s 1961 State of the Union address.

### Install dependencies

In [2]:
!pip install -q -U "google-genai>=0.0.1"


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
from google import genai
from google.genai import types

from IPython import display

### Configure your API key

To run the following cell, your API key must be stored it in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see [Authentication](../quickstarts/Authentication.ipynb) for an example.

In [4]:
try:
    from google.colab import userdata

    GOOGLE_API_KEY = userdata.get("GOOGLE_API_KEY")
except ImportError:
    import os

    GOOGLE_API_KEY = os.environ["GOOGLE_API_KEY"]
client = genai.Client(api_key=GOOGLE_API_KEY)

## Upload an audio file with the File API

To use an audio file in your prompt, you must first upload it using the [File API](../quickstarts/File_API.ipynb).


In [5]:
URL = "https://storage.googleapis.com/generativeai-downloads/data/State_of_the_Union_Address_30_January_1961.mp3"

In [6]:
!curl -q $URL -o sample.mp3

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 39.8M  100 39.8M    0     0  37.0M      0  0:00:01  0:00:01 --:--:-- 37.1M


In [7]:
your_file = client.files.upload(path="sample.mp3")

your_file = types.Part.from_uri(
    file_uri=your_file.uri, mime_type=your_file.mime_type
)  # TODO delete

## Use the file in your prompt

In [8]:
prompt = "Listen carefully to the following audio file. Provide a brief summary."
response = client.models.generate_content(
    model="models/gemini-1.5-flash", contents=[prompt, your_file]
)
display.Markdown(response.text)

I'm sorry, I can't play audio. 

## Inline Audio

For small requests you can inline the audio data into the request, like you can with images. Use PyDub to trim the first 10s of the audio:

In [9]:
!pip install -Uq pydub


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [10]:
from pydub import AudioSegment

In [11]:
sound = AudioSegment.from_mp3("sample.mp3")

In [12]:
sound[:10000]  # slices are in ms

Add it to the list of parts in the prompt:

In [13]:
response = client.models.generate_content(
    model="models/gemini-1.5-flash",
    contents=[
        "Please transcribe this recording:",
        types.Part.from_bytes(
            mime_type="audio/mp3",
            data=sound[:10000].export().read(),  # TODO: Part.from_file
        ),
    ],
)

In [14]:
display.Markdown(response.text)

The President's State of the Union address to a joint session of the Congress from the rostrum of the House of Representatives while 

## Count audio tokens

You can count the number of tokens in your audio file like this.

In [15]:
client.models.count_tokens(model="models/gemini-1.5-flash", contents=[your_file])

CountTokensResponse(total_tokens=83553, cached_content_token_count=None)

## Next Steps
### Useful API references:

More details about Gemini API's [vision capabilities](https://ai.google.dev/gemini-api/docs/vision) in the documentation.

If you want to know about the File API, check its [API reference](https://ai.google.dev/api/files) or the [File API](https://github.com/google-gemini/cookbook/blob/main/quickstarts/File_API.ipynb) quickstart.

### Related examples

Check this example using the audio files to give you more ideas on what the gemini API can do with them:
* Share [Voice memos](https://github.com/google-gemini/cookbook/blob/main/examples/Voice_memos.ipynb) with Gemini API and brainstorm ideas

### Continue your discovery of the Gemini API

Have a look at the [Audio](../quickstarts/Audio.ipynb) quickstart to learn about another type of media file, then learn more about [prompting with media files](https://ai.google.dev/tutorials/prompting_with_media) in the docs, including the supported formats and maximum length for audio files. .
