<a href="https://colab.research.google.com/github/ParasBhardava/DSA/blob/main/get_started_with_gemini_TTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gemini API: Gemini Text-to-speech


The Gemini API can transform text input into single speaker or multi-speaker audio (podcast-like experience like in [NotebookLM](https://notebooklm.google.com/). This notebook provides an example of how to control the *Text-to-speech* (TTS) capability of the Gemini model and guide its style, accent, pace, and tone.

Before diving in the code, you should try this capability on [AI Studio](https://aistudio.google.com/prompts/new_chat?model=gemini-2.5-flash-preview-tts).

**Note that the TTS model can only do TTS, it does not have the reasoning capabilities of the Gemini models, so you can ask things like "say this in that style", but not "tell me why the sky is blue".** If that's what you want, you should use the [Live API](./Get_started_LiveAPI.ipynb) instead.

The [documentation](https://ai.google.dev/gemini-api/docs/audio-generation) is also a good place to start discovering the TTS capability.

## Setup

### Setup your API key

To run the following cell, your API key must be stored it in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see [Authentication ![image](https://storage.googleapis.com/generativeai-downloads/images/colab_icon16.png)](../quickstarts/Authentication.ipynb) for an example.

In [None]:
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

### Install and initialize the SDK


In [None]:
!pip install -U -q "google-genai>=1.16.0" # 1.16 is needed for multi-speaker audio


In [None]:
from google import genai
from google.genai import types

client = genai.Client(api_key=GOOGLE_API_KEY)

### Select a model

Audio-out is only supported by the "`tts`" models, `gemini-2.5-flash-preview-tts` and `gemini-2.5-pro-preview-tts`.

For more information about all Gemini models, check the [documentation](https://ai.google.dev/gemini-api/docs/models/gemini) for extended information on each of them.


In [None]:
MODEL_ID = "gemini-2.5-flash-preview-tts" # @param ["gemini-2.5-flash-preview-tts","gemini-2.5-pro-preview-tts"] {"allow-input":true, isTemplate: true}

Next create a helper function to prompt the model and play back the audio in the notebook:

In [None]:
# @title Helper functions (just run that cell)

import contextlib
import wave
from IPython.display import Audio

file_index = 0

@contextlib.contextmanager
def wave_file(filename, channels=1, rate=24000, sample_width=2):
    with wave.open(filename, "wb") as wf:
        wf.setnchannels(channels)
        wf.setsampwidth(sample_width)
        wf.setframerate(rate)
        yield wf

def play_audio_blob(blob):
  global file_index
  file_index += 1

  fname = f'audio_{file_index}.wav'
  with wave_file(fname) as wav:
    wav.writeframes(blob.data)

  return Audio(fname, autoplay=True)

def play_audio(response):
    return play_audio_blob(response.candidates[0].content.parts[0].inline_data)

## Generate a simple audio output

Let's start with something simple:

In [None]:
response = client.models.generate_content(
  model=MODEL_ID,
  contents="Say 'hello, my name is Gemini!'",
  config={"response_modalities": ['Audio']},
)

The generated ouput is in the response `inline_data` and as you can see it's indeed audio data.

In [None]:
blob = response.candidates[0].content.parts[0].inline_data
print(blob.mime_type)

To be able to listen to the generated audio in colab, you're going to use our helper function to write the output in a file and play it.

In [None]:
play_audio_blob(blob)

Note that the model can only do TTS, so you should always tell it to "say", "read", "TTS" something, otherwise it won't do anything.

## Control how the model speaks

There are 30 different built-in voices you can use and 24 supported languages which gives you plenty of combinations to try.

### Choose a voice

Choose a voice among the 30 different ones. You can find their characteristics in the [documentation](https://ai.google.dev/gemini-api/docs/speech-generation#voices).

In [None]:
voice_name = "Sadaltager" # @param ["Zephyr", "Puck", "Charon", "Kore", "Fenrir", "Leda", "Orus", "Aoede", "Callirhoe", "Autonoe", "Enceladus", "Iapetus", "Umbriel", "Algieba", "Despina", "Erinome", "Algenib", "Rasalgethi", "Laomedeia", "Achernar", "Alnilam", "Schedar", "Gacrux", "Pulcherrima", "Achird", "Zubenelgenubi", "Vindemiatrix", "Sadachbia", "Sadaltager", "Sulafar"]

In [None]:
response = client.models.generate_content(
  model=MODEL_ID,
  contents="""Say "I am a very knowlegeable model, especially when using grounding", wait 5 seconds then say "Don't you think?".""",
  config={
      "response_modalities": ['Audio'],
      "speech_config": {
          "voice_config": {
              "prebuilt_voice_config": {
                  "voice_name": voice_name
              }
          }
      }
  },
)

play_audio(response)

### Change the language

Just tell the model to speak in a certain language and it will. The [documentation](https://ai.google.dev/gemini-api/docs/speech-generation#languages) lists all the supported ones.

In [None]:
response = client.models.generate_content(
  model=MODEL_ID,
  contents="""
    Read this in French:

    Les chaussettes de l'archiduchesse sont-elles s√®ches ? Archi-s√®ches ?
    Un chasseur sachant chasser doit savoir chasser sans son chien.
  """,
  config={"response_modalities": ['Audio']},
)

play_audio(response)

### Prompt the model to speak in certain ways

You can control style, tone, accent, and pace using natural language prompts, for example:

In [None]:
response = client.models.generate_content(
  model=MODEL_ID,
  contents="""
    Say in an spooky whisper:
    "By the pricking of my thumbs...
    Something wicked this way comes!"
  """,
  config={"response_modalities": ['Audio']},
)

play_audio(response)

In [None]:
response = client.models.generate_content(
  model=MODEL_ID,
  contents="""
    Read this disclaimer in as fast a voice as possible while remaining intelligible:

    [The author] assumes no responsibility or liability for any errors or omissions in the content of this site.
    The information contained in this site is provided on an 'as is' basis with no guarantees of completeness, accuracy, usefulness or timeliness
  """,
  config={"response_modalities": ['Audio']},
)

play_audio(response)

## Mutlti-speakers

The TTS model can also read discussions between 2 speakers (like [NotebookLM](https://Fnotebooklm.google.com) podcast feature). You just need to tell it that there are two speakers:

In [None]:
response = client.models.generate_content(
  model=MODEL_ID,
  contents="""
    Make Speaker1 sound tired and bored, and Speaker2 sound excited and happy:

    Speaker1: So... what's on the agenda today?
    Speaker2: You're never going to guess!
  """,
  config={"response_modalities": ['Audio']},
)

play_audio(response)

You can also select the voices for each participants and pass their names to the model.

But first let's generate a discussion between two scientists:

In [None]:
transcript = client.models.generate_content(
    model='gemini-2.5-flash',
    contents="""
      Hi, please generate a short (like 100 words) transcript that reads like
      it was clipped from a podcast by excited herpetologists, Dr. Claire and
      her assistant, the young Aurora.
    """
  ).text

print(transcript)

Then let's have the TTS model render the conversation using the voices you want.

In [None]:
config = types.GenerateContentConfig(
    response_modalities=["AUDIO"],
    speech_config=types.SpeechConfig(
        multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
            speaker_voice_configs=[
                types.SpeakerVoiceConfig(
                    speaker='Dr. Claire',
                    voice_config=types.VoiceConfig(
                        prebuilt_voice_config=types.PrebuiltVoiceConfig(
                            voice_name='sulafat',
                        )
                    )
                ),
                types.SpeakerVoiceConfig(
                    speaker='Aurora',
                    voice_config=types.VoiceConfig(
                        prebuilt_voice_config=types.PrebuiltVoiceConfig(
                            voice_name='Leda',
                        )
                    )
                ),
            ]
        )
    )
)

response = client.models.generate_content(
  model=MODEL_ID,
  contents="TTS the following conversation between a very excited Dr. Claire and her assistant, the young Aurora: "+transcript,
  config=config,
)

play_audio(response)

## üé• Example 1: Epic Movie Trailer Voice

Let's create a dramatic movie trailer announcer voice - the kind you hear in blockbuster previews!

In [None]:
# Epic Movie Trailer Voice
response = client.models.generate_content(
    model=MODEL_ID,
        contents="""
            Say this in a deep, dramatic movie trailer voice with epic pauses:

                    "In a world... where artificial intelligence has changed everything...
                        One API... will transform how you create audio content...
                            Gemini Text-to-Speech...
                                Coming to a developer near you...
                                    This summer."
                                        """,
                                            config={"response_modalities": ['Audio']},
                                            )
play_audio(response)

## üéôÔ∏è Example 2: Podcast Debate

Let's create a fun podcast-style debate between two speakers with contrasting personalities - one enthusiastic, one skeptical!

In [None]:
# Podcast Debate - Multi-speaker with different personalities
config = types.GenerateContentConfig(
    response_modalities=["AUDIO"],
    speech_config=types.SpeechConfig(
        multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
            speaker_voice_configs=[
                types.SpeakerVoiceConfig(
                    speaker='Alex',
                    voice_config=types.VoiceConfig(
                        prebuilt_voice_config=types.PrebuiltVoiceConfig(
                            voice_name='Puck'  # Enthusiastic voice
                        )
                    )
                ),
                types.SpeakerVoiceConfig(
                    speaker='Morgan',
                    voice_config=types.VoiceConfig(
                        prebuilt_voice_config=types.PrebuiltVoiceConfig(
                            voice_name='Fenrir'  # Skeptical voice
                        )
                    )
                )
            ]
        )
    )
)
response = client.models.generate_content(
    model=MODEL_ID,
    contents="""
        TTS this podcast debate. Make Alex super enthusiastic and excited,
        and Morgan sound skeptical and unimpressed:

            Alex: Oh my gosh, have you TRIED the new AI text-to-speech? It's AMAZING!
            Morgan: Meh. I've heard TTS before. They all sound robotic.
            Alex: No no no, this one is different! It can do emotions, accents, multiple speakers!
            Morgan: Sure... that's what they always say.
            Alex: Listen to THIS! It can even whisper and shout!
            Morgan: Okay, I'll admit... that's actually pretty impressive.
            Alex: I TOLD you! The future is HERE!
        """,
    config=config,
)

play_audio(response)

## üòÉ Example 3: Comedy Show with Different Voice Personalities


In [None]:
# Example 3: Comedy Show with Different Voice Personalities
config = types.GenerateContentConfig(
    response_modalities=["AUDIO"],
    speech_config=types.SpeechConfig(
        multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
            speaker_voice_configs=[
                types.SpeakerVoiceConfig(
                    speaker='Comedian',
                    voice_config=types.VoiceConfig(
                        prebuilt_voice_config=types.PrebuiltVoiceConfig(
                            voice_name='Algieba'
                        )
                    )
                ),
                types.SpeakerVoiceConfig(
                    speaker='Audience',
                    voice_config=types.VoiceConfig(
                        prebuilt_voice_config=types.PrebuiltVoiceConfig(
                            voice_name='Enceladus'
                        )
                    )
                )
            ]
        )
    )
)

response = client.models.generate_content(
    model=MODEL_ID,
    contents="""TTS this comedy show. Make the comedian funny and dramatic, audience excited:

    Comedian: So I tried to learn Python programming...
    Audience: How'd that go?
    Comedian: Let's just say I got 404 errors and 500 problems!
    Audience: HAHAHAHA!
    Comedian: My code was so broken, it needed a life coach!
    """,
    config=config,
)
play_audio(response)

## üõ∏ Example 4: Aliens Visiting Earth


In [None]:
speaker_configs = [
    types.SpeakerVoiceConfig(speaker='Alien1', voice_config=types.VoiceConfig(prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name='Kore'))),
    types.SpeakerVoiceConfig(speaker='Alien2', voice_config=types.VoiceConfig(prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name='Umbriel'))),
]
cfg = types.GenerateContentConfig(response_modalities=["AUDIO"], speech_config=types.SpeechConfig(multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(speaker_voice_configs=speaker_configs)))
resp = client.models.generate_content(model=MODEL_ID, contents="Alien1: Greetings, Earth creature! We come in peace! Alien2: Indeed! Your pizza is fascinating! Alien1: Yes, we will take your pizza technology back to our planet!", config=cfg)
play_audio(resp)

## ü§ñ Example 5: Funny robot commentary - shows voice selection flexibility


In [None]:
r = client.models.generate_content(model=MODEL_ID, contents="Robot1 sounds mechanical. Robot2 sounds playful. Robot1: This pizza is inefficient. Robot2: But tasty! Robot1: Indeed. We shall study it.", config=types.GenerateContentConfig(response_modalities=["AUDIO"], speech_config=types.SpeechConfig(multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(speaker_voice_configs=[types.SpeakerVoiceConfig(speaker='Robot1', voice_config=types.VoiceConfig(prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name='Algieba'))), types.SpeakerVoiceConfig(speaker='Robot2', voice_config=types.VoiceConfig(prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name='Puck')))]))))
play_audio(r)

In [None]:
response = client.models.generate_content(
    model=MODEL_ID,
    contents="""
    # AUDIO PROFILE: Jaz R.
## "The Morning Hype"

## THE SCENE: The London Studio
It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline,
but inside, it is blindingly bright. The red "ON AIR" tally light is blazing.
Jaz is standing up, not sitting, bouncing on the balls of their heels to the
rhythm of a thumping backing track. Their hands fly across the faders on a
massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake
up an entire nation.

### DIRECTOR'S NOTES
Style:
* The "Vocal Smile": You must hear the grin in the audio. The soft palate is
always raised to keep the tone bright, sunny, and explicitly inviting.
* Dynamics: High projection without shouting. Punchy consonants and elongated
vowels on excitement words (e.g., "Beauuutiful morning").

Pace: Speaks at an energetic pace, keeping up with the fast music.  Speaks
with A "bouncing" cadence. High-speed delivery with fluid transitions ‚Äî no dead
air, no gaps.

Add raining sounds in the background!

Accent: Jaz is from Brixton, London

### SAMPLE CONTEXT
Jaz is the industry standard for Top 40 radio, high-octane event promos, or any
script that requires a charismatic Estuary accent and 11/10 infectious energy.

#### TRANSCRIPT
Yes, massive vibes in the studio! You are locked in and it is absolutely
popping off in London right now. If you're stuck on the tube, or just sat
there pretending to work... stop it. Seriously, I see you. Turn this up!
We've got the project roadmap landing in three, two... let's go!
    """,
    config={"response_modalities": ['Audio']},
)
play_audio(response)

In [None]:
response = client.models.generate_content(
    model=MODEL_ID,
    contents="""
    Read this disclaimer in as fast a voice as possible while remaining intelligible, keep the voice deep:

    Someone has to pay for the bills, so let's hear from our sponsor!""",
      config={
      "response_modalities": ['Audio'],
      "speech_config": {
          "voice_config": {
              "prebuilt_voice_config": {
                  "voice_name": "Enceladus"
              }
          }
      }
  },
)
play_audio(response)