# Text-to-Audio

Working with audio using AI has never been easier!
Spring-AI provides a clean,
Kotlin-friendly way to both transcribe audio to text and convert text to speech.
In this tutorial,
we'll explore both capabilities using OpenAI's audio models through Spring-AI's intuitive API.

Let's start by adding the necessary dependency:

In [1]:
%useLatestDescriptors
%use spring-ai-openai

Next, we'll set up our API key for authentication:

In [2]:
val apiKey = System.getenv("OPENAI_API_KEY") ?: "YOUR_OPENAI_API_KEY"

## Audio Transcription

First, let's look at converting audio to text (transcription). This is perfect for creating subtitles, transcribing meetings, or processing voice commands:

In [3]:
import org.springframework.core.io.FileSystemResource

// Set up the OpenAI Audio API
val openAiAudioApi = OpenAiAudioApi.builder().apiKey(apiKey).build()

// Create our transcription model
var openAiAudioTranscriptionModel = OpenAiAudioTranscriptionModel(openAiAudioApi)

// Configure transcription options
var transcriptionOptions = OpenAiAudioTranscriptionOptions.builder()
    .responseFormat(OpenAiAudioApi.TranscriptResponseFormat.TEXT)
    .temperature(0f) // More deterministic results
    .build()

// Load our audio file (this example assumes a WAV file of the Harvard sentences)
val audioFile = FileSystemResource("data/harvard.wav")

// Create and execute our transcription request
val transcriptionRequest = AudioTranscriptionPrompt(audioFile, transcriptionOptions)
val response = openAiAudioTranscriptionModel.call(transcriptionRequest)

Now let's see the transcription result:

In [4]:
response.result.output

The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health and zest. A salt pickle tastes fine with ham. Tacos al pastor are my favorite. A zestful food is the hot cross bun.


## Text-to-Speech Generation

Next, let's explore the reverse: converting text to speech.
This is great for creating voiceovers,
accessibility features, or interactive voice applications:

In [5]:
import org.springframework.ai.audio.tts.TextToSpeechPrompt

// Create our speech model
val openAiAudioSpeechModel = OpenAiAudioSpeechModel(openAiAudioApi)

// Configure speech options
val speechOptions = OpenAiAudioSpeechOptions.builder()
    .voice(OpenAiAudioApi.SpeechRequest.Voice.ALLOY) // Choose the voice type
    .speed(1.0) // Normal speaking speed
    .responseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3) // Get MP3 format
    .model(OpenAiAudioApi.TtsModel.TTS_1.value) // Using TTS-1 model
    .build()

// Prepare our text to be converted to speech
val prompt = """
Black holes represent one of the most fascinating phenomena in our universe.
When a massive star dies,
it can collapse under its own gravity to form a singularity - a point where space-time curves infinitely and the laws of physics as we know them cease to function.
What's particularly interesting is the event horizon,
the boundary marking the point of no return. Once anything crosses this threshold,
be it light or matter, it cannot escape the black hole's gravitational pull.
Recent breakthroughs have allowed scientists to actually photograph these cosmic behemoths, confirming theories that have existed for decades.
"""

// Create and execute our speech request
val speechRequest = TextToSpeechPrompt(prompt, speechOptions)
val response = openAiAudioSpeechModel.call(speechRequest)

Finally, let's play the generated audio directly in our notebook:

In [6]:
@file:OptIn(ExperimentalEncodingApi::class)

import kotlin.io.encoding.Base64
import kotlin.io.encoding.ExperimentalEncodingApi

// Convert the audio bytes to Base64 for embedding in HTML
val audioBytes = response.result.output
val base64Audio = Base64.encode(audioBytes)

// Create an audio player widget
HTML("""
    <audio controls>
        <source src="data:audio/mp3;base64,$base64Audio" type="audio/mp3">
        Your browser does not support the audio element.
    </audio>
""")

Let's save the audio to a file

In [7]:
import java.io.File

val filePath = "data/black_holes.mp3"
File(filePath).writeBytes(audioBytes)

println("Audio saved to $filePath")

Audio saved to data/black_holes.mp3


This notebook shows the power and simplicity of working with audio AI through Spring-AI.
The library handles all the complex details of API communication,
leaving you to focus on the creative aspects of your application.

You can experiment with different voice options
(_ALLOY_, _ECHO_, _FABLE_, _ONYX_, _NOVA_, and _SHIMMER_),
adjust speech speed, and try different text content to see how the model performs.
For transcription, you might try different audio files and languages to test the model's capabilities.

Spring-AI's consistent API makes it easy to integrate these audio features into larger applications,
combining them with other AI capabilities like text generation or image creation for truly multimedia experiences.