# SPEECH TO TEXT

**OPENAI AUDIO TRANSCRIPTIONS:** Transcribes audio into the input language.

Args:
1. **file:** The audio file object (not file name) to transcribe, in one of these formats:flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

2. **model:** ID of the model to use. Only whisper-1 (which is powered by our open source Whisper V2 model) is currently available.
3. **language:** The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency.

4. **prompt:** An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language.

5. **response_format:** The format of the output, in one of these options: json, text, srt verbose_json, or vtt.

6. **temperature:** The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to
automatically increase the temperature until certain thresholds are hit.

7. **timestamp_granularities:** The timestamp granularities to populate for this transcription.
`response_format must be set verbose_json` to use timestamp granularities.

Either or both of these options are supported: word, or segment. Note: There
is no additional latency for segment timestamps, but generating word timestamps
incurs additional latency.

8. **extra_headers:** Send extra headers

9. **extra_query:** Add additional query parameters to the request

10. **extra_body:** Add additional JSON properties to the request

11. **timeout:** Override the client-level default timeout for this request, in seconds

REF: https://platform.openai.com/docs/guides/speech-to-text


NOTE: `add openai api in secrets of colab notebook` and `add audio file in files to convert to text`



In [1]:
import openai
from google.colab import userdata
from IPython.display import Audio, display

In [2]:
# Setting up the client
openai_api = userdata.get('OPENAI_API_KEY')
client = openai.Client(api_key=openai_api)

In [3]:
audio_file= open("test.mp3", "rb")
transcript = client.audio.transcriptions.create(
    model = "whisper-1",
    file=audio_file
    )
print(transcript.text) # By default, the response type will be json with the raw text included.

The fire that warms us can also consume us. It is not the fault of the fire.


**TEXT OUTPUT**

if you want to set the response_format as text, your request would look like the following:

In [4]:
audio_file= open("test.mp3", "rb")
transcript = client.audio.transcriptions.create(
    model = "whisper-1",
    file=audio_file,
    response_format="text"  # set output format as text
)
print(transcript)

The fire that warms us can also consume us. It is not the fault of the fire.



**Timestamps**

The `timestamp_granularities[]` parameter enables a more structured and timestamped json output format, with timestamps at the segment, word level, or both.

*This enables word-level precision for transcripts and video edits, which allows for the removal of specific frames tied to individual words.*

`response_format must be set verbose_json` to use timestamp granularities.

In [5]:
audio_file = open("test.mp3", "rb")
transcript = client.audio.transcriptions.create(
file=audio_file,
model="whisper-1",
response_format="verbose_json",  # response_format must be set verbose_json
timestamp_granularities=["word"]
)

print(transcript.words)

[TranscriptionWord(end=0.23999999463558197, start=0.0, word='The'), TranscriptionWord(end=0.4000000059604645, start=0.23999999463558197, word='fire'), TranscriptionWord(end=0.699999988079071, start=0.4000000059604645, word='that'), TranscriptionWord(end=0.9399999976158142, start=0.699999988079071, word='warms'), TranscriptionWord(end=1.1799999475479126, start=0.9399999976158142, word='us'), TranscriptionWord(end=1.340000033378601, start=1.1799999475479126, word='can'), TranscriptionWord(end=1.7200000286102295, start=1.340000033378601, word='also'), TranscriptionWord(end=2.0999999046325684, start=1.7200000286102295, word='consume'), TranscriptionWord(end=2.5399999618530273, start=2.0999999046325684, word='us'), TranscriptionWord(end=3.059999942779541, start=3.0199999809265137, word='It'), TranscriptionWord(end=3.2200000286102295, start=3.059999942779541, word='is'), TranscriptionWord(end=3.380000114440918, start=3.2200000286102295, word='not'), TranscriptionWord(end=3.680000066757202, s

# TEXT TO SPEECH

**OPENAI AUDIO API** Generates audio from the input text.
The Audio API provides a speech endpoint based on our TTS (text-to-speech) model. It comes with 6 built-in voices and can be used to:

* Narrate a written blog post
* Produce spoken audio in multiple languages
* Give real time audio output using streaming

Args:
1. **input:** The text to generate audio for. The maximum length is 4096 characters.

2. **model:** One of the available TTS models:
      * tts-1
      * tts-1-hd

3. **voice:** The voice to use when generating the audio. Supported voices are
      * alloy
      * echo
      * fable
      * onyx
      * nova
      * shimmer
  
* Previews of the voices are
available in the [Text to Speech Guide](https://platform.openai.com/docs/guides/text-to-speech#voice-options)


4. **response_format:** The format to audio in. Supported formats are `mp3(default), opus, aac, flac,
      wav, and pcm.`
      * Opus: For internet streaming and communication, low latency.
      * AAC: For digital audio compression, preferred by YouTube, Android, iOS.
      * FLAC: For lossless audio compression, favored by audio enthusiasts for archiving.
      * WAV: Uncompressed WAV audio, suitable for low-latency applications to avoid decoding overhead.
      * PCM: Similar to WAV but containing the raw samples in 24kHz (16-bit signed, low-endian), without the header.

5. **speed:** The speed of the generated audio. Select a value from 0.25 to 4.0. 1.0 is
      the default.

6. **extra_headers:** Send extra headers

7. **extra_query:** Add additional query parameters to the request

8. **extra_body:** Add additional JSON properties to the request

9. **timeout:** Override the client-level default timeout for this request, in seconds

REF: https://platform.openai.com/docs/guides/text-to-speech

In [6]:
text = "Mindfulness is a technique that involves being aware of the present moment without judgment."

The `speech` endpoint takes in three key inputs: the `model`, the `text` that should be turned into audio, and the `voice` to be used for the audio generation. A simple request would look like the following:

In [7]:
audio =client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input=text
    )
speech_file_path = "audio.mp3"
audio.stream_to_file(speech_file_path)
display(Audio(speech_file_path, autoplay=True))


  audio.stream_to_file(speech_file_path)
