# SPEECH TO TEXT

Groq API is the fastest speech-to-text solution available, offering OpenAI-compatible endpoints that enable real-time transcriptions and translations

In [1]:
!pip install groq

In [2]:
from groq import Groq

In [None]:
#For Google colab
from google.colab import userdata
from IPython.display import Audio, display
# Setting up the client
groq_api = userdata.get('GROQ_API_KEY')

In [3]:
# Getting up the openai key file on local pc
from dotenv import load_dotenv
import os

# Load environment variables from the .env file
load_dotenv()

# Retrieve the API key from the environment
groq_api = os.getenv("GROQ_API_KEY")

In [4]:

client = Groq(api_key=groq_api)

In [6]:
print(os.getcwd())

/Users/ashish/Desktop/vettura-genai/Codes


In [7]:
audio_file= "./Week_2/Day_1/transcription.mp3"
with open(audio_file, "rb") as file:
    transcription = client.audio.transcriptions.create(
      file=(audio_file, file.read()),
      model="whisper-large-v3-turbo",
      # prompt="",  # Optional
      # response_format="json",  # Optional
      # language="en",  # Optional
      # temperature=0.0  # Optional
    )
    print(transcription.text)


 The fire that warms us can also consume us. It is not the fault of the fire.


## **TEXT OUTPUT**

if you want to set the response_format as text, your request would look like the following:

In [8]:
audio_file= "./Week_2/Day_1/transcription.mp3"
with open(audio_file, "rb") as file:
    transcription = client.audio.transcriptions.create(
      file=(audio_file, file.read()),
      model="whisper-large-v3-turbo",
      # prompt="",  # Optional
      response_format="text",  # Optional
      # language="en",  # Optional
      # temperature=0.0  # Optional
    )
    print(transcription)


 The fire that warms us can also consume us. It is not the fault of the fire.


## Timestamps of segments
Set `response_format` to verbose_json to receive timestamps for audio segments.


In [9]:
audio_file = "./Week_2/Day_1/transcription.mp3"
with open(audio_file, "rb") as file:
    transcription = client.audio.transcriptions.create(
      file=(audio_file, file.read()),
      model="whisper-large-v3-turbo",
      # prompt="",  # Optional
      response_format="verbose_json",  # response_format must be set verbose_json for timestamps
      # language="en",  # Optional
      # temperature=0.0  # Optional
    )
    print(transcription.segments)




[{'id': 0, 'seek': 0, 'start': 0, 'end': 2.44, 'text': ' The fire that warms us can also consume us.', 'tokens': [50365, 440, 2610, 300, 1516, 2592, 505, 393, 611, 14732, 505, 13, 50487], 'temperature': 0, 'avg_logprob': -0.11128976, 'compression_ratio': 1.0555556, 'no_speech_prob': 1.0499434e-11}, {'id': 1, 'seek': 0, 'start': 2.88, 'end': 4.26, 'text': ' It is not the fault of the fire.', 'tokens': [50509, 467, 307, 406, 264, 7441, 295, 264, 2610, 13, 50578], 'temperature': 0, 'avg_logprob': -0.11128976, 'compression_ratio': 1.0555556, 'no_speech_prob': 1.0499434e-11}]


# Audio translation API

Translates audio into English.

Parameters:
1. **file**  
   - `string` (Required)  
   - The audio file object (not the file name) to translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

2. **model**  
   - `string` (Required)  
   - ID of the model to use. Only `whisper-large-v3` is currently available.

3. **prompt**  
   - `string` (Optional)  
   - An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English.

4. **response_format**  
   - `string` (Optional)  
   - Defaults to `json`  
   - The format of the translation output, in one of these options: `json`, `text`, or `verbose_json`.

5. **temperature**  
   - `number` (Optional)  
   - Defaults to `0`  
   - The sampling temperature, between `0` and `1`. Higher values like `0.8` will make the output more random, while lower values like `0.2` will make it more focused and deterministic. If set to `0`, the model will use log probability to automatically adjust the temperature until certain thresholds are met.


In [10]:
filename = "./Week_2/Day_1/translation.m4a"
with open(filename, "rb") as file:
    translation = client.audio.translations.create(
      file=(filename, file.read()),
      model="whisper-large-v3",
      prompt="Specify context or spelling",  # Optional
      response_format="json",  # Optional
      temperature=0.0  # Optional
    )
    print(translation.text)



 The capital of West Bengal is located on the banks of the Huggali River, 180 km from the border of the Bengal Khadi.
