# Speech2Text

**Information**

The **Bark model** by Suno is introduced as an innovative text-to-audio generation model. Unlike traditional text-to-speech models, Bark is designed as a versatile, fully generative model that produces realistic, multilingual speech and diverse audio elements, including background sounds, nonverbal expressions, and even music. The Bark model features a variety of unique traits: it goes beyond speech, capturing nonverbal cues such as laughter and sighs, and can support audio outputs with environmental sounds or musical elements. Suno has provided pretrained model checkpoints with support for real-time audio generation on GPUs and compatibility with CPUs, making it suitable for commercial use. It incorporates natural prosody and can even mimic accents based on input text.

Remark: Bark model information is based on suno-ai/bark, developed and maintained by Suno: https://suno.com/


***
**Coding sources**

* Hugging Face model page: https://huggingface.co/suno/bark-small
* Hugging Face documentation: https://huggingface.co/docs/transformers/model_doc/bark


***
**Aim of the code template**

Mimic the Advanced Speech-to-Text API of Google by i. generating an audio file (Text2Speech), ii. transcribing the audio file (Speech2Text) and iii. improve the transcription by using two LLMs; see Google API: https://cloud.google.com/speech-to-text/?hl=en

# Transcripe your audio file

In [1]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"
print("device:", device)

torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps=True,
    # generate_kwargs={"task": "translate", "language": "german"} # translate the audio file, else whisper predicts the language of the source audio automatically and the source audio language is the same as the target text language.
)

# Load audio data from your local "dialog_suno.wav" file
audio_file = "../Text2Speech/dialog_suno.wav"
audio_input, sample_rate = sf.read(audio_file)

# Process and transcribe the audio data
result = pipe({"array": audio_input, "sampling_rate": sample_rate})
print(result["text"])

device: cpu


ImportError: torchaudio is required to resample audio samples in AutomaticSpeechRecognitionPipeline. The torchaudio package can be installed through: `pip install torchaudio`.

In [None]:
ERROR

# Improve the transcription of your audio file

## Environment Setup

Get API key(s):

In [6]:
import os
import sys

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('../..','src')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your API_key module
import API_key as key

Code to improve your transcriped audio file:

In [None]:
from huggingface_hub import InferenceClient
import textwrap


# Initialize client
client = InferenceClient(token=key.hugging_api_key)

# Create prompts
system_content = "You are a helpful assistant, whos task is to improve the transcription of an audio file. Focus here on possible spelling errors and missing words."
user_content = f"""
    Check the transcription of the following audio file. Only provide the improved transcription:
    
    {transcription}
"""

# Feed prompts into model
output = client.chat_completion(
    messages=[
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content}
    ],
   model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    max_tokens=500,
    stream=False,
    temperature=0.0,
)


# Accessing the text in the output object
text = output.choices[0].message.content

# Printing the output in a more readable format
print('\n'.join(textwrap.wrap(text, 100)))