# Text2Speech

**Information**



The Speech2Text model of OpenAI was proposed in the article [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)

The **Whisper large-v3-turbo model** is a streamlined, faster version of OpenAI's Whisper large-v3, reducing decoding layers from 32 to 4 to improve speed with minimal quality loss. Trained on over 5 million hours of labeled data, this model is robust in zero-shot speech recognition and translation across diverse datasets and domains.


Remark: Model is based on [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3), which was developed by OpenAI

***
**Coding sources**

* Hugging Face model page: https://huggingface.co/openai/whisper-large-v3-turbo
* Hugging Face documentation: https://huggingface.co/docs/transformers/main/en/model_doc/whisper


***
**Aim of the code template**

Mimic the Advanced Speech-to-Text API of Google by (a) transciping audio file and (b) improve the transcription by using two LLMs, see Google API: https://cloud.google.com/speech-to-text/?hl=en

# Load your own audio file

**Text**:

In [2]:
text = """
Person A: So, let's brainstorm potential applications for Large Language Models (LLMs). With their ability to process and generate human-like text, there are tons of possibilities. What comes to mind first?

Person B: Definitely customer support. LLMs could handle a large volume of basic inquiries, like troubleshooting and FAQs, 24/7. This would free up human agents to focus on more complex issues.

Person A: Agreed. They’d also be great for content creation. Think of generating marketing copy, blogs, or even personalized emails. It could save so much time and maintain brand voice consistently.

Person B: Right, and education too. LLMs could serve as tutors, explaining concepts in various ways until a student understands. Interactive and responsive learning!

Person A: Another area is healthcare. They could assist in medical documentation or patient pre-screening, which could speed up processes in busy clinics.

Person B: Also, research. Analyzing large datasets, summarizing reports, or even helping draft papers. Researchers would save hours.

Person A: Exactly. There’s huge potential in every industry. Our focus should be on balancing productivity gains with ethical considerations.

Person B: Agreed. We need to ensure transparency and control, especially with sensitive information.

Person A: Let’s start drafting specific use cases for each sector.

Person B: Sounds like a plan!
"""

In [3]:
from transformers import AutoProcessor, AutoModel

processor = AutoProcessor.from_pretrained("suno/bark-small")
model = AutoModel.from_pretrained("suno/bark-small")

voice_preset = "v2/en_speaker_6"

inputs = processor(
    text=text,
    return_tensors="pt",
    voice_preset=voice_preset
)

speech_values = model.generate(**inputs, do_sample=True)

  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


en_speaker_6_semantic_prompt.npy:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

en_speaker_6_coarse_prompt.npy:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

en_speaker_6_fine_prompt.npy:   0%|          | 0.00/15.0k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [8]:
from IPython.display import Audio
import soundfile as sf

# play within Jupyter notebook:
sampling_rate = model.generation_config.sample_rate
Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)

In [5]:
# Save as a WAV file
sf.write("dialog_suno.wav", speech_values.cpu().numpy().squeeze(), sampling_rate)

# Improve the transcription of your audio file

## Environment Setup

Get API key(s):

In [6]:
import os
import sys

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('../..','src')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your API_key module
import API_key as key

Code to improve your transcriped audio file:

In [7]:
from huggingface_hub import InferenceClient
import textwrap


# Initialize client
client = InferenceClient(token=key.hugging_api_key)

# Create prompts
system_content = "You are a helpful assistant, whos task is to improve the transcription of an audio file. Focus here on possible spelling errors and missing words."
user_content = f"""
    Check the transcription of the following audio file. Only provide the improved transcription:
    
    {transcription}
"""

# Feed prompts into model
output = client.chat_completion(
    messages=[
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content}
    ],
   model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    max_tokens=500,
    stream=False,
    temperature=0.0,
)


# Accessing the text in the output object
text = output.choices[0].message.content

# Printing the output in a more readable format
print('\n'.join(textwrap.wrap(text, 100)))

NameError: name 'transcription' is not defined