# Text2Speech

**Information**

The Speech2Text model of OpenAI was proposed in the article [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)

The **Whisper large-v3-turbo model** is a streamlined, faster version of OpenAI's Whisper large-v3, reducing decoding layers from 32 to 4 to improve speed with minimal quality loss. Trained on over 5 million hours of labeled data, this model is robust in zero-shot speech recognition and translation across diverse datasets and domains.


Remark: Model is based on [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3), which was developed by OpenAI

***
**Coding sources**

* Hugging Face model page: https://huggingface.co/openai/whisper-large-v3-turbo
* Hugging Face documentation: https://huggingface.co/docs/transformers/main/en/model_doc/whisper


***
**Aim of the code template**

Mimic the Advanced Speech-to-Text API of Google by i. generating an audio file (Text2Speech), ii. transcribing the audio file (Speech2Text) and iii. improve the transcription by using two LLMs; see Google API: https://cloud.google.com/speech-to-text/?hl=en

# Load your own audio file

Code is based on: https://github.com/suno-ai/bark/blob/main/notebooks/long_form_generation.ipynb

**Text**:

In [1]:
speaker_lookup = {"Samantha": "v2/en_speaker_9", "John": "v2/en_speaker_2"}

# Script generated by chat GPT
script = """
Samantha: So, let's brainstorm potential applications for Large Language Models (LLMs). With their ability to process and generate human-like text, there are tons of possibilities. What comes to mind first?

John: Definitely customer support. LLMs could handle a large volume of basic inquiries, like troubleshooting and FAQs, 24/7. This would free up human agents to focus on more complex issues.

Samantha: Agreed. They’d also be great for content creation. Think of generating marketing copy, blogs, or even personalized emails. It could save so much time and maintain brand voice consistently.

John: Right, and education too. LLMs could serve as tutors, explaining concepts in various ways until a student understands. Interactive and responsive learning!

Samantha: Another area is healthcare. They could assist in medical documentation or patient pre-screening, which could speed up processes in busy clinics.

John: Also, research. Analyzing large datasets, summarizing reports, or even helping draft papers. Researchers would save hours.

Samantha: Exactly. There’s huge potential in every industry. Our focus should be on balancing productivity gains with ethical considerations.

John: Agreed. We need to ensure transparency and control, especially with sensitive information.

Samantha: Let’s start drafting specific use cases for each sector.

John: Sounds like a plan!
"""

script = script.strip().split("\n")
script = [s.strip() for s in script if s]
script

["Samantha: So, let's brainstorm potential applications for Large Language Models (LLMs). With their ability to process and generate human-like text, there are tons of possibilities. What comes to mind first?",
 'John: Definitely customer support. LLMs could handle a large volume of basic inquiries, like troubleshooting and FAQs, 24/7. This would free up human agents to focus on more complex issues.',
 'Samantha: Agreed. They’d also be great for content creation. Think of generating marketing copy, blogs, or even personalized emails. It could save so much time and maintain brand voice consistently.',
 'John: Right, and education too. LLMs could serve as tutors, explaining concepts in various ways until a student understands. Interactive and responsive learning!',
 'Samantha: Another area is healthcare. They could assist in medical documentation or patient pre-screening, which could speed up processes in busy clinics.',
 'John: Also, research. Analyzing large datasets, summarizing repor

In [2]:
from transformers import AutoProcessor, AutoModel
import numpy as np

processor = AutoProcessor.from_pretrained("suno/bark-small")
model = AutoModel.from_pretrained("suno/bark-small")
sampling_rate = model.generation_config.sample_rate

silence = np.zeros(int(0.5*sampling_rate)) # half a second silence


pieces = []

for line in script:
    speaker, text = line.split(": ")
    inputs = processor(
    text=text,
    voice_preset=speaker_lookup[speaker]
    )
    speech_values = model.generate(**inputs, do_sample=True)
    pieces += [speech_values, silence.copy()]

  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obta

In [6]:
len(pieces)

20

In [3]:
pieces_out = []
for p in pieces:
    if(len(p) == 1):
        pieces_out.append(p.cpu().numpy().squeeze())
    else:
        pieces_out.append(p)

In [5]:
from IPython.display import Audio

# play within Jupyter notebook:
Audio(np.concatenate(pieces_out), rate=sampling_rate)

In [8]:
import soundfile as sf
# Save as a WAV file
sf.write("dialog_suno.wav", np.concatenate(pieces_out), sampling_rate)
sf.write("dialog_suno.mp3", np.concatenate(pieces_out), sampling_rate)