Notebook 4: TTS Workflow
---
We have the exact podcast transcripts ready now to generate our audio for the Podcast.

In this notebook, we will learn how to generate Audio using both suno/bark and parler-tts/parler-tts-mini-v1 models first.

After that, we will use the output from Notebook 3 to generate our complete podcast

Step 1: Setting Up Your Environment for Machine Learning
---
In this tutorial, we'll start by installing the necessary libraries for our machine learning project. 

We'll use the following lines of code to install optimum, flash-attn, and transformers

* Install the optimum library for optimizing and deploying machine learning models --> pip3 install optimum
* Install packaging which is a pre-requisite for attention v2 --> pip install packaging
* Install ninja which is a pre-requisite for attention v2 --> pip install ninja
* Install the flash-attn library for efficient attention mechanisms in deep learning models (https://github.com/Dao-AILab/flash-attention) --> pip install -U flash-attn --no-build-isolation 
* Install the transformers library for natural language processing tasks, specifically version 4.43.3 --> pip install transformers==4.43.3
* Install parler_tts for text to audio generation -->pip install git+https://github.com/username/parler_tts.git
* Install bBark model by Suno AI for text-to-audio generation, --> pip install git+https://github.com/suno-ai/bark.git
* Install pydub for dealing with audio files --> pip install pydub
* Install ffmpeg for dealing with audio files:
     
     sudo apt install

     sudo apt install ffmpeg

In [1]:
#Importing necessary libraries

import IPython.display as ipd
import torch
import soundfile as sf
import json
import numpy as np
import sys
import os

from IPython.display import Audio
from tqdm import tqdm
from transformers import BarkModel, AutoProcessor, AutoTokenizer
from parler_tts import ParlerTTSForConditionalGeneration
from bark import generate_audio
from flash_attn import flash_attn_func

Testing the Audio Generation
---
The audio for the podcast is going to be generated by 2 text to speech models namely Parler and Suno. 

Note the subtle differences in prompting:

* Parler: Takes in a description prompt that can be used to set the speaker profile and generation speeds
* Suno: Takes in expression words like [sigh], [laughs] etc. You can find more notes on the experiments that were run for this notebook in the TTS_Notes.md file to learn more.

Please set device = "cuda" below if you're using a single GPU node.

## Step 2: Generating Synthetic Speech with ParlerTTS & Bark Using Custom Text and Voice Descriptions

### Step 2.1: Parker Model

Using the Parler Model first and generate a short segment with speaker Laura's voice

This code block demonstrates how to generate synthetic audio using a text-to-speech (TTS) model from the ParlerTTS library. It involves loading the TTS model, providing a text input along with a description of the voice style, and generating audio output. The process is detailed below.

##### Code Overview and Purpose:
* Setup Device: Determines whether to use a GPU (CUDA) if available, or the CPU if not.
* Load Model and Tokenizer: Loads a pretrained TTS model (ParlerTTSForConditionalGeneration) and a tokenizer compatible with it. These tools convert text into tokens the model can process to generate speech.
* Define Text and Description: Specifies the content to be spoken and describes the voice style.
* Tokenize Inputs: Converts the text and voice description into token IDs that the model can understand.
* Generate Audio: Passes the tokenized text and description to the model, which produces synthetic audio.
* Play Audio: Plays the generated audio output in the notebook.

##### End Result
The final output is an audio array that contains synthesized speech based on the given text, presented with the specified voice characteristics. The audio is then played back, allowing users to hear the model’s vocal interpretation of the input text.

In [2]:

# Set up device for model processing
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load TTS model and tokenizer from pretrained resources
# The model converts text to speech, while the tokenizer processes the input text
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

# Define the text prompt and the vocal style description
# 'text_prompt' contains the words to be spoken, and 'description' sets the tone/style of the voice
text_prompt = """
Exactly! And the distillation part is where you take a LARGE-model, and compress it down into a smaller, 
more efficient model that can run on devices with limited resources.
"""
description = """
Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording 
that almost has no background noise.
"""

# Tokenize the description and text prompt
# 'input_ids' represents the vocal style, and 'prompt_input_ids' represents the text to be spoken
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device)

# Generate audio using the TTS model
# 'generation' produces an array of audio samples based on the input text and voice description
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)

# Convert the generated audio tensor to a NumPy array for playback
audio_arr = generation.cpu().numpy().squeeze()

# Play the generated audio within the notebook environment
ipd.Audio(audio_arr, rate=model.config.sampling_rate)

model.safetensors:  67%|######6   | 2.35G/3.51G [00:00<?, ?B/s]

  WeightNorm.apply(module, name, dim)


generation_config.json:   0%|          | 0.00/265 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


#### Step 2.2: Bark Model

##### Code Overview and Purpose
This code block uses the Bark model, a text-to-speech synthesis model, to generate a nuanced audio output based on a custom text prompt. The code includes:

1. Setting up a specific voice preset and sampling rate.
2. Loading the required processor and model, both of which are configured for efficient GPU use.
3. Processing the input text prompt to fit the selected voice characteristics.
4. Customizing the generation process by setting temperatures for expressive, natural-sounding speech synthesis.

##### End Result
The generated output is an audio file containing the synthetic voice that recites the provided text prompt. The voice has been customized with temperature settings that enhance expressiveness and a specific voice preset to shape the vocal quality.


Notes:

* We will set the voice_preset to our favorite speaker
* This time we can include expression prompts inside our generation prompt
* Note you can CAPTILISE words to make the model emphasise on these
* You can add hyphens to make the model pause on certain words

In [3]:
# Define the voice preset and audio sampling rate
# 'voice_preset' selects a specific synthetic voice style, and 'sampling_rate' defines the audio quality
voice_preset = "v2/en_speaker_6"
sampling_rate = 24000
device = "cuda"  # Use GPU if available

# Load the text processing model
# The processor will convert text into the format required by the Bark model
processor = AutoProcessor.from_pretrained("suno/bark")

# Load the Bark TTS model with specific settings for precision and speed
# This model can generate expressive audio based on text and custom presets
model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to(device)

# Define the text prompt to be converted into speech
# The prompt includes [sigh] to add expressive elements to the generated speech
text_prompt = """
Exactly! [sigh] And the distillation part is where you take a LARGE-model, and compress it down 
into a smaller, more efficient model that can run on devices with limited resources.
"""

# Process the text prompt with the specified voice preset for input to the model
# 'inputs' contains tokenized text and voice style settings
inputs = processor(text_prompt, voice_preset=voice_preset).to(device)

# Generate synthetic speech from the processed inputs
# 'temperature' and 'semantic_temperature' control the expressiveness of the generated speech
speech_output = model.generate(**inputs, temperature=0.9, semantic_temperature=0.8)

# Play the generated audio in the notebook environment
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

speaker_embeddings_path.json:   0%|          | 0.00/61.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/8.81k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/4.49G [00:00<?, ?B/s]

  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


generation_config.json:   0%|          | 0.00/4.91k [00:00<?, ?B/s]

en_speaker_6_semantic_prompt.npy:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

en_speaker_6_coarse_prompt.npy:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

en_speaker_6_fine_prompt.npy:   0%|          | 0.00/15.0k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


Step 3: Bringing it together: Making the Podcast
---
##### Code Overview and Purpose
This code block handles the process of loading a .pkl file, typically containing text data, and preparing two TTS (Text-to-Speech) models, Bark and ParlerTTS, for subsequent audio generation tasks. It allows users to select the .pkl file through a file dialog, handles errors if no file is selected, and sets up the required models for generating audio from text data.

##### End Result
The final output loads text data from a .pkl file into a variable and initializes the Bark and ParlerTTS models, which will be used to convert this text into audio in later steps. This step also defines the voice profile for one of the speakers.

In [6]:
# Import necessary libraries for file handling and GUI
import pickle  # For loading .pkl files
import tkinter as tk  # For file dialog window
from tkinter import filedialog  # For opening the file dialog

# Function to open a file dialog box for selecting a .pkl file
def select_file():
    root = tk.Tk()
    root.withdraw()  # Hide the main Tkinter window
    # Open a dialog box for selecting .pkl files
    file_path = filedialog.askopenfilename(
        title="Select a .pkl file",
        filetypes=[("Pickle files", "*.pkl")]
    )
    return file_path

# Attempt to get the .pkl file path via the dialog; if no file is selected, prompt for direct input
try:
    file_path = select_file()  # Opens the file dialog
    if not file_path:  # If the user cancels or closes the dialog without selecting a file
        raise FileNotFoundError("No file selected.")
except Exception as e:
    # If no file is selected or dialog fails, request path input from the user
    print(f"Error: {e}")
    file_path = input("Please enter the path to your .pkl file: ")

# Load the specified .pkl file containing text data for speech generation
with open(file_path, 'rb') as file:
    PODCAST_TEXT = pickle.load(file)  # Loads the content of the .pkl file into 'PODCAST_TEXT'

# Set up Bark model and processor for audio generation
# 'bark_processor' prepares text for the Bark model, which generates audio
bark_processor = AutoProcessor.from_pretrained("suno/bark")
bark_model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to("cuda")
bark_sampling_rate = 24000  # Defines the audio quality for Bark model output

# Set up Parler TTS model and tokenizer for additional voice synthesis options
parler_model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to("cuda")
parler_tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

# Define voice profile description for the first speaker
# 'speaker1_description' specifies characteristics like tone, pace, and recording style
speaker1_description = """
Laura's voice is expressive and dramatic in delivery, speaking at a moderately fast pace with a very close recording 
that almost has no background noise.
"""

# Initialize lists to store generated audio segments and their sampling rates
generated_segments = []  # Will store audio segments generated by the models
sampling_rates = []  # Keeps track of each segment's sampling rate

# Specify the device for model processing
device = "cuda"  # Uses GPU if available

Step 4: Functions for  Audio generation for speakers using Parler and Bark Text to Speech Models.
---
#### Step 4.1: Function for Generating Audio for Speaker 1 Using ParlerTTS

##### Code Overview and Purpose
This code block defines a function, generate_speaker1_audio, which leverages the ParlerTTS model to generate audio based on input text for a specific speaker profile. This function uses predefined characteristics from speaker1_description to produce audio that matches the intended tone and style for "Speaker 1."

##### End Result
The function returns an audio array and the sampling rate, which can be used to play or further process the generated audio for Speaker 1. This audio output will be shaped according to the expressive, dramatic characteristics defined in the speaker1_description.

In [7]:
# Define a function to generate audio for Speaker 1 using ParlerTTS
def generate_speaker1_audio(text):
    """Generate audio using ParlerTTS for Speaker 1"""
    
    # Convert the speaker description into tokens that the model can understand
    # 'input_ids' represent the voice style, described by 'speaker1_description'
    input_ids = parler_tokenizer(speaker1_description, return_tensors="pt").input_ids.to(device)
    
    # Tokenize the input text that will be converted into speech
    # 'prompt_input_ids' represents the text to be spoken by Speaker 1
    prompt_input_ids = parler_tokenizer(text, return_tensors="pt").input_ids.to(device)
    
    # Generate audio using ParlerTTS with the given description and text
    # 'generation' produces an array of audio samples based on the input text and voice description
    generation = parler_model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
    
    # Convert the generated audio tensor to a NumPy array for easier playback and processing
    audio_arr = generation.cpu().numpy().squeeze()
    
    # Return the generated audio array and the model's sampling rate
    return audio_arr, parler_model.config.sampling_rate

#### Step 4.2: Function for Generating Audio for Speaker 2 Using Bark Model

##### Code Overview and Purpose
This code block defines a function, generate_speaker2_audio, which uses the Bark model to synthesize audio for Speaker 2 based on the input text. The function applies a specific voice preset and temperature settings, creating a distinct, expressive audio output tailored to Speaker 2’s vocal style.

##### End Result
The function outputs an audio array and sampling rate, which can be played or processed further. This array contains the synthesized voice, customized to match the characteristics of the selected voice preset for Speaker 2.

In [8]:
# Define a function to generate audio for Speaker 2 using the Bark model
def generate_speaker2_audio(text):
    """Generate audio using Bark for Speaker 2"""
    
    # Process the input text with a specific voice preset
    # 'inputs' contains the tokenized text for Bark, using voice preset "v2/en_speaker_6"
    inputs = bark_processor(text, voice_preset="v2/en_speaker_6").to(device)
    
    # Generate the audio using Bark's TTS model
    # 'temperature' and 'semantic_temperature' control expressiveness and natural flow of speech
    speech_output = bark_model.generate(**inputs, temperature=0.9, semantic_temperature=0.8)
    
    # Convert the generated audio tensor to a NumPy array for playback and further processing
    audio_arr = speech_output[0].cpu().numpy()
    
    # Return the generated audio array and the predefined sampling rate for Bark model
    return audio_arr, bark_sampling_rate


Step 5: Converting Numpy Array to AudioSegment for Playback and Export
---
#### Code Overview and Purpose
This code block provides a utility function, numpy_to_audio_segment, which converts a NumPy array containing audio data into an AudioSegment object. The AudioSegment format, from the pydub library, allows easy manipulation and export of audio files in various formats. The function processes the audio data by converting it to a 16-bit PCM format and then creating a WAV file in memory.

#### End Result
The function returns an AudioSegment object that can be directly played back, exported to various file formats, or further processed. This allows for flexible handling of generated audio data in formats compatible with many media applications.

In [14]:
# Import necessary modules for in-memory byte stream and audio handling
import io  # For in-memory byte streams to handle audio without saving to disk
import numpy as np  # For handling and processing audio data as arrays
from scipy.io import wavfile  # For writing audio data to a WAV format
from pydub import AudioSegment  # For audio manipulation and format conversion

def numpy_to_audio_segment(audio_arr, sampling_rate):
    """Convert numpy array to AudioSegment"""
    
    # Convert audio data from float (-1 to 1) to 16-bit PCM format (int16)
    # This step prepares the data for WAV format, commonly used in audio applications
    audio_int16 = (audio_arr * 32767).astype(np.int16)
    
    # Create a WAV file in memory using an in-memory byte stream (BytesIO)
    # This avoids writing to disk and allows for efficient in-memory processing
    byte_io = io.BytesIO()
    wavfile.write(byte_io, sampling_rate, audio_int16)
    byte_io.seek(0)  # Move to the beginning of the stream for reading
    
    # Convert the in-memory WAV data to an AudioSegment object
    # AudioSegment supports playback, manipulation, and export to other formats (e.g., MP3, OGG)
    return AudioSegment.from_wav(byte_io)


In [15]:
# Printing the resulted text of the podcast
PODCAST_TEXT 

'[\n    ("Speaker 1", "Welcome to \'The Writing Life\', where we explore the world of writing and share tips and tricks to help you improve your craft. I\'m your host, and today we\'re joined by a seasoned writer and educator who\'s worked with authors, poets, and journalists. Let\'s dive right in! What draws you to writing, and how do you approach the creative process?"),\n    ("Speaker 2", "Hmm, I think I\'ve always been fascinated by the power of words to convey emotion and meaning. But, umm, how do you approach writing, exactly?"),\n    ("Speaker 1", "Well, I think it\'s all about understanding your audience and purpose. Whether you\'re writing a novel, essay, or poem, it\'s essential to consider who your readers are and what they want to take away from your work."),\n    ("Speaker 2", "That makes sense. But, umm, what about tone? How do you convey a tone through writing? I\'ve always struggled with this one."),\n    ("Speaker 1", "Tone is a great topic! Think of it like music - yo

Step 6: Safely Converting Text Data to Python Objects Using ast.literal_eval
---
#### Code Overview and Purpose
This code block uses Python’s ast (Abstract Syntax Trees) module to convert the PODCAST_TEXT, a string representation of a Python data structure, into an actual Python object. The ast.literal_eval function is used to evaluate this string safely, avoiding potential security risks associated with eval. This is particularly useful when loading serialized data in string format that represents dictionaries, lists, or other literal structures.

#### End Result
The result of this operation is a Python object derived from PODCAST_TEXT, which can now be manipulated as a native Python structure (like a list or dictionary), allowing further processing or analysis.

In [16]:
# Import the ast module for safe evaluation of strings containing Python literals
import ast

# Convert the string in PODCAST_TEXT to a Python object safely
# 'ast.literal_eval' only evaluates strings containing Python literals (e.g., lists, dicts, numbers)
# This function avoids the security risks of 'eval' by only parsing basic data structures
parsed_data = ast.literal_eval(PODCAST_TEXT)

# 'parsed_data' now holds a Python object derived from PODCAST_TEXT, ready for further use

[('Speaker 1',
  "Welcome to 'The Writing Life', where we explore the world of writing and share tips and tricks to help you improve your craft. I'm your host, and today we're joined by a seasoned writer and educator who's worked with authors, poets, and journalists. Let's dive right in! What draws you to writing, and how do you approach the creative process?"),
 ('Speaker 2',
  "Hmm, I think I've always been fascinated by the power of words to convey emotion and meaning. But, umm, how do you approach writing, exactly?"),
 ('Speaker 1',
  "Well, I think it's all about understanding your audience and purpose. Whether you're writing a novel, essay, or poem, it's essential to consider who your readers are and what they want to take away from your work."),
 ('Speaker 2',
  "That makes sense. But, umm, what about tone? How do you convey a tone through writing? I've always struggled with this one."),
 ('Speaker 1',
  "Tone is a great topic! Think of it like music - you can convey a specific 

Step 7: Generating and Concatenating Audio Segments for Podcast Production
---
#### Code Overview and Purpose
This code block generates and concatenates audio segments for a podcast using two synthetic voices, “Speaker 1” and “Speaker 2.” The code iterates through the text segments in PODCAST_TEXT, identifies the speaker, generates audio accordingly, and then combines the segments into a single audio file.

#### End Result
The final output is a continuous audio file, final_audio, containing all the generated segments combined in sequence. This final audio can be exported as a single file, forming a complete podcast episode or audio presentation.

In [17]:
# Initialize the final audio segment to None; this will store the complete concatenated audio
final_audio = None

# Loop through each segment in PODCAST_TEXT, converting each into audio based on speaker identity
for speaker, text in tqdm(ast.literal_eval(PODCAST_TEXT), desc="Generating podcast segments", unit="segment"):
    
    # Check which speaker to use for the audio generation
    if speaker == "Speaker 1":
        # Generate audio for Speaker 1
        audio_arr, rate = generate_speaker1_audio(text)
    else:  # Assumes the other speaker is Speaker 2
        # Generate audio for Speaker 2
        audio_arr, rate = generate_speaker2_audio(text)
    
    # Convert the generated audio NumPy array to an AudioSegment object
    # 'numpy_to_audio_segment' handles the conversion to an AudioSegment format
    audio_segment = numpy_to_audio_segment(audio_arr, rate)
    
    # Add the current segment to the final audio
    if final_audio is None:
        # If this is the first segment, initialize 'final_audio' with 'audio_segment'
        final_audio = audio_segment
    else:
        # Append the segment to the existing audio, creating a continuous audio file
        final_audio += audio_segment

Generating podcast segments:   3%|▎         | 1/29 [00:14<06:35, 14.13s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
Generating podcast segments:  10%|█         | 3/29 [00:35<04:47, 11.04s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
Generating podcast segments:  17%|█▋        | 5/29 [01:00<04:40, 11.71s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
Generat

Step 8: Saving the Final Podcast Audio as an MP3 File
---
#### Code Overview and Purpose
This code block saves the final_audio object, which contains the complete concatenated podcast audio, as an MP3 file. It first checks that final_audio has been created, then opens a file dialog for the user to specify the save location and filename. If the user selects a location, the audio is exported in MP3 format; otherwise, a message is displayed indicating that the save operation was canceled.

#### End Result
The result of this code is an MP3 file saved to the specified location, containing the entire podcast audio generated in previous steps. This allows for easy storage and sharing of the generated podcast.

In [18]:
# Ensure that final_audio has been generated before attempting to save
if final_audio:
    # Initialize Tkinter for the file save dialog
    root = tk.Tk()
    root.withdraw()  # Hide the main Tkinter window, showing only the file dialog

    # Open a "Save As" dialog to let the user choose where to save the audio file
    file_path = filedialog.asksaveasfilename(
        defaultextension=".mp3",  # Default to .mp3 extension if none provided
        filetypes=[("MP3 files", "*.mp3")],  # Only show MP3 file options
        title="Save the podcast audio as"  # Title of the save dialog window
    )

    # Check if a file path was selected
    if file_path:
        # Save the final audio as an MP3 file at the specified location
        # 'bitrate' sets the audio quality, and 'parameters' adjusts encoding options
        final_audio.export(file_path, 
                           format="mp3", 
                           bitrate="192k", 
                           parameters=["-q:a", "0"])  # High-quality encoding
    else:
        # Inform the user if they canceled the save operation
        print("Save operation canceled.")
else:
    # Notify the user if no audio was generated to save
    print("No audio segments were generated.")

Suggested Next Steps:
---
* Experiment with the prompts: Please feel free to experiment with the SYSTEM_PROMPT in the notebooks
* Extend workflow beyond two speakers
* Test other TTS Models
* Experiment with Speech Enhancer models as a step 5.