### Testing the Audio Generation using `parler_tts` i.e Parler Model and Bark Model
`parler_tts` is a text-to-speech (TTS) library developed by Hugging Face

##### Text-To-Speech
[ipsilondev/parler_tts](https://huggingface.co/ipsilondev/parler_tts)

[ipsilondev/parler_tts](https://huggingface.co/parler-tts/parler-tts-mini-v1)

uv pip install git+https://github.com/huggingface/parler-tts.git

In [None]:
#!pip3 install optimum
#!pip install -U flash-attn --no-build-isolation
#!pip install transformers==4.43.3

In [None]:
from IPython.display import Audio
import IPython.display as ipd
from tqdm import tqdm

from transformers import BarkModel, AutoProcessor, AutoTokenizer
import torch
import json
import numpy as np
from parler_tts import ParlerTTSForConditionalGeneration

### Parler Model
Let's try using the Parler Model first and generate a short segment with speaker Laura's voice

In [None]:
## Set up device
device = "cuda" if torch.cuda.is_available() else "cpu"

## Load model and tokenizer
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

# Define text and description
text_prompt = """
Exactly! And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
"""
# text_prompt → What to say
# description → How to say it (like giving instructions, mood, emotion, pacing, tone)
description = """ 
Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.
"""

## Tokenize inputs to convert into numerical tensors (which is what the model needs) and Sent to GPU or CPU via .to(device)
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
# input_ids: tells the model how the speaker should sound.
# prompt_input_ids: tells the model what the speaker should say.
prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device)
# Together, they allow the model to generate expressive speech in the style you describe.

## Generate audio
# model.generate(...)	Model produces speech audio from description + text
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze() # .squeeze() removes any extra dimensions.

# The model returns a PyTorch tensor (usually on GPU).
# We move it to CPU with .cpu().
# Then convert it to a NumPy array, which is easier to work with.
# End result: audio_arr is a 1D array of audio samples — just like get from a .wav file!

# Play audio in notebook
ipd.Audio(audio_arr, rate=model.config.sampling_rate)

### 🎙️ Bark Model

Amazing! Let's try the same with **Bark** now:

- We will set the `voice_preset` to our favorite speaker.
- This time, we can include **expression prompts** inside our generation prompt.
- Note:
  - We can **CAPITALIZE** words to make the model **emphasize** them.
  - We can use **hyphens** (`-`) to make the model **pause** on certain words.

Example:
> "HELLO - my name is ChatGPT and I'm EXCITED - to talk with you!"

Bark interprets these cues to generate more **natural-sounding and expressive speech**.


In [None]:
device = "cuda:7" # system has multiple GPUs, it's forcing Bark to run on the 8th GPU (index starts from 0).
# device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained("suno/bark") # text processing pipeline that matches the Bark model i.e
# Tokenizing the input text, Applying the voice preset, Formatting everything for the model

## 🔧 Bark TTS Model Loading Options: Speed & Optimization Guide

### ✅ Default Model (Standard Setup)
```python
model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to(device)
```
- Uses Bark as-is, in float16 precision.
- Compatible with all systems.
- 🐢 Slower inference, but safe and reliable.

---

### ⚡ Option 1: BetterTransformer
```python
model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to(device)
model = model.to_bettertransformer()
```
- Uses Hugging Face's `BetterTransformer` for optimized transformer layers.
- ⚡ Speeds up inference (30–50% faster).
- ✅ Uses less memory.
- Requires **PyTorch 2.x**.
- 🧪 Simple to use — great balance of speed & ease.

---

### ⚡⚡ Option 2: FlashAttention 2
```python
model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to(device)
```
- Enables **FlashAttention-2**: super fast & memory efficient attention.
- ⚡⚡ Highest performance, especially for large inputs or batching.
- Requires:
  - ✅ PyTorch 2.x
  - ✅ CUDA >= 11.7
  - ✅ Compatible GPU (Ampere or newer: A100, 3090, 4090, etc.)
  - ✅ `flash-attn` package installed and compiled correctly
- 🔧 Advanced setup — best for power users.

---

### 📌 Recommendation:
| Use Case | Suggested Setup |
|----------|-----------------|
| ✅ Just getting started | Default setup |
| ⚙️ Want speed with ease | `to_bettertransformer()` |
| 🧠 Have modern GPU + need max speed | `flash_attention_2` |



In [None]:
#model =  model.to_bettertransformer()
#model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to(device)
model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to(device)#.to_bettertransformer()

In [None]:
voice_preset = "v2/en_speaker_6" # selects a predefined voice from Bark’s library of synthetic voices
sampling_rate = 24000 #  sets the audio playback/sample rate, i.e., how many samples per second the audio contains

# 24 kHz is high enough to sound natural and expressive.f
# It’s also more efficient (smaller file sizes) than 44.1kHz used in CD audio.
# Bark models are trained on 24kHz, so this is the native sample rate

In [None]:
# text_prompt: sentence the model to speak.
text_prompt = """
Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
"""
inputs = processor(text_prompt, voice_preset=voice_preset).to(device)

# actual audio generation happens
speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate) # Play the Generated Audio

### Bringing it together: Making the Podcast

In [None]:
import pickle

with open('../data/podcast_ready_data.pkl', 'rb') as file:
    PODCAST_TEXT = pickle.load(file)

In [None]:
bark_processor = AutoProcessor.from_pretrained("suno/bark")
bark_model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to("cuda:3")
bark_sampling_rate = 24000

In [None]:
parler_model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to("cuda:3")
parler_tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

In [None]:
speaker1_description = """
Laura's voice is expressive and dramatic in delivery, speaking at a moderately fast pace with a very close recording that almost has no background noise.
"""

In [None]:
generated_segments = []
sampling_rates = []  # We'll need to keep track of sampling rates for each segment

In [None]:
device="cuda:3"

### Generate the Audio of Speaker_1 `Parler Model`

In [None]:
def generate_speaker1_audio(text):
    """Generate audio using ParlerTTS for Speaker 1"""
    input_ids = parler_tokenizer(speaker1_description, return_tensors="pt").input_ids.to(device)
    prompt_input_ids = parler_tokenizer(text, return_tensors="pt").input_ids.to(device)
    generation = parler_model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
    audio_arr = generation.cpu().numpy().squeeze()
    return audio_arr, parler_model.config.sampling_rate

### Generate the Audio of Speaker_2 `Bark Model`

In [None]:
def generate_speaker2_audio(text):
    """Generate audio using Bark for Speaker 2"""
    inputs = bark_processor(text, voice_preset="v2/en_speaker_6").to(device)
    speech_output = bark_model.generate(**inputs, temperature=0.9, semantic_temperature=0.8)
    audio_arr = speech_output[0].cpu().numpy()
    return audio_arr, bark_sampling_rate

# `numpy_to_audio_segment` Utility Function Explanation

`numpy_to_audio_segment` is a handy utility to convert a raw audio signal in a NumPy array into a format that can be easily manipulated or played using the pydub library’s `AudioSegment` class.

- **`audio_arr`**: A NumPy array representing audio waveform samples, usually floating-point values between -1 and 1.

- **`sampling_rate`**: The sample rate (in Hz), e.g., 24000 or 16000, that specifies how many samples per second the audio has.

---

## Step-by-Step Breakdown

1. **Convert to 16-bit PCM format**  
   - Audio samples usually range between -1 and 1.  
   - Multiplying by 32767 scales samples to the 16-bit integer range.  
   - `.astype(np.int16)` converts the samples to 16-bit signed integers, which is the standard format for WAV files.

2. **Create an in-memory WAV file**  
   - `io.BytesIO()` creates an in-memory binary stream (no disk I/O).  
   - `wavfile.write()` writes the audio data to this stream as a WAV file with the specified sampling rate.  
   - `byte_io.seek(0)` resets the stream position to the beginning for reading.

3. **Load as `AudioSegment`**  
   - `AudioSegment.from_wav(byte_io)` reads the in-memory WAV data and returns an `AudioSegment` object.  
   - This object can be used for audio manipulation, playback, and exporting to other formats.

---

## Why Use This?

- Many TTS or audio generation models output raw NumPy arrays.  
- `pydub.AudioSegment` provides powerful audio manipulation and export features.  
- This function bridges the raw model output to a flexible and easy-to-use audio format.

---


In [None]:
def numpy_to_audio_segment(audio_arr, sampling_rate):
    """Convert numpy array to AudioSegment"""
    # Convert to 16-bit PCM
    audio_int16 = (audio_arr * 32767).astype(np.int16)
    
    # Create WAV file in memory
    byte_io = io.BytesIO()
    wavfile.write(byte_io, sampling_rate, audio_int16)
    byte_io.seek(0)
    
    # Convert to AudioSegment
    return AudioSegment.from_wav(byte_io)

In [None]:
PODCAST_TEXT

# What is `import ast` and `ast.literal_eval(PODCAST_TEXT)`?

- **`import ast`**:  
  Imports Python's Abstract Syntax Trees (AST) module, which helps parse and analyze Python code programmatically.

- **`ast.literal_eval()`**:  
  A safe way to evaluate a string containing a Python literal or container (like strings, numbers, tuples, lists, dicts, booleans, and `None`) into its corresponding Python object.  
  Unlike `eval()`, it **only evaluates literals** and does **not execute arbitrary code**, so it's much safer.

---

## In this context:

- `PODCAST_TEXT` is likely a string that looks like a Python list or dictionary, for example:  
  `"[('Speaker 1', 'Hello!'), ('Speaker 2', 'Hi there!')]"`

- Running `ast.literal_eval(PODCAST_TEXT)` converts this string into an actual Python list of tuples (or whatever structure is represented), so we can work with it as a real Python object.

---

## Why use it?

- When we load data from files (like `.pkl`, `.txt`, or `.json` sometimes) that store Python data structures as strings, we need to convert those strings back to Python objects.
- `literal_eval` lets we do this safely and easily.

---

## Example:

```python
import ast

s = "[('Speaker 1', 'Hello!'), ('Speaker 2', 'Hi there!')]"
data = ast.literal_eval(s)
print(data)
# Output: [('Speaker 1', 'Hello!'), ('Speaker 2', 'Hi there!')]
print(type(data))
# Output: <class 'list'>


In [None]:
import ast
ast.literal_eval(PODCAST_TEXT)

### Generating the Final Podcast

This code implements a Text-to-Speech (TTS) pipeline that:

- Takes a podcast script divided into multiple segments, each assigned to a different speaker.
- Uses two separate TTS models to generate speech audio for each speaker’s text:
  - One model for Speaker 1’s voice.
  - Another model for Speaker 2’s voice.
- Converts the raw audio outputs (NumPy arrays) into `AudioSegment` objects for easier manipulation.
- Concatenates all audio segments in sequence to form a continuous podcast audio track.
- Produces a final combined audio file ready for playback or distribution.

In short: **it automatically creates a multi-voice podcast episode from written text by synthesizing and merging the voices.**


In [None]:
final_audio = None

for speaker, text in tqdm(ast.literal_eval(PODCAST_TEXT), desc="Generating podcast segments", unit="segment"):
    if speaker == "Speaker 1":
        audio_arr, rate = generate_speaker1_audio(text)
    else:  # Speaker 2
        audio_arr, rate = generate_speaker2_audio(text)
    
    # Convert to AudioSegment (pydub will handle sample rate conversion automatically)
    audio_segment = numpy_to_audio_segment(audio_arr, rate)
    
    # Add to final audio
    if final_audio is None:
        final_audio = audio_segment
    else:
        final_audio += audio_segment

### Output the Podcast

In [None]:
final_audio.export("../data/outputs/_podcast.mp3", 
                  format="mp3", 
                  bitrate="192k",
                  parameters=["-q:a", "0"])

# Difference Between Raw Audio and AudioSegment

## Raw Audio
- Basic, low-level audio data represented as a NumPy array.
- Contains sound samples (numbers) typically ranging from -1 to 1.
- Just the waveform data without metadata or easy playback support.
- Output format from many TTS or audio generation models.
- Needs extra processing to play, edit, or save as audio files.

## AudioSegment (from pydub)
- A high-level Python object that wraps raw audio data.
- Includes metadata like sample rate, channels, duration.
- Supports easy playback, editing (cut, join, fade), and exporting (MP3, WAV).
- Automatically handles audio format and sample rate conversions.
- Makes audio manipulation and saving straightforward and user-friendly.

---

## Summary
| Aspect        | Raw Audio (NumPy Array)         | AudioSegment (pydub Object)         |
|---------------|--------------------------------|------------------------------------|
| Data Type    | Array of audio samples (numbers) | Object with audio data + metadata  |
| Usage       | Low-level waveform data           | Easy editing, playback, exporting  |
| Playback    | Requires additional tools         | Can play and export directly       |
| Editing     | Difficult, manual processing      | Built-in functions for edits       |

**AudioSegment is used to convert raw audio arrays into a convenient format for working with sound in Python.**
