## Notebook 4: TTS Workflow

We have the exact podcast transcripts ready now to generate our audio for the Podcast.

In this notebook, we will learn how to generate Audio using both `suno/bark` and `parler-tts/parler-tts-mini-v1` models first. 

After that, we will use the output from Notebook 3 to generate our complete podcast

Note: Please feel free to extend this notebook with newer models. The above two were chosen after some tests using a sample prompt.

In [None]:
import numpy as np
import requests
from tqdm import tqdm
import ast
from pydub import AudioSegment
import numpy as np
from io import BytesIO
import os

%env DEEPGRAM_API_KEY=YOUR_API_KEY

env: DEEPGRAM_API_KEY=f1e75452fbafeaaf1e4ee5a2b424e84336d0fb04


## Bringing it together: Making the Podcast

Okay now that we understand everything-we can now use the complete pipeline to generate the entire podcast

Let's load in our pickle file from earlier and proceed:

In [51]:
import pickle

with open('./resources/podcast_ready_data.pkl', 'rb') as file:
    PODCAST_TEXT = pickle.load(file)

In [52]:
PODCAST_TEXT

'[\n    ("Speaker 1", "Welcome to today\'s explosive episode of \'AI Revolution Unleashed!\' where we\'re about to dive headfirst into the uncharted territories of knowledge distillation for Large Language Models. Today\'s topic is literally a ticking time bomb that\'s going to change the AI landscape forever. Think of it as the secret ingredient to unleashing the true potential of AI. So, fasten your seatbelt, and get ready for the wildest ride of your life!"),\n    ("Speaker 2", "Umm, what exactly is knowledge distillation? I mean, I\'ve heard of it before, but I\'m not entirely sure what it does hm"),\n    ("Speaker 1", "Ah, great question! Knowledge distillation is the process of transferring knowledge from a large, pre-trained model – think of it as the \'teacher\' – to a smaller, more efficient model – the \'student.\' This technique enables us to harness the power of massive language models without the need for astronomical computational resources and data. Essentially, it\'s li

Most of the times we argue in life that Data Structures isn't very useful. However, this time the knowledge comes in handy. 

We will take the string from the pickle file and load it in as a Tuple with the help of `ast.literal_eval()`

In [53]:
import ast
ast.literal_eval(PODCAST_TEXT)

[('Speaker 1',
  "Welcome to today's explosive episode of 'AI Revolution Unleashed!' where we're about to dive headfirst into the uncharted territories of knowledge distillation for Large Language Models. Today's topic is literally a ticking time bomb that's going to change the AI landscape forever. Think of it as the secret ingredient to unleashing the true potential of AI. So, fasten your seatbelt, and get ready for the wildest ride of your life!"),
 ('Speaker 2',
  "Umm, what exactly is knowledge distillation? I mean, I've heard of it before, but I'm not entirely sure what it does hm"),
 ('Speaker 1',
  "Ah, great question! Knowledge distillation is the process of transferring knowledge from a large, pre-trained model – think of it as the 'teacher' – to a smaller, more efficient model – the 'student.' This technique enables us to harness the power of massive language models without the need for astronomical computational resources and data. Essentially, it's like downloading the bra

#### Generating the Final Podcast with Deepgram Text to Speech

Finally, we can loop over the Tuple and use our helper functions to generate the audio

In [None]:
DEEPGRAM_API_KEY = os.environ["DEEPGRAM_API_KEY"]
OUTPUT_FOLDER = "podcast_segments"  # Folder to save audio segments
VOICE_1 = "aura-helios-en"  # Speaker 1 voice model
VOICE_2 = "aura-asteria-en"  # Speaker 2 voice model
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

# Function to call Deepgram TTS API
def call_deepgram_tts(text, model):
    url = f"https://api.deepgram.com/v1/speak?model={model}"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Token {DEEPGRAM_API_KEY}"
    }
    data = {
        "text": text
    }
    response = requests.post(url, headers=headers, json=data)
    
    if response.status_code == 200:
        return response.content
    else:
        raise Exception(f"Deepgram TTS API error: {response.status_code} - {response.text}")

# Save each speaker's segment as an MP3 file
def save_speaker_audio(text, model, filename):
    audio_data = call_deepgram_tts(text, model)
    with open(filename, "wb") as f:
        f.write(audio_data)

# Function to call Deepgram TTS API
def call_deepgram_tts(text, model):
    url = f"https://api.deepgram.com/v1/speak?model={model}"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Token {DEEPGRAM_API_KEY}"
    }
    data = {
        "text": text
    }
    response = requests.post(url, headers=headers, json=data)
    
    if response.status_code == 200:
        return response.content
    else:
        raise Exception(f"Deepgram TTS API error: {response.status_code} - {response.text}")


## Combine all the mp3 files into a single file

In [55]:
# Main code for generating and saving podcast segments
for index, (speaker, text) in enumerate(tqdm(ast.literal_eval(PODCAST_TEXT), desc="Generating podcast segments", unit="segment")):
    model = VOICE_1 if speaker == "Speaker 1" else VOICE_2
    filename = os.path.join(OUTPUT_FOLDER, f"segment_{index + 1}.mp3")
    save_speaker_audio(text, model, filename)

# Combine all segments
combined_audio = None
for filename in sorted(os.listdir(OUTPUT_FOLDER)):
    segment = AudioSegment.from_mp3(os.path.join(OUTPUT_FOLDER, filename))
    if combined_audio is None:
        combined_audio = segment
    else:
        combined_audio += segment

Generating podcast segments:   0%|          | 0/17 [00:00<?, ?segment/s]

Generating podcast segments: 100%|██████████| 17/17 [00:10<00:00,  1.68segment/s]


### Output the Podcast

We can now save this as a mp3 file

In [56]:
# Export the final combined audio
combined_audio.export("final_podcast.mp3", format="mp3", bitrate="48k")

<_io.BufferedRandom name='final_podcast.mp3'>

### Suggested Next Steps:

- Experiment with the prompts: Please feel free to experiment with the SYSTEM_PROMPT in the notebooks
- Extend workflow beyond two speakers
- Test other TTS Models
- Experiment with Speech Enhancer models as a step 5.

In [57]:
#fin