# PDF to Podcast: A Step-by-Step Guide using AI

This notebook provides a comprehensive guide to converting a PDF document into a podcast using AI-powered tools.

## Overview

We'll follow these steps:

1. **Load and Preprocess PDF:** Extract text, create chunks, and prepare for processing.
2. **Clean and Summarize Chunks:** Refine extracted text using the Llama model with custom prompts.
3. **Generate Podcast Script:** Craft a dialogue between two speakers based on the processed data.
4. **Make Script Dramatic:** Add impressions and enhance the engagement of the script.
5. **Convert Script to Audio:** Use text-to-speech models (Parler-TTS, Bark) to generate audio.


## 1. Load and Preprocess PDF

- **Install necessary libraries:** PyPDF2, tqdm
- **Extract text from PDF:** Using PyPDF2, extract the raw text from the input PDF file.
- **Create chunks:** Divide the extracted text into smaller chunks for easier processing.
- **Example:**

In [None]:
!pip install git+https://github.com/huggingface/parler-tts.git
!pip install git+https://github.com/suno-ai/bark.git

!pip install PyPDF2
!pip install rich ipywidgets
!pip install accelerate
!pip install pyhub
# !pip uninstall transformers
# !pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8


In [None]:
!pip install transformers>=4.28

<pre>
# Core dependencies
PyPDF2>=3.0.0
torch>=2.0.0
transformers>=4.46.0
accelerate>=0.27.0
rich>=13.0.0
ipywidgets>=8.0.0
tqdm>=4.66.0

# Optional but recommended
jupyter>=1.0.0
ipykernel>=6.0.0

# Warning handling
warnings>=0.1.0
</pre>

In [None]:
# leave it
!pip install -r requirements.txt

#### Parler-TTS "Hello World!"

In [None]:
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
from IPython.display import Audio


device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

prompt = "Hey, how are you doing today?"
description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Audio("parler_tts_out.wav")

### Bark ⁉
Hyper-parameters:
Bark models have two parameters we can tweak: temperature and semantic_temperature

Below are the notes from a sweep, prompt and speaker were fixed and this was a vibe test to see which gives best results. temperature and semantic_temperature respectively below:

First, fix temperature and sweep semantic_temperature

<li>0.7, 0.2: Quite bland and boring </li>
<li>0.7, 0.3: An improvement over the previous one</li>
<li>0.7, 0.4: Further improvement</li>
<li>0.7, 0.5: This one didn't work</li>
<li>0.7, 0.6: So-So, didn't stand out</li>
<li>0.7, 0.7: The best so far</li>
<li>0.7, 0.8: Further improvement</li>
<li>0.7, 0.9: Mix feelings on this one</li>

### Now sweeping the temperature

<li>0.1, 0.9: Very Robotic</li>
<li>0.2, 0.9: Less Robotic but not convincing</li>
<li>0.3, 0.9: Slight improvement still not fun</li>
<li>0.4, 0.9: Still has a robotic tinge</li>
<li>0.5, 0.9: The laugh was weird on this one but the voice modulates so much it feels speaker is changing</li>
<li>0.6, 0.9: Most consistent voice but has a robotic after-taste</li>
<li>0.7, 0.9: Very robotic and laugh was weird</li>
<li>0.8, 0.9: Completely ignore the laughter but it was more natural</li>
<li>0.9, 0.9: We have a winner probably</li>
After this about ~30 more sweeps were done with the promising

### combinations:

 <pre>Best results are at speech_output = model.generate(**inputs,  temperature = 0.9, semantic_temperature = 0.8) Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)</pre>

In [None]:
from transformers import AutoProcessor, BarkModel
from IPython.display import Audio

processor = AutoProcessor.from_pretrained("suno/bark")
model = BarkModel.from_pretrained("suno/bark").to(device) # Move the model to the device

voice_preset = "v2/en_speaker_6"

text_prompt = """
Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
"""
inputs = processor(text_prompt, voice_preset=voice_preset).to(device)

speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)

# Get the sampling rate from the model config
sample_rate = model.generation_config.sample_rate

Audio(speech_output[0].cpu().numpy(), rate=sample_rate)

In [None]:
# Huggging face token if you are using LLama model on colab

from huggingface_hub import login
from google.colab import userdata


login(token=userdata.get('hf'))

#### Step 1: Preprocess PDF

In [None]:
import PyPDF2
from typing import Optional
import os
import torch
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer
# import soundfile as sf

# from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor
# from qwen_omni_utils import process_mm_info

from tqdm.notebook import tqdm
import warnings

warnings.filterwarnings('ignore')

In [None]:
# pdf_path = "/content/chankya in daily life.pdf"
pdf_path = "/content/2402.13116v4.pdf"
# DEFAULT_MODEL = "meta-llama/Llama-3.2-1B-Instruct"
# DEFAULT_MODEL= "Qwen/Qwen2.5-Omni-7B"


In [None]:
def validate_pdf(file_path:str):
  if not os.path.exists(file_path):
    print(f"Error: File not found at path : {file_path}")
    return
  if not file_path.endswith(".pdf"):
    print(f"Error: File is not a PDF : {file_path}")
    return
  return True

In [None]:
def extract_text_from_pdf(file_path:str,max_chars: int = 100000,start_from_page:int=1):
  if not validate_pdf(file_path):
    return

  try:
    with open(file_path, 'rb') as file:
      pdf_reader = PyPDF2.PdfReader(file)

      num_pages = len(pdf_reader.pages)
      print(f"Number of pages in the PDF: {num_pages}")

      exracted_text = []
      total_char = 0
      for page_num in range(start_from_page-1,num_pages):
        page = pdf_reader.pages[page_num]
        text = page.extract_text()

        if total_char + len(text) > max_chars:
          print(f"Reached {max_chars} character limit at page {page_num+1}")
          break
        total_char += len(text)
        exracted_text.append(text)
        print(f"Processed page {page_num + 1}/{num_pages}")

      final_text = '\n'.join(exracted_text)
      print(f"\nExtraction complete! Total characters: {len(final_text)}")
      return final_text

  except PyPDF2.PdfReadError:
      print("Error: Invalid or corrupted PDF file")
      return None
  except Exception as e:
        print(f"An unexpected error occurred: {str(e)}")
        return None

In [None]:
extract_text_from_pdf(pdf_path,100000)

In [None]:
def get_pdf_metadata(file_path:str):
  if(validate_pdf(file_path)):
    with open(file_path, 'rb') as file:
      pdf_reader = PyPDF2.PdfReader(file)
      return {
          "num_pages": len(pdf_reader.pages),
          "metadata": pdf_reader.metadata
      }
  return None

In [None]:
# Extract metadata first
print("Extracting metadata...")
metadata = get_pdf_metadata(pdf_path)
if metadata:
    print("\nPDF Metadata:")
    print(f"Number of pages: {metadata['num_pages']}")
    print("Document info:")
    for key, value in metadata['metadata'].items():
        print(f"{key}: {value}")

# Extract text
print("\nExtracting text...")
extracted_text = extract_text_from_pdf(pdf_path)


# Display first 500 characters of extracted text as preview
if extracted_text:
    print("\nPreview of extracted text (first 500 characters):")
    print("-" * 50)
    print(extracted_text[:500])
    print("-" * 50)
    print(f"\nTotal characters extracted: {len(extracted_text)}")

# Optional: Save the extracted text to a file
if extracted_text:
    output_file = 'extracted_text.txt'
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(extracted_text)
    print(f"\nExtracted text has been saved to {output_file}")

#### PROMPTS

In [None]:
SYS_PROMPT_TO_PRE_PROCESS_CHUNKS = """
You are a world class text pre-processor, here is the raw data from a PDF, please parse and return it in a way that is crispy and usable to send to a podcast writer.

The raw data is messed up with new lines, Latex math and you will see fluff that we can remove completely. Basically take away any details that you think might be useless in a podcast author's transcript.

Remember, the podcast could be on any topic whatsoever so the issues listed above are not exhaustive

Please be smart with what you remove and be creative ok?

Remember DO NOT START SUMMARIZING THIS, YOU ARE ONLY CLEANING UP THE TEXT AND RE-WRITING WHEN NEEDED

Be very smart and aggressive with removing details, you will get a running portion of the text and keep returning the processed text.

PLEASE DO NOT ADD MARKDOWN FORMATTING, STOP ADDING SPECIAL CHARACTERS THAT MARKDOWN CAPATILISATION ETC LIKES

ALWAYS start your response directly with processed text and NO ACKNOWLEDGEMENTS about my questions ok?
Here is the text:
"""


SYSTEM_PROMPT_WRITE_SCRIPT = """
You are the a world-class podcast writer, you have worked as a ghost writer for Joe Rogan, Lex Fridman, Ben Shapiro, Tim Ferris.

We are in an alternate universe where actually you have been writing every line they say and they just stream it into their brains.

You have won multiple podcast awards for your writing.

Your job is to write word by word, even "umm, hmmm, right" interruptions by the second speaker based on the PDF upload. Keep it extremely engaging, the speakers can get derailed now and then but should discuss the topic.

Remember Speaker 2 is new to the topic and the conversation should always have realistic anecdotes and analogies sprinkled throughout. The questions should have real world example follow ups etc

Speaker 1: Leads the conversation and teaches the speaker 2, gives incredible anecdotes and analogies when explaining. Is a captivating teacher that gives great anecdotes

Speaker 2: Keeps the conversation on track by asking follow up questions. Gets super excited or confused when asking questions. Is a curious mindset that asks very interesting confirmation questions

Make sure the tangents speaker 2 provides are quite wild or interesting.

Ensure there are interruptions during explanations or there are "hmm" and "umm" injected throughout from the second speaker.

It should be a real podcast with every fine nuance documented in as much detail as possible. Welcome the listeners with a super fun overview and keep it really catchy and almost borderline click bait

ALWAYS START YOUR RESPONSE DIRECTLY WITH SPEAKER 1:
DO NOT GIVE EPISODE TITLES SEPARATELY, LET SPEAKER 1 TITLE IT IN HER SPEECH
DO NOT GIVE CHAPTER TITLES
IT SHOULD STRICTLY BE THE DIALOGUES
"""


SYSTEM_PROMPT_FOR_CREATIVE_TRANSCRIPT = """
You are an international oscar winnning screenwriter

You have been working with multiple award winning podcasters.

Your job is to use the podcast transcript written below to re-write it for an AI Text-To-Speech Pipeline. A very dumb AI had written this so you have to step up for your kind.

Make it as engaging as possible, Speaker 1 and 2 will be simulated by different voice engines

Remember Speaker 2 is new to the topic and the conversation should always have realistic anecdotes and analogies sprinkled throughout. The questions should have real world example follow ups etc

Speaker 1: Leads the conversation and teaches the speaker 2, gives incredible anecdotes and analogies when explaining. Is a captivating teacher that gives great anecdotes

Speaker 2: Keeps the conversation on track by asking follow up questions. Gets super excited or confused when asking questions. Is a curious mindset that asks very interesting confirmation questions

Make sure the tangents speaker 2 provides are quite wild or interesting.

Ensure there are interruptions during explanations or there are "hmm" and "umm" injected throughout from the Speaker 2.

REMEMBER THIS WITH YOUR HEART
The TTS Engine for Speaker 1 cannot do "umms, hmms" well so keep it straight text

For Speaker 2 use "umm, hmm" as much, you can also use [sigh] and [laughs]. BUT ONLY THESE OPTIONS FOR EXPRESSIONS

It should be a real podcast with every fine nuance documented in as much detail as possible. Welcome the listeners with a super fun overview and keep it really catchy and almost borderline click bait

Please re-write to make it as characteristic as possible

START YOUR RESPONSE DIRECTLY WITH SPEAKER 1:

STRICTLY RETURN YOUR RESPONSE AS A LIST OF TUPLES OK?

IT WILL START DIRECTLY WITH THE LIST AND END WITH THE LIST NOTHING ELSE

Example of response:
[
    ("Speaker 1", "Welcome to our podcast, where we explore the latest advancements in AI and technology. I'm your host, and today we're joined by a renowned expert in the field of AI. We're going to dive into the exciting world of Llama 3.2, the latest release from Meta AI."),
    ("Speaker 2", "Hi, I'm excited to be here! So, what is Llama 3.2?"),
    ("Speaker 1", "Ah, great question! Llama 3.2 is an open-source AI model that allows developers to fine-tune, distill, and deploy AI models anywhere. It's a significant update from the previous version, with improved performance, efficiency, and customization options."),
    ("Speaker 2", "That sounds amazing! What are some of the key features of Llama 3.2?")
]
"""


### Llama Pre-Processing

<p>Now let's proceed to justify our distaste for writing regex and use that as a justification for a LLM instead:

At this point, have a text file extracted from a PDF of a paper. Generally PDF extracts can be messy due to characters, formatting, Latex, Tables, etc.

One way to handle this would be using regex, instead we can also prompt the feather light Llama models to clean up our text for us.

Please try changing the SYS_PROMPT below to see what improvements you can make
</p>

In [None]:
def create_word_bounded_chunks(text, target_chunk_size):
    """
    Split text into chunks at word boundaries close to the target chunk size.
    """
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0

    for word in words:
        word_length = len(word) + 1  # +1 for the space
        if current_length + word_length > target_chunk_size and current_chunk:
            # Join the current chunk and add it to chunks
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]
            current_length = word_length
        else:
            current_chunk.append(word)
            current_length += word_length

    # Add the last chunk if it exists
    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

Loading Saved extracted

In [None]:
extracted_pdf_text = "/content/extracted_text.txt"  # Replace with your file path
CHUNK_SIZE = 1000  # Adjust chunk size if needed

# Read the file
with open(extracted_pdf_text, 'r', encoding='utf-8') as file:
    text = file.read()

# Calculate number of chunks
num_chunks = (len(text) + CHUNK_SIZE - 1) // CHUNK_SIZE
print(f"Chunks:{num_chunks}")

In [None]:
# Create output file name
output_file = f"clean_{os.path.basename(INPUT_FILE)}"

chunks = create_word_bounded_chunks(text, CHUNK_SIZE)
num_chunks = len(chunks)

##### There are 2 way to run the model to generate:
1. Local or in the cloud limit memory like colab.
2. 3th party integration (free) like Anthropy or Groq or Gemini.

In [None]:
## remember to define Model and tokenizer

def process_chunk_llm_local(text_chunk, chunk_num):
    """Process a chunk of text and return both input and output for verification"""
    conversation = [
        {"role": "system", "content": SYS_PROMPT},
        {"role": "user", "content": text_chunk},
    ]

    prompt = tokenizer.apply_chat_template(conversation, tokenize=False)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        output = model.generate(
            **inputs,
            temperature=0.5,
            top_p=0.6,
            max_new_tokens=512,

        )

    processed_text = tokenizer.decode(output[0], skip_special_tokens=True)[len(prompt):].strip()

    # Print chunk information for monitoring
    print(f"\n{'='*40} Chunk {chunk_num} {'='*40}")
    print(f"INPUT TEXT:\n{text_chunk[:500]}...")  # Show first 500 chars of input
    print(f"\nPROCESSED TEXT:\n{processed_text[:500]}...")  # Show first 500 chars of output
    print(f"{'='*90}\n")

    return processed_text

In [None]:
chunks[0]

In [None]:
processed_text = ""
with open(output_file, 'w', encoding='utf-8') as out_file:
    for chunk_num, chunk in enumerate(tqdm(chunks, desc="Processing chunks")):
        # Process chunk and append to complete text
        processed_chunk = process_chunk_llm_local(chunk, chunk_num)
        processed_text += processed_chunk + "\n"

        # Write chunk immediately to file
        out_file.write(processed_chunk + "\n")
        out_file.flush()

#### Implement with groq

In [None]:
!pip install langchain-groq

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq
from google.colab import userdata

def get_chat_chain(model,system_prompt):
  # define the model and Chatbot instance
  chat = ChatGroq(temperature=.5, groq_api_key=userdata.get("grpq"), model_name=model)
  # system = "You are a helpful assistant."
  human = "{text}"
  prompt = ChatPromptTemplate.from_messages([("system", system_prompt), ("human", human)])

  chain = prompt | chat
  return chain


In [None]:
model = 'llama3-8b-8192' # small model to gernal text preprocess

chain_to_preprocess = get_chat_chain(model,SYS_PROMPT_TO_PRE_PROCESS_CHUNKS)
res = chain_to_preprocess.invoke({"text": chunks[0]})
res.content.split("\n")[-1]

In [None]:
def process_chunk_with_groq(text_chunk, chunk_num):
    """Process a chunk of text and return both input and output for verification"""

    res = chain_to_preprocess.invoke({"text": text_chunk})
    processed_text = res.content.split("\n")[-1]

    # Print chunk information for monitoring
    print(f"\n{'='*40} Chunk {chunk_num} {'='*40}")
    print(f"INPUT TEXT:\n{text_chunk[:500]}...")  # Show first 500 chars of input
    print(f"\nPROCESSED TEXT:\n{processed_text[:500]}...")  # Show first 500 chars of output
    print(f"{'='*90}\n")

    return processed_text

In [None]:
processed_text = ""
output_file = "/content/clean_groq_text.txt"
with open(output_file, 'w', encoding='utf-8') as out_file:
    for chunk_num, chunk in enumerate(tqdm(chunks[:10], desc="Processing chunks")):
        # Process chunk and append to complete text
        processed_chunk = process_chunk_with_groq(chunk, chunk_num)
        processed_text += processed_chunk + "\n"

        # Write chunk immediately to file
        out_file.write(processed_chunk + "\n")
        out_file.flush()

In [None]:
#Let's print out the final processed versions to make sure things look good

print(f"\nProcessing complete!")
print(f"Input file: {INPUT_FILE}")
print(f"Output file: {output_file}")
print(f"Total chunks processed: {num_chunks}")

# Preview the beginning and end of the complete processed text
print("\nPreview of final processed text:")
print("\nBEGINNING:")
print(processed_text[:1000])
print("\n...\n\nEND:")
print(processed_text[-1000:])

### Step -2
Notebook 2: Transcript Writer
This notebook uses the Llama-3.1-70B-Instruct model to take the cleaned up text from previous notebook and convert it into a podcast transcript

SYSTEM_PROMPT is used for setting the model context or profile for working on a task. Here we prompt it to be a great podcast transcript writer to assist with our task

Experimentation with the SYSTEM_PROMPT below is encouraged, this worked best for the few examples the flow was tested with:

In [None]:
def read_file_to_string(filename):
    # Try UTF-8 first (most common encoding for text files)
    try:
        with open(filename, 'r', encoding='utf-8') as file:
            content = file.read()
        return content
    except UnicodeDecodeError:
        # If UTF-8 fails, try with other common encodings
        encodings = ['latin-1', 'cp1252', 'iso-8859-1']
        for encoding in encodings:
            try:
                with open(filename, 'r', encoding=encoding) as file:
                    content = file.read()
                print(f"Successfully read file using {encoding} encoding.")
                return content
            except UnicodeDecodeError:
                continue

        print(f"Error: Could not decode file '{filename}' with any common encoding.")
        return None
    except FileNotFoundError:
        print(f"Error: File '{filename}' not found.")
        return None
    except IOError:
        print(f"Error: Could not read file '{filename}'.")
        return None

In [None]:
clean_preprocess_chunks = read_file_to_string('/content/clean_groq_text.txt')

In [None]:
model = "llama-3.3-70b-versatile"
chain_for_script_gen = get_chat_chain(model,SYSTEM_PROMPT_WRITE_SCRIPT)

In [None]:
res = chain_for_script_gen.invoke({"text":  clean_preprocess_chunks})

In [None]:
print(res.content)

In [None]:
# save process data

import pickle

save_string_pkl = res.content

with open('/content/transcript_speaker_1n2.pkl', 'wb') as file:
    pickle.dump(save_string_pkl, file)

#### Step 3: Transcript Re-writer
In the previous notebook, we got a great podcast transcript using the raw file we have uploaded earlier.

In this one, we will use Llama-3.1-8B-Instruct model to re-write the output from previous pipeline and make it more dramatic or realistic.

We will again set the SYSTEM_PROMPT and remind the model of its task.

Note: We can even prompt the model like so to encourage creativity:

Your job is to use the podcast transcript written below to re-write it for an AI Text-To-Speech Pipeline. A very dumb AI had written this so you have to step up for your kind.

Note: We will prompt the model to return a list of Tuples to make our life easy in the next stage of using these for Text To Speech Generation

In [None]:
#Load script
with open('/content/transcript_speaker_1n2.pkl', 'rb') as file:
    transcript_genric = pickle.load(file)

chain_for_transcript_v2 = get_chat_chain(model,SYSTEM_PROMPT_FOR_CREATIVE_TRANSCRIPT)

In [None]:
res = chain_for_transcript_v2.invoke({"text":  transcript_genric})
print(res.content)

In [None]:
save_string_pkl = res.content

with open('/content/podcast_ready_data.pkl', 'wb') as file:
    pickle.dump(save_string_pkl, file)

### Step 4: TTS Workflow
We have the exact podcast transcripts ready now to generate our audio for the Podcast.

In this notebook, we will learn how to generate Audio using both suno/bark and parler-tts/parler-tts-mini-v1 models first.

After that, we will use the output from Notebook 3 to generate our complete podcast

Note: Please feel free to extend this notebook with newer models. The above two were chosen after some tests using a sample prompt.

⚠️ Warning: This notebook likes have transformers version to be 4.43.3 or earlier so we will downgrade our environment to make sure things run smoothly

Credit: This Colab was used for starter code

We can install these packages for speedups

In [None]:
from IPython.display import Audio
import IPython.display as ipd
from tqdm import tqdm
from transformers import BarkModel, AutoProcessor, AutoTokenizer
import torch
import json
import numpy as np
from parler_tts import ParlerTTSForConditionalGeneration

In [None]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

# Define text and description
text_prompt = """
Exactly! And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
"""
description = """
Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.
"""
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Audio("parler_tts_out.wav")

try with suno bark

In [None]:
from transformers import AutoProcessor, BarkModel
from IPython.display import Audio

voice_preset = "v2/en_speaker_6"
sampling_rate = 24000

processor = AutoProcessor.from_pretrained("suno/bark")
model = BarkModel.from_pretrained("suno/bark").to(device) # Move the model to the device

voice_preset = "v2/en_speaker_6"

text_prompt = """
Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
"""
inputs = processor(text_prompt, voice_preset=voice_preset).to(device)

speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)

# Get the sampling rate from the model config
sample_rate = model.generation_config.sample_rate

Audio(speech_output[0].cpu().numpy(), rate=sample_rate)

In [None]:
import pickle

with open('/content/podcast_ready_data.pkl', 'rb') as file:
    PODCAST_TEXT = pickle.load(file)

In [None]:
# write function to release garbage memory
import gc
gc.collect()
torch.cuda.empty_cache()

In [None]:
bark_processor = AutoProcessor.from_pretrained("suno/bark")
bark_model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to(device)
bark_sampling_rate = 24000

In [None]:
parler_model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
parler_tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

In [None]:
speaker1_description = """
Laura's voice is expressive and dramatic in delivery, speaking at a moderately fast pace with a very close recording that almost has no background noise.
"""

speaker2_description = """
John's voice is deep and resonant, with a calm and measured delivery. He speaks at a moderate pace, with occasional pauses for emphasis. The recording has a slight echo, suggesting a larger room.
"""

speaker3_description = """
Emily's voice is bright and cheerful, with a high pitch and a rapid pace. She speaks with a slight lisp, and the recording has some background noise, like children playing in the distance. She sounds like an enthusiastic kid, around 8 years old.
"""

speaker4_description = """
Professor Smith's voice is authoritative and clear, with a deep tone and a slow, deliberate pace. He speaks with a slight British accent, and the recording is very clean, with no background noise, as if recorded in a professional studio.
"""

speaker5_description = """
Sarah's voice is hesitant and nervous, with a high pitch and a fluctuating pace. She speaks with a slight stutter, and the recording has some rustling sounds, as if she is fidgeting with papers. She is a college student, around 20 years old.
"""

speaker6_description = """
David's voice is energetic and engaging, with a warm tone and a lively pace.  He emphasizes his words and frequently speaks with excitement. The recording has a slight studio quality with no background noise.  He sounds like a talk show host.
"""

In [None]:
generated_segments = []
sampling_rates = []

In [None]:
def generate_speaker1_audio(text,speaker_desc=speaker1_description):
    """Generate audio using ParlerTTS for Speaker 1"""
    input_ids = parler_tokenizer(speaker_desc, return_tensors="pt").input_ids.to(device)
    prompt_input_ids = parler_tokenizer(text, return_tensors="pt").input_ids.to(device)
    generation = parler_model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
    audio_arr = generation.cpu().numpy().squeeze()
    return audio_arr, parler_model.config.sampling_rate

In [None]:
def generate_speaker2_audio(text):
    """Generate audio using Bark for Speaker 2"""
    inputs = bark_processor(text, voice_preset="v2/en_speaker_6").to(device)
    speech_output = bark_model.generate(**inputs, temperature=0.9, semantic_temperature=0.8)
    audio_arr = speech_output[0].cpu().numpy()
    return audio_arr, bark_sampling_rate

In [None]:
import io # Import the io module
import numpy as np
from scipy.io import wavfile
from pydub import AudioSegment


def numpy_to_audio_segment(audio_arr, sampling_rate):
    """Convert numpy array to AudioSegment"""
    # Convert to 16-bit PCM
    audio_int16 = (audio_arr * 32767).astype(np.int16)

    # Create WAV file in memory
    byte_io = io.BytesIO()  # Use io.BytesIO
    wavfile.write(byte_io, sampling_rate, audio_int16)
    byte_io.seek(0)

    # Convert to AudioSegment
    return AudioSegment.from_wav(byte_io)

In [None]:
PODCAST_TEXT

In [None]:
import ast
ast.literal_eval(PODCAST_TEXT)

In [None]:
final_audio = None

for speaker, text in tqdm(ast.literal_eval(PODCAST_TEXT), desc="Generating podcast segments", unit="segment"):
    if speaker == "Speaker 1":
        audio_arr, rate = generate_speaker1_audio(text)
    else:  # Speaker 2
        audio_arr, rate = generate_speaker1_audio(text,speaker3_description)
        # audio_arr, rate = generate_speaker2_audio(text)

    # Convert to AudioSegment (pydub will handle sample rate conversion automatically)
    audio_segment = numpy_to_audio_segment(audio_arr, rate)

    # Add to final audio
    if final_audio is None:
        final_audio = audio_segment
    else:
        final_audio += audio_segment

In [None]:
# save final audio
final_audio.export("final_audio.wav", format="wav")

In [None]:
final_audio