# Generate Podcast Video from Transcript - V1 (Final Step)

## Overview

This notebook outlines the final step in generating a podcast video from a PDF document. It takes the podcast transcript and audio (generated in previous steps) and creates a video using the Wan2.1 1.3B model.

## Key Points

1. **Last Step:** This is the final step in the "Generate Podcast from PDF V1" series, where we generate a video from the transcript data.
2. **Model:** We utilize the Wan2.1 1.3B model for video generation.
3. **Data:**
   - Podcast transcript (generated from the previous step)
   - Podcast audio (generated from a previous step)
4. **Groq API Key:** A Groq API key is required for prompt generation. You can store it securely in Google Colab user data using `google.colab.userdata.get("grpq")`.
5. **Hardware:** This notebook is designed to run on an A100 GPU (40) with 64GB of RAM. You might be able to run it on a Colab Pro T4 with 24GB of RAM by adjusting memory settings, but performance may vary.

## Steps

1. **Install Libraries:** Install necessary libraries, including `diffusers`, `moviepy`, `langchain-groq`, and others.
2. **Load Data:** Load the podcast transcript and audio data.
3. **Prompt Generation:** Use Langchain and Groq to generate creative prompts for video generation based on the transcript.
4. **Video Generation:** Utilize the Wan2.1 1.3B model with the generated prompts to create video clips.
5. **Combine Clips:** Merge the individual video clips and audio to create the final podcast video.

## Usage

1. Provide the path to your podcast transcript and audio files.
2. Set your Groq API key using `google.colab.userdata.set("grpq", "YOUR_API_KEY")`.
3. Execute the notebook cells to generate the podcast video.

## Notes

- Make sure to enable GPU acceleration in Colab.
- Adjust video generation parameters like `height`, `width`, and `guidance_scale` as needed.
- The checkpoint file allows resuming the generation process if interrupted.

In [None]:
# cloning Wan 2.1 repo for requirement.txt file
!git clone https://github.com/Wan-Video/Wan2.1.git

In [None]:
## installing all dependiencies
!pip install -r Wan2.1/requirements.txt
!pip install --upgrade diffusers[torch]
!pip install moviepy diffusers accelerate --quiet
!apt-get install -y ffmpeg
!pip install langchain-groq

### Load Transcript (generate in previous step) and common import and consts.

In [None]:
import os
import torch
import numpy as np
import tempfile
from tqdm import tqdm
from moviepy.editor import AudioFileClip, ImageClip, concatenate_videoclips
from PIL import Image
# from diffusers import StableDiffusionPipeline
import ast
import pickle

# Load podcast transcript
with open('/content/podcast_ready_data.pkl', 'rb') as file:
    PODCAST_TEXT = pickle.load(file)

# Use tempfile for output chunks
output_dir = tempfile.mkdtemp()

# Initialize Stable Diffusion
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to(device)

# Load the final audio (generated from your provided code)
final_audio_path = "/content/_podcast.mp3"  # You should export your `final_audio` to this path

segments = ast.literal_eval(PODCAST_TEXT)

#### Function to generate Video using prompt (480p and 720p) use 14B model for high quality output.

In [None]:
import torch
from diffusers.utils import export_to_video
from diffusers import AutoencoderKLWan, WanPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler

# Available models: Wan-AI/Wan2.1-T2V-14B-Diffusers, Wan-AI/Wan2.1-T2V-1.3B-Diffusers
model_id = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
flow_shift = 3.0 # 5.0 for 720P, 3.0 for 480P
scheduler = UniPCMultistepScheduler(prediction_type='flow_prediction', use_flow_sigmas=True, num_train_timesteps=1000, flow_shift=flow_shift)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
pipe.scheduler = scheduler
pipe.to(device)

def generate_video(prompt, negative_prompt,num_frames,output_dir,output_file, height=480,width=960, guidance_scale=5.0):
    output = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        height=height,
        width=width,
        num_frames=num_frames,
        guidance_scale=guidance_scale,
    ).frames[0]
    export_to_video(output, f"{output_dir}/{output_file}", fps=16)
    return f"{output_dir}/{output_file}"


In [None]:
prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
generate_video(prompt,negative_prompt,88,"/content","dog_cat_cook.mp4")

#### Function to generate Prompt from transcrip to generate video
*** you can more fine tune the prompt as needed ***

In [None]:
from google.colab import userdata
from langchain_core.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq

model = 'llama3-70b-8192'

chat = ChatGroq(temperature=.5, groq_api_key=userdata.get("grpq"), model_name=model)


def get_prompt_to_generate_video_clip(speaker, text):
  SYSTEM_PROMPT = """
    You are a creative content creator and prompt engineer. And you are using a tool to generate the podcast video.
    your job is to create prompt for stable diffusions  Wan2 1.3B model in creative way assuming we provide you the transcript of speaker 1 and speaker 2 format.
    so create creative prompts by analyzing the transcript so it can get more engagements on social media.
    return the prompt in string format that only contains the prompt and negative_prompt within a JSON structure like this it should not contain anything else other then JSON:
    {{
      "prompt":"A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window.",
      "negative_prompt":"Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
    }}
    """
  human = "speaker:{speaker} , transcript:{transcript}"
  prompt = ChatPromptTemplate.from_messages([("system", SYSTEM_PROMPT), ("human", human)])

  chain_for_prompt_gen = prompt | chat

  res = chain_for_prompt_gen.invoke({"speaker":speaker,"transcript":  text})

  # Parse the JSON response to access the prompt
  import json
  try:
    prompt_data = json.loads(res.content)
    return prompt_data # Handle cases where "prompt" might be missing
  except json.JSONDecodeError as e:
    print(f"Error decoding JSON: {e}")
    print(f"Raw response: {res.content}")
    return {"prompt":"","negative_prompt":""} # Return empty string if JSON decoding fails

In [None]:
get_prompt_to_generate_video_clip(segments[0][0],segments[0][1])

#### Main function to generate Video from transcript segments.

In [None]:
import os
import gc
import json
import torch
import traceback
from tqdm import tqdm
from moviepy.editor import AudioFileClip, VideoFileClip, concatenate_videoclips

# Setup
output_dir = "/content/generated_clips"
os.makedirs(output_dir, exist_ok=True)
checkpoint_file = "/content/generation_checkpoint.txt"
final_output_path = "/content/podcast_video.mp4"

# Load final audio
audio_clip = AudioFileClip(final_audio_path)
audio_duration = audio_clip.duration

# Check where we left off
start_index = 0
if os.path.exists(checkpoint_file):
    with open(checkpoint_file, "r") as f:
        start_index = int(f.read().strip())

print(f"🔁 Resuming from segment {start_index}")

# Generate video per segment
segment_start = 0
video_clips = []

for i, (speaker, text) in enumerate(tqdm(segments, desc="Generating video clips", unit="clip")):
    if i < start_index:
        continue

    try:
        print(f"\n🎙️ Generating prompt for segment {i}")
        prompt_data = get_prompt_to_generate_video_clip(speaker, text)

        if not prompt_data["prompt"]:
            raise ValueError("Prompt generation failed or returned empty.")

        # Estimate segment duration from audio
        segment_duration = max(3, len(text.split()) * 0.3)
        if segment_start + segment_duration > audio_duration:
            segment_duration = audio_duration - segment_start

        print(f"🕒 Segment duration: {segment_duration:.2f}s")

        # Generate video
        clip_name = f"clip_{i:03d}.mp4"
        clip_path = os.path.join(output_dir, clip_name)
        print(f"🎥 Generating video clip: {clip_name}")

        result_path = generate_video(
            prompt=prompt_data["prompt"],
            negative_prompt=prompt_data["negative_prompt"],
            num_frames=int(segment_duration * 16),  # FPS = 16
            output_dir=output_dir,
            output_file=clip_name
        )

        # Attach audio segment
        video = VideoFileClip(result_path).set_duration(segment_duration)
        audio_segment = audio_clip.subclip(segment_start, segment_start + segment_duration)
        video = video.set_audio(audio_segment)

        video_clips.append(video)

        # Save checkpoint
        with open(checkpoint_file, "w") as f:
            f.write(str(i + 1))

        # Cleanup GPU memory
        torch.cuda.empty_cache()
        gc.collect()

        segment_start += segment_duration

    except Exception as e:
        print(f"❌ Failed to process segment {i}: {e}")
        traceback.print_exc()
        break  # Stop so we can resume later from same segment

# Finalize video if we have clips
if video_clips:
    final_video = concatenate_videoclips(video_clips, method="compose")
    final_video.write_videofile(final_output_path, fps=24, audio_codec="aac")
    print(f"\n✅ Final podcast video saved at: {final_output_path}")
else:
    print("⚠️ No video clips were generated.")


#### Fallback logic it Notebook Fails at any point.
*** Only run if above cell fails at any iteration (not on 0) :) ***

In [None]:
import os
from moviepy.editor import AudioFileClip, VideoFileClip, concatenate_videoclips

# Setup
output_dir = "/content/generated_clips"
final_output_path = "/content/podcast_video_partial.mp4"

# Load all available video clips
clips = sorted([
    os.path.join(output_dir, f)
    for f in os.listdir(output_dir)
    if f.endswith(".mp4")
])

print(f"✅ Found {len(clips)} video clips.")

video_clips = [VideoFileClip(clip) for clip in clips]

# Load full audio
audio_clip = AudioFileClip(final_audio_path)

# Calculate total duration from segments 0-N
total_duration = 0
for speaker, text in segments[:13]:  # 0 to N inclusive
    words = len(text.split())
    segment_duration = max(3, words * 0.3)
    total_duration += segment_duration

print(f"🎵 Total duration for audio cut: {total_duration:.2f} seconds")

# Cut the audio to match the video length
audio_clip = audio_clip.subclip(0, total_duration)

# Concatenate video clips
final_video = concatenate_videoclips(video_clips, method="compose")

# Set the trimmed audio
final_video = final_video.set_audio(audio_clip)

# Export the final video
final_video.write_videofile(final_output_path, fps=24, audio_codec="aac")

print(f"\n✅ Final partial podcast video saved at: {final_output_path}")


## Conclusion

This notebook demonstrates a workflow for generating podcast videos from transcripts and audio using the Wan2.1 1.3B model. While the current implementation provides a basic framework, there are several areas for improvement and further exploration.

## TODO

1. **Fine-tune Prompts:** Experiment with different prompt engineering techniques to generate more engaging and creative video content. Consider using more detailed descriptions, specifying camera angles, or incorporating emotions.

2. **Incorporate PDF Content:** Extend the pipeline to analyze the original PDF document, including images and text from specific sections or pages. This would allow for more context-aware video generation and potentially include relevant visuals in the final output.

3. **Explore Other Models:** Investigate alternative video generation models like Lightricks/LTX-Video and compare their performance and output quality to Wan2.1 1.3B. This could lead to improved video quality or more diverse creative options.

4. **Create a Complete Pipeline:** Develop a streamlined pipeline that takes a PDF document as input and automatically generates a complete podcast video, including transcript extraction, audio generation, and video creation. This would make the process more user-friendly and accessible.

5. **Develop API or Gradio App:** Create an API or a Gradio app to expose the functionality of the pipeline to a wider audience. This would allow users to easily generate podcast videos without needing to interact directly with the code.

By addressing these TODO items, we can significantly enhance the capabilities of this workflow and create more compelling and informative podcast videos.