📄 Autonomous AI Video Storytelling System – Documentation Report

Team Members: Amin Niaziardekani (3119828), Swapnaneel Sarkhel (3119924), Ajdin Buljko (3119754)

🎯 Objective


This project implements a fully autonomous AI video storytelling pipeline that generates two short narrated videos based on random themes, features 6 scenes, and includes:

    A machine-generated story (using Mistral-7B),

    AI-generated narration (Bark),

    Matching ambient sound effects (Stable Audio Open),

    Background music (Stable Audio Open),

    AI-generated images (Stable Diffusion),

    Subtitle overlays and transitions (MoviePy).

⚙️ Program Flow Overview
1. Theme Selection

Two random themes are selected from a predefined list:

all_themes = ["dark fantasy", "medieval", "post-apocalyptic", "sci-fi", "steampunk", "cyberpunk", "mythical adventure"]

2. Story Generation

    A structured 6-sentence story is generated using Mistral-7B-Instruct-v0.3.

    The story has fixed roles:

        Setting

        Main Character

        Conflict

        Climax

        Resolution

        Emotional Ending

3. Prompt Generation

Using the story:

    Image prompts are crafted for Stable Diffusion.

    Ambient prompts are crafted for Stable Audio based only on the soundscape.

    A story title is generated using the same model.

4. Audio & Visual Generation

    Narration: Each sentence is spoken using Bark, with expressive tuning, pauses, fade-ins, and gain normalization.

    SFX: Stable Audio generates ambient sound based on each sentence's audio prompt.

    Music: A full 30–45 second theme-appropriate ambient track is generated and faded in/out.

5. Image Generation

    One image per sentence is generated using Stable Diffusion with cinematic styling and character consistency.

6. Scene Composition

Each of the 6 scenes is built using:

    One image,

    One subtitle (story sentence),

    One narration audio,

    One ambient background audio (lower volume).

All audio is merged using pydub, and all visuals are composed using MoviePy.
7. Video Compilation

    A title card is generated.

    All scenes are concatenated into a final video.

    The background music is added beneath narration + ambient sound.

    Fade transitions and timing are synchronized.

8. Metadata Logging

A .txt log file is created for each theme that documents:

    Title

    Story

    Image prompts

    Ambient prompts

    Music prompt

    Timestamp

📁 Output Files

For each theme (e.g., sci-fi), the following directories and files are created:

    images_sci-fi/ → 6 scene images

    audio_sci-fi/ → 6 narration files

    sfx_sci-fi/ → 6 ambient audio tracks

    stable_music_sci-fi.wav → Background music

    sci-fi_story_video.mp4 → Final video (30 seconds)

    sci-fi_log.txt → Metadata log

In [None]:
# RUN THIS ONCE AND RESTART SESSION
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip uninstall -y sentence-transformers
!pip install --quiet git+https://github.com/Stability-AI/stable-audio-tools.git
!pip install --upgrade --quiet --no-deps \
  protobuf==5.29.1 \
  einops
!pip install transformers==4.41.2
!pip install diffusers==0.27.2
!pip install bitsandbytes==0.43.1
!pip install accelerate==0.30.1
!pip install moviepy==1.0.3
!pip install soundfile==0.12.1
!pip install pydub==0.25.1
!pip install git+https://github.com/suno-ai/bark.git
!pip install peft==0.9.0
!pip install huggingface_hub==0.25.2
!apt-get install -y imagemagick
!apt-get install -y fonts-dejavu-core
!pip install numpy==1.26.4

In [None]:
# RUN AND LOGIN WITH THE ACESS TOKEN
from huggingface_hub import login
login()  # Will prompt for token

In [None]:
# This function must be manually added due to a version conflict between dependencies : peft and transformers

!sed -i '25i import gc\nimport torch\ndef clear_device_cache():\n    gc.collect()\n    if torch.cuda.is_available():\n        torch.cuda.empty_cache()\n        torch.cuda.synchronize()\n' /usr/local/lib/python3.11/dist-packages/peft/utils/loftq_utils.py




In [None]:
# STEP 1: STORY GENERATOR
from pydub.silence import detect_nonsilent
import torchaudio

import torch, random, gc, warnings
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig, AutoProcessor, AutoModel
from datetime import datetime

def save_generation_log(theme, title, story, image_prompts, ambient_prompts, music_prompt=None, seed=None):
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    log_path = f"{theme.replace(' ', '_')}_log.txt"

    with open(log_path, "w", encoding="utf-8") as f:
        f.write(f"📅 Generated on: {timestamp}\n")
        f.write(f"🎬 Theme: {theme}\n")
        f.write(f"📘 Title: {title}\n\n")
        f.write("📝 Story:\n")
        for i, line in enumerate(story, 1):
            f.write(f"  {i}. {line}\n")

        f.write("\n🎨 Image Prompts:\n")
        for i, line in enumerate(image_prompts, 1):
            f.write(f"  {i}. {line}\n")

        f.write("\n🌫️ Ambient Prompts:\n")
        for i, line in enumerate(ambient_prompts, 1):
            f.write(f"  {i}. {line}\n")

        if music_prompt:
            f.write(f"\n🎵 Music Prompt:\n  {music_prompt}\n")
        if seed:
            f.write(f"🎲 Music Seed: {seed}\n")

    print(f"🗂️ Log saved to: {log_path}")

def clear_device_cache():
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

class StoryGenerator:
    def __init__(self):
        print("\U0001F4D6 Loading Mistral model for storytelling...")
        model_name = "mistralai/Mistral-7B-Instruct-v0.3"

        quant_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4"
        )

        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=quant_config,
            trust_remote_code=True,
            low_cpu_mem_usage=True,   # helps avoid .to()
            device_map=None           # <-- this is key, don't use 'auto'
        )

        self.generator = pipeline(
            "text-generation",
            model=self.model,
            tokenizer=self.tokenizer,
            pad_token_id=self.tokenizer.eos_token_id
        )
    def is_strong_ending(self, sentence):
        keywords = [
            "finally", "at last", "was never the same", "legacy", "the end",
            "peace", "reborn", "crowned", "fell silent", "the world changed"
        ]
        sentence = sentence.lower()
        return any(kw in sentence for kw in keywords)

    def generate_story(self, theme):
        seed = int(datetime.now().timestamp()) % (2**32 - 1)
        torch.manual_seed(seed)
        random.seed(seed)

        prompt = (
            f"Write a 6-sentence {theme} story. "
            "Sentence 1 introduces the setting, "
            "sentence 2 introduces the main character, "
            "sentence 3 shows the conflict, "
            "sentence 4 shows a dramatic moment, "
            "sentence 5 begins resolution, "
            "and sentence 6 concludes the story with a twist or emotional end."
        )

        last_valid_story = None

        for attempt in range(5):
            try:
                print(f"\n✨ Generating a {theme} story (Attempt {attempt+1})...")
                raw_output = self.generator(
                    prompt,
                    max_new_tokens=300,
                    do_sample=True,
                    temperature=1.0,
                    top_k=50,
                    top_p=0.92,
                    repetition_penalty=1.2
                )[0]['generated_text']

                output = raw_output.replace(prompt, "").strip()
                story_sentences = [s.strip() for s in output.split('.') if len(s.strip()) > 10][:6]

                if len(story_sentences) < 6:
                    raise ValueError("Too few sentences")

                if attempt == 0 and not self.is_strong_ending(story_sentences[-1]):
                    print("⚠️ Ending not strong enough. Trying one more time...")
                    continue

                # Either strong ending on first try OR any valid second attempt
                last_valid_story = story_sentences
                print("\n🖋️ Final Story:")
                for i, s in enumerate(story_sentences, 1):
                    print(f"{i}. {s}")
                return story_sentences

            except Exception as e:
                print(f"  ❌ Error during generation attempt {attempt+1}: {e}")
                continue

        if last_valid_story:
            print("\n✅ Using last generated story (ending may be weak):")
            for i, s in enumerate(last_valid_story, 1):
                print(f"{i}. {s}")
            return last_valid_story

        print("⚠️ Story generation failed. Using fallback.")
        return self.get_fallback_story(theme)



    def get_fallback_story(self, theme):
        if theme == "dark fantasy":
            return [
                "The sky cracked open as a blood moon rose",
                "Whispers echoed through the forest of forgotten souls",
                "A knight stepped forth, cloaked in shadow and dread",
                "He drew a blade that pulsed with ancient sorrow",
                "Demons bowed as he passed, hailing their king reborn",
                "The world trembled as his eyes met the stars"
            ]
        elif theme == "medieval":
            return [
                "In a quiet village, a squire found a hidden scroll",
                "The scroll spoke of a sword buried beneath the king’s keep",
                "He set off with nothing but hope and a borrowed horse",
                "The castle gates creaked as he slipped inside at night",
                "Steel met fire in the chamber of the forgotten king",
                "He emerged with the blade, and the sun crowned him in light"
            ]
        else:
            return ["A story could not be generated."] * 6

def generate_story_title(theme, generator, story_sentences):
    story_text = " ".join(story_sentences)
    prompt = (
        f"Given the following {theme} story:\n\n"
        f"{story_text}\n\n"
        "Generate a short and creative title (max 5 words) that captures the essence or mood of this story."
    )

    response = generator(
        prompt,
        max_new_tokens=20,
        do_sample=True,
        temperature=0.8,
        top_p=0.9
    )[0]["generated_text"]

    title = response.replace(prompt, "").strip().strip('"').strip(".")
    print(f"\n📘 Title for {theme}: {title}")
    return title


# AUTOMATIC AMBIENT PROMPT GENERATION

import re
import random

def generate_ambient_descriptions(story_lines, theme, generator, max_retries=2):
    print(f"\n🌫️ Generating ambient sound prompts for each sentence...")
    if story_lines[0].strip().lower().startswith("a story could not be generated"):
        print("⚠️ Skipping ambient SFX prompts due to fallback story.")
        return [""] * 6

    ambient_prompts = []
    tone_choices = ["ominous", "peaceful", "mysterious", "chaotic", "tense", "dreamlike", "dystopian"]
    sound_keywords = ["echo", "wind", "rustle", "creak", "drip", "flames", "howl", "chant", "crackling", "thunder", "footsteps"]

    for idx, line in enumerate(story_lines):
        retries = 0
        description = ""

        while retries <= max_retries:
            tone = random.choice(tone_choices)
            prompt = (
                f"Given the following {theme} story scene:\n\n"
                f"\"{line}\"\n\n"
                f"Describe the ambient background sounds for this scene in one rich sentence. "
                f"Include only audio-related elements, not visual or narrative content. "
                f"The soundscape should feel {tone}."
            )

            try:
                output = generator(prompt, max_new_tokens=100)[0]["generated_text"].strip()
            except Exception as e:
                print(f"  ❌ Error during generation: {e}")
                retries += 1
                continue

            # Clean and validate
            cleaned_output = re.sub(r"^ambient\s*\d*[:\.\-\s]*", "", output, flags=re.IGNORECASE)
            cleaned_output = re.sub(r"^\s*[\d\.\-\•]+\s*", "", cleaned_output).strip()

            if (
                len(cleaned_output) < 30 or
                "thas" in cleaned_output.lower() or
                not any(kw in cleaned_output.lower() for kw in sound_keywords)
            ):
                print(f"  ⚠️ Invalid or bland output on attempt {retries+1}: {cleaned_output}")
                retries += 1
                continue

            if len(cleaned_output) > 250:
                cleaned_output = cleaned_output[:250].rsplit(".", 1)[0] + "."

            description = cleaned_output
            # ✅ PRINT HERE
            print(f"  🎧 Final Ambient Prompt {idx+1}: {description}")
            break

        if not description:
            print(f"  ❌ Failed to get valid ambient prompt for line {idx + 1}. Using fallback.")
            description = "Eerie wind through trees, distant animal cries, and unsettling silence in the air."

        print(f"  ✅ Ambient {idx+1}: {description}")
        ambient_prompts.append(description)

    return ambient_prompts





# AUTOMATIC IMAGE PROMPT GENERATION

import re

def generate_image_prompts(story_lines, theme, generator, max_retries=2):
    print(f"\n🖼 Generating improved image prompts for each sentence...")
    image_prompts = []

    full_story = " ".join(story_lines)
    art_style = "in cinematic digital art style, ultra-detailed, dramatic lighting, consistent character design"

    for idx, line in enumerate(story_lines):
        retries = 0
        image_prompt = ""

        while retries <= max_retries:
            prompt = (
                f"This is a {theme} story:\n\n"
                f"{full_story}\n\n"
                f"Scene {idx + 1}: \"{line}\"\n\n"
                f"Describe this scene visually in one sentence. "
                f"Focus on atmosphere, consistent characters and environment, and ensure it fits the ongoing narrative. "
                f"The description should be suitable for AI image generation — {art_style}. "
                f"Only return the raw visual prompt."
            )

            try:
                output_raw = generator(
                    prompt,
                    max_new_tokens=100,
                    do_sample=True,
                    temperature=0.9,
                    top_p=0.95
                )[0]["generated_text"]

                output = output_raw.replace(prompt, "").strip()
                cleaned = re.sub(r"(Title:|Description:|Scene Description:)", "", output, flags=re.IGNORECASE).strip()
                cleaned = re.sub(r"^\s*[\d\.\-\•]+\s*", "", cleaned)

                if len(cleaned) < 40 or "title" in cleaned.lower() or "description" in cleaned.lower():
                    print(f"  ⚠️ Retry {retries+1}: {cleaned}")
                    retries += 1
                else:
                    image_prompt = cleaned
                    break

            except Exception as e:
                print(f"  ❌ Error on attempt {retries+1}: {str(e)}")
                retries += 1

        if not image_prompt:
            print(f"  ❌ Failed to get image prompt for line {idx + 1}. Using fallback.")
            image_prompt = f"A dramatic {theme} scene with consistent characters, {art_style}"

        print(f"  🎨 Prompt {idx+1}: {image_prompt}")
        image_prompts.append(image_prompt)

    return image_prompts



# STEP 2: SCENE CLASS

from moviepy.editor import ImageClip, CompositeVideoClip, ColorClip
from PIL import Image, ImageDraw, ImageFont
import numpy as np
from pydub import AudioSegment

from PIL import Image, ImageDraw, ImageFont
import numpy as np

import textwrap

def create_text_image(text, size=(1280, 720), base_fontsize=40, font_path="/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf"):
    img = Image.new("RGBA", size, (0, 0, 0, 0))
    draw = ImageDraw.Draw(img)
    font_size = base_fontsize
    font = ImageFont.truetype(font_path, font_size)

    # Wrap text into lines of approx. 60 characters
    wrapper = textwrap.TextWrapper(width=60)
    lines = wrapper.wrap(text)

    # Recalculate font size if lines are too wide
    while True:
        too_wide = any(draw.textbbox((0, 0), line, font=font)[2] > size[0] - 100 for line in lines)
        if not too_wide or font_size <= 12:
            break
        font_size -= 2
        font = ImageFont.truetype(font_path, font_size)

    # Calculate total text height
    line_height = draw.textbbox((0, 0), "A", font=font)[3]
    total_text_height = len(lines) * (line_height + 10)

    # Starting Y position
    y_start = size[1] - total_text_height - 50

    # Draw each line centered
    for line in lines:
        line_width = draw.textbbox((0, 0), line, font=font)[2]
        x = (size[0] - line_width) // 2
        draw.text((x, y_start), line, font=font, fill=(255, 255, 255, 255))
        y_start += line_height + 10

    return np.array(img)




class Scene:
    def __init__(self, index, sentence, image_path, narration_path, sfx_path, duration=None):
        self.index = index
        self.sentence = sentence
        self.image_path = image_path
        self.audio_path = narration_path  # Bark narration
        self.sfx_path = sfx_path          # Stable Audio SFX
        self.duration = None  # will be set dynamically in get_video_clip()

    def get_video_clip(self, size=(1280, 720)):
        try:
            print(f"🎞️ Building video clip for Scene {self.index + 1} with image: {self.image_path}")

            # --- Mix ambient SFX and narration into one track ---
            narration = AudioSegment.from_file(self.audio_path).apply_gain(+10)
            sfx = AudioSegment.from_file(self.sfx_path).apply_gain(-15)

            # Loop SFX to match or exceed narration length
            if len(sfx) < len(narration):
                loops = (len(narration) // len(sfx)) + 1
                sfx = sfx * loops
            sfx = sfx[:len(narration)]

            mixed_audio = narration.overlay(sfx)

            # FIX: Use mixed audio duration
            self.duration = round(len(mixed_audio) / 1000.0, 2)

            # Save temp audio
            temp_audio_path = f"temp_scene_audio_{self.index+1}.wav"
            mixed_audio.export(temp_audio_path, format="wav")

            # --- Load image or fallback to black ---
            if not os.path.exists(self.image_path):
                print(f"⚠️ Image not found: {self.image_path}. Using black background.")
                image_clip = ColorClip(size=size, color=(0, 0, 0)).set_duration(self.duration)
            else:
                image_clip = ImageClip(self.image_path).set_duration(self.duration)
                image_clip = image_clip.resize(height=size[1])

            # --- Text subtitle overlay ---
            text_img = create_text_image(self.sentence, size=image_clip.size)
            text_clip = ImageClip(np.array(text_img)).set_duration(self.duration).set_position(("center", "bottom")).set_opacity(1)

            # --- Composite video and set audio ---
            composite = CompositeVideoClip([image_clip, text_clip]).set_duration(self.duration).fadein(0.5).fadeout(0.5)
            composite = composite.set_audio(AudioFileClip(temp_audio_path))

            return composite

        except Exception as e:
            print(f"❌ Error creating clip for Scene {self.index + 1}: {e}")
            fallback = ColorClip(size=size, color=(0, 0, 0)).set_duration(self.duration)
            text_img = create_text_image(self.sentence, size=size)
            text_clip = ImageClip(np.array(text_img)).set_duration(self.duration).set_position(("center", "bottom")).set_opacity(1)
            fallback = CompositeVideoClip([fallback, text_clip]).fadein(0.5).fadeout(0.5)
            return fallback



# STEP 3: ENHANCED AUDIO GENERATION

from bark import generate_audio, preload_models
from pydub import effects
import soundfile as sf
import os

def generate_bark_audio_only(sentences, theme_name="story", voice=None):
    print(f"\n🗣️ Generating Bark narration with realism for theme: {theme_name}")
    preload_models()

    narration_dir = f"audio_{theme_name}"
    os.makedirs(narration_dir, exist_ok=True)
    audio_paths = []

    voices = ["v2/en_speaker_6", "v2/en_speaker_1", "v2/en_speaker_3"]
    voice = voice or random.choice(voices)

    for i, sentence in enumerate(sentences):
        narration_path = f"{narration_dir}/sentence_{i+1}.wav"
        if os.path.exists(narration_path):
            print(f"  ✅ Skipping existing narration {i+1}")
        else:
            print(f"  🎤 Generating expressive narration {i+1}: {sentence}")

            # Add commas as pauses and ellipses for emotion
            processed = sentence.strip()

            narration = generate_audio(processed, history_prompt=voice)
            sf.write(narration_path, narration, 24000)

            # Load and post-process
            narr_seg = AudioSegment.from_file(narration_path)
            narr_seg = effects.normalize(narr_seg)
            narr_seg = narr_seg.fade_in(500).fade_out(500)

            # Optional: subtle speed variation
            if random.random() < 0.3:
                narr_seg = narr_seg.speedup(playback_speed=1.05)

            narr_seg.export(narration_path, format="wav")

        audio_paths.append(narration_path)

    return audio_paths



import os
import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond

stable_audio_model = None
stable_audio_config = None
device = "cuda" if torch.cuda.is_available() else "cpu"

def load_stable_audio_model():
    global stable_audio_model, stable_audio_config
    if stable_audio_model is None:
        print("🎧 Loading Stable Audio Open model...")
        stable_audio_model, stable_audio_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
        stable_audio_model = stable_audio_model.to(device)

def generate_stable_audio(prompt, output_path, duration=6.0, steps=100, cfg_scale=7.0, seed=None):
    load_stable_audio_model()

    if seed is not None:
        torch.manual_seed(seed)

    conditioning = [{
        "prompt": prompt,
        "seconds_start": 0,
        "seconds_total": duration
    }]

    output = generate_diffusion_cond(
        stable_audio_model,
        steps=steps,
        cfg_scale=cfg_scale,
        conditioning=conditioning,
        sample_size=stable_audio_config["sample_size"],
        sigma_min=0.3,
        sigma_max=500,
        sampler_type="dpmpp-3m-sde",
        device=device
    )

    output = rearrange(output, "b d n -> d (b n)")
    normalized = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
    # Convert tensor to audio segment
    temp_path = "temp_output.wav"
    torchaudio.save(temp_path, normalized, stable_audio_config["sample_rate"])

    # Load with pydub and trim
    audio = AudioSegment.from_file(temp_path)
    trimmed = trim_silence(audio)
    trimmed.export(output_path, format="wav")
    os.remove(temp_path)


def generate_stable_audio_sfx(prompts, theme_name="story"):
    print(f"\n🔊 Generating ambient SFX with Stable Audio for theme: {theme_name}")
    sfx_dir = f"sfx_{theme_name}"
    os.makedirs(sfx_dir, exist_ok=True)
    sfx_paths = []

    for i, prompt in enumerate(prompts):
        output_path = os.path.join(sfx_dir, f"sfx_{i+1}.wav")
        if os.path.exists(output_path):
            print(f"  ✅ Skipping existing SFX {i+1}")
        else:
            print(f"  🔊 Generating SFX {i+1}: {prompt}")
            generate_stable_audio(prompt, output_path, duration=6.0, seed=i)
        sfx_paths.append(output_path)

    return sfx_paths


def generate_background_music(prompt, duration_sec, theme_name="story"):
    print(f"\n🎵 Generating background music for theme: {theme_name}")
    output_path = f"stable_music_{theme_name}.wav"

    # Generate a random seed each time for diversity
    seed = int(datetime.now().timestamp()) % (2**32 - 1)

    # First generation attempt
    generate_stable_audio(prompt, output_path, duration=duration_sec, steps=300, seed=seed)

    # Load and check volume
    music = AudioSegment.from_file(output_path)
    if music.dBFS < -35:
        print("⚠️ Music too quiet. Regenerating with different seed...")
        seed = random.randint(0, 99999)
        generate_stable_audio(prompt, output_path, duration=duration_sec, steps=300, seed=seed)
        music = AudioSegment.from_file(output_path)

    # Normalize and apply gain
    music = effects.normalize(music).apply_gain(-10)
    music.export(output_path, format="wav")

    return output_path

def trim_silence(audio_segment, silence_thresh=-40, min_silence_len=300):
    #Trim leading and trailing silence from an audio segment.
    non_silents = detect_nonsilent(audio_segment, min_silence_len, silence_thresh)
    if non_silents:
        start_trim = non_silents[0][0]
        end_trim = non_silents[-1][1]
        return audio_segment[start_trim:end_trim]
    return audio_segment

#step 4

from diffusers import StableDiffusionPipeline
from PIL import Image

import hashlib

def generate_images(prompts, theme_name="story", model_name="runwayml/stable-diffusion-v1-5"):
    print(f"\n🖼️ Generating images for theme: {theme_name}")
    device = "cuda" if torch.cuda.is_available() else "cpu"

    # Generate consistent seed per theme
    seed = int(hashlib.md5(theme_name.encode()).hexdigest(), 16) % (2**32)
    generator = torch.manual_seed(seed)

    pipe = StableDiffusionPipeline.from_pretrained(
        model_name,
        torch_dtype=torch.float16 if device == "cuda" else torch.float32,
        safety_checker=None
    ).to(device)
    pipe.enable_attention_slicing()

    os.makedirs(f"images_{theme_name}", exist_ok=True)
    image_paths = []

    for i, prompt in enumerate(prompts):
        path = f"images_{theme_name}/frame_{i+1}.png"
        if os.path.exists(path):
            print(f"  ✅ Skipping existing image {i+1}")
        else:
            final_prompt = f"{prompt}, cinematic digital art, consistent character design, ultra-detailed, dramatic light"
            print(f"  🎨 Generating image {i+1}: {final_prompt}")
            image = pipe(
                final_prompt,
                num_inference_steps=30,
                guidance_scale=7.5,
            ).images[0]
            image.save(path)
        image_paths.append(path)

    del pipe
    torch.cuda.empty_cache()
    return image_paths

# step 5 making video scenes
from moviepy.editor import concatenate_videoclips, AudioFileClip, TextClip, CompositeAudioClip
from pydub import AudioSegment, effects

def build_video_from_scenes(scenes, theme_name, title_text, bg_music_path=None):
    print(f"\n🎬 Building video for theme: {theme_name}")

    # 1. Create video clips with embedded narration + SFX
    video_clips = []
    for scene in scenes:
        clip = scene.get_video_clip()
        if scene.index == len(scenes) - 1:
            clip = clip.fadeout(1.5)
        video_clips.append(clip)

    # 2. Title card
    bg = Image.new("RGBA", video_clips[0].size, (0, 0, 0, 255))
    text_img = create_text_image(title_text, size=video_clips[0].size, base_fontsize=70)
    bg.paste(Image.fromarray(text_img), (0, 0), Image.fromarray(text_img))
    title_clip = ImageClip(np.array(bg)).set_duration(3).fadein(1)

    # 3. Combine video
    full_clips = [title_clip] + video_clips
    final_video = concatenate_videoclips(full_clips, method="compose")

    # 4. Add background music beneath existing audio
    if bg_music_path:
        print("🎼 Adding background music under narration...")

        main_audio = final_video.audio
        bg_music = AudioSegment.from_file(bg_music_path).apply_gain(-22)  # lower volume
        bg_music = bg_music.fade_in(3000).fade_out(3000)
        bg_music = effects.normalize(bg_music)

        # Repeat music to match video duration
        video_duration_ms = int(final_video.duration * 1000)
        bg_music_loop = (bg_music * ((video_duration_ms // len(bg_music)) + 1))[:video_duration_ms]

        # Save and reload as AudioFileClip
        music_temp_path = "temp_music.wav"
        bg_music_loop.export(music_temp_path, format="wav")
        music_audio = AudioFileClip(music_temp_path).set_duration(final_video.duration)

        # Mix it under the narration+SFX
        final_audio = CompositeAudioClip([main_audio, music_audio])
        final_video = final_video.set_audio(final_audio)

    # 5. Export video
    output_path = f"{theme_name.replace(' ', '_')}_story_video.mp4"
    final_video.write_videofile(output_path, fps=24)
    print(f"✅ Video saved: {output_path}")

    # Cleanup
    if os.path.exists("temp_music.wav"):
        os.remove("temp_music.wav")


def main():
    clear_device_cache()
    warnings.filterwarnings("ignore")

    # 🎭 Pick 2 random themes
    all_themes = [
        "dark fantasy",
        "medieval",
        "post-apocalyptic",
        "sci-fi",
        "steampunk",
        "cyberpunk",
        "mythical adventure"
    ]
    selected_themes = random.sample(all_themes, 2)

    for theme in selected_themes:
        print(f"\n====== 🌍 Processing Theme: {theme.upper()} ======")
        story_gen = StoryGenerator()
        # 1. Generate story and prompts
        story = story_gen.generate_story(theme)
        ambient_prompts = generate_ambient_descriptions(story, theme, story_gen.generator)
        image_prompts = generate_image_prompts(story, theme, story_gen.generator)
        title = generate_story_title(theme, story_gen.generator, story)

        del story_gen.model
        del story_gen.tokenizer
        del story_gen.generator
        del story_gen
        clear_device_cache()

        # 2. Generate audio
        sfx_paths = generate_stable_audio_sfx(ambient_prompts, theme_name=theme.replace(" ", "_"))
        image_paths = generate_images(image_prompts, theme_name=theme.replace(" ", "_"))
        audio_paths = generate_bark_audio_only(story, theme_name=theme.replace(" ", "_"))

        clear_device_cache()

        # 3. Generate background music
        music_prompts = {
            "dark fantasy": "A haunting, atmospheric orchestral piece with slow rising strings and distant percussion. Dark fantasy mood.",
            "medieval": "A light medieval melody with lutes, flutes, and soft drums. Noble and adventurous feel.",
            "post-apocalyptic": "Ambient industrial textures with distant booms and eerie melodies. Gritty and desolate.",
            "sci-fi": "Futuristic synth waves with subtle pulses and cosmic ambience. High-tech and mysterious.",
            "steampunk": "Clockwork percussion, brass, and mechanical rhythm. Adventurous and vintage.",
            "cyberpunk": "Glitchy, dark electronic beats with echoing sirens and synth pads. Gritty and neon-lit.",
            "mythical adventure": "Ethereal chimes, soft choirs, and ambient strings. Magical and uplifting."
        }
        music_prompt = music_prompts.get(theme, "An ambient orchestral piece with subtle fantasy elements.")
        music_path = generate_background_music(music_prompt, duration_sec=45, theme_name=theme.replace(" ", "_"))

        clear_device_cache()

        # 4. Save metadata
        save_generation_log(
            theme=theme,
            title=title,
            story=story,
            image_prompts=image_prompts,
            ambient_prompts=ambient_prompts,
            music_prompt=music_prompt
        )

        # 5. Create scenes
        scenes = [
            Scene(i, story[i], image_paths[i], audio_paths[i], sfx_paths[i])
            for i in range(6)
        ]
        build_video_from_scenes(scenes, theme.replace(" ", "_"), title, bg_music_path=music_path)

        clear_device_cache()

if __name__ == "__main__":
    main()



📖 Loading Mistral model for storytelling...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


✨ Generating a post-apocalyptic story (Attempt 1)...
⚠️ Ending not strong enough. Trying one more time...

✨ Generating a post-apocalyptic story (Attempt 2)...

🖋️ Final Story:
1. In the desolate wasteland of what used to be New York City, Ava scavenged for resources each day, trying her best to survive alone among the ruins
2. One fateful afternoon, she stumbled upon a strange, unfamiliar sound coming from an abandoned subway tunnel
3. Suddenly, a group of raiders emerged, brandishing weapons, ready to claim their latest prize
4. With a surge of adrenaline coursing through her veins, Ava lunged at them, surprising them with unexpected ferocity, overpowering them in mere moments
5. As she stood triumphantly over the vanquished gang, she discovered something extraordinary - not just food and water as usual, but a note detailing the location of another survivor nearby
6. Unable to contain her excitement, Ava set out in search of this long-lost companion, hope sparkling in her eyes

🌫️ G

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


  🎨 Prompt 2: A desolate New York City wasteland, an abandoned subway tunnel bathed in eerie, dim light with scavenger Ava silhouetted against the grungy, broken subway walls, listening to a strange, unfamiliar sound emanating from the darkness within.
  🎨 Prompt 3: A desolate, post-apocalyptic subway tunnel, filled with darkness and despair, suddenly lit up by the flickering light of several menacing raiders armed with weapons, ready to pounce on their next target. The tunnel walls are lined with graffiti, remnants of a bygone era, while the ground is littered with debris. The protagonist Ava, with unkempt hair and grime-streaked face, is suddenly
  🎨 Prompt 4: A gritty, post-apocalyptic scene featuring a determined, battle-hardened Ava amidst ruins, fiercely striking down a group of raiders with dramatic, cinematic lighting.
  🎨 Prompt 5: Ava, her figure illuminated by a streak of sunlight filtering through the dusty tunnel, stands victoriously over a group of defeated raiders, their

  0%|          | 0/100 [00:00<?, ?it/s]

  🔊 Generating SFX 2: Given the following post-apocalyptic story scene:

"One fateful afternoon, she stumbled upon a strange, unfamiliar sound coming from an abandoned subway tunnel"

Describe the ambient background sounds for this scene in one rich sentence.
1877085076


  0%|          | 0/100 [00:00<?, ?it/s]

  🔊 Generating SFX 3: Given the following post-apocalyptic story scene:

"Suddenly, a group of raiders emerged, brandishing weapons, ready to claim their latest prize"

Describe the ambient background sounds for this scene in one rich sentence.
2107363746


  0%|          | 0/100 [00:00<?, ?it/s]

  🔊 Generating SFX 4: Given the following post-apocalyptic story scene:

"With a surge of adrenaline coursing through her veins, Ava lunged at them, surprising them with unexpected ferocity, overpowering them in mere moments"

Describe the ambient background sounds for th.
4256259846


  0%|          | 0/100 [00:00<?, ?it/s]

  🔊 Generating SFX 5: Given the following post-apocalyptic story scene:

"As she stood triumphantly over the vanquished gang, she discovered something extraordinary - not just food and water as usual, but a note detailing the location of another survivor nearby"

Describe.
1115146539


  0%|          | 0/100 [00:00<?, ?it/s]

  🔊 Generating SFX 6: Given the following post-apocalyptic story scene:

"Unable to contain her excitement, Ava set out in search of this long-lost companion, hope sparkling in her eyes"

Describe the ambient background sounds for this scene in one rich sentence.
119607995


  0%|          | 0/100 [00:00<?, ?it/s]


🖼️ Generating images for theme: post-apocalyptic


Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
Token indices sequence length is longer than the specified maximum sequence length for this model (97 > 77). Running this sequence through the model will result in indexing errors
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['subway tunnel " how would, cinematic digital art, consistent ch

  🎨 Generating image 1: "A lone figure, Ava, clad in worn-out combat gear, navigates the desolate, post-apocalyptic New York City, where decaying skyscrapers cast long, ominous shadows over a barren wasteland bathed in a blood-red sunset."

Scene 2: "One fateful afternoon, she stumbled upon a strange, unfamiliar sound coming from an abandoned subway tunnel"

How would, cinematic digital art, consistent character design, ultra-detailed, dramatic light


  0%|          | 0/30 [00:00<?, ?it/s]

  🎨 Generating image 2: A desolate New York City wasteland, an abandoned subway tunnel bathed in eerie, dim light with scavenger Ava silhouetted against the grungy, broken subway walls, listening to a strange, unfamiliar sound emanating from the darkness within., cinematic digital art, consistent character design, ultra-detailed, dramatic light


  0%|          | 0/30 [00:00<?, ?it/s]

The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['and grime - streaked face, is suddenly, cinematic digital art, consistent character design, ultra - detailed, dramatic light']


  🎨 Generating image 3: A desolate, post-apocalyptic subway tunnel, filled with darkness and despair, suddenly lit up by the flickering light of several menacing raiders armed with weapons, ready to pounce on their next target. The tunnel walls are lined with graffiti, remnants of a bygone era, while the ground is littered with debris. The protagonist Ava, with unkempt hair and grime-streaked face, is suddenly, cinematic digital art, consistent character design, ultra-detailed, dramatic light


  0%|          | 0/30 [00:00<?, ?it/s]

  🎨 Generating image 4: A gritty, post-apocalyptic scene featuring a determined, battle-hardened Ava amidst ruins, fiercely striking down a group of raiders with dramatic, cinematic lighting., cinematic digital art, consistent character design, ultra-detailed, dramatic light


  0%|          | 0/30 [00:00<?, ?it/s]

The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['alizing possibility of another survivor nearby., cinematic digital art, consistent character design, ultra - detailed, dramatic light']


  🎨 Generating image 5: Ava, her figure illuminated by a streak of sunlight filtering through the dusty tunnel, stands victoriously over a group of defeated raiders, their bodies littering the grungy subway platform, the desolate New York City skyline visible in the distance through the tunnel's arch. In her hand, a crumpled piece of paper glows in the dim light, hinting at the tantalizing possibility of another survivor nearby., cinematic digital art, consistent character design, ultra-detailed, dramatic light


  0%|          | 0/30 [00:00<?, ?it/s]

  🎨 Generating image 6: "A lone figure, Ava, bathed in the soft glow of a setting sun, ventures into the desolate, overgrown ruins of a city, her eyes aglow with hope and determination as she searches for another survivor.", cinematic digital art, consistent character design, ultra-detailed, dramatic light


  0%|          | 0/30 [00:00<?, ?it/s]


🗣️ Generating Bark narration with realism for theme: post-apocalyptic
  🎤 Generating expressive narration 1: In the desolate wasteland of what used to be New York City, Ava scavenged for resources each day, trying her best to survive alone among the ruins


100%|██████████| 575/575 [00:10<00:00, 53.33it/s]
100%|██████████| 29/29 [00:38<00:00,  1.32s/it]


  🎤 Generating expressive narration 2: One fateful afternoon, she stumbled upon a strange, unfamiliar sound coming from an abandoned subway tunnel


100%|██████████| 425/425 [00:07<00:00, 53.73it/s]
100%|██████████| 22/22 [00:27<00:00,  1.27s/it]


  🎤 Generating expressive narration 3: Suddenly, a group of raiders emerged, brandishing weapons, ready to claim their latest prize


100%|██████████| 351/351 [00:05<00:00, 58.68it/s]
100%|██████████| 18/18 [00:23<00:00,  1.29s/it]


  🎤 Generating expressive narration 4: With a surge of adrenaline coursing through her veins, Ava lunged at them, surprising them with unexpected ferocity, overpowering them in mere moments


100%|██████████| 619/619 [00:11<00:00, 55.64it/s]
100%|██████████| 31/31 [00:40<00:00,  1.31s/it]


  🎤 Generating expressive narration 5: As she stood triumphantly over the vanquished gang, she discovered something extraordinary - not just food and water as usual, but a note detailing the location of another survivor nearby


100%|██████████| 712/712 [00:12<00:00, 55.92it/s]
100%|██████████| 36/36 [00:46<00:00,  1.30s/it]


  🎤 Generating expressive narration 6: Unable to contain her excitement, Ava set out in search of this long-lost companion, hope sparkling in her eyes


100%|██████████| 429/429 [00:07<00:00, 53.96it/s]
100%|██████████| 22/22 [00:27<00:00,  1.27s/it]



🎵 Generating background music for theme: post-apocalyptic
1180422165


  0%|          | 0/300 [00:00<?, ?it/s]

🗂️ Log saved to: post-apocalyptic_log.txt

🎬 Building video for theme: post-apocalyptic
🎞️ Building video clip for Scene 1 with image: images_post-apocalyptic/frame_1.png
🎞️ Building video clip for Scene 2 with image: images_post-apocalyptic/frame_2.png
🎞️ Building video clip for Scene 3 with image: images_post-apocalyptic/frame_3.png
🎞️ Building video clip for Scene 4 with image: images_post-apocalyptic/frame_4.png
🎞️ Building video clip for Scene 5 with image: images_post-apocalyptic/frame_5.png
🎞️ Building video clip for Scene 6 with image: images_post-apocalyptic/frame_6.png
🎼 Adding background music under narration...
Moviepy - Building video post-apocalyptic_story_video.mp4.
MoviePy - Writing audio in post-apocalyptic_story_videoTEMP_MPY_wvf_snd.mp3




MoviePy - Done.
Moviepy - Writing video post-apocalyptic_story_video.mp4





Moviepy - Done !
Moviepy - video ready post-apocalyptic_story_video.mp4
✅ Video saved: post-apocalyptic_story_video.mp4

📖 Loading Mistral model for storytelling...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


✨ Generating a mythical adventure story (Attempt 1)...
⚠️ Ending not strong enough. Trying one more time...

✨ Generating a mythical adventure story (Attempt 2)...

🖋️ Final Story:
1. In the mystic realm of Elarion, where magic flowed like rivers, lived a humble blacksmith named Rennik
2. The enchanting land was torn asunder by an ancient sorcerer's dark spell, and its denizens were scattered to the winds
3. One day, while shaping molten metal into armaments for his people, an otherworldly artifact fell from above—a rare crystal imbued with dormant celestial power
4. As soon as he grasped it, visions flooded his mind: memories of grand battles, forgotten victories, and unfulfilled legacies
5. With this newfound knowledge, his resolve solidified; now clad in shining armor and wielding a celestial sword, Rennik embarked on an epic quest to restore peace to Elarion, promising the long-dead heroes eternal rest if their lost spirits would guide him
6. But when he finally reached the heart 

  0%|          | 0/100 [00:00<?, ?it/s]

  🔊 Generating SFX 2: Given the following mythical adventure story scene:

"The enchanting land was torn asunder by an ancient sorcerer's dark spell, and its denizens were scattered to the winds"

Describe the ambient background sounds for this scene in one rich sentence.
2863134664


  0%|          | 0/100 [00:00<?, ?it/s]

  🔊 Generating SFX 3: Given the following mythical adventure story scene:

"One day, while shaping molten metal into armaments for his people, an otherworldly artifact fell from above—a rare crystal imbued with dormant celestial power"

Describe the ambient background sou.
1362487168


  0%|          | 0/100 [00:00<?, ?it/s]

  🔊 Generating SFX 4: Given the following mythical adventure story scene:

"As soon as he grasped it, visions flooded his mind: memories of grand battles, forgotten victories, and unfulfilled legacies"

Describe the ambient background sounds for this scene in one rich sen.
2677273075


  0%|          | 0/100 [00:00<?, ?it/s]