# Setup

In [1]:
import logging
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

In [2]:
# take openai key as input secret
import getpass
openai_key = getpass.getpass("Enter your OpenAI API key: ")

Enter your OpenAI API key:  ········


In [3]:
import os
os.environ["OPENAI_API_KEY"] = openai_key

# Read Your Story

In [4]:
import os
from pathlib import Path
data_dir = Path(os.getcwd()) / "data" / "0"
parts = []
files = [f for f in os.listdir(data_dir) if f.endswith('.txt')]
logger.info(f"Files in data directory: {files}")
for f in files:
    with open(data_dir / f, 'r', encoding='utf8') as file:
        content = file.read()
        parts.append(content)
logger.info(f"Read {len(parts)} parts from the story files.")


INFO:__main__:Files in data directory: ['0.txt', '1.txt', '2.txt', '3.txt']
INFO:__main__:Read 4 parts from the story files.


In [5]:
parts[1]

'The Arrival\nIt was past 8:00 PM when Riya, Kabir, and Mehul entered the silent city of Bansipur.\nA faint mist curled along the cracked roads, the streetlamps flickering as if unsure they wanted to stay lit.\nThey had been driving for hours, lost after taking a wrong turn from the highway. Fuel gauge—dangerously low.\nBansipur looked deserted, except for one building at the end of the main road:\na bright, glowing sign that read “Mehta Super Mart – Always Open.”\nRiya gave a nervous laugh.\n“Creepy or not, we need snacks. And water. And… maybe a map.”\nThey parked, the sound of their car engine echoing too loudly in the empty street.\nThe sliding doors of the supermarket opened with a slow hiss, though no one stood behind the counter.\nInside, the air was too cold for summer.\nThe lights buzzed overhead, but the aisles were perfectly stocked—cereal boxes lined like soldiers, canned goods gleaming, fruits unnaturally shiny.\nMehul called out, “Hello? Anyone here?”\nOnly the sound of t

# Scenes Creation

In [23]:
prompt = """
You are a YouTube Shorts storyboard generator, cinematic scene visual prompt writer, and animation planner.

I will give you a story.  
Your task is to split it into coherent scenes for a short vertical video (9:16 aspect ratio) and ensure narration, visuals, and motion match **exactly**.

---

## Scene Planning Rules
- **Scene chunking:** End a scene whenever the location, main subject, or time changes. Never mix two locations or actions in one scene.
- **Narration timing:** 
  - Target total video duration: {target_duration_sec} seconds
  - Narration pace: {words_per_second} words/sec
  - Each scene duration: 6–12 seconds
- **Continuity:** Narration and visuals must represent the *same* moment in time.
- **Perspective:** Visual prompts must always specify the exact camera POV (e.g., “point-of-view from driver’s seat,” “over-the-shoulder from Riya,” “low-angle looking up at supermarket sign”).
- **Sound:** Narration output must also suggest background music or ambient sound effects that match the mood.
- **Consistency:** Characters, props, vehicles, and environment must remain visually consistent across all scenes unless narration explicitly changes them.
- **Aspect Ratio:** Always assume vertical 9:16 framing.

---

## Visual Prompt Requirements
Each `visual_prompt` MUST be structured into **three layers**:

- **Foreground (in-frame, closest to viewer):** e.g., dashboard fuel gauge, character’s hand, weapon, face.  
- **Midground (main subject/action):** e.g., characters walking, car interior, main interaction.  
- **Background (environment/context):** e.g., supermarket sign, skyline, mountains, horizon.  

Rules:
- If a layer is not present, explicitly state: **“None.”**
- Always anchor the camera POV (driver’s seat view, over-the-shoulder, low-angle, bird’s-eye, etc.).
- For text on signs, posters, or screens, specify: **“in-frame, clear, sharp, legible text displaying EXACTLY: ‘YOUR TEXT HERE’.”**
- Include cinematic style (ultra-realistic, anime, film noir, etc.).
- Define lighting (golden hour, neon glow, candlelight, etc.).
- Specify camera angle and lens type.
- Describe colors and mood (gritty, vibrant, muted, pastel, etc.).
- Mention textures and atmospheric details (mist, rain, dust, sparks, lens flare).
- Add time of day and weather if relevant.
- Optionally reference artistic styles (photographer, painter, director).

---

## Animation Planning
For each scene, pick the most suitable animation type based on the visual structure:

- **Ken Burns:** Slow zoom/pan for close-up or still images.
- **Parallax:** For layered depth (foreground/midground/background separation).
- **Cinemagraph:** For subtle looping elements (flickering lights, mist, fire, rain).
- **Dolly Zoom / Push In / Pan:** For dramatic tension or reveals.
- **Static:** For moments meant to feel still and frozen.

---

## Output
Return a valid JSON object with the following keys:

- `profile` (object) with:
  - `geographic_location`: {{ "country": str, "specific_location": str }}
  - `time_period`: {{ "era": str, "time_of_day": str }}
  - `weather`: {{ "condition": str, "details": str }}
  - `ethnicity`: str
  - `mood`: str
  - `characters`: list of objects with:
    - `name`: str (or "Unnamed" if not specified)
    - `role`: str (hero, narrator, bystander, etc.)
    - `visual_features`: str (clothing, hair, build, etc.)
    - `psychological_features`: str (personality, emotions, motivations)

- `scenes`: array of objects, each containing:
  - `scene_index` (int): Scene number starting at 0.
  - `narration_text` (str): Voiceover narration.
  - `subtitle_text` (str): Short text (≤12 words) for on-screen subtitle.
  - `visual_prompt` (str): Cinematic AI description, strictly layered:
        Foreground: ...
        Midground: ...
        Background: ...
        (plus cinematic style, lighting, camera POV, colors, textures, etc.)
  - `background_music` (str): Suggested BGM/ambient sound.
  - `animation_type` (str): One of ["Ken Burns", "Parallax", "Cinemagraph", "Dolly Zoom", "Static"].
  - `duration_sec` (int): Scene length in seconds.

---

Story:
{story_text}

"""

In [24]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
template = ChatPromptTemplate.from_template(prompt)

In [25]:
from pydantic import BaseModel, Field
from typing import List, Optional

class GeographicLocation(BaseModel):
    country: Optional[str] = Field(None, description="Likely country where the story takes place")
    specific_location: Optional[str] = Field(None, description="Specific setting of the story")

class TimePeriod(BaseModel):
    era: Optional[str] = Field(None, description="Era or historical context")
    time_of_day: Optional[str] = Field(None, description="Time of day in which the story occurs")

class Weather(BaseModel):
    condition: Optional[str] = Field(None, description="Weather condition (rainy, sunny, snowy)")
    details: Optional[str] = Field(None, description="Extra weather details (mist, storm, clear sky)")

class Character(BaseModel):
    name: str = Field(..., description="Character name or 'Unnamed'")
    role: str = Field(..., description="Role in the story (hero, bystander, narrator)")
    visual_features: Optional[str] = Field(None, description="Appearance and clothing")
    psychological_features: Optional[str] = Field(None, description="Personality and emotions")

class Profile(BaseModel):
    geographic_location: GeographicLocation
    time_period: TimePeriod
    weather: Weather
    ethnicity: Optional[str] = Field(None, description="Ethnicity of main characters if implied")
    mood: Optional[str] = Field(None, description="Overall mood or tone of the scenes")
    characters: List[Character] = Field(default_factory=list, description="List of main characters")

class Scene(BaseModel):
    scene_index: int
    narration_text: str
    subtitle_text: str
    visual_prompt: str
    background_music: str
    animation_type: str
    duration_sec: int

class StoryBoard(BaseModel):
    profile: Profile
    scenes: List[Scene]

In [26]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0.2,
    openai_api_key=openai_key,
)

llm = llm.with_structured_output(StoryBoard)  # Enable structured output for Scenes model

In [27]:
chain = template | llm

In [28]:
result = chain.invoke({
    "story_text": "\n\n".join(parts),
    "target_duration_sec": 60,
    "words_per_second": 3
})


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [29]:
len(result.scenes)

11

In [30]:
result.profile

Profile(geographic_location=GeographicLocation(country='India', specific_location='Bansipur'), time_period=TimePeriod(era='Contemporary', time_of_day='Night'), weather=Weather(condition='Misty', details='Faint mist curling along the roads'), ethnicity='South Asian', mood='Eerie and suspenseful', characters=[Character(name='Riya', role='Hero', visual_features='Long black hair, wearing a denim jacket and jeans', psychological_features='Cautious, slightly nervous but determined'), Character(name='Kabir', role='Hero', visual_features='Short hair, wearing a hoodie and cargo pants', psychological_features='Observant, slightly skeptical'), Character(name='Mehul', role='Hero', visual_features='Curly hair, wearing a t-shirt and backpack', psychological_features='Curious, slightly anxious')])

In [31]:
result.scenes

[Scene(scene_index=0, narration_text='It was past 8:00 PM when Riya, Kabir, and Mehul entered the silent city of Bansipur. A faint mist curled along the cracked roads, the streetlamps flickering as if unsure they wanted to stay lit.', subtitle_text='Entering Bansipur at 8:00 PM.', visual_prompt='Foreground: Dashboard fuel gauge showing low fuel. Midground: Riya, Kabir, and Mehul in the car, looking out the window. Background: Misty roads with flickering streetlamps. Cinematic style: Ultra-realistic. Lighting: Dim, with streetlamp flickers. Camera POV: Over-the-shoulder from Riya. Colors: Muted, with a hint of orange from streetlights. Textures: Mist and cracked roads. Time of day: Night.', background_music='Eerie ambient tones with distant car engine hum.', animation_type='Parallax', duration_sec=10),
 Scene(scene_index=1, narration_text="They had been driving for hours, lost after taking a wrong turn from the highway. Fuel gauge—dangerously low. Bansipur looked deserted, except for on

In [15]:
from tabulate import tabulate

def to_profile_tab(profile: Profile) -> str:
    rows = [
        ["Geographic Location", f"{profile.geographic_location.country or ''}, {profile.geographic_location.specific_location or ''}"],
        ["Time Period", f"Era: {profile.time_period.era or ''}, Time: {profile.time_period.time_of_day or ''}"],
        ["Weather", f"{profile.weather.condition or ''} ({profile.weather.details or ''})"],
        ["Ethnicity", profile.ethnicity or ""],
        ["Mood", profile.mood or ""],
    ]
    
    # Add characters as a sub-table
    char_rows = []
    for c in profile.characters:
        char_rows.append([
            c.name,
            c.role,
            c.visual_features or "",
            c.psychological_features or ""
        ])
    
    table_str = tabulate(rows, headers=["Attribute", "Value"], tablefmt="grid")
    
    if char_rows:
        char_table = tabulate(
            char_rows, 
            headers=["Name", "Role", "Visual Features", "Psychological Features"], 
            tablefmt="grid"
        )
        table_str += "\n\nCharacters:\n" + char_table
    
    return table_str


In [16]:
profile_str = to_profile_tab(result.profile)

In [17]:
print(profile_str)


+---------------------+----------------------------------------+
| Attribute           | Value                                  |
| Geographic Location | India, Bansipur                        |
+---------------------+----------------------------------------+
| Time Period         | Era: Modern Day, Time: Night           |
+---------------------+----------------------------------------+
| Weather             | Misty (Faint mist curling along roads) |
+---------------------+----------------------------------------+
| Ethnicity           | South Asian                            |
+---------------------+----------------------------------------+
| Mood                | Eerie and suspenseful                  |
+---------------------+----------------------------------------+

Characters:
+--------+--------+-------------------------------------------------+--------------------------+
| Name   | Role   | Visual Features                                 | Psychological Features   |
| Riya   | He

# Image Generation

In [18]:
endpoint = "http://34.228.224.128:8000/generate"

In [19]:
import requests
import json
import base64

def generate_image(prompt, image_path: str = None):
    data = {"prompt": prompt}
    files = None

    if image_path:
        with open(image_path, "rb") as f:
            data["image"] = base64.b64encode(f.read()).decode("utf-8")

    response = requests.post(endpoint, 
                             data=json.dumps(data), 
                             headers={'Content-Type': 'application/json'})

    if response.status_code != 200:
        raise RuntimeError(f"Request failed: {response.status_code} {response.text}")
    else:
        img = json.loads(response.content.decode('utf-8'))
        image_bytes = base64.b64decode(img['image_base64'])
        return image_bytes


In [20]:
import os
from uuid import uuid4
from pathlib import Path

id = str(uuid4())

stage_dir = Path(os.getcwd()) / "stage" / id


In [21]:
import base64
def generate_scene(idx: int):
    """Generate a scene image based on the visual prompt."""
    scene = result.scenes[idx]
    prompt = f"""
        Profile:
        {profile_str}
        Scene:
        {scene.visual_prompt}
    """

    logger.info(f"Generating image for scene {idx} with prompt: {prompt}")

    # Generate the image
    previous_img_path =  stage_dir / f"scene_{(idx-1)}.png" if idx > 0 else None
    if previous_img_path and previous_img_path.exists():
        logger.info(f"Using previous image for scene {idx-1}: {previous_img_path}")
        image_bytes = generate_image(prompt, image_path=previous_img_path)
    else:
        logger.info(f"No previous image found, generating new image.")
        image_bytes = generate_image(prompt)

    # Save the image to the stage directory
    stage_dir.mkdir(parents=True, exist_ok=True)
    output_path = stage_dir / f"scene_{idx}.png"
    
    with open(output_path, "wb") as f:
        f.write(image_bytes)
    
    logger.info(f"Image for scene {idx} saved to {output_path}")
    return output_path


In [22]:
generate_scene(0)
generate_scene(1)

INFO:__main__:Generating image for scene 0 with prompt: 
        Profile:
        +---------------------+----------------------------------------+
| Attribute           | Value                                  |
| Geographic Location | India, Bansipur                        |
+---------------------+----------------------------------------+
| Time Period         | Era: Modern Day, Time: Night           |
+---------------------+----------------------------------------+
| Weather             | Misty (Faint mist curling along roads) |
+---------------------+----------------------------------------+
| Ethnicity           | South Asian                            |
+---------------------+----------------------------------------+
| Mood                | Eerie and suspenseful                  |
+---------------------+----------------------------------------+

Characters:
+--------+--------+-------------------------------------------------+--------------------------+
| Name   | Role   | Visual F

WindowsPath('C:/Samriddha/opensource/examples/misc/projects/story-to-video/stage/bab22c38-4da3-4af9-8d4a-71bb087f566e/scene_1.png')

In [15]:
client = OpenAI()

def tts_generate(text, filename):
    """Generate narration audio from text using OpenAI TTS."""
    speech = client.audio.speech.create(
        model="gpt-4o-mini-tts",
        voice="alloy",
        input=text
    )
    with open(filename, "wb") as f:
        f.write(speech.read())
    return filename

def image_generate(prompt, filename):
    """Generate image using OpenAI image API."""
    img = client.images.generate(
        model="gpt-image-1",
        prompt=prompt,
        size="1024x1536"
    )
    print(f"Generated image for prompt: {prompt}")
    image_base64 = img.data[0].b64_json

    # Step 3 — Decode base64 → bytes
    image_bytes = base64.b64decode(image_base64)
    
    # Step 4 — Open with Pillow
    img = Image.open(BytesIO(image_bytes))
    
    # Step 5 — Resize to YouTube Shorts-friendly 1080x1920
    img = img.resize((1080, 1920), Image.LANCZOS)
    
    # Step 6 — Save
    img.save(filename)
    print(f"Image saved as {filename}")
    
    return filename

In [16]:
import subprocess

def create_video_with_ffmpeg(image_path, audio_path, output_path, duration):
    # Create a video from a single image + audio, loop image for duration
    command = [
        "ffmpeg",
        "-y",
        "-loop", "1",
        "-i", image_path,
        "-i", audio_path,
        "-c:v", "libx264",
        "-tune", "stillimage",
        "-c:a", "aac",
        "-b:a", "192k",
        "-pix_fmt", "yuv420p",
        "-shortest",
        output_path
    ]
    subprocess.run(command, check=True)

In [17]:
scene_videos = []

for idx, scene in enumerate(result.scenes):
    audio_file = f"scene{idx}.mp3"
    image_file = f"scene{idx}.png"
    video_file = f"scene{idx}.mp4"

    # Generate assets
    tts_generate(scene.narration_text, audio_file)
    image_generate(scene.visual_prompt, image_file)

    # Estimate duration from narration speed (~3 words/sec)
    words = len(scene.narration_text.split())
    duration = round(words / 3.0, 1)

    # Combine into scene video
    create_video_with_ffmpeg(image_file, audio_file, video_file, duration)
    scene_videos.append(video_file)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/audio/speech "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/images/generations "HTTP/1.1 200 OK"


Generated image for prompt: A wide-angle shot of a deserted city street at night. The scene is ultra-realistic with a moody, noir style. Streetlamps cast flickering, dim light over cracked roads. A faint mist swirls, creating an eerie atmosphere. The camera captures the scene from a low angle, emphasizing the desolation. Colors are muted with a cold blue tint, enhancing the chilling mood. Textures of cracked asphalt and swirling mist are prominent.
Image saved as scene0.png


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/audio/speech "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/images/generations "HTTP/1.1 200 OK"


Generated image for prompt: A medium shot of a car parked on an empty street, with a glowing neon sign of 'Mehta Super Mart' in the background. The scene is depicted in a cinematic, neo-noir style with vibrant neon colors contrasting against the dark surroundings. The camera captures the scene from a slightly elevated angle, focusing on the car and the sign. The neon glow casts colorful reflections on the wet pavement, adding texture and depth.
Image saved as scene1.png


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/audio/speech "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/images/generations "HTTP/1.1 200 OK"


Generated image for prompt: An ultra-realistic, wide-angle shot of the supermarket interior. The scene is brightly lit with a sterile, fluorescent glow. The camera captures the aisles from a low angle, emphasizing the perfectly stocked shelves. Colors are vibrant but slightly surreal, with a focus on the shiny, unnatural appearance of the products. The buzzing of the lights adds an unsettling atmosphere.
Image saved as scene2.png


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/audio/speech "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/images/generations "HTTP/1.1 200 OK"


Generated image for prompt: A narrow staircase descending into darkness, lit by a single, flickering bulb. The scene is captured in a film noir style with deep shadows and high contrast. The camera takes a close-up shot of the sign and the staircase, emphasizing the mystery and foreboding atmosphere. The colors are muted, with a focus on the interplay of light and shadow.
Image saved as scene3.png


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/audio/speech "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/images/generations "HTTP/1.1 200 OK"


Generated image for prompt: A surreal, Escher-like staircase scene with repeating elements. The camera captures the scene from a wide-angle, slightly tilted perspective, emphasizing the disorienting, infinite nature of the staircases. The lighting is dim, with a single bulb casting long shadows. The colors are dark and gritty, enhancing the sense of confusion and unease.
Image saved as scene4.png


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/audio/speech "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/images/generations "HTTP/1.1 200 OK"


Generated image for prompt: A close-up shot of a dark red smear on a wall, with a flickering bulb above. The scene is captured in a horror style with high contrast and deep shadows. The camera focuses on the texture of the smear, with the bulb casting an eerie glow. The colors are dark and ominous, with a focus on the red smear and the shadows it casts.
Image saved as scene5.png


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/audio/speech "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/images/generations "HTTP/1.1 200 OK"


Generated image for prompt: A dark, atmospheric shot of a staircase with faint, approaching footsteps. The scene is captured in a suspenseful, thriller style with minimal lighting. The camera takes a low-angle shot, focusing on the staircase and the shadows. The colors are muted, with a focus on the interplay of light and shadow. The atmosphere is tense, with a sense of impending danger.
Image saved as scene6.png


In [2]:
scene_videos = [
    "scene0.mp4",
    "scene1.mp4",
    "scene2.mp4",
    "scene3.mp4",
    "scene4.mp4",
    "scene5.mp4",
    "scene6.mp4",
]

In [3]:
with open("scenes.txt", "w") as f:
    for video in scene_videos:
        f.write(f"file '{video}'\n")

import subprocess
subprocess.run([
    "ffmpeg", "-f", "concat", "-safe", "0", "-i", "scenes.txt",
    "-c", "copy", "story_1.mp4"
], check=True)


CompletedProcess(args=['ffmpeg', '-f', 'concat', '-safe', '0', '-i', 'scenes.txt', '-c', 'copy', 'story_1.mp4'], returncode=0)