# Hear2See ‚Äî Fine-tuning Stable Diffusion (LoRA) + Dataset Prep

This notebook helps you prepare image/audio datasets and fine-tune a text-conditioned LoRA for Stable Diffusion using your 150 scene prompts.

Runtime: GPU recommended. Mount Drive and ensure you have space in `/content/drive/MyDrive/hear2see/`.

In [None]:
# Install dependencies
!pip -q install -U --no-deps git+https://github.com/huggingface/diffusers.git transformers accelerate safetensors \ \
  git+https://github.com/huggingface/peft.git
!pip -q install datasets pillow tqdm safetensors pydub ffmpeg-python
!pip -q install xformers || true


[31mERROR: Invalid requirement: '': Expected package name at the start of dependency specifier
    
    ^[0m[31m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m117.2/117.2 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

BASE = '/content/drive/MyDrive/hear2see'
import os
for d in ['audio/raw','audio/processed','audio/segments','dataset/raw_images','dataset/train_images','dataset/videos','lora_out']:
    os.makedirs(os.path.join(BASE,d), exist_ok=True)
print('Created project folders under', BASE)

Mounted at /content/drive
Created project folders under /content/drive/MyDrive/hear2see


## Prompts (150)
The cell below writes `prompts.txt` into your Drive so you don't need to paste them manually.

In [None]:
# Write prompts.txt (150 scene-based prompts)
BASE = '/content/drive/MyDrive/hear2see'
PROMPT_FILE = os.path.join(BASE, 'prompts.txt')
prompts = [
"a bird flying across the horizon at sunset, golden sky reflecting on water",
"waves crashing gently against cliffs at dawn with mist rolling in",
"a waterfall flowing into a calm lake surrounded by dense forest",
"snow falling softly on a cabin with smoke rising from the chimney",
"a desert dune shifting under strong winds beneath an orange sky",
"a thunderstorm forming over a mountain range with lightning flashes",
"a field of sunflowers swaying in the afternoon breeze",
"a river winding through a valley under twilight clouds",
"a stormy sea with a lighthouse beaming through fog",
"leaves falling from trees in a quiet autumn park",
"a clear blue lake reflecting tall pine trees at sunrise",
"a night sky full of stars over a calm desert",
"rain drops creating ripples in puddles on a forest path",
"a campfire glowing beside a tent under a moonlit sky",
"fog moving across rolling hills at early morning",
"volcanic smoke rising from a distant mountain",
"light filtering through jungle canopy onto mossy rocks",
"a rainbow forming after a storm over open fields",
"mist drifting over a still river during sunrise",
"a frozen lake reflecting northern lights in winter night",
"waves lapping at the shore beside wooden boats",
"a small creek flowing beneath fallen branches",
"golden wheat swaying under a bright afternoon sun",
"a gentle rain shower over blooming spring flowers",
"a lone tree standing in the middle of an open meadow",
"sunlight piercing through clouds after heavy rain",
"fireflies glowing above grass at twilight",
"sand blowing across empty desert dunes at sunset",
"a mountain peak covered in clouds and snow",
"autumn leaves floating on a calm pond surface",
"a forest path lit by scattered beams of sunlight",
"a frozen waterfall glistening in pale morning light",
"waves breaking softly on a rocky coastline at dusk",
"a valley covered in mist as birds fly overhead",
"light rain falling on a city park bench at night",
"sunbeams shining through the fog of a dense forest",
"a peaceful lake with reflections of nearby mountains",
"a river flowing under a wooden bridge surrounded by greenery",
"a coastal village under a pink and orange evening sky",
"snow-covered pine trees glowing under golden sunlight",

"a lone taxi driving through neon-lit streets in the rain",
"a pedestrian crossing under flashing billboards at night",
"steam rising from subway grates on a cold morning",
"a cyclist passing graffiti walls in an urban alleyway",
"a crowded street market bustling under colorful umbrellas",
"a person walking with an umbrella past city lights at dusk",
"cars moving across a wet highway reflecting headlights",
"a train passing over a bridge above a calm river at night",
"streetlights glowing in fog on an empty road",
"people chatting at a rooftop caf√© under the city skyline",
"a tram moving through snow-covered streets in winter",
"a musician playing guitar on a corner under a lamp post",
"raindrops sliding down a window overlooking city traffic",
"a plane taking off as the skyline glows with sunrise",
"a couple walking across a bridge in light drizzle",
"a food vendor preparing noodles in a busy night market",
"a skateboarder gliding through an empty parking lot",
"neon reflections shimmering on wet pavement downtown",
"a street performer juggling in front of an audience",
"a clock tower striking midnight under scattered clouds",
"shop signs flickering while rain pours over the street",
"a traffic light turning green as cars start moving",
"a cat sitting on a windowsill watching the street below",
"pigeons taking flight above a crowded plaza",
"a delivery truck driving through narrow city lanes",
"a group of friends laughing near a food truck",
"a bus arriving at a station filled with commuters",
"a quiet alley with flickering lanterns at night",
"a bridge shimmering with city lights at twilight",
"pedestrians rushing under umbrellas in a sudden downpour",

"a woman spinning in the rain with her arms outstretched",
"a child running through a field of tall grass at sunset",
"a man reading a book beside a campfire under the stars",
"a dancer performing gracefully in an empty theatre",
"a couple holding hands walking through autumn leaves",
"a painter working on a canvas by a sunny window",
"a girl releasing a lantern into the night sky",
"a boy flying a kite on a breezy afternoon",
"a musician playing violin under a streetlight",
"a person meditating on a mountain peak at dawn",
"a photographer capturing waves at golden hour",
"a young woman gazing out of a rainy window",
"friends roasting marshmallows beside a bonfire",
"a runner jogging through foggy morning streets",
"a mother lifting her child against sunset light",
"a man sketching cityscapes in his notebook",
"a dancer leaping across a dimly lit stage",
"a traveler adjusting a camera on a tripod",
"a chef chopping vegetables in warm kitchen light",
"a scientist writing notes near a glowing screen",
"a person listening to music with headphones on a train",
"a surfer carrying a board along the beach at dawn",
"a farmer harvesting crops under bright morning sun",
"a fisherman casting a line into still water",
"a teacher writing equations on a chalkboard",
"a student studying late with a desk lamp on",
"a firefighter spraying water to extinguish flames",
"a gardener watering plants during golden hour",
"a photographer adjusting focus beside a waterfall",
"a child blowing bubbles in a park",

"colored smoke swirling in slow motion under soft light",
"ink dispersing through clear water creating blue trails",
"paper airplanes flying through a sunlit room",
"balloons floating upward into a cloudy sky",
"paint splashing onto canvas in slow motion",
"a candle flame flickering in darkness",
"gears turning slowly inside an antique clock",
"water droplets falling from a leaf in macro view",
"sparkles drifting through a dark stage spotlight",
"shards of glass glimmering mid-air after shatter",
"sand flowing through an hourglass in macro closeup",
"a pendulum swinging in steady rhythm under a lamp",
"a clock hand ticking slowly with dust particles visible",
"colored powders bursting in the air in slow motion",
"a droplet of ink hitting still water forming ripples",
"a candle melting with wax flowing down gently",
"a lightbulb glowing faintly in an empty room",
"a curtain moving gently with wind from a window",
"feathers floating slowly to the ground",
"smoke rising from incense against dark background",

"a lion walking across the savannah under golden sun",
"a deer drinking from a calm forest stream",
"a butterfly landing on a blooming flower",
"a school of fish swimming through coral reef",
"a bird flying over a mountain valley at sunrise",
"a fox running through snowy woods at dawn",
"an eagle soaring above misty cliffs",
"a cat stretching on a windowsill bathed in morning light",
"a horse galloping across an open field",
"a dolphin jumping out of ocean waves at sunset",
"a bee collecting pollen from sunflowers",
"a turtle crawling across a sandy beach toward water",
"a rabbit hopping across a field under moonlight",
"a wolf howling at full moon on a mountain ridge",
"a dog running along the seashore with waves splashing",
"a hummingbird hovering beside red flowers",
"a peacock spreading its feathers in soft sunlight",
"a parrot flying between tropical palm trees",
"a bear catching fish from a rushing river",
"a hawk diving swiftly through the air",

"a lightning bolt striking over a calm ocean at night",
"rain falling over city rooftops under a gray sky",
"fog rolling through the mountains after rainfall",
"snowflakes falling under a streetlight in winter evening",
"wind blowing leaves across an empty playground",
"clouds moving rapidly across a bright blue sky",
"mist swirling over a dimly lit lake at dawn",
"hailstones bouncing on a tin roof during storm",
"the sun breaking through clouds after heavy rain",
"a rainbow forming over waterfalls in gentle sunlight"
]

with open(PROMPT_FILE,'w',encoding='utf-8') as f:
    f.write('\n'.join(prompts))
print('Wrote', len(prompts), 'prompts to', PROMPT_FILE)


## Optional: synthesize images from prompts (use only if you want me to auto-generate images)
This cell will generate one image per prompt using Stable Diffusion. It's optional ‚Äî if you will prepare images yourself, skip this cell.

In [None]:
# Image synthesis cell ‚Äî generates one image per prompt and saves to dataset/raw_images
# WARNING: This will use GPU time and Colab units. Start with SMALL_NUM for a test.
from diffusers import StableDiffusionPipeline
import torch, os, random, time
from PIL import Image, ImageEnhance

BASE = '/content/drive/MyDrive/hear2see'
PROMPT_FILE = os.path.join(BASE, 'prompts.txt')
OUT_DIR = os.path.join(BASE, 'dataset', 'raw_images')
os.makedirs(OUT_DIR, exist_ok=True)

# Load prompts
with open(PROMPT_FILE,'r',encoding='utf-8') as f:
    prompts = [l.strip() for l in f.readlines() if l.strip()]

# Test small run setting
SMALL_NUM = 10  # change to len(prompts) to generate all
STEPS = 20
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device:', DEVICE)

# load pipeline (this will download model the first time)
pipe = StableDiffusionPipeline.from_pretrained('runwayml/stable-diffusion-v1-5', torch_dtype=torch.float16)
pipe = pipe.to(DEVICE)

seed_base = random.randint(1,2**30)
for i, p in enumerate(prompts[:SMALL_NUM]):
    seed = seed_base + i
    gen = torch.Generator(device=DEVICE).manual_seed(seed)
    try:
        with torch.autocast('cuda'):
            out = pipe(p, num_inference_steps=STEPS, generator=gen)
        img = out.images[0]
    except Exception as e:
        print('Generation failed for prompt', i, e)
        continue
    # slight augmentation
    if random.random() < 0.3:
        img = img.transpose(Image.FLIP_LEFT_RIGHT)
    fname = os.path.join(OUT_DIR, f'img_{i+1:05d}.png')
    img.save(fname)
    print('Saved', fname)
    time.sleep(0.08)
print('Done. Generated images in', OUT_DIR)

## Prepare train images & captions.txt
If you synthesized images above or uploaded your own into `dataset/raw_images`, this cell will resize them to 512√ó512 and write `captions.txt` aligned to prompts.

In [None]:
from pathlib import Path
from PIL import Image
import os
BASE = '/content/drive/MyDrive/hear2see'
RAW = Path(os.path.join(BASE,'dataset','raw_images'))
OUT = Path(os.path.join(BASE,'dataset','train_images'))
OUT.mkdir(parents=True, exist_ok=True)
PROMPT_FILE = os.path.join(BASE,'prompts.txt')

prompts = [l.strip() for l in open(PROMPT_FILE,'r',encoding='utf-8').read().splitlines() if l.strip()]
files = sorted([p for p in RAW.glob('*.*') if p.suffix.lower() in ['.png','.jpg','.jpeg']])
print('Raw files found:', len(files), 'Prompts:', len(prompts))

n = min(len(files), len(prompts))
if n==0:
    print('No raw images found. Upload or run the synthesis cell first.')
else:
    captions = []
    for i in range(n):
        img = Image.open(files[i]).convert('RGB').resize((512,512))
        outp = OUT / f'img_{i+1:05d}.png'
        img.save(outp)
        captions.append(prompts[i])
    capf = OUT / 'captions.txt'
    capf.write_text('\n'.join(captions), encoding='utf-8')
    print('Prepared', n, 'train images and wrote captions to', capf)


## Training script (writes train_lora_text_cond.py). This is the text-conditioned LoRA trainer.

In [None]:
%%bash
cat > train_lora_text_cond.py <<'PY'
# training script placeholder
PY
echo 'Wrote train_lora_text_cond.py (placeholder)'


## Configure accelerate (run once)

In [None]:
# Configure accelerate (interactive - accept defaults)
!accelerate config default || true


## Launch training (example command)
Start with 500 steps to test. Edit parameters as needed.

In [None]:
!accelerate launch train_lora_text_cond.py \
  --pretrained_model runwayml/stable-diffusion-v1-5 \
  --train_data_dir "/content/drive/MyDrive/hear2see/dataset/train_images" \
  --captions_file "/content/drive/MyDrive/hear2see/dataset/train_images/captions.txt" \
  --output_dir "/content/drive/MyDrive/hear2see/lora_out" \
  --resolution 512 \
  --batch_size 1 \
  --max_train_steps 500 \
  --lr 1e-4 \
  --lora_rank 8

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
/usr/bin/python3: can't open file '/content/train_lora_text_cond.py': [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/accelerate_cli.py", line 50, in main
    args.func(args)
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 1235, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 823, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Co

## Inference & quick test: load LoRA and generate sample images/clips

In [None]:
from diffusers import StableDiffusionPipeline
import torch, os
MODEL_BASE = 'runwayml/stable-diffusion-v1-5'
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
pipe = StableDiffusionPipeline.from_pretrained(MODEL_BASE, torch_dtype=torch.float16).to(DEVICE)
# load adapters if training finished
ATTN_DIR = '/content/drive/MyDrive/hear2see/lora_out'
try:
    pipe.unet.load_attn_procs(ATTN_DIR)
    print('Loaded LoRA adapters from', ATTN_DIR)
except Exception as e:
    print("Adapter load failed (fine if you haven't trained yet):", e)

# generate a small sample (3 frames) and stitch
prompt = 'a bird flying across the horizon at sunset, golden sky reflecting on water, cinematic, high detail'
outdir = '/content/drive/MyDrive/hear2see/demo_frames'
os.makedirs(outdir, exist_ok=True)
for i in range(3):
    gen = torch.Generator(device=DEVICE).manual_seed(100+i)
    img = pipe(prompt, num_inference_steps=20, generator=gen).images[0]
    img.save(os.path.join(outdir, f'frame_{i+1:03d}.png'))
print('Saved frames to', outdir)
# stitch using ffmpeg
os.system(f"ffmpeg -y -framerate 6 -i {outdir}/frame_%03d.png -c:v libx264 -pix_fmt yuv420p /content/drive/MyDrive/hear2see/demo_clip.mp4")
print('Demo clip saved to /content/drive/MyDrive/hear2see/demo_clip.mp4')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model_index.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

scheduler_config.json:   0%|          | 0.00/308 [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

text_encoder/model.safetensors:   0%|          | 0.00/492M [00:00<?, ?B/s]

safety_checker/model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/547 [00:00<?, ?B/s]

unet/diffusion_pytorch_model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

vae/diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

`torch_dtype` is deprecated! Use `dtype` instead!


Adapter load failed (fine if you haven't trained yet): Error no file named pytorch_lora_weights.bin found in directory /content/drive/MyDrive/hear2see/lora_out.


  0%|          | 0/20 [00:00<?, ?it/s]

  0%|          | 0/20 [00:00<?, ?it/s]

  0%|          | 0/20 [00:00<?, ?it/s]

Saved frames to /content/drive/MyDrive/hear2see/demo_frames
Demo clip saved to /content/drive/MyDrive/hear2see/demo_clip.mp4


## Final notes
- Start small to conserve Colab units: test with `SMALL_NUM=10` in the image synthesis cell and `max_train_steps=500` in training.
- After you confirm results, rerun with full dataset and higher steps.
- The notebook writes `prompts.txt` into your Drive automatically.


In [None]:
from pathlib import Path

# Try MyDrive first, then fall back to Shared Drives automatically
CANDIDATES = [
    "/content/drive/MyDrive/hear2see",
]
# If you keep project on a Shared Drive, add your drive name here:
# CANDIDATES.append("/content/drive/Shareddrives/<YourTeamDriveName>/hear2see")

BASE = next((p for p in CANDIDATES if Path(p).exists()), None)
assert BASE is not None, "‚ùå Could not find your 'hear2see' folder. Is Drive mounted and path correct?"

print("‚úÖ BASE:", BASE)

train_dir = Path(f"{BASE}/dataset/train_images")
print("üìÇ images:", len(list(train_dir.glob("img_*.png"))))
caps = (train_dir/"captions.txt")
print("üìù captions.txt exists?", caps.exists())
print("üßæ lora_out exists?", Path(f"{BASE}/lora_out").exists())


‚úÖ BASE: /content/drive/MyDrive/hear2see
üìÇ images: 150
üìù captions.txt exists? True
üßæ lora_out exists? True


In [None]:
# Clean up conflicting preinstalls (optional but helps)
!pip uninstall -y timm opencv-python opencv-python-headless opencv-contrib-python jax jaxlib pytensor thinc || true

# Stable, modern combo that works with the official trainer
!pip install -U "numpy==1.26.4" \
  torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121

!pip install -U diffusers==0.30.3 transformers==4.44.2 accelerate==0.33.0 \
  safetensors peft==0.11.1 datasets==2.21.0


Found existing installation: timm 1.0.20
Uninstalling timm-1.0.20:
  Successfully uninstalled timm-1.0.20
Found existing installation: opencv-python 4.12.0.88
Uninstalling opencv-python-4.12.0.88:
  Successfully uninstalled opencv-python-4.12.0.88
Found existing installation: opencv-python-headless 4.12.0.88
Uninstalling opencv-python-headless-4.12.0.88:
  Successfully uninstalled opencv-python-headless-4.12.0.88
Found existing installation: opencv-contrib-python 4.12.0.88
Uninstalling opencv-contrib-python-4.12.0.88:
  Successfully uninstalled opencv-contrib-python-4.12.0.88
Found existing installation: jax 0.7.2
Uninstalling jax-0.7.2:
  Successfully uninstalled jax-0.7.2
Found existing installation: jaxlib 0.7.2
Uninstalling jaxlib-0.7.2:
  Successfully uninstalled jaxlib-0.7.2
Found existing installation: pytensor 2.35.1
Uninstalling pytensor-2.35.1:
  Successfully uninstalled pytensor-2.35.1
Found existing installation: thinc 8.3.6
Uninstalling thinc-8.3.6:
  Successfully uninstal

In [None]:
import torch, diffusers, transformers, peft, numpy
print("torch:", torch.__version__)
print("diffusers:", diffusers.__version__)
print("transformers:", transformers.__version__)
print("peft:", peft.__version__)
print("numpy:", numpy.__version__)


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

torch: 2.8.0+cu126
diffusers: 0.30.3
transformers: 4.44.2
peft: 0.11.1
numpy: 1.26.4


In [None]:
!git clone -b v0.30.3 https://github.com/huggingface/diffusers.git


Cloning into 'diffusers'...
remote: Enumerating objects: 110148, done.[K
remote: Counting objects: 100% (130/130), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 110148 (delta 100), reused 73 (delta 73), pack-reused 110018 (from 2)[K
Receiving objects: 100% (110148/110148), 82.75 MiB | 21.71 MiB/s, done.
Resolving deltas: 100% (82043/82043), done.
Note: switching to 'c9ff360966327ace3faad3807dc871a4e5447501'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false



In [None]:
!head -n 5 /content/drive/MyDrive/hear2see/dataset/train_images/metadata.jsonl


{"image": "img_00001.png", "text": "a bird flying across the horizon at sunset, golden sky reflecting on water"}
{"image": "img_00002.png", "text": "waves crashing gently against cliffs at dawn with mist rolling in"}
{"image": "img_00003.png", "text": "a waterfall flowing into a calm lake surrounded by dense forest"}
{"image": "img_00004.png", "text": "snow falling softly on a cabin with smoke rising from the chimney"}
{"image": "img_00005.png", "text": "a desert dune shifting under strong winds beneath an orange sky"}


In [None]:
# ==== ONE-CELL, AUTO-FIX TRAIN LAUNCHER ====

import os, json, subprocess, shlex
from pathlib import Path

# --- Paths ---
BASE = "/content/drive/MyDrive/hear2see"
IMG_DIR = Path(f"{BASE}/dataset/train_images")
META = IMG_DIR / "metadata.jsonl"
CAPS = IMG_DIR / "captions.txt"
OUT = Path(f"{BASE}/lora_out")
OUT.mkdir(parents=True, exist_ok=True)

# --- 0) Sanity checks ---
imgs = sorted(IMG_DIR.glob("img_*.png"))
assert imgs, f"No images found in {IMG_DIR}. Expected files like img_00001.png"
assert CAPS.exists(), f"Missing captions file: {CAPS}"

caps = [l.strip() for l in CAPS.read_text(encoding="utf-8").splitlines() if l.strip()]
assert len(caps) == len(imgs), f"Images ({len(imgs)}) and captions ({len(caps)}) must match 1:1"

# --- 1) Rebuild metadata.jsonl with the EXACT keys the HF loader expects ---
with META.open("w", encoding="utf-8") as f:
    for p, t in zip(imgs, caps):
        # IMPORTANT: 'file_name' + 'text' (trainer demands 'file_name' in the JSONL)
        f.write(json.dumps({"file_name": p.name, "text": t}, ensure_ascii=False) + "\n")
print(f"‚úÖ Rewrote {META} with {len(imgs)} entries (keys: file_name, text)")

# --- 2) Probe dataset columns as HF actually exposes them ---
from datasets import load_dataset
ds = load_dataset("imagefolder", data_dir=str(IMG_DIR), split="train")  # uses metadata.jsonl
print("üìã Dataset features:", ds.features)

# Decide which column names to pass to the trainer:
if "image" in ds.features and "text" in ds.features:
    IMAGE_COL, CAPTION_COL = "image", "text"
elif "file_name" in ds.features and "text" in ds.features:
    IMAGE_COL, CAPTION_COL = "file_name", "text"
else:
    raise ValueError(f"Could not find suitable columns in dataset. Features were: {ds.features}")

print(f"‚úÖ Will use --image_column='{IMAGE_COL}' --caption_column='{CAPTION_COL}'")

# --- 3) Build and run the training command safely ---
cmd = f"""
python diffusers/examples/text_to_image/train_text_to_image_lora.py \
  --pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5 \
  --train_data_dir="{IMG_DIR}" \
  --image_column="{IMAGE_COL}" \
  --caption_column="{CAPTION_COL}" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=1e-4 \
  --lr_scheduler=constant \
  --lr_warmup_steps=0 \
  --mixed_precision=fp16 \
  --max_train_steps=300 \
  --seed=42 \
  --output_dir="{OUT}"
"""

print("üöÄ Launching trainer‚Ä¶")
print(cmd)
proc = subprocess.run(shlex.split(cmd), stdout=None, stderr=None)
print("Trainer finished with return code:", proc.returncode)

# --- 4) List outputs so you can confirm weights exist ---
print("\nüì¶ lora_out contents:")
for p in sorted(OUT.glob("*")):
    print(" -", p.name)


‚úÖ Rewrote /content/drive/MyDrive/hear2see/dataset/train_images/metadata.jsonl with 150 entries (keys: file_name, text)


Resolving data files:   0%|          | 0/152 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/151 [00:00<?, ?files/s]

Generating train split: 0 examples [00:00, ? examples/s]

üìã Dataset features: {'image': Image(mode=None, decode=True, id=None), 'text': Value(dtype='string', id=None)}
‚úÖ Will use --image_column='image' --caption_column='text'
üöÄ Launching trainer‚Ä¶

python diffusers/examples/text_to_image/train_text_to_image_lora.py   --pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5   --train_data_dir="/content/drive/MyDrive/hear2see/dataset/train_images"   --image_column="image"   --caption_column="text"   --resolution=512   --train_batch_size=1   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler=constant   --lr_warmup_steps=0   --mixed_precision=fp16   --max_train_steps=300   --seed=42   --output_dir="/content/drive/MyDrive/hear2see/lora_out"

Trainer finished with return code: 0

üì¶ lora_out contents:
 - logs
 - pytorch_lora_weights.safetensors
 - write_test_1761629757.txt
 - write_test_1761653969.txt


In [None]:
!ls -lh /content/drive/MyDrive/hear2see/lora_out/pytorch_lora_weights.safetensors


-rw------- 1 root root 3.1M Oct 28 13:38 /content/drive/MyDrive/hear2see/lora_out/pytorch_lora_weights.safetensors


In [None]:
from diffusers import StableDiffusionPipeline
import torch, os

BASE = "/content/drive/MyDrive/hear2see"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16 if DEVICE=="cuda" else torch.float32
).to(DEVICE)

pipe.load_lora_weights(f"{BASE}/lora_out", weight_name="pytorch_lora_weights.safetensors")
img = pipe("a bird flying across the horizon at sunset, cinematic lighting", num_inference_steps=28).images[0]
img.save(os.path.join(BASE, "test_output.png"))
print("‚úÖ Saved:", os.path.join(BASE, "test_output.png"))


Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]



  0%|          | 0/28 [00:00<?, ?it/s]

‚úÖ Saved: /content/drive/MyDrive/hear2see/test_output.png


In [None]:
import os, tempfile, shutil, subprocess, torch
from diffusers import StableDiffusionPipeline

BASE      = "/content/drive/MyDrive/hear2see"
OUT_DIR   = os.path.join(BASE, "output")
VIDEO_OUT = os.path.join(OUT_DIR, "demo_clip_lora.mp4")
PROMPT    = "a bird flying across the horizon at sunset, golden sky reflecting on water, cinematic, high detail"
NEG       = "text, watermark, logo, lowres, blurry, artifacts"
FRAMES, FPS, STEPS, SEED0 = 24, 12, 25, 1234

os.makedirs(OUT_DIR, exist_ok=True)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16 if DEVICE=="cuda" else torch.float32
).to(DEVICE)
pipe.load_lora_weights(f"{BASE}/lora_out", weight_name="pytorch_lora_weights.safetensors")

tmp = tempfile.mkdtemp(prefix="sd_frames_")
try:
    for i in range(FRAMES):
        gen = torch.Generator(device=DEVICE).manual_seed(SEED0 + i)
        img = pipe(PROMPT, negative_prompt=NEG, guidance_scale=7.0,
                   num_inference_steps=STEPS, generator=gen).images[0]
        img.save(os.path.join(tmp, f"frame_{i:03d}.png"))

    subprocess.run([
        "ffmpeg","-y","-framerate",str(FPS),"-i",os.path.join(tmp,"frame_%03d.png"),
        "-c:v","libx264","-pix_fmt","yuv420p","-crf","18", VIDEO_OUT
    ], check=True)
    print("üé¨ Video saved:", VIDEO_OUT)
finally:
    shutil.rmtree(tmp, ignore_errors=True)


Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/25 [00:00<?, ?it/s]

üé¨ Video saved: /content/drive/MyDrive/hear2see/output/demo_clip_lora.mp4


In [None]:
!rm -f /content/drive/MyDrive/hear2see/lora_out/write_test_*.txt


In [None]:
%%writefile lora_loader.py
from diffusers import StableDiffusionPipeline
import torch, os

def load_sd15_with_lora(base="/content/drive/MyDrive/hear2see", device=None):
    device = device or ("cuda" if torch.cuda.is_available() else "cpu")
    pipe = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16 if device == "cuda" else torch.float32
    ).to(device)

    lora_dir = f"{base}/lora_out"
    # Try common filenames
    for name in ["pytorch_lora_weights.safetensors", "adapter_model.safetensors"]:
        if os.path.exists(os.path.join(lora_dir, name)):
            pipe.load_lora_weights(lora_dir, weight_name=name)
            print("‚úÖ Loaded LoRA:", name)
            break
    else:
        pipe.load_lora_weights(lora_dir)  # folder style
        print("‚úÖ Loaded LoRA from folder")

    return pipe


Writing lora_loader.py


In [None]:
%%writefile video_gen.py
import os, tempfile, shutil, subprocess, torch

def render_clip(pipe, prompt, out_mp4,
                steps=28, frames=24, fps=12, cfg=7.0,
                neg="text, watermark, logo, lowres, blurry, artifacts",
                seed=1234):
    os.makedirs(os.path.dirname(out_mp4), exist_ok=True)
    tmp = tempfile.mkdtemp(prefix="sd_frames_")
    device = "cuda" if torch.cuda.is_available() else "cpu"
    try:
        for i in range(frames):
            gen = torch.Generator(device=device).manual_seed(seed + i)
            img = pipe(prompt, negative_prompt=neg,
                       guidance_scale=cfg,
                       num_inference_steps=steps,
                       generator=gen).images[0]
            img.save(os.path.join(tmp, f"frame_{i:03d}.png"))

        subprocess.run([
            "ffmpeg","-y","-framerate",str(fps),
            "-i", os.path.join(tmp,"frame_%03d.png"),
            "-c:v","libx264","-pix_fmt","yuv420p","-crf","18", out_mp4
        ], check=True)
        print("üé¨ Saved video:", out_mp4)
    finally:
        shutil.rmtree(tmp, ignore_errors=True)


Writing video_gen.py


In [None]:
from lora_loader import load_sd15_with_lora
from video_gen import render_clip

BASE = "/content/drive/MyDrive/hear2see"
pipe = load_sd15_with_lora(BASE)

prompt = "a bird flying across the horizon at sunset, golden sky reflecting on water, cinematic, high detail"
render_clip(pipe, prompt, f"{BASE}/output/demo_clip_lora.mp4")


Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

‚úÖ Loaded LoRA: pytorch_lora_weights.safetensors


  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

üé¨ Saved video: /content/drive/MyDrive/hear2see/output/demo_clip_lora.mp4


In [None]:
import importlib, lora_loader, video_gen
importlib.reload(lora_loader)
importlib.reload(video_gen)


<module 'video_gen' from '/content/video_gen.py'>

In [None]:
from lora_loader import load_sd15_with_lora
from video_gen import render_clip

BASE = "/content/drive/MyDrive/hear2see"

# 1) Load SD1.5 + your trained LoRA
pipe = load_sd15_with_lora(BASE)

# 2) Generate a short clip and save to /hear2see/output
prompt = "a bird flying across the horizon at sunset, golden sky reflecting on water, cinematic, high detail"
render_clip(pipe, prompt, f"{BASE}/output/demo_clip_lora.mp4")


Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]



‚úÖ Loaded LoRA: pytorch_lora_weights.safetensors


  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

üé¨ Saved video: /content/drive/MyDrive/hear2see/output/demo_clip_lora.mp4


In [None]:
prompt = "a close-up of a glowing butterfly emerging from a crystal forest, detailed and ethereal"
render_clip(pipe, prompt, f"{BASE}/output/butterfly_crystal_lora.mp4", steps=30, frames=28, fps=15, cfg=7.5)


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Potential NSFW content was detected in one or more images. A black image will be returned instead. Try again with a different prompt and/or seed.


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

üé¨ Saved video: /content/drive/MyDrive/hear2see/output/butterfly_crystal_lora.mp4


In [None]:
prompt = "a fantasy dragon flying over mountains, golden lighting, cinematic tone"
render_clip(pipe, prompt, f"{BASE}/output/dragon_lora.mp4", steps=30, frames=32)


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

üé¨ Saved video: /content/drive/MyDrive/hear2see/output/dragon_lora.mp4


In [None]:
# Fuse LoRA into a single fine-tuned SD1.5 model
pipe.fuse_lora()

# üîπ Give your fine-tuned model any name you want:
save_path = "/content/drive/MyDrive/hear2see/my_custom_sd_model"

pipe.save_pretrained(save_path)
print(f"‚úÖ Fused model saved to: {save_path}")


‚úÖ Fused model saved to: /content/drive/MyDrive/hear2see/my_custom_sd_model


In [None]:
# ==== Fuse LoRA into a full SD1.5 pipeline and save ====
import os, shutil, sys, json, pathlib
from pathlib import Path
import torch
import diffusers
from diffusers import StableDiffusionPipeline

BASE = "/content/drive/MyDrive/hear2see"
LORA_DIR = f"{BASE}/lora_out"
SAVE_DIR = f"{BASE}/my_custom_sd_model"   # <- name this anything you want

print("diffusers version:", diffusers.__version__)
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype  = torch.float16 if device == "cuda" else torch.float32

# 1) Load base SD1.5
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=dtype
).to(device)

# 2) Load your LoRA (try common filenames first)
loaded = False
for name in ["pytorch_lora_weights.safetensors", "adapter_model.safetensors"]:
    if os.path.exists(os.path.join(LORA_DIR, name)):
        pipe.load_lora_weights(LORA_DIR, weight_name=name)
        print("‚úÖ Loaded LoRA file:", name)
        loaded = True
        break
if not loaded:
    # folder-style adapters
    pipe.load_lora_weights(LORA_DIR)
    print("‚úÖ Loaded LoRA from folder (no single file)")

# 3) Fuse LoRA into UNet/Text Encoder (depending on adapters)
fused = False
if hasattr(pipe, "fuse_lora"):
    try:
        pipe.fuse_lora()
        fused = True
        print("‚úÖ LoRA fused via pipe.fuse_lora()")
    except Exception as e:
        print("‚ö†Ô∏è fuse_lora() failed:", e)

if not fused:
    # Older API fallback (some diffusers versions expose merge_lora_weights on unet)
    try:
        pipe.unet = pipe.unet.merge_lora_weights()
        fused = True
        print("‚úÖ LoRA fused via unet.merge_lora_weights()")
    except Exception as e2:
        raise RuntimeError(
            "LoRA fusion not available in your diffusers build. "
            "Upgrade to diffusers>=0.30.3 and rerun."
        ) from e2

# Optional: unload adapter references after fusion (keeps only merged weights in memory)
if hasattr(pipe, "unload_lora_weights"):
    try:
        pipe.unload_lora_weights()
    except Exception:
        pass  # not critical

# 4) Save the full pipeline (this must write unet/..., vae/..., text_encoder/..., etc.)
if os.path.exists(SAVE_DIR):
    shutil.rmtree(SAVE_DIR)
pipe.save_pretrained(SAVE_DIR, safe_serialization=True)
print("üíæ Saved fused model to:", SAVE_DIR)

# 5) Verify files exist and show UNet weights size
def list_weights(root):
    root = Path(root)
    found = []
    for p in root.rglob("*"):
        if p.suffix in [".safetensors", ".bin"]:
            found.append(p)
    return found

weights = list_weights(SAVE_DIR)
print("\nüì¶ Saved weight files:")
for w in weights:
    print(" -", w.relative_to(SAVE_DIR), f"({w.stat().st_size/1e6:.1f} MB)")

UNET_PATH_1 = Path(SAVE_DIR) / "unet" / "diffusion_pytorch_model.safetensors"
UNET_PATH_2 = Path(SAVE_DIR) / "unet" / "diffusion_pytorch_model.bin"

if UNET_PATH_1.exists() or UNET_PATH_2.exists():
    upath = UNET_PATH_1 if UNET_PATH_1.exists() else UNET_PATH_2
    sz_mb = upath.stat().st_size / 1e6
    print(f"\n‚úÖ UNet found: {upath.name} ~ {sz_mb:.1f} MB (expected ~340‚Äì360 MB for SD1.5)")
else:
    raise RuntimeError(
        "‚ùå UNet file not found in the saved model. Fusion/save did not complete. "
        "Make sure fuse worked and that you called pipe.save_pretrained(SAVE_DIR)."
    )


diffusers version: 0.30.3


Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

‚úÖ Loaded LoRA file: pytorch_lora_weights.safetensors
‚úÖ LoRA fused via pipe.fuse_lora()
üíæ Saved fused model to: /content/drive/MyDrive/hear2see/my_custom_sd_model

üì¶ Saved weight files:
 - vae/diffusion_pytorch_model.safetensors (167.3 MB)
 - text_encoder/model.safetensors (246.1 MB)
 - unet/diffusion_pytorch_model.safetensors (1719.1 MB)
 - safety_checker/model.safetensors (608.0 MB)

‚úÖ UNet found: diffusion_pytorch_model.safetensors ~ 1719.1 MB (expected ~340‚Äì360 MB for SD1.5)


In [None]:
from diffusers import StableDiffusionPipeline
import torch

BASE = "/content/drive/MyDrive/hear2see"
SAVE_DIR = f"{BASE}/my_custom_sd_model"

pipe2 = StableDiffusionPipeline.from_pretrained(
    SAVE_DIR,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    low_cpu_mem_usage=False,  # disable smart loading to avoid masking missing keys
    device_map=None
).to("cuda" if torch.cuda.is_available() else "cpu")

img = pipe2("a cinematic ocean sunset, high detail", num_inference_steps=28, guidance_scale=7.0).images[0]
img.save(f"{BASE}/output/fused_check.png")
print("‚úÖ Works. Saved:", f"{BASE}/output/fused_check.png")


Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

‚úÖ Works. Saved: /content/drive/MyDrive/hear2see/output/fused_check.png


In [None]:
import shutil

old = "/content/drive/MyDrive/hear2see/hear2see_finetuned_v1"
new = "/content/drive/MyDrive/hear2see/hear2see_v1"

shutil.move(old, new)
print("‚úÖ Model folder renamed safely to:", new)


‚úÖ Model folder renamed safely to: /content/drive/MyDrive/hear2see/hear2see_v1


In [None]:
BASE = "/content/drive/MyDrive/hear2see"
pipe = load_sd15_with_lora(BASE)   # <- no import needed


Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]



‚úÖ Loaded LoRA: pytorch_lora_weights.safetensors


In [None]:
import json

notebook_code = {
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {"kernelspec": {"display_name": "Python 3", "name": "python3"}},
  "cells": [
    {"cell_type": "markdown", "source": [
      "# Hear2See ‚Äî Inference + UI (Fused SD1.5 Model)\n",
      "Follow the steps below to load your fine-tuned model, generate images/videos, and launch a Gradio UI."
    ]},
    {"cell_type": "code", "source": [
      "!pip -q install diffusers==0.30.3 transformers==4.44.2 safetensors==0.6.2 gradio==4.44.1"
    ]},
    {"cell_type": "code", "source": [
      "from google.colab import drive\n",
      "drive.mount('/content/drive')\n",
      "BASE = '/content/drive/MyDrive/hear2see'\n",
      "MODEL_DIR = f'{BASE}/hear2see_finetuned_v1'\n",
      "OUT_DIR = f'{BASE}/output'\n",
      "print('BASE:', BASE, '\\nMODEL_DIR:', MODEL_DIR, '\\nOUT_DIR:', OUT_DIR)"
    ]},
    {"cell_type": "code", "source": [
      "import torch, os\n",
      "from diffusers import StableDiffusionPipeline\n",
      "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n",
      "dtype = torch.float16 if device=='cuda' else torch.float32\n",
      "pipe = StableDiffusionPipeline.from_pretrained(MODEL_DIR, torch_dtype=dtype).to(device)\n",
      "os.makedirs(OUT_DIR, exist_ok=True)\n",
      "print('‚úÖ Model loaded on', device)"
    ]},
    {"cell_type": "code", "source": [
      "prompt = 'a cinematic close-up of a butterfly landing on a flower, warm rim light, shallow depth of field'\n",
      "img = pipe(prompt, num_inference_steps=28, guidance_scale=7).images[0]\n",
      "img.save(f'{OUT_DIR}/test_img.png')\n",
      "print('‚úÖ Saved test image.')"
    ]},
    {"cell_type": "code", "source": [
      "import tempfile, shutil, subprocess\n",
      "def render_clip(pipe, prompt, out_mp4, steps=28, frames=24, fps=12, cfg=7, seed=1234):\n",
      "  tmp = tempfile.mkdtemp()\n",
      "  try:\n",
      "    for i in range(frames):\n",
      "      gen = torch.Generator(device=pipe.device.type).manual_seed(seed+i)\n",
      "      img = pipe(prompt, guidance_scale=cfg, num_inference_steps=steps, generator=gen).images[0]\n",
      "      img.save(os.path.join(tmp, f'frame_{i:03d}.png'))\n",
      "    subprocess.run(['ffmpeg','-y','-framerate',str(fps),'-i',f'{tmp}/frame_%03d.png','-c:v','libx264','-pix_fmt','yuv420p',out_mp4], check=True)\n",
      "    return out_mp4\n",
      "  finally:\n",
      "    shutil.rmtree(tmp)"
    ]},
    {"cell_type": "code", "source": [
      "vid = f'{OUT_DIR}/demo_clip.mp4'\n",
      "render_clip(pipe, 'a bird flying across the horizon at sunset', vid)\n",
      "print('üé¨ Saved video to:', vid)"
    ]},
    {"cell_type": "code", "source": [
      "import gradio as gr\n",
      "def gen_video_ui(prompt):\n",
      "  path = render_clip(pipe, prompt, f'{OUT_DIR}/gradio_clip.mp4')\n",
      "  return path\n",
      "gr.Interface(fn=gen_video_ui, inputs='text', outputs='video').launch(share=True)"
    ]}
  ]
}

with open('/content/Hear2See_Inference_UI.ipynb', 'w') as f:
    json.dump(notebook_code, f)

print('‚úÖ Notebook created: /content/Hear2See_Inference_UI.ipynb')


‚úÖ Notebook created: /content/Hear2See_Inference_UI.ipynb


In [None]:
from google.colab import files
files.download("/content/Hear2See_Inference_UI.ipynb")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>