# **MUSIA** 2025: Baseline Story-to-Image Generation & Submission Guide

**Author**: MUSIA 2025 Organizing Committee

**Purpose**: This notebook provides a baseline pipeline for the MUSIA 2025 Shared Task, focusing on generating high-fidelity synthetic images from multilingual stories using state-of-the-art generative models.

This notebook is designed to:

* Load the mapping file that specifies how many images to generate for each story.
* Read and process the test stories from the MUSIA testing corpus.
* Use a fine-tuned transformer-based summarizer (trained on MUSIA training stories) to create concise textual summaries.
* Pass these summaries as prompts to a Stable Diffusion XL pipeline to generate photorealistic synthetic images.
* Save the generated outputs in the official submission format.
* Present examples of generated images along with human evaluation results for qualitative assessment.


## Environment Setup and Imports

Installs required libraries including `transformers` for summarization and `diffusers` for image generation. Then imports all necessary packages for file handling, modeling, and image output.


In [None]:
!pip install -q transformers
!pip install -U diffusers
import os
import json
import math
import torch
from transformers import pipeline
from diffusers import DiffusionPipeline
import torch
from PIL import Image



## Load Summarization Pipeline

Initializes a fine-tuned summarization model specifically trained on MUSIA training stories. This model will be used to convert each full-length story into a concise summary that serves as the input prompt for the image generation model.


In [None]:
device = 0 if torch.cuda.is_available() else -1
sum2 = pipeline("summarization", model="", device=device) #Use summarizer trained on MUSIA stories

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


adapter_config.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use cuda:0


## Load Stable Diffusion XL Model

Initializes the Stable Diffusion XL model using the `diffusers` library. This model will generate photorealistic images from summary prompts. The attention slicing is enabled for efficient memory usage.


In [None]:
pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda" if torch.cuda.is_available() else "cpu")
pipe.enable_attention_slicing()

model_index.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

Fetching 19 files:   0%|          | 0/19 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/565 [00:00<?, ?B/s]

text_encoder_2/model.safetensors:   0%|          | 0.00/2.78G [00:00<?, ?B/s]

text_encoder/model.safetensors:   0%|          | 0.00/492M [00:00<?, ?B/s]

scheduler_config.json:   0%|          | 0.00/479 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/575 [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/737 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/725 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

unet/diffusion_pytorch_model.safetensors:   0%|          | 0.00/10.3G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

vae/diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

vae_1_0/diffusion_pytorch_model.safetens(…):   0%|          | 0.00/335M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

## Load Story-to-Frame Count Mapping

Loads the mapping file that contains the number of images to be generated for each story. This mapping ensures that output image counts match the requirements of the MUSIA shared task.


In [None]:
with open("/content/drive/MyDrive/MUSIA-2025/EN_story_image_counts.json", "r") as f:
    story_frame_map = json.load(f)

## Define Story Directory Path

Specifies the root directory where all extracted testing story `.txt` files are stored. This path is used to access and read individual stories during the generation loop.


In [None]:
story_root = "/content/Testing_Stories/English_Testing/Testing_Stories/"

## Define Fixed Visual Style Prompt

Specifies a consistent visual style prompt to be appended to each summary before passing it to the image generation model. This ensures all generated images share a unified and coherent visual traits.


In [None]:
fixed = "Cartoon style, warm color palette, soft shading, hand-drawn texture, same characters, consistent clothing and face, wide frame."

# fixed format : [Style], [Color], [Shading], [Texture], [Character Consistency], [Framing]

#Style
#Cartoon style, Anime style, Digital painting, 3D render, Watercolor illustration, Oil painting, Comic book style, Fantasy art,
#Pixel art, Sketch art, Realistic style, Flat illustration, Low-poly style, Chibi style, Papercut style, Cel-shaded style, Line art style, Ink wash painting

#Color
#Studio Ghibli colors, Pastel colors, Vibrant colors, Muted tones, Warm color palette, Cool tones, Neon lights, Earth tones, Duotone scheme, Retro color scheme,
# Desaturated tones, High contrast colors, Cinematic color grading, Sepia tone, Monochrome palette, Rainbow gradient

#Shading
#Soft shading, Hard shading, Cel shading, Volumetric lighting, Ambient occlusion, Global illumination, Flat lighting, Soft lighting, Harsh shadows, Backlighting,
#Rim lighting, Subsurface scattering, Bounce lighting, Ray-traced lighting

#Texture
#Hand-drawn texture, Painted texture, Smooth texture, Sketch-like strokes, Grainy texture, Rough brush strokes, Inked outlines, Crayon texture, Chalk texture,
# Marker rendering, Pencil sketch texture, Watercolor wash, Canvas texture, Digital airbrush, Etching lines

#Character Consistency
#same characters, Consistent clothing and face, Repeating character model, Fixed hairstyle and outfit, Identical facial features across frames, Character continuity,
#Preserve facial structure, Consistent outfit design, No change in appearance, Character template unchanged, Use same character across all frames, Maintain character identity, Uniform costume across scenes

#Framing / Composition –
#Wide frame, Close-up, Medium shot, Portrait frame, Landscape frame, Bird’s eye view, Worm’s eye view, Over-the-shoulder view, Centered frame, Rule of thirds composition,
# Dynamic camera angle, Symmetrical framing, Diagonal composition, Isometric view, Cinematic framing, Panoramic shot

## Create Output Directory for Generated Images

Creates a directory named `generated_images` (if it does not already exist). All images generated by the Stable Diffusion model will be saved to this folder for later review or submission formatting.


In [None]:
os.makedirs("generated_images", exist_ok=True)

## Generate Images from Summarized Story Segments

This block processes the test stories and generates images as follows:

* Limits processing to 2 stories for demonstration.
* Loads each story from the extracted `.txt` files based on the mapping file.
* Splits the story text into `num_frames` chunks, ensuring each chunk corresponds to one image.
* Summarizes each chunk using the fine-tuned MUSIA summarizer.
* Combines the fixed visual style prompt with the summary to form the final prompt for the diffusion model.
* Generates an image for each summarized chunk using Stable Diffusion XL.
* Saves each image with a consistent filename format and stores metadata for later use.

All images are stored in the `generated_images` directory, and successful generations are logged to the console.


In [None]:
final_prompts = []

# === Limit to 2 stories ===
processed_count = 0
max_stories = 2

for story_id, num_frames in story_frame_map.items():
    if processed_count >= max_stories:
        break

    story_path = os.path.join(story_root, f"{story_id}.txt")

    if not os.path.exists(story_path):
        print(f"[!] Story not found: {story_path}")
        continue

    with open(story_path, 'r', encoding='utf-8') as f:
        full_text = f.read().strip()

    # Break into N chunks
    sentences = full_text.split('. ')
    chunk_size = math.ceil(len(sentences) / num_frames)
    sub_stories = ['. '.join(sentences[i:i + chunk_size]).strip() for i in range(0, len(sentences), chunk_size)]

    for i, story in enumerate(sub_stories[:num_frames]):
        try:
            summary = sum2(story, max_length=50, min_length=35, do_sample=False)[0]['summary_text'].strip()
            prompt = f"{fixed}. {summary}"

            # Generate image
            image = pipe(prompt).images[0]
            filename = f"{story_id}_{i+1}.png"
            image_path = os.path.join("generated_images", filename)
            image.save(image_path)

            final_prompts.append({
                "story_id": story_id,
                "frame_number": i + 1,
                "prompt": prompt,
                "image_file": image_path
            })

            print(f"[✓] Generated: {filename}")

        except Exception as e:
            print(f"[X] Error in {story_id}_{i+1}: {e}")

    processed_count += 1


Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['and a tiger in the darkness awww a picture of a man in his dark suit atop a tree in the middle of the night']
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['and a tiger in the darkness awww a picture of a man in his dark suit atop a tree in the middle of the night']


  0%|          | 0/50 [00:00<?, ?it/s]

Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[✓] Generated: eng_story_0006_1.png


  0%|          | 0/50 [00:00<?, ?it/s]

Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[✓] Generated: eng_story_0006_2.png


  0%|          | 0/50 [00:00<?, ?it/s]

Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[✓] Generated: eng_story_0006_3.png


  0%|          | 0/50 [00:00<?, ?it/s]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[✓] Generated: eng_story_0006_4.png


  0%|          | 0/50 [00:00<?, ?it/s]

Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[✓] Generated: eng_story_0011_1.png


  0%|          | 0/50 [00:00<?, ?it/s]

Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[✓] Generated: eng_story_0011_2.png


  0%|          | 0/50 [00:00<?, ?it/s]

[✓] Generated: eng_story_0011_3.png


## Visualize Generated Images with Prompts

This cell displays a small set of generated images from the test stories for manual inspection. For each story ID:

* The images are shown in frame order.
* The corresponding prompt (combined visual style + summary) is printed above the image.
* `IPython.display` is used for inline visualization in the notebook.

In [None]:
from IPython.display import Image as IPyImage, display

sample_story_ids = list({entry['story_id'] for entry in final_prompts})

for story_id in sample_story_ids:
    entries = sorted(
        [entry for entry in final_prompts if entry['story_id'] == story_id],
        key=lambda x: x['frame_number']
    )

    for entry in entries:
        print(f"\n {entry['story_id']} — Frame {entry['frame_number']}")
        print(f"Prompt: {entry['prompt']}\n")
        display(IPyImage(filename=entry["image_file"]))


## Human Evaluation Results

This markdown cell reports human evaluation scores for selected generated samples. Each story is manually assessed based on three criteria:

- **Relevance**: How well the image reflects the content or summary of the story.
- **Visual Quality**: The aesthetic and technical quality of the image.
- **Consistency**: Whether generated frames maintain visual and narrative coherence across the story.

These evaluations help benchmark the baseline performance and guide improvements in prompt design or model choice.


**Human Evaluation**


---


**eng_story_0006**\
Relevance: Good\
Visual Quality: Excellent\
Consistency: Average


---

**eng_story_0011**\
Relevance: Terrible\
Visual Quality: Good\
Consistency: Terrible

## Archive and Download Generated Images

This cell creates a ZIP archive of all the generated images and triggers a download for submission or local storage.

- `shutil.make_archive(...)` compresses the `generated_images` directory into a `generated_images.zip` file.
- `google.colab.files.download(...)` enables direct download of the ZIP file from the Colab environment.

This is the final step in the baseline pipeline, preparing the generated outputs for evaluation or submission to MUSIA 2025.


In [None]:
import shutil

# Create a zip file from the folder
shutil.make_archive("generated_images", "zip", "generated_images")
from google.colab import files
files.download("generated_images.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>