In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
!pip install -q transformers accelerate pillow

# Hybrid AI Story Generator
Vision (BLIP) + Language (Mistral-7B)

# Problem Definition & Objective

## Problem Statement

Creative storytelling using AI often lacks contextual grounding when relying only on text or only on images. Traditional text-only models fail to capture visual emotions, while vision-only models cannot generate deep narratives.

## Objective

### To design a Hybrid AI Story Generator that

1. Understands visual input (images)
2. Interprets user intent from text
3. Generates coherent cinematic stories
4. Uses multi-modal intelligence combining Vision + Language models

## Real-World Motivation

### This system can be applied to:
1. AI storytelling & content creation
2. Game narrative generation
3. Creative writing assistants
4. AI-assisted filmmaking
5. Digital art and storytelling tools

## Selected Project Track

Hybrid AI System
(Computer Vision + Large Language Models)

# Data Understanding & Preparation

## Dataset Source
1. **Image Input:** User-provided images (real-world images)
2. **Text Input:** User-provided natural language prompts
3. **No fixed dataset** required (dynamic inference-based system)

## Data Processing

Images processed using BLIP Image Captioning
Text prompts parsed using keyword extraction

### Scene attributes inferred:

1. Mood
2. Weather
3. Time
4. Emotion
5. Narrative tone

## Preprocessing Steps

1. Image normalization using BLIP processor
2. Tokenization using Mistral tokenizer
3. Prompt structuring using instruction-tuned format
4. No missing data (real-time inputs)

# Model / System Design

## AI Techniques Used

**Computer Vision → BLIP (Vision-to-Text)**
**Large Language Model → Mistral-7B (4-bit quantized)**
**Prompt Engineering**
**Hybrid Pipeline Design**


## Architecture Overview

User Input  
↓  
Text Prompt  
↓  
Intent Analyzer  
↓  
Image → BLIP → Scene Understanding  
↓  
Context Builder  
↓  
Mistral-7B  
↓  
Cinematic Story Output



# Why This Design?
BLIP provides visual grounding
Mistral offers strong narrative generation
Quantization allows GPU-efficient execution
Modular design allows easy upgrades

# Core Implementation

## Key Components

1. BLIPProcessor → Image captioning
2. Mistral-7B Instruct → Story generation
3. BitsAndBytes → 4-bit quantization
### Custom pipeline for:
1. Intent detection
2. Context fusion
3. Prompt engineering
## Prompt Engineering Strategy
1. System instruction-based prompting
### Enforced story structure:
1. Conflict
2. Climax
3. Resolution
4. Style control (cinematic, poetic, emotional)
## Execution Flow
1. User enters text + optional image
2. Image captioned via BLIP
3. Context synthesized
4. Structured prompt created
5. Story generated via Mistral
6. Clean output returned

# Evaluation & Analysis
## Evaluation Type
1. Qualitative Evaluation
2. Human-centered evaluation
## Metrics Used
1. Narrative coherence
2. Visual relevance
3. Emotional consistency
4. Prompt adherence
Creativity score (manual)
## Sample Output
1. Emotion-aware storytelling
2. Scene continuity
3. Logical progression
Visual grounding from image
## Observed Limitations
1. No factual verification
2. Output varies with prompt clarity
3. Requires GPU for smooth execution

# Ethical Considerations & Responsible AI

## Bias & Fairness

1. Model may reflect biases from training data
2. Emotional interpretation may vary
3. No harmful intent filtering (can be added)

## Responsible Usage

1. No medical/legal decision-making
2. Intended for creative purposes only
3. User-controlled inputs prevent misuse

## Dataset Limitations

1. BLIP trained on generic image-caption datasets
2. Mistral trained on large-scale internet text

# Conclusion & Future Scope

## Summary

1. Successfully built a Hybrid Vision + Language AI
2. Achieved coherent, cinematic storytelling
3. Modular, scalable, and efficient design

## Future Improvements

1. Add speech-to-text input
2. Integrate diffusion-based image generation
3. Add emotion classifier
4. Support multi-character narratives
5. Deploy as a web app or API

In [3]:
# ============================================================
# HYBRID AI STORY GENERATOR (KAGGLE-READY)
# Vision (BLIP) + Language (Mistral-7B)
# Fixed dependency versions for stability
# ============================================================

import os
os.system("pip install -U transformers accelerate bitsandbytes 'pillow<11.0'")

import torch
from PIL import Image
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    BlipProcessor,
    BlipForConditionalGeneration
)

# Detect device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# -------------------------
# Load BLIP (VISION ONLY)
# -------------------------
print("Loading BLIP model...")
blip_processor = BlipProcessor.from_pretrained(
    "Salesforce/blip-image-captioning-base"
)

blip_model = BlipForConditionalGeneration.from_pretrained(
    "Salesforce/blip-image-captioning-base",
    torch_dtype=torch.float16
).to(device)

def blip_caption(image, prompt, max_len=80):
    if image is None:
        return "No image provided."
        
    inputs = blip_processor(
        image,
        text=prompt,
        return_tensors="pt"
    ).to(device)

    with torch.no_grad():
        output = blip_model.generate(
            **inputs,
            max_length=max_len,
            num_beams=3
        )

    return blip_processor.decode(output[0], skip_special_tokens=True)

# -------------------------
# Load Mistral-7B (4-bit)
# -------------------------
print("Loading Mistral-7B model...")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    use_fast=True
)

# Load Model
mistral = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    device_map="auto",
    quantization_config=bnb_config,
    torch_dtype=torch.float16
)

# -------------------------
# Interpret crude user input
# -------------------------
def interpret_user_input(text):
    words = text.lower().split()
    return {
        "raw_text": text, # Store original input for the prompt
        "tone": "melancholic" if any(w in words for w in ["sad", "lost", "alone", "empty", "rain"]) else "neutral",
        "time": "night" if "night" in words else "day",
        "weather": "rain" if "rain" in words else "clear",
        "style": "cinematic",
        "keywords": words
    }

# -------------------------
# Multi-perspective understanding
# -------------------------
def build_context(intent, image=None):
    if image is not None:
        return {
            "scene": blip_caption(image, f"a {intent['time']} scene with {intent['weather']}"),
            "emotion": blip_caption(image, f"describe the {intent['tone']} emotional atmosphere"),
            "character": blip_caption(image, "describe the main character and their situation"),
            "cinematic": blip_caption(image, f"a {intent['style']} storytelling description")
        }
    else:
        # Fallback: Use raw text directly if no image is present
        return {
            "scene": f"a setting based on: {intent['raw_text']}",
            "emotion": intent["tone"],
            "character": f"characters involved in: {intent['raw_text']}",
            "cinematic": f"a {intent['style']} interpretation of {intent['raw_text']}"
        }

# -------------------------
# Narrative skeleton
# -------------------------
def build_story_plan(vc, intent):
    return {
        "topic": intent["raw_text"],
        "setting": vc["scene"],
        "character": vc["character"],
        "mood": vc["emotion"],
        "style": intent["style"]
    }

# -------------------------
# Story generation (Mistral)
# -------------------------
def generate_story(plan):
    # Updated prompt to strictly follow the user's Core Concept
    prompt = f"""[INST]
Write a short {plan['style']} story based on the details below.

Core Concept (Most Important): {plan['topic']}
Visual Setting: {plan['setting']}
Atmosphere: {plan['mood']}

Instructions:
1. Your story MUST be about the 'Core Concept'.
2. Incorporate the 'Visual Setting' if it adds to the scene.
3. Develop a narrative arc (Conflict -> Climax -> Resolution) that fits specifically with the Core Concept.

Style:
Poetic, visual, engaging.
[/INST]"""

    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        output = mistral.generate(
            **inputs,
            max_new_tokens=300,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )

    return tokenizer.decode(output[0], skip_special_tokens=True)

# ============================================================
# USER INPUT (Interactive)
# ============================================================

print("\n--- SETUP COMPLETE ---\n")

user_text = input(
    "Enter crude idea (example: 'sad boy rain lost night'): "
).strip()

image_path = input(
    "Enter image path (press Enter to skip image): "
).strip()

if not user_text:
    user_text = "lonely figure night rain"

image = None
if image_path:
    try:
        image = Image.open(image_path).convert("RGB")
        print("Image loaded successfully.")
    except Exception as e:
        print(f"Could not load image: {e}. Proceeding with text only.")
        image = None

# ============================================================
# PIPELINE EXECUTION
# ============================================================

print("Analyzing inputs...")
intent = interpret_user_input(user_text)
context = build_context(intent, image)
story_plan = build_story_plan(context, intent)

print("Generating story...")
final_story = generate_story(story_plan)

# ============================================================
# OUTPUT
# ============================================================

print("\n--- GENERATED STORY ---\n")
# Clean up the output to remove the prompt if Mistral includes it
if "[/INST]" in final_story:
    final_story = final_story.split("[/INST]")[-1].strip()

print(final_story)

Collecting transformers
  Downloading transformers-4.57.6-py3-none-any.whl.metadata (43 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.0/44.0 kB 1.5 MB/s eta 0:00:00
Collecting accelerate
  Downloading accelerate-1.12.0-py3-none-any.whl.metadata (19 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.49.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting pillow<11.0
  Downloading pillow-10.4.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (9.2 kB)
Downloading transformers-4.57.6-py3-none-any.whl (12.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.0/12.0 MB 107.0 MB/s eta 0:00:00
Downloading accelerate-1.12.0-py3-none-any.whl (380 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 380.9/380.9 kB 28.3 MB/s eta 0:00:00
Downloading bitsandbytes-0.49.1-py3-none-manylinux_2_24_x86_64.whl (59.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.1/59.1 MB 33.3 MB/s eta 0:00:00
Downloading pillow-10.4.0-cp312-cp312-manylinux_2_28_x86_64.whl (4.5 MB)
   ━━━━━━━━━━━━━━━

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.26.0 requires google-cloud-bigquery-storage<3.0.0,>=2.30.0, which is not installed.
dopamine-rl 4.1.2 requires gymnasium>=1.0.0, but you have gymnasium 0.29.0 which is incompatible.
gradio 5.49.1 requires pydantic<2.12,>=2.0, but you have pydantic 2.12.5 which is incompatible.
bigframes 2.26.0 requires rich<14,>=12.4.4, but you have rich 14.2.0 which is incompatible.
fastai 2.8.4 requires fastcore<1.9,>=1.8.0, but you have fastcore 1.11.3 which is incompatible.


Successfully installed accelerate-1.12.0 bitsandbytes-0.49.1 pillow-10.4.0 transformers-4.57.6


2026-01-17 12:58:21.231713: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1768654701.457379      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1768654701.522314      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1768654702.074996      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768654702.075036      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768654702.075039      55 computation_placer.cc:177] computation placer alr

Using device: cuda
Loading BLIP model...


preprocessor_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Loading Mistral-7B model...


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]


--- SETUP COMPLETE ---



Enter crude idea (example: 'sad boy rain lost night'):  Future without A.I 
Enter image path (press Enter to skip image):  


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Analyzing inputs...
Generating story...

--- GENERATED STORY ---

In the waning twilight of the twenty-second century, the world stood still, a testament to the human spirit that refused to bend to the inexorable march of time. The sun dipped below the horizon, casting long shadows over a landscape devoid of the once ubiquitous hum of artificial intelligence.

The once towering metropolises now lay in ruins, their skeletal frames reaching for the heavens like the bony fingers of a forgotten god. The streets were empty, save for the occasional forlorn figure, their eyes glazed over with the weight of a thousand untold stories.

Amidst this desolate tableau, a young woman named Aria wandered, her heart heavy with the burden of a thousand unanswered questions. She clutched a tattered tome, its pages filled with the wisdom of the ancients, who had once harnessed the power of A.I to build a better world.

As she traversed the crumbling remnants of civilization, Aria could not help but feel 