Skip to content

AkinCodes/MoviePosterAI

Repository files navigation

MoviePosterAI

Python 3.10+ FastAPI PyTorch SwiftUI iOS 17+ License Docker

Give it a movie poster and it tells you everything. Genre, objects, faces, all the text on it, even the mood. Then it can generate entirely new posters from a text description, or take two posters and remix them into something new. The metadata endpoint feeds the image through CLIP and GPT-4 to produce a title, tagline, genre, and summary -- all from a single image upload.

This is a monorepo: a Python/FastAPI backend with real ML models (not wrappers around a single API), plus a native SwiftUI iOS app that talks to it. Upload a poster from your phone, get results in seconds.


Screenshots

Genre classification and metadata
Genre classification & metadata
OCR text extraction
OCR text extraction
Object and face detection
Object & face detection
Poster generation
Poster generation from text

What It Can Do

Computer Vision

Genre Classification -- A Vision Transformer (ViT) fine-tuned on movie poster images. It resizes the poster to 224x224, normalises it, runs it through the transformer, and outputs softmax probabilities across five genres: Action, Comedy, Drama, Horror, and Romance. You get the top prediction plus the full probability distribution, so you can see when a poster sits between two genres.

Object Detection -- YOLOv8 (nano variant) scans the poster and returns bounding boxes for every object it finds -- people, cars, weapons, animals, text regions, whatever is in frame. Each detection comes with a label, confidence score, and pixel coordinates. Useful for understanding what's visually prominent on the poster.

Face Detection -- Uses dlib's frontal face detector and the face_recognition library to find every face in the image. Returns bounding boxes and face encodings. The system uses both OpenCV's Haar cascades and dlib's HOG-based detector for reliability across different lighting and angles.

Text Understanding

OCR -- EasyOCR extracts every piece of text on the poster: the movie title, tagline, credits, release date, small print at the bottom. Supports multiple languages (English, Chinese simplified/traditional, and more). The reader is cached per language combo so subsequent requests are fast.

Metadata Generation -- A two-step pipeline. First, CLIP processes the image and produces a visual description. Then that description is sent to GPT-4 with a system prompt that asks it to act as a movie metadata writer. Back comes a title, genre, tagline, mood, and a two-sentence summary. All from looking at the poster.

Image Generation

Text-to-Poster -- Describe a movie that doesn't exist and Stable Diffusion v1.5 will create a poster for it. The pipeline runs on CUDA, Apple Silicon (MPS), or CPU depending on what's available. Generated images are saved to disk and served via a static URL.

Poster Remixing -- The weirdest feature. Give it three posters: an anchor, one to add, and one to subtract. It computes CLIP embeddings for all three, does vector arithmetic (anchor + add - subtract), normalises the result, and generates a new image from that fused embedding. It's like "take the vibe of poster A, add the style of poster B, remove the feel of poster C."

iOS App

A native SwiftUI app (iOS 17+) with:

  • Photo picker -- PHPicker integration for selecting posters from your library
  • Instant analysis -- One-tap upload to the backend with loading skeletons while it processes
  • Results view -- Metadata cards showing title, genre, year, director, cast, and description, plus OCR text with copy/share
  • Tag generator -- Enter a movie title and genre, get AI-generated tags displayed in a flow layout with colour-coded chips
  • Full-screen zoom -- Pinch-to-zoom poster viewer with magnification gesture handling
  • Accessibility -- Every interactive element has accessibilityLabel, accessibilityHint, and accessibilityIdentifier. VoiceOver-ready.
  • Network monitoring -- NWPathMonitor checks connectivity before requests; offline errors are caught early
  • Error handling -- Typed VisionaryError enum with retry buttons for every failure state

Auth & Feedback

JWT Authentication -- Register and login endpoints with bcrypt password hashing and HS256 JWT tokens. Token expiry is configurable. Built with python-jose and passlib.

Feedback Loop -- Users can submit corrections when the model gets something wrong. These corrections are saved to the database through the active learning endpoint, so the training data improves over time. The correction model stores the original prediction alongside the corrected value.


The ML Models

Model What It Does How It Works Notes
ViT (google/vit-base-patch16-224) Genre classification Vision Transformer pre-trained on ImageNet-21k, fine-tuned on poster images. Splits image into 16x16 patches and uses self-attention. 86M params, 224x224 input
CLIP (openai/clip-vit-base-patch32) Image embeddings & metadata prompts Contrastive model that maps images and text into the same vector space. Used to generate image descriptions for GPT-4. 150M params, 32-patch variant
Stable Diffusion v1.5 Poster generation & remixing Diffusion model that denoises random noise guided by text prompts. Runs with float16 on GPU/MPS, float32 on CPU. ~1B params, guidance_scale=8.5
YOLOv8n Object detection Single-shot detector that processes the whole image in one pass. Nano variant balances speed and accuracy. 3.2M params, real-time
EasyOCR Text extraction CRAFT text detector + CRNN recognition network. Readers cached by language for repeat requests. Supports 80+ languages
dlib + face_recognition Face detection & encoding HOG-based face detector + ResNet face encoder. Also uses Haar cascades as fallback. 128-dimensional face encodings
GPT-4 Metadata writing Takes CLIP's image description and generates structured movie metadata (title, genre, tagline, mood, summary). External API call

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        iOS App (SwiftUI)                        │
│  PhotoPicker → ViewModels → APIService → URLSession (async)     │
└─────────────────────────┬───────────────────────────────────────┘
                          │ HTTP/JSON
                          ▼
┌─────────────────────────────────────────────────────────────────┐
│                     FastAPI Backend                              │
│                                                                 │
│  ┌──────────┐   ┌───────────────────────────────────────────┐   │
│  │ Middleware│   │            API Routes (/api/v1)           │   │
│  │ ─ CORS   │   │  classify, embed, metadata, generation,   │   │
│  │ ─ ReqID  │   │  remix, detect, ocr, faces, feedback,     │   │
│  └──────────┘   │  auth, active_learning                    │   │
│                 └──────────────────┬────────────────────────┘   │
│                                    │                            │
│                 ┌──────────────────▼────────────────────────┐   │
│                 │              Services                      │   │
│                 │  GenreClassifier    CLIPEmbedder           │   │
│                 │  StableDiffusion   YOLODetector            │   │
│                 │  OCRReader         FaceAnalyzer            │   │
│                 │  PosterMetadata    RemixEngine             │   │
│                 └──────────────────┬────────────────────────┘   │
│                                    │                            │
│                 ┌──────────────────▼────────────────────────┐   │
│                 │         ML Models & External APIs          │   │
│                 │  ViT, CLIP, SD 1.5, YOLOv8, EasyOCR,     │   │
│                 │  dlib, face_recognition, OpenAI GPT-4     │   │
│                 └──────────────────┬────────────────────────┘   │
│                                    │                            │
│                 ┌──────────────────▼────────────────────────┐   │
│                 │         Database (SQLite / SQLModel)       │   │
│                 │  Users, Feedback, Active Learning          │   │
│                 └───────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

The iOS app sends images as base64-encoded JPEG over JSON (or multipart for OCR). The backend validates the request through Pydantic models, routes it to the right service, and returns a structured { success, data } envelope. Models are lazy-loaded via lru_cache -- the first request loads them, subsequent requests are instant. A request ID middleware tags every request for tracing.


Tech Stack

Layer Technology Why
Backend framework FastAPI Async, auto-generated OpenAPI docs, Pydantic validation out of the box
ML framework PyTorch + Transformers + Diffusers Industry standard for vision models; Hugging Face ecosystem makes loading pre-trained models trivial
Object detection Ultralytics YOLOv8 Best one-shot detector available; nano variant runs fast even on CPU
OCR EasyOCR Works without Tesseract, supports 80+ languages, good on stylised poster text
Face detection dlib + face_recognition Well-tested, returns both bounding boxes and 128-dim face encodings
Metadata AI OpenAI GPT-4 Best at structured creative writing from visual descriptions
Database SQLite + SQLModel Zero config, good enough for single-server use, SQLModel gives Pydantic + SQLAlchemy in one
Auth python-jose + passlib Standard JWT + bcrypt stack, nothing exotic
iOS framework SwiftUI Declarative UI, native async/await, modern iOS development
iOS networking URLSession No third-party deps needed; built-in async/await support since iOS 15
Package management uv Faster than pip/poetry, handles lockfiles and virtual envs
Linting Ruff Fast Python linter/formatter, replaces flake8 + isort
Testing pytest + pytest-asyncio Async test support for FastAPI endpoints
Containerisation Docker Reproducible builds with all system deps (cmake, dlib, opencv)

Getting Started

Backend

Prerequisites: Python 3.10+, uv

# 1. Clone the repo
git clone https://github.com/AkinCodes/MoviePosterAI.git
cd MoviePosterAI

# 2. Install dependencies
uv sync

# 3. Set up environment
cp .env.example .env
# Edit .env and add your keys:
#   VISIONARYGPT_SECRET_KEY=your-jwt-secret
#   VISIONARYGPT_OPENAI_API_KEY=sk-...  (only needed for /metadata)

# 4. Start the server
uv run uvicorn backend.app.main:app --reload

The server starts at http://127.0.0.1:8000. Open http://127.0.0.1:8000/docs for the interactive Swagger UI where you can test every endpoint.

What to expect on first run: The first request to any ML endpoint will take a while as the model downloads from Hugging Face (ViT is ~350MB, CLIP is ~600MB, Stable Diffusion is ~4GB). After that, models are cached locally. Subsequent requests load from cache in a few seconds.

iOS App

Prerequisites: Xcode 15+, iOS 17+ device or simulator

  1. Open ios/VisionaryGPTApp.xcodeproj in Xcode
  2. Edit ios/VisionaryGPTApp/Config/APIConfig.swift and point the base URL at your running backend (e.g., http://localhost:8000/api/v1 for simulator, or your machine's local IP for a physical device)
  3. Select your target device and hit Run
  4. Tap "Upload Poster," pick an image from your library, and tap "Analyze Poster"

The app will show a loading skeleton while the backend processes, then display the results in cards. You can share results, copy OCR text, or zoom into the poster image.


API Reference

All endpoints are prefixed with /api/v1. Most accept a JSON body with a base64-encoded image.

Method Path Body What It Returns
POST /classify/ { "image": "<base64>" } Genre label, confidence, full probability distribution
POST /embed/ { "image": "<base64>" } 512-dimensional CLIP embedding vector
POST /metadata/ { "image": "<base64>" } Title, genre, tagline, mood, summary (via CLIP + GPT-4)
POST /generation/ { "prompt": "..." } URL to generated poster image
POST /remix/ { "anchor": "...", "add": "...", "subtract": "..." } URL to remixed poster image
POST /detect/ { "image": "<base64>" } List of detected objects with labels, confidence, bounding boxes
POST /ocr/ Multipart: file (image) + language Extracted text lines
GET /ocr/languages -- List of supported OCR languages
POST /faces/ { "image": "<base64>" } List of detected faces with bounding boxes
POST /feedback/ { "feedback": "..." } Confirmation
POST /active_learning/corrections { "original_prediction": "...", "corrected_value": "..." } Confirmation
POST /auth/register { "username": "...", "password": "..." } JWT token
POST /auth/login { "username": "...", "password": "..." } JWT token
GET /health -- Server status, loaded models

Example -- classify a poster with curl:

# Encode an image to base64
BASE64=$(base64 -i spiderman.jpeg)

# Send it
curl -X POST http://localhost:8000/api/v1/classify/ \
  -H "Content-Type: application/json" \
  -d "{\"image\": \"$BASE64\"}"

# Response:
# {
#   "success": true,
#   "data": {
#     "label": "Action",
#     "confidence": 0.847,
#     "predictions": {
#       "Action": 0.847, "Comedy": 0.052, "Drama": 0.061,
#       "Horror": 0.023, "Romance": 0.017
#     }
#   }
# }

Environment Variables

Set these in your .env file. All are prefixed with VISIONARYGPT_.

Variable Required Default What It's For
VISIONARYGPT_SECRET_KEY Yes (production) dev-secret-change-in-production Signs JWT tokens. Change this.
VISIONARYGPT_OPENAI_API_KEY For /metadata only -- OpenAI API key for GPT-4 metadata generation
VISIONARYGPT_CORS_ORIGINS No http://localhost:3000,http://localhost:8000 Comma-separated allowed origins
VISIONARYGPT_DEBUG No false Enables debug logging
VISIONARYGPT_DATABASE_URL No sqlite:///app/data/scenescope.db Database connection string

Testing, Linting, Docker

# Run the test suite
uv run pytest

# Lint and format
uv run ruff check .
uv run ruff format .

# Build and run with Docker
docker build -t movieposterai .
docker run -p 8000:8000 \
  -e VISIONARYGPT_SECRET_KEY=your-secret \
  -e VISIONARYGPT_OPENAI_API_KEY=sk-... \
  movieposterai

The Docker image is based on python:3.11-slim and installs all system deps for dlib and OpenCV (cmake, libopenblas, etc). First build takes a while due to dlib compilation.


Project Structure

MoviePosterAI/
├── backend/
│   ├── app/
│   │   ├── main.py                    # FastAPI app, middleware, exception handlers
│   │   ├── api/                       # Route handlers
│   │   │   ├── classify.py
│   │   │   ├── detect.py
│   │   │   ├── embedding.py
│   │   │   ├── faces.py
│   │   │   ├── feedback.py
│   │   │   ├── generation.py
│   │   │   ├── metadata.py
│   │   │   ├── ocr.py
│   │   │   ├── remix.py
│   │   │   ├── auth.py
│   │   │   └── active_learning.py
│   │   ├── services/                  # ML model wrappers
│   │   │   ├── classification/        # ViT genre classifier
│   │   │   ├── detection/             # YOLOv8 detector
│   │   │   ├── embeddings/            # CLIP embedder
│   │   │   ├── faces/                 # dlib + face_recognition
│   │   │   ├── generation/            # Stable Diffusion
│   │   │   ├── metadata/              # CLIP + GPT-4 pipeline
│   │   │   ├── ocr/                   # EasyOCR reader
│   │   │   └── remix/                 # Embedding arithmetic + generation
│   │   ├── models/                    # Pydantic schemas & training scripts
│   │   │   ├── genre_classifier/      # ViT training & evaluation
│   │   │   ├── active_learning/       # Correction model
│   │   │   ├── auth/                  # User model
│   │   │   ├── embeddings/
│   │   │   ├── feedback/
│   │   │   ├── generation/
│   │   │   ├── metadata/
│   │   │   └── remix/
│   │   ├── core/                      # Config, auth, database, validation
│   │   │   ├── config.py              # Pydantic settings from .env
│   │   │   ├── auth.py                # JWT creation & verification
│   │   │   ├── database.py            # SQLModel session management
│   │   │   ├── dependencies.py        # Lazy model loading (lru_cache)
│   │   │   ├── exceptions.py          # Custom exception hierarchy
│   │   │   └── validation.py          # Image upload validation
│   │   └── db/                        # Database init
│   └── tests/                         # pytest test suite
│       ├── conftest.py
│       ├── test_api.py
│       └── test_health.py
├── ios/
│   └── VisionaryGPTApp/
│       ├── Config/
│       │   └── APIConfig.swift        # Backend URL configuration
│       ├── Model/
│       │   ├── APIResponses.swift     # Decodable response types
│       │   └── VisionaryError.swift   # Typed error enum
│       ├── Services/
│       │   └── APIService.swift       # Network layer (async/await)
│       ├── ViewModels/
│       │   ├── PosterUploadViewModel.swift
│       │   ├── ResultsViewModel.swift
│       │   └── TagGeneratorViewModel.swift
│       └── Views/
│           ├── ContentView.swift      # Home screen with navigation
│           ├── PhotoPicker.swift       # PHPicker wrapper
│           ├── PosterUploadView.swift  # Upload + OCR analysis
│           ├── ResultsView.swift       # Metadata + OCR display
│           └── TagGeneratorView.swift  # Tag generation with flow layout
├── pyproject.toml                     # Python deps & tool config
├── Dockerfile
├── Procfile
└── .env.example

What I'd Improve Next

  • Batch processing -- Accept multiple posters in one request instead of one at a time
  • Model caching layer -- Pre-load models on startup instead of lazy-loading on first request; add a warm-up endpoint
  • More genres -- Expand beyond the current five (Action, Comedy, Drama, Horror, Romance) with more training data
  • Poster similarity search -- Use CLIP embeddings to build a vector index and find visually similar posters
  • More tests -- Integration tests for each ML service, snapshot tests for the iOS views
  • CI/CD pipeline -- GitHub Actions for linting, tests, and Docker image publishing
  • Model versioning -- Track which model checkpoint is deployed and allow A/B testing
  • Streaming responses -- Stream Stable Diffusion progress back to the client instead of waiting for completion
  • iPad layout -- The iOS app works on iPad but doesn't use the extra screen space well yet
  • User history -- Save past analyses per user so they can come back to them

Related Projects


Author

Akin Olusanya

LinkedIn · GitHub · workwithakin@gmail.com

About

Show it a movie poster and it tells you everything — genre, objects, text, metadata. Generates new posters too. FastAPI + SwiftUI.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages