Give it a movie poster and it tells you everything. Genre, objects, faces, all the text on it, even the mood. Then it can generate entirely new posters from a text description, or take two posters and remix them into something new. The metadata endpoint feeds the image through CLIP and GPT-4 to produce a title, tagline, genre, and summary -- all from a single image upload.
This is a monorepo: a Python/FastAPI backend with real ML models (not wrappers around a single API), plus a native SwiftUI iOS app that talks to it. Upload a poster from your phone, get results in seconds.
Genre classification & metadata |
OCR text extraction |
Object & face detection |
Poster generation from text |
Genre Classification -- A Vision Transformer (ViT) fine-tuned on movie poster images. It resizes the poster to 224x224, normalises it, runs it through the transformer, and outputs softmax probabilities across five genres: Action, Comedy, Drama, Horror, and Romance. You get the top prediction plus the full probability distribution, so you can see when a poster sits between two genres.
Object Detection -- YOLOv8 (nano variant) scans the poster and returns bounding boxes for every object it finds -- people, cars, weapons, animals, text regions, whatever is in frame. Each detection comes with a label, confidence score, and pixel coordinates. Useful for understanding what's visually prominent on the poster.
Face Detection -- Uses dlib's frontal face detector and the face_recognition library to find every face in the image. Returns bounding boxes and face encodings. The system uses both OpenCV's Haar cascades and dlib's HOG-based detector for reliability across different lighting and angles.
OCR -- EasyOCR extracts every piece of text on the poster: the movie title, tagline, credits, release date, small print at the bottom. Supports multiple languages (English, Chinese simplified/traditional, and more). The reader is cached per language combo so subsequent requests are fast.
Metadata Generation -- A two-step pipeline. First, CLIP processes the image and produces a visual description. Then that description is sent to GPT-4 with a system prompt that asks it to act as a movie metadata writer. Back comes a title, genre, tagline, mood, and a two-sentence summary. All from looking at the poster.
Text-to-Poster -- Describe a movie that doesn't exist and Stable Diffusion v1.5 will create a poster for it. The pipeline runs on CUDA, Apple Silicon (MPS), or CPU depending on what's available. Generated images are saved to disk and served via a static URL.
Poster Remixing -- The weirdest feature. Give it three posters: an anchor, one to add, and one to subtract. It computes CLIP embeddings for all three, does vector arithmetic (anchor + add - subtract), normalises the result, and generates a new image from that fused embedding. It's like "take the vibe of poster A, add the style of poster B, remove the feel of poster C."
A native SwiftUI app (iOS 17+) with:
- Photo picker -- PHPicker integration for selecting posters from your library
- Instant analysis -- One-tap upload to the backend with loading skeletons while it processes
- Results view -- Metadata cards showing title, genre, year, director, cast, and description, plus OCR text with copy/share
- Tag generator -- Enter a movie title and genre, get AI-generated tags displayed in a flow layout with colour-coded chips
- Full-screen zoom -- Pinch-to-zoom poster viewer with magnification gesture handling
- Accessibility -- Every interactive element has
accessibilityLabel,accessibilityHint, andaccessibilityIdentifier. VoiceOver-ready. - Network monitoring -- NWPathMonitor checks connectivity before requests; offline errors are caught early
- Error handling -- Typed
VisionaryErrorenum with retry buttons for every failure state
JWT Authentication -- Register and login endpoints with bcrypt password hashing and HS256 JWT tokens. Token expiry is configurable. Built with python-jose and passlib.
Feedback Loop -- Users can submit corrections when the model gets something wrong. These corrections are saved to the database through the active learning endpoint, so the training data improves over time. The correction model stores the original prediction alongside the corrected value.
| Model | What It Does | How It Works | Notes |
|---|---|---|---|
| ViT (google/vit-base-patch16-224) | Genre classification | Vision Transformer pre-trained on ImageNet-21k, fine-tuned on poster images. Splits image into 16x16 patches and uses self-attention. | 86M params, 224x224 input |
| CLIP (openai/clip-vit-base-patch32) | Image embeddings & metadata prompts | Contrastive model that maps images and text into the same vector space. Used to generate image descriptions for GPT-4. | 150M params, 32-patch variant |
| Stable Diffusion v1.5 | Poster generation & remixing | Diffusion model that denoises random noise guided by text prompts. Runs with float16 on GPU/MPS, float32 on CPU. | ~1B params, guidance_scale=8.5 |
| YOLOv8n | Object detection | Single-shot detector that processes the whole image in one pass. Nano variant balances speed and accuracy. | 3.2M params, real-time |
| EasyOCR | Text extraction | CRAFT text detector + CRNN recognition network. Readers cached by language for repeat requests. | Supports 80+ languages |
| dlib + face_recognition | Face detection & encoding | HOG-based face detector + ResNet face encoder. Also uses Haar cascades as fallback. | 128-dimensional face encodings |
| GPT-4 | Metadata writing | Takes CLIP's image description and generates structured movie metadata (title, genre, tagline, mood, summary). | External API call |
┌─────────────────────────────────────────────────────────────────┐
│ iOS App (SwiftUI) │
│ PhotoPicker → ViewModels → APIService → URLSession (async) │
└─────────────────────────┬───────────────────────────────────────┘
│ HTTP/JSON
▼
┌─────────────────────────────────────────────────────────────────┐
│ FastAPI Backend │
│ │
│ ┌──────────┐ ┌───────────────────────────────────────────┐ │
│ │ Middleware│ │ API Routes (/api/v1) │ │
│ │ ─ CORS │ │ classify, embed, metadata, generation, │ │
│ │ ─ ReqID │ │ remix, detect, ocr, faces, feedback, │ │
│ └──────────┘ │ auth, active_learning │ │
│ └──────────────────┬────────────────────────┘ │
│ │ │
│ ┌──────────────────▼────────────────────────┐ │
│ │ Services │ │
│ │ GenreClassifier CLIPEmbedder │ │
│ │ StableDiffusion YOLODetector │ │
│ │ OCRReader FaceAnalyzer │ │
│ │ PosterMetadata RemixEngine │ │
│ └──────────────────┬────────────────────────┘ │
│ │ │
│ ┌──────────────────▼────────────────────────┐ │
│ │ ML Models & External APIs │ │
│ │ ViT, CLIP, SD 1.5, YOLOv8, EasyOCR, │ │
│ │ dlib, face_recognition, OpenAI GPT-4 │ │
│ └──────────────────┬────────────────────────┘ │
│ │ │
│ ┌──────────────────▼────────────────────────┐ │
│ │ Database (SQLite / SQLModel) │ │
│ │ Users, Feedback, Active Learning │ │
│ └───────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
The iOS app sends images as base64-encoded JPEG over JSON (or multipart for OCR). The backend validates the request through Pydantic models, routes it to the right service, and returns a structured { success, data } envelope. Models are lazy-loaded via lru_cache -- the first request loads them, subsequent requests are instant. A request ID middleware tags every request for tracing.
| Layer | Technology | Why |
|---|---|---|
| Backend framework | FastAPI | Async, auto-generated OpenAPI docs, Pydantic validation out of the box |
| ML framework | PyTorch + Transformers + Diffusers | Industry standard for vision models; Hugging Face ecosystem makes loading pre-trained models trivial |
| Object detection | Ultralytics YOLOv8 | Best one-shot detector available; nano variant runs fast even on CPU |
| OCR | EasyOCR | Works without Tesseract, supports 80+ languages, good on stylised poster text |
| Face detection | dlib + face_recognition | Well-tested, returns both bounding boxes and 128-dim face encodings |
| Metadata AI | OpenAI GPT-4 | Best at structured creative writing from visual descriptions |
| Database | SQLite + SQLModel | Zero config, good enough for single-server use, SQLModel gives Pydantic + SQLAlchemy in one |
| Auth | python-jose + passlib | Standard JWT + bcrypt stack, nothing exotic |
| iOS framework | SwiftUI | Declarative UI, native async/await, modern iOS development |
| iOS networking | URLSession | No third-party deps needed; built-in async/await support since iOS 15 |
| Package management | uv | Faster than pip/poetry, handles lockfiles and virtual envs |
| Linting | Ruff | Fast Python linter/formatter, replaces flake8 + isort |
| Testing | pytest + pytest-asyncio | Async test support for FastAPI endpoints |
| Containerisation | Docker | Reproducible builds with all system deps (cmake, dlib, opencv) |
Prerequisites: Python 3.10+, uv
# 1. Clone the repo
git clone https://github.com/AkinCodes/MoviePosterAI.git
cd MoviePosterAI
# 2. Install dependencies
uv sync
# 3. Set up environment
cp .env.example .env
# Edit .env and add your keys:
# VISIONARYGPT_SECRET_KEY=your-jwt-secret
# VISIONARYGPT_OPENAI_API_KEY=sk-... (only needed for /metadata)
# 4. Start the server
uv run uvicorn backend.app.main:app --reloadThe server starts at http://127.0.0.1:8000. Open http://127.0.0.1:8000/docs for the interactive Swagger UI where you can test every endpoint.
What to expect on first run: The first request to any ML endpoint will take a while as the model downloads from Hugging Face (ViT is ~350MB, CLIP is ~600MB, Stable Diffusion is ~4GB). After that, models are cached locally. Subsequent requests load from cache in a few seconds.
Prerequisites: Xcode 15+, iOS 17+ device or simulator
- Open
ios/VisionaryGPTApp.xcodeprojin Xcode - Edit
ios/VisionaryGPTApp/Config/APIConfig.swiftand point the base URL at your running backend (e.g.,http://localhost:8000/api/v1for simulator, or your machine's local IP for a physical device) - Select your target device and hit Run
- Tap "Upload Poster," pick an image from your library, and tap "Analyze Poster"
The app will show a loading skeleton while the backend processes, then display the results in cards. You can share results, copy OCR text, or zoom into the poster image.
All endpoints are prefixed with /api/v1. Most accept a JSON body with a base64-encoded image.
| Method | Path | Body | What It Returns |
|---|---|---|---|
POST |
/classify/ |
{ "image": "<base64>" } |
Genre label, confidence, full probability distribution |
POST |
/embed/ |
{ "image": "<base64>" } |
512-dimensional CLIP embedding vector |
POST |
/metadata/ |
{ "image": "<base64>" } |
Title, genre, tagline, mood, summary (via CLIP + GPT-4) |
POST |
/generation/ |
{ "prompt": "..." } |
URL to generated poster image |
POST |
/remix/ |
{ "anchor": "...", "add": "...", "subtract": "..." } |
URL to remixed poster image |
POST |
/detect/ |
{ "image": "<base64>" } |
List of detected objects with labels, confidence, bounding boxes |
POST |
/ocr/ |
Multipart: file (image) + language |
Extracted text lines |
GET |
/ocr/languages |
-- | List of supported OCR languages |
POST |
/faces/ |
{ "image": "<base64>" } |
List of detected faces with bounding boxes |
POST |
/feedback/ |
{ "feedback": "..." } |
Confirmation |
POST |
/active_learning/corrections |
{ "original_prediction": "...", "corrected_value": "..." } |
Confirmation |
POST |
/auth/register |
{ "username": "...", "password": "..." } |
JWT token |
POST |
/auth/login |
{ "username": "...", "password": "..." } |
JWT token |
GET |
/health |
-- | Server status, loaded models |
Example -- classify a poster with curl:
# Encode an image to base64
BASE64=$(base64 -i spiderman.jpeg)
# Send it
curl -X POST http://localhost:8000/api/v1/classify/ \
-H "Content-Type: application/json" \
-d "{\"image\": \"$BASE64\"}"
# Response:
# {
# "success": true,
# "data": {
# "label": "Action",
# "confidence": 0.847,
# "predictions": {
# "Action": 0.847, "Comedy": 0.052, "Drama": 0.061,
# "Horror": 0.023, "Romance": 0.017
# }
# }
# }Set these in your .env file. All are prefixed with VISIONARYGPT_.
| Variable | Required | Default | What It's For |
|---|---|---|---|
VISIONARYGPT_SECRET_KEY |
Yes (production) | dev-secret-change-in-production |
Signs JWT tokens. Change this. |
VISIONARYGPT_OPENAI_API_KEY |
For /metadata only |
-- | OpenAI API key for GPT-4 metadata generation |
VISIONARYGPT_CORS_ORIGINS |
No | http://localhost:3000,http://localhost:8000 |
Comma-separated allowed origins |
VISIONARYGPT_DEBUG |
No | false |
Enables debug logging |
VISIONARYGPT_DATABASE_URL |
No | sqlite:///app/data/scenescope.db |
Database connection string |
# Run the test suite
uv run pytest
# Lint and format
uv run ruff check .
uv run ruff format .
# Build and run with Docker
docker build -t movieposterai .
docker run -p 8000:8000 \
-e VISIONARYGPT_SECRET_KEY=your-secret \
-e VISIONARYGPT_OPENAI_API_KEY=sk-... \
movieposteraiThe Docker image is based on python:3.11-slim and installs all system deps for dlib and OpenCV (cmake, libopenblas, etc). First build takes a while due to dlib compilation.
MoviePosterAI/
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI app, middleware, exception handlers
│ │ ├── api/ # Route handlers
│ │ │ ├── classify.py
│ │ │ ├── detect.py
│ │ │ ├── embedding.py
│ │ │ ├── faces.py
│ │ │ ├── feedback.py
│ │ │ ├── generation.py
│ │ │ ├── metadata.py
│ │ │ ├── ocr.py
│ │ │ ├── remix.py
│ │ │ ├── auth.py
│ │ │ └── active_learning.py
│ │ ├── services/ # ML model wrappers
│ │ │ ├── classification/ # ViT genre classifier
│ │ │ ├── detection/ # YOLOv8 detector
│ │ │ ├── embeddings/ # CLIP embedder
│ │ │ ├── faces/ # dlib + face_recognition
│ │ │ ├── generation/ # Stable Diffusion
│ │ │ ├── metadata/ # CLIP + GPT-4 pipeline
│ │ │ ├── ocr/ # EasyOCR reader
│ │ │ └── remix/ # Embedding arithmetic + generation
│ │ ├── models/ # Pydantic schemas & training scripts
│ │ │ ├── genre_classifier/ # ViT training & evaluation
│ │ │ ├── active_learning/ # Correction model
│ │ │ ├── auth/ # User model
│ │ │ ├── embeddings/
│ │ │ ├── feedback/
│ │ │ ├── generation/
│ │ │ ├── metadata/
│ │ │ └── remix/
│ │ ├── core/ # Config, auth, database, validation
│ │ │ ├── config.py # Pydantic settings from .env
│ │ │ ├── auth.py # JWT creation & verification
│ │ │ ├── database.py # SQLModel session management
│ │ │ ├── dependencies.py # Lazy model loading (lru_cache)
│ │ │ ├── exceptions.py # Custom exception hierarchy
│ │ │ └── validation.py # Image upload validation
│ │ └── db/ # Database init
│ └── tests/ # pytest test suite
│ ├── conftest.py
│ ├── test_api.py
│ └── test_health.py
├── ios/
│ └── VisionaryGPTApp/
│ ├── Config/
│ │ └── APIConfig.swift # Backend URL configuration
│ ├── Model/
│ │ ├── APIResponses.swift # Decodable response types
│ │ └── VisionaryError.swift # Typed error enum
│ ├── Services/
│ │ └── APIService.swift # Network layer (async/await)
│ ├── ViewModels/
│ │ ├── PosterUploadViewModel.swift
│ │ ├── ResultsViewModel.swift
│ │ └── TagGeneratorViewModel.swift
│ └── Views/
│ ├── ContentView.swift # Home screen with navigation
│ ├── PhotoPicker.swift # PHPicker wrapper
│ ├── PosterUploadView.swift # Upload + OCR analysis
│ ├── ResultsView.swift # Metadata + OCR display
│ └── TagGeneratorView.swift # Tag generation with flow layout
├── pyproject.toml # Python deps & tool config
├── Dockerfile
├── Procfile
└── .env.example
- Batch processing -- Accept multiple posters in one request instead of one at a time
- Model caching layer -- Pre-load models on startup instead of lazy-loading on first request; add a warm-up endpoint
- More genres -- Expand beyond the current five (Action, Comedy, Drama, Horror, Romance) with more training data
- Poster similarity search -- Use CLIP embeddings to build a vector index and find visually similar posters
- More tests -- Integration tests for each ML service, snapshot tests for the iOS views
- CI/CD pipeline -- GitHub Actions for linting, tests, and Docker image publishing
- Model versioning -- Track which model checkpoint is deployed and allow A/B testing
- Streaming responses -- Stream Stable Diffusion progress back to the client instead of waiting for completion
- iPad layout -- The iOS app works on iPad but doesn't use the extra screen space well yet
- User history -- Save past analyses per user so they can come back to them
- CinemaScopeAI -- AI-powered cinema discovery platform
- RecommenderSystem -- Movie recommendation engine using collaborative filtering
Akin Olusanya



