MoviePosterAI

Give it a movie poster and it tells you everything. Genre, objects, faces, all the text on it, even the mood. Then it can generate entirely new posters from a text description, or take two posters and remix them into something new. The metadata endpoint feeds the image through CLIP and GPT-4 to produce a title, tagline, genre, and summary -- all from a single image upload.

This is a monorepo: a Python/FastAPI backend with real ML models (not wrappers around a single API), plus a native SwiftUI iOS app that talks to it. Upload a poster from your phone, get results in seconds.

Screenshots

_{Genre classification & metadata}	_{OCR text extraction}
_{Object & face detection}	_{Poster generation from text}

What It Can Do

Computer Vision

Genre Classification -- A Vision Transformer (ViT) fine-tuned on movie poster images. It resizes the poster to 224x224, normalises it, runs it through the transformer, and outputs softmax probabilities across five genres: Action, Comedy, Drama, Horror, and Romance. You get the top prediction plus the full probability distribution, so you can see when a poster sits between two genres.

Object Detection -- YOLOv8 (nano variant) scans the poster and returns bounding boxes for every object it finds -- people, cars, weapons, animals, text regions, whatever is in frame. Each detection comes with a label, confidence score, and pixel coordinates. Useful for understanding what's visually prominent on the poster.

Face Detection -- Uses dlib's frontal face detector and the face_recognition library to find every face in the image. Returns bounding boxes and face encodings. The system uses both OpenCV's Haar cascades and dlib's HOG-based detector for reliability across different lighting and angles.

Text Understanding

OCR -- EasyOCR extracts every piece of text on the poster: the movie title, tagline, credits, release date, small print at the bottom. Supports multiple languages (English, Chinese simplified/traditional, and more). The reader is cached per language combo so subsequent requests are fast.

Metadata Generation -- A two-step pipeline. First, CLIP processes the image and produces a visual description. Then that description is sent to GPT-4 with a system prompt that asks it to act as a movie metadata writer. Back comes a title, genre, tagline, mood, and a two-sentence summary. All from looking at the poster.

Image Generation

Text-to-Poster -- Describe a movie that doesn't exist and Stable Diffusion v1.5 will create a poster for it. The pipeline runs on CUDA, Apple Silicon (MPS), or CPU depending on what's available. Generated images are saved to disk and served via a static URL.

Poster Remixing -- The weirdest feature. Give it three posters: an anchor, one to add, and one to subtract. It computes CLIP embeddings for all three, does vector arithmetic (anchor + add - subtract), normalises the result, and generates a new image from that fused embedding. It's like "take the vibe of poster A, add the style of poster B, remove the feel of poster C."

iOS App

A native SwiftUI app (iOS 17+) with:

Photo picker -- PHPicker integration for selecting posters from your library
Instant analysis -- One-tap upload to the backend with loading skeletons while it processes
Results view -- Metadata cards showing title, genre, year, director, cast, and description, plus OCR text with copy/share
Tag generator -- Enter a movie title and genre, get AI-generated tags displayed in a flow layout with colour-coded chips
Full-screen zoom -- Pinch-to-zoom poster viewer with magnification gesture handling
Accessibility -- Every interactive element has accessibilityLabel, accessibilityHint, and accessibilityIdentifier. VoiceOver-ready.
Network monitoring -- NWPathMonitor checks connectivity before requests; offline errors are caught early
Error handling -- Typed VisionaryError enum with retry buttons for every failure state

Auth & Feedback

JWT Authentication -- Register and login endpoints with bcrypt password hashing and HS256 JWT tokens. Token expiry is configurable. Built with python-jose and passlib.

Feedback Loop -- Users can submit corrections when the model gets something wrong. These corrections are saved to the database through the active learning endpoint, so the training data improves over time. The correction model stores the original prediction alongside the corrected value.

The ML Models

Model	What It Does	How It Works	Notes
ViT (google/vit-base-patch16-224)	Genre classification	Vision Transformer pre-trained on ImageNet-21k, fine-tuned on poster images. Splits image into 16x16 patches and uses self-attention.	86M params, 224x224 input
CLIP (openai/clip-vit-base-patch32)	Image embeddings & metadata prompts	Contrastive model that maps images and text into the same vector space. Used to generate image descriptions for GPT-4.	150M params, 32-patch variant
Stable Diffusion v1.5	Poster generation & remixing	Diffusion model that denoises random noise guided by text prompts. Runs with float16 on GPU/MPS, float32 on CPU.	~1B params, guidance_scale=8.5
YOLOv8n	Object detection	Single-shot detector that processes the whole image in one pass. Nano variant balances speed and accuracy.	3.2M params, real-time
EasyOCR	Text extraction	CRAFT text detector + CRNN recognition network. Readers cached by language for repeat requests.	Supports 80+ languages
dlib + face_recognition	Face detection & encoding	HOG-based face detector + ResNet face encoder. Also uses Haar cascades as fallback.	128-dimensional face encodings
GPT-4	Metadata writing	Takes CLIP's image description and generates structured movie metadata (title, genre, tagline, mood, summary).	External API call

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        iOS App (SwiftUI)                        │
│  PhotoPicker → ViewModels → APIService → URLSession (async)     │
└─────────────────────────┬───────────────────────────────────────┘
                          │ HTTP/JSON
                          ▼
┌─────────────────────────────────────────────────────────────────┐
│                     FastAPI Backend                              │
│                                                                 │
│  ┌──────────┐   ┌───────────────────────────────────────────┐   │
│  │ Middleware│   │            API Routes (/api/v1)           │   │
│  │ ─ CORS   │   │  classify, embed, metadata, generation,   │   │
│  │ ─ ReqID  │   │  remix, detect, ocr, faces, feedback,     │   │
│  └──────────┘   │  auth, active_learning                    │   │
│                 └──────────────────┬────────────────────────┘   │
│                                    │                            │
│                 ┌──────────────────▼────────────────────────┐   │
│                 │              Services                      │   │
│                 │  GenreClassifier    CLIPEmbedder           │   │
│                 │  StableDiffusion   YOLODetector            │   │
│                 │  OCRReader         FaceAnalyzer            │   │
│                 │  PosterMetadata    RemixEngine             │   │
│                 └──────────────────┬────────────────────────┘   │
│                                    │                            │
│                 ┌──────────────────▼────────────────────────┐   │
│                 │         ML Models & External APIs          │   │
│                 │  ViT, CLIP, SD 1.5, YOLOv8, EasyOCR,     │   │
│                 │  dlib, face_recognition, OpenAI GPT-4     │   │
│                 └──────────────────┬────────────────────────┘   │
│                                    │                            │
│                 ┌──────────────────▼────────────────────────┐   │
│                 │         Database (SQLite / SQLModel)       │   │
│                 │  Users, Feedback, Active Learning          │   │
│                 └───────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

The iOS app sends images as base64-encoded JPEG over JSON (or multipart for OCR). The backend validates the request through Pydantic models, routes it to the right service, and returns a structured { success, data } envelope. Models are lazy-loaded via lru_cache -- the first request loads them, subsequent requests are instant. A request ID middleware tags every request for tracing.

Tech Stack

Layer	Technology	Why
Backend framework	FastAPI	Async, auto-generated OpenAPI docs, Pydantic validation out of the box
ML framework	PyTorch + Transformers + Diffusers	Industry standard for vision models; Hugging Face ecosystem makes loading pre-trained models trivial
Object detection	Ultralytics YOLOv8	Best one-shot detector available; nano variant runs fast even on CPU
OCR	EasyOCR	Works without Tesseract, supports 80+ languages, good on stylised poster text
Face detection	dlib + face_recognition	Well-tested, returns both bounding boxes and 128-dim face encodings
Metadata AI	OpenAI GPT-4	Best at structured creative writing from visual descriptions
Database	SQLite + SQLModel	Zero config, good enough for single-server use, SQLModel gives Pydantic + SQLAlchemy in one
Auth	python-jose + passlib	Standard JWT + bcrypt stack, nothing exotic
iOS framework	SwiftUI	Declarative UI, native async/await, modern iOS development
iOS networking	URLSession	No third-party deps needed; built-in async/await support since iOS 15
Package management	uv	Faster than pip/poetry, handles lockfiles and virtual envs
Linting	Ruff	Fast Python linter/formatter, replaces flake8 + isort
Testing	pytest + pytest-asyncio	Async test support for FastAPI endpoints
Containerisation	Docker	Reproducible builds with all system deps (cmake, dlib, opencv)

Getting Started

Backend

Prerequisites: Python 3.10+, uv

# 1. Clone the repo
git clone https://github.com/AkinCodes/MoviePosterAI.git
cd MoviePosterAI

# 2. Install dependencies
uv sync

# 3. Set up environment
cp .env.example .env
# Edit .env and add your keys:
#   VISIONARYGPT_SECRET_KEY=your-jwt-secret
#   VISIONARYGPT_OPENAI_API_KEY=sk-...  (only needed for /metadata)

# 4. Start the server
uv run uvicorn backend.app.main:app --reload

The server starts at http://127.0.0.1:8000. Open http://127.0.0.1:8000/docs for the interactive Swagger UI where you can test every endpoint.

What to expect on first run: The first request to any ML endpoint will take a while as the model downloads from Hugging Face (ViT is ~350MB, CLIP is ~600MB, Stable Diffusion is ~4GB). After that, models are cached locally. Subsequent requests load from cache in a few seconds.

iOS App

Prerequisites: Xcode 15+, iOS 17+ device or simulator

Open ios/VisionaryGPTApp.xcodeproj in Xcode
Edit ios/VisionaryGPTApp/Config/APIConfig.swift and point the base URL at your running backend (e.g., http://localhost:8000/api/v1 for simulator, or your machine's local IP for a physical device)
Select your target device and hit Run
Tap "Upload Poster," pick an image from your library, and tap "Analyze Poster"

The app will show a loading skeleton while the backend processes, then display the results in cards. You can share results, copy OCR text, or zoom into the poster image.

API Reference

All endpoints are prefixed with /api/v1. Most accept a JSON body with a base64-encoded image.

Method	Path	Body	What It Returns
`POST`	`/classify/`	`{ "image": "<base64>" }`	Genre label, confidence, full probability distribution
`POST`	`/embed/`	`{ "image": "<base64>" }`	512-dimensional CLIP embedding vector
`POST`	`/metadata/`	`{ "image": "<base64>" }`	Title, genre, tagline, mood, summary (via CLIP + GPT-4)
`POST`	`/generation/`	`{ "prompt": "..." }`	URL to generated poster image
`POST`	`/remix/`	`{ "anchor": "...", "add": "...", "subtract": "..." }`	URL to remixed poster image
`POST`	`/detect/`	`{ "image": "<base64>" }`	List of detected objects with labels, confidence, bounding boxes
`POST`	`/ocr/`	Multipart: `file` (image) + `language`	Extracted text lines
`GET`	`/ocr/languages`	--	List of supported OCR languages
`POST`	`/faces/`	`{ "image": "<base64>" }`	List of detected faces with bounding boxes
`POST`	`/feedback/`	`{ "feedback": "..." }`	Confirmation
`POST`	`/active_learning/corrections`	`{ "original_prediction": "...", "corrected_value": "..." }`	Confirmation
`POST`	`/auth/register`	`{ "username": "...", "password": "..." }`	JWT token
`POST`	`/auth/login`	`{ "username": "...", "password": "..." }`	JWT token
`GET`	`/health`	--	Server status, loaded models

Example -- classify a poster with curl:

# Encode an image to base64
BASE64=$(base64 -i spiderman.jpeg)

# Send it
curl -X POST http://localhost:8000/api/v1/classify/ \
  -H "Content-Type: application/json" \
  -d "{\"image\": \"$BASE64\"}"

# Response:
# {
#   "success": true,
#   "data": {
#     "label": "Action",
#     "confidence": 0.847,
#     "predictions": {
#       "Action": 0.847, "Comedy": 0.052, "Drama": 0.061,
#       "Horror": 0.023, "Romance": 0.017
#     }
#   }
# }

Environment Variables

Set these in your .env file. All are prefixed with VISIONARYGPT_.

Variable	Required	Default	What It's For
`VISIONARYGPT_SECRET_KEY`	Yes (production)	`dev-secret-change-in-production`	Signs JWT tokens. Change this.
`VISIONARYGPT_OPENAI_API_KEY`	For `/metadata` only	--	OpenAI API key for GPT-4 metadata generation
`VISIONARYGPT_CORS_ORIGINS`	No	`http://localhost:3000,http://localhost:8000`	Comma-separated allowed origins
`VISIONARYGPT_DEBUG`	No	`false`	Enables debug logging
`VISIONARYGPT_DATABASE_URL`	No	`sqlite:///app/data/scenescope.db`	Database connection string

Testing, Linting, Docker

# Run the test suite
uv run pytest

# Lint and format
uv run ruff check .
uv run ruff format .

# Build and run with Docker
docker build -t movieposterai .
docker run -p 8000:8000 \
  -e VISIONARYGPT_SECRET_KEY=your-secret \
  -e VISIONARYGPT_OPENAI_API_KEY=sk-... \
  movieposterai

The Docker image is based on python:3.11-slim and installs all system deps for dlib and OpenCV (cmake, libopenblas, etc). First build takes a while due to dlib compilation.

Project Structure

MoviePosterAI/
├── backend/
│   ├── app/
│   │   ├── main.py                    # FastAPI app, middleware, exception handlers
│   │   ├── api/                       # Route handlers
│   │   │   ├── classify.py
│   │   │   ├── detect.py
│   │   │   ├── embedding.py
│   │   │   ├── faces.py
│   │   │   ├── feedback.py
│   │   │   ├── generation.py
│   │   │   ├── metadata.py
│   │   │   ├── ocr.py
│   │   │   ├── remix.py
│   │   │   ├── auth.py
│   │   │   └── active_learning.py
│   │   ├── services/                  # ML model wrappers
│   │   │   ├── classification/        # ViT genre classifier
│   │   │   ├── detection/             # YOLOv8 detector
│   │   │   ├── embeddings/            # CLIP embedder
│   │   │   ├── faces/                 # dlib + face_recognition
│   │   │   ├── generation/            # Stable Diffusion
│   │   │   ├── metadata/              # CLIP + GPT-4 pipeline
│   │   │   ├── ocr/                   # EasyOCR reader
│   │   │   └── remix/                 # Embedding arithmetic + generation
│   │   ├── models/                    # Pydantic schemas & training scripts
│   │   │   ├── genre_classifier/      # ViT training & evaluation
│   │   │   ├── active_learning/       # Correction model
│   │   │   ├── auth/                  # User model
│   │   │   ├── embeddings/
│   │   │   ├── feedback/
│   │   │   ├── generation/
│   │   │   ├── metadata/
│   │   │   └── remix/
│   │   ├── core/                      # Config, auth, database, validation
│   │   │   ├── config.py              # Pydantic settings from .env
│   │   │   ├── auth.py                # JWT creation & verification
│   │   │   ├── database.py            # SQLModel session management
│   │   │   ├── dependencies.py        # Lazy model loading (lru_cache)
│   │   │   ├── exceptions.py          # Custom exception hierarchy
│   │   │   └── validation.py          # Image upload validation
│   │   └── db/                        # Database init
│   └── tests/                         # pytest test suite
│       ├── conftest.py
│       ├── test_api.py
│       └── test_health.py
├── ios/
│   └── VisionaryGPTApp/
│       ├── Config/
│       │   └── APIConfig.swift        # Backend URL configuration
│       ├── Model/
│       │   ├── APIResponses.swift     # Decodable response types
│       │   └── VisionaryError.swift   # Typed error enum
│       ├── Services/
│       │   └── APIService.swift       # Network layer (async/await)
│       ├── ViewModels/
│       │   ├── PosterUploadViewModel.swift
│       │   ├── ResultsViewModel.swift
│       │   └── TagGeneratorViewModel.swift
│       └── Views/
│           ├── ContentView.swift      # Home screen with navigation
│           ├── PhotoPicker.swift       # PHPicker wrapper
│           ├── PosterUploadView.swift  # Upload + OCR analysis
│           ├── ResultsView.swift       # Metadata + OCR display
│           └── TagGeneratorView.swift  # Tag generation with flow layout
├── pyproject.toml                     # Python deps & tool config
├── Dockerfile
├── Procfile
└── .env.example

What I'd Improve Next

Batch processing -- Accept multiple posters in one request instead of one at a time
Model caching layer -- Pre-load models on startup instead of lazy-loading on first request; add a warm-up endpoint
More genres -- Expand beyond the current five (Action, Comedy, Drama, Horror, Romance) with more training data
Poster similarity search -- Use CLIP embeddings to build a vector index and find visually similar posters
More tests -- Integration tests for each ML service, snapshot tests for the iOS views
CI/CD pipeline -- GitHub Actions for linting, tests, and Docker image publishing
Model versioning -- Track which model checkpoint is deployed and allow A/B testing
Streaming responses -- Stream Stable Diffusion progress back to the client instead of waiting for completion
iPad layout -- The iOS app works on iPad but doesn't use the extra screen space well yet
User history -- Save past analyses per user so they can come back to them

Related Projects

CinemaScopeAI -- AI-powered cinema discovery platform
RecommenderSystem -- Movie recommendation engine using collaborative filtering

Author

Akin Olusanya

LinkedIn · GitHub · workwithakin@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
app		app
backend		backend
deployment/kubernetes		deployment/kubernetes
ios		ios
scripts		scripts
.gitignore		.gitignore
DATA_CARD.md		DATA_CARD.md
Dockerfile		Dockerfile
Procfile		Procfile
README.md		README.md
classifyMoviePoster.py		classifyMoviePoster.py
netflix_titles.csv		netflix_titles.csv
pyproject.toml		pyproject.toml
read_csv.py		read_csv.py
runserver.sh		runserver.sh
spiderman.jpeg		spiderman.jpeg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoviePosterAI

Screenshots

What It Can Do

Computer Vision

Text Understanding

Image Generation

iOS App

Auth & Feedback

The ML Models

Architecture

Tech Stack

Getting Started

Backend

iOS App

API Reference

Environment Variables

Testing, Linting, Docker

Project Structure

What I'd Improve Next

Related Projects

Author

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MoviePosterAI

Screenshots

What It Can Do

Computer Vision

Text Understanding

Image Generation

iOS App

Auth & Feedback

The ML Models

Architecture

Tech Stack

Getting Started

Backend

iOS App

API Reference

Environment Variables

Testing, Linting, Docker

Project Structure

What I'd Improve Next

Related Projects

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages