Part 2 of the CollectorVision series. Part 1 has the overview.
Before you can identify a card you have to find it. Cornelius is the model responsible for that — it takes a camera frame and predicts where the four corners of the card are.
This sounds easy. It's not.
What the detector has to deal with
Cards in a real scene come in a lot of shapes. They're held at angles, rotated, half-off-screen, in colored sleeves, on patterned table surfaces, or overlapping other cards. There's no clean white rectangle to look for. Classical approaches based on edge detection and contour finding work reasonably well in controlled conditions, but in practice they require a lot of tuning and tend to fail on anything unusual.
Using a learned model means the detector can be trained on examples of what "card in hand" actually looks like, rather than what a theoretical textbook expects it to look like.
Architecture
Cornelius is a MobileViT-XXS backbone with SimCC coordinate heads. Input is a 384×384 RGB image, ImageNet-normalized. The output is four normalized (x, y) corner coordinates — one per corner of the card — plus a sharpness score.
MobileViT-XXS is a hybrid CNN/Transformer architecture designed for mobile inference. The full thing runs in around 10–15ms per frame on a laptop CPU, 30–50ms on a phone.
SimCC (Simple Coordinate Classification) is the interesting part. Rather than predicting corner positions directly as floating-point numbers, it turns each coordinate axis into a classification problem over a discretized grid. The model predicts a probability distribution over bins, and the expected value of that distribution is the corner position. This lets the model express uncertainty — a sharp peak in the distribution means high confidence; a flat distribution means the model doesn't know.
The sharpness gate
The sharpness score is the mean peak value across all eight softmax distributions (four corners times two axes). It turned out to be a useful proxy for "is there a well-framed card here?"
The model also outputs a card presence logit, but that one isn't reliable — it fires strongly on blank images and hands without cards. The sharpness signal is better. When the card is clearly in frame with clean corners, all eight distributions are sharply peaked. When there's nothing there, or the image is blurry, the distributions go flat.
In practice, frames below a sharpness threshold of around 0.08–0.10 are just skipped. Nothing is forced.
detection = detector.detect(image)
if detection.sharpness < 0.10:
continue # try the next frame
Corner ordering
Four predicted corners still need to be assigned to the right positions — top-left, top-right, bottom-right, bottom-left. The ordering uses a standard geometric trick:
s = pts.sum(axis=1) # x + y
d = np.diff(pts, axis=1).ravel() # x - y
tl = pts[np.argmin(s)] # smallest x+y
tr = pts[np.argmin(d)] # smallest x-y
br = pts[np.argmax(s)] # largest x+y
bl = pts[np.argmax(d)] # largest x-y
This works for any convex quadrilateral regardless of how skewed the card is.
Dewarp
Once the corners are ordered, a perspective transform maps the card to a fixed output rectangle. The canonical output size is 252 × 352 pixels — proportional to the physical card dimensions at 4 pixels per millimeter.
detection.dewarp(image) returns a PIL Image, always the same shape, regardless of how the card was held. That consistency matters a lot for the embedder — Milo always sees the same-shaped input.
Failure modes
Cornelius has trouble with:
- Cards lying flat on a complex surface, where the card border blends into the background
- Very high viewing angles (more than about 60 degrees off axis)
- Full-art cards with no clear light-colored border
For most practical use — card held roughly flat, camera more or less overhead — it works reliably. The sharpness gate handles bad frames by skipping them rather than producing wrong answers.
The pluggable interface
The library's CornerDetector interface is a protocol, not a base class. Anything with a detect(image) -> DetectionResult method works as a drop-in replacement. There's an example in the repo of a Canny-edge-based detector implemented in about 30 lines, mostly to show how the interface works.
Next: Part 3 — Milo, the embedding model
Before you can identify a card you have to find it. Cornelius is the model responsible for that — it takes a camera frame and predicts where the four corners of the card are.
This sounds easy. It's not.
What the detector has to deal with
Cards in a real scene come in a lot of shapes. They're held at angles, rotated, half-off-screen, in colored sleeves, on patterned table surfaces, or overlapping other cards. There's no clean white rectangle to look for. Classical approaches based on edge detection and contour finding work reasonably well in controlled conditions, but in practice they require a lot of tuning and tend to fail on anything unusual.
Using a learned model means the detector can be trained on examples of what "card in hand" actually looks like, rather than what a theoretical textbook expects it to look like.
Architecture
Cornelius is a MobileViT-XXS backbone with SimCC coordinate heads. Input is a 384×384 RGB image, ImageNet-normalized. The output is four normalized (x, y) corner coordinates — one per corner of the card — plus a sharpness score.
MobileViT-XXS is a hybrid CNN/Transformer architecture designed for mobile inference. The full thing runs in around 10–15ms per frame on a laptop CPU, 30–50ms on a phone.
SimCC (Simple Coordinate Classification) is the interesting part. Rather than predicting corner positions directly as floating-point numbers, it turns each coordinate axis into a classification problem over a discretized grid. The model predicts a probability distribution over bins, and the expected value of that distribution is the corner position. This lets the model express uncertainty — a sharp peak in the distribution means high confidence; a flat distribution means the model doesn't know.
The sharpness gate
The sharpness score is the mean peak value across all eight softmax distributions (four corners times two axes). It turned out to be a useful proxy for "is there a well-framed card here?"
The model also outputs a card presence logit, but that one isn't reliable — it fires strongly on blank images and hands without cards. The sharpness signal is better. When the card is clearly in frame with clean corners, all eight distributions are sharply peaked. When there's nothing there, or the image is blurry, the distributions go flat.
In practice, frames below a sharpness threshold of around 0.08–0.10 are just skipped. Nothing is forced.
Corner ordering
Four predicted corners still need to be assigned to the right positions — top-left, top-right, bottom-right, bottom-left. The ordering uses a standard geometric trick:
This works for any convex quadrilateral regardless of how skewed the card is.
Dewarp
Once the corners are ordered, a perspective transform maps the card to a fixed output rectangle. The canonical output size is 252 × 352 pixels — proportional to the physical card dimensions at 4 pixels per millimeter.
detection.dewarp(image)returns a PIL Image, always the same shape, regardless of how the card was held. That consistency matters a lot for the embedder — Milo always sees the same-shaped input.Failure modes
Cornelius has trouble with:
For most practical use — card held roughly flat, camera more or less overhead — it works reliably. The sharpness gate handles bad frames by skipping them rather than producing wrong answers.
The pluggable interface
The library's
CornerDetectorinterface is a protocol, not a base class. Anything with adetect(image) -> DetectionResultmethod works as a drop-in replacement. There's an example in the repo of a Canny-edge-based detector implemented in about 30 lines, mostly to show how the interface works.Next: Part 3 — Milo, the embedding model