A two-stage visual place recognition pipeline that combines structural edge-based local feature extraction with self-supervised vision transformer representations. SelMA (Selective Edge Location Matching & Assessment) identifies the location of a query photograph by matching it against a geo-tagged database of reference images, addressing challenges such as extreme illumination change (day-to-night), viewpoint variation, and perceptual aliasing.
- Setup and Installation
- Usage
- Method Overview
- Mathematical Foundations
- Project Structure
- Configuration Reference
- Benchmark Evaluation
- Dataset
- Benchmark Results
- References
| Requirement | Version |
|---|---|
| Python | ≥ 3.10 |
| CUDA | ≥ 12.8 (optional for GPU acceleration) |
| GPU VRAM | ≥ 8 GB recommended |
git clone https://github.com/CasRepoClone/SelMA.git
cd SelMApython -m venv .venvActivate it:
# Windows (PowerShell)
.\.venv\Scripts\Activate.ps1
# Linux / macOS
source .venv/bin/activatepip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txtThe DINOv2 ViT-S/14 weights (~85 MB) are downloaded automatically on first run via torch.hub.
Place the Aachen Day-Night images under images_upright/:
images_upright/
├── db/ # Reference database images
└── query/ # Query images
├── day/
└── night/
cd src
python main.py --helpcd src
python main.py [OPTIONS]| Flag | Short | Description | Default |
|---|---|---|---|
--query-dir |
-q |
Path to query image directory | images_upright/query |
--db-dir |
-d |
Path to database image directory | images_upright/db |
--output-dir |
-o |
Path to output root directory | output/ |
--ransac-method |
Geometric model: fundamental, homography, affine |
fundamental |
|
--ransac-reproj |
Reprojection error threshold (px) | 5.0 |
|
--ransac-iters |
Maximum RANSAC iterations | 2000 |
|
--ransac-confidence |
RANSAC confidence level | 0.999 |
|
--ransac-min-inliers |
Minimum inlier count to accept a match | 8 |
|
--benchmark |
Run benchmark evaluation (pose estimation) | off | |
--benchmark-scene |
Path to benchmark scene directory | — | |
--benchmark-max-pairs |
Limit number of evaluated pairs | all | |
--benchmark-pose-method |
Pose method: essential, fundamental |
essential |
|
--descriptor |
Descriptor type: sift, dinov2 |
sift |
# Run with defaults
python main.py
# Custom directories
python main.py -q ../data/queries -d ../data/database -o ../results
# Use homography model with stricter thresholds
python main.py --ransac-method homography --ransac-reproj 3.0 --ransac-min-inliers 12
# Run pose estimation benchmark on a phototourism scene
python main.py --benchmark --benchmark-scene ../data/phototourism/reichstag --benchmark-max-pairs 100 --descriptor siftEach run creates a timestamped folder under the output directory:
output/<YYYYMMDD_HHMMSS>/
├── results.csv # Per-query match results
├── query_<name>.jpg # Copy of each query image
├── match_<name>.jpg # Best-matching database image
├── vis_matches_<name>.jpg # Pre-RANSAC keypoint match lines
├── vis_ransac_<name>.jpg # Post-RANSAC inlier/outlier visualization
├── vis_spatial_query_<name>.jpg # Feature similarity heatmap (query)
├── vis_spatial_db_<name>.jpg # Feature similarity heatmap (database)
└── vis_top10_<name>.jpg # Top-10 candidate grid with scores
When running --benchmark, results are saved to output/benchmark/:
output/benchmark/
├── benchmark_<scene>.csv # Per-pair pose errors and metrics
├── summary_<scene>.txt # Aggregated mAA and statistics
├── kpmatches/ # Postprocessed keypoints per image
│ └── <image_stem>.jpg # All detected keypoints overlaid on the image
└── topk/ # Top-5 matching images per query
└── <image_stem>.jpg # Query + top-5 candidates grid with scores
- kpmatches/ — Each scene image with all postprocessed SIFT keypoints drawn as green dots, labelled with the total keypoint count. This shows the edge-selective filtering output.
- topk/ — For each image, a grid showing the query on the left and its 5 best-matching images ranked by match count. Each candidate shows the inlier ratio (H), descriptor similarity (Sim), match count (RANSAC), and pass/fail status.
Visual place recognition is formulated as an image retrieval problem: given a query image
Conventional approaches rely on hand-crafted local features (SIFT, ORB) or global descriptors (NetVLAD, GeM pooling). SelMA explores an alternative strategy: extracting local patches along structural edges — which are more robust to illumination change than texture-based keypoints — and encoding them with DINOv2, a foundation model whose self-supervised training on 142 million images yields representations with strong cross-domain generalisation.
The system operates in two stages:
Stage 1 — Coarse Retrieval. Each image (query and database) is processed through the same feature extraction pipeline: Canny edge detection → edge-point sampling → patch extraction → Gaussian denoising → DINOv2 encoding. This produces a set of 384-dimensional feature vectors anchored to structurally salient locations. Database images are ranked by the average cosine similarity of the top-$k$ most confident patch correspondences, yielding a shortlist of
Stage 2 — Geometric Re-ranking. Each shortlisted candidate undergoes full-patch matching followed by RANSAC geometric verification. A heuristic scoring function combines feature similarity and geometric consistency:
The final ranking is determined by
Canny edge detection identifies structural boundaries in images through a multi-stage process.
Step 1 — Gaussian Smoothing. The image
Step 2 — Gradient Computation. Sobel operators compute the intensity gradient magnitude
where
Step 3 — Non-Maximum Suppression. Only local maxima along the gradient direction are retained, producing thin edges.
Step 4 — Double Thresholding. Edges with
In this pipeline:
Edge pixels
If more than
This ensures even spatial coverage across the image.
Each
where
DINOv2 (Oquab et al., 2023) is a self-supervised Vision Transformer trained with a self-distillation objective (DINO) on the LVD-142M dataset. We use the ViT-S/14 variant (21M parameters, 384-dimensional embeddings).
Each [CLS] token is prepended, yielding
Patch embedding. A linear projection maps each
where [CLS] token.
Positional encoding. Learnable positional embeddings
Transformer blocks. Each of the 12 transformer blocks applies multi-head self-attention (MHSA) and a feed-forward network (FFN) with residual connections and layer normalization:
Multi-Head Self-Attention. For
where
Feature output. The [CLS] token from the final layer,
Images are normalized with ImageNet statistics before being fed to DINOv2:
For a query image with features
Each query patch
The average similarity across all query patches is:
The top-$k$ average similarity (with
After feature matching, RANSAC (Random Sample Consensus) verifies geometric consistency between matched point pairs. This filters out incorrect matches caused by visual aliasing.
Given
- Sample a minimal subset of correspondences (7 for fundamental matrix, 4 for homography, 3 for affine).
- Fit a geometric model from the minimal sample.
-
Score by counting inliers — correspondences whose reprojection error is below a threshold
$\tau$ . -
Repeat for
$K$ iterations, keeping the model with the most inliers.
The number of iterations required for confidence
In this pipeline:
The fundamental matrix
This constrains
The homography
with 8 degrees of freedom. Appropriate when the scene is approximately planar.
The affine model
with 6 degrees of freedom. A simpler model useful when perspective effects are minimal.
A match is classified as geometrically verified if the number of RANSAC inliers exceeds the minimum threshold (8). This confirms the query and database images share a consistent spatial relationship.
SelMA/
├── src/
│ ├── main.py # Pipeline entry point
│ ├── config/
│ │ ├── __init__.py
│ │ ├── settings.py # Default parameters
│ │ └── cli.py # Command-line argument parsing
│ ├── dataHandlers/
│ │ ├── __init__.py
│ │ ├── dataset.py # Image loading and directory listing
│ │ ├── output.py # Result persistence (CSV + image copies)
│ │ └── visualization.py # Match, RANSAC, spatial, and top-10 visualizations
│ ├── GeometryFuncs/
│ │ ├── __init__.py
│ │ ├── edges.py # Canny edge detection + patch extraction
│ │ └── denoise.py # Gaussian / Non-Local Means denoising
│ ├── ModelFuncs/
│ │ ├── __init__.py
│ │ ├── feature_extractor.py # DINOv2 ViT-S/14 wrapper + SIFT edge extractor
│ │ ├── matcher.py # Cosine similarity, SIFT matching, ranking
│ │ └── match_filter.py # Pre-RANSAC match filtering
│ ├── ransac/
│ │ ├── __init__.py
│ │ └── geometric_filter.py # RANSAC (fundamental / homography / affine)
│ ├── calibration/
│ │ ├── __init__.py
│ │ ├── calibrate.py # Camera calibration from checkerboard images
│ │ └── colmap_parser.py # COLMAP binary model parser
│ ├── benchmark/
│ │ ├── __init__.py
│ │ ├── dataset.py # BenchmarkScene: COLMAP/HDF5/JSON calibration loader
│ │ ├── evaluate.py # Benchmark evaluation runner
│ │ └── metrics.py # Pose estimation, mAA, rotation/translation error
│ └── evaluation/
│ ├── __init__.py
│ ├── dataset.py # Evaluation scene loader
│ ├── evaluate.py # Benchmark runner (pair evaluation + visualization)
│ └── metrics.py # Pose estimation and mAA metrics
├── scripts/
│ ├── create_test_scene.py # Generate synthetic benchmark scene
│ └── download_benchmark.py # Download phototourism benchmark data
├── demo_results/ # Sample output visualizations
├── images_upright/
│ ├── db/ # Database images (4,479)
│ └── query/ # Query images (947)
│ ├── day/ # Daytime queries
│ └── night/ # Nighttime queries
├── output/ # Timestamped result folders
├── requirements.txt # Pinned Python dependencies
└── readme.md # This document
All default parameters are defined in src/config/settings.py and can be overridden via CLI flags where noted.
| Parameter | Default | Description |
|---|---|---|
CANNY_LOW_THRESH |
50 | Lower hysteresis threshold for Canny |
CANNY_HIGH_THRESH |
150 | Upper hysteresis threshold for Canny |
PATCH_SIZE |
64 | Side length of extracted patches (px) |
PATCH_SPACING |
15 | Minimum spacing between sampled edge points (px) |
MAX_PATCHES |
2000 | Maximum patches retained per image |
DENOISE_METHOD |
"gaussian" |
"gaussian" or "nlmeans" |
DENOISE_GAUSSIAN_KERNEL |
3 | Gaussian blur kernel size |
| Parameter | Default | Description |
|---|---|---|
DINO_MODEL_NAME |
"dinov2_vits14" |
Model variant (ViT-S/14, 21M params) |
DINO_INPUT_SIZE |
112 | Input resolution (must be divisible by 14) |
DINO_BATCH_SIZE |
128 | GPU batch size for feature extraction |
DEVICE |
None |
Compute device; auto-detects CUDA if available |
| Parameter | Default | Description |
|---|---|---|
TOP_K_PATCHES |
50 | Top-$k$ patches used for coarse ranking (CLI: not exposed) |
SHORTLIST_SIZE |
100 | Candidates forwarded to geometric re-ranking |
TOP_K_VISUALIZE |
10 | Candidates shown in the top-N grid visualization |
HEURISTIC_W_SIM |
0.5 | Heuristic weight: cosine similarity |
HEURISTIC_W_INLIER_RATIO |
0.35 | Heuristic weight: RANSAC inlier ratio |
HEURISTIC_W_RANSAC_PASS |
0.15 | Heuristic weight: RANSAC pass bonus |
| Parameter | Default | CLI flag | Description |
|---|---|---|---|
RANSAC_METHOD |
"fundamental" |
--ransac-method |
Geometric model type |
RANSAC_REPROJ_THRESH |
5.0 | --ransac-reproj |
Reprojection error threshold (px) |
RANSAC_MAX_ITERS |
2000 | --ransac-iters |
Maximum iterations |
RANSAC_CONFIDENCE |
0.999 | --ransac-confidence |
Confidence level |
RANSAC_MIN_INLIERS |
8 | --ransac-min-inliers |
Minimum inliers to accept |
SelMA includes a pose estimation benchmark mode that evaluates matching quality against ground-truth camera poses from calibrated multi-view datasets (e.g., Phototourism scenes with COLMAP reconstructions).
cd src
python main.py --benchmark --benchmark-scene ../data/phototourism/reichstag \
--benchmark-max-pairs 100 --descriptor siftThe benchmark:
- Extracts features for each image using the SelMA pipeline (edge detection → SIFT at edge keypoints with CLAHE preprocessing and RootSIFT normalization).
- Matches pairs using bidirectional ratio-tested FLANN matching.
- Estimates poses via dual USAC_MAGSAC + USAC_PROSAC essential matrix estimation with Sampson error model selection.
- Computes metrics — rotation error, translation error, and mAA (mean Average Accuracy at 1°–10° thresholds).
- Generates visualizations:
- kpmatches/ — Every scene image with all postprocessed keypoints drawn, showing the edge-selective filtering output.
- topk/ — For each image, an all-vs-all matching grid showing its top-5 best-matching images ranked by match count, with inlier ratio, descriptor similarity, and pass/fail annotations.
| Parameter | Value | Description |
|---|---|---|
SIFT_CONTRAST_THRESH |
0.02 | SIFT contrast threshold (lower = more keypoints) |
SIFT_RATIO_THRESH |
0.75 | Lowe's ratio test threshold |
KEYPOINT_MAX_CORNERS |
3000 | Maximum keypoints per image |
KEYPOINT_EDGE_PROXIMITY |
15 | Minimum distance from image border (px) |
MATCH_USE_MNN |
True | Mutual nearest neighbour (bidirectional) matching |
This pipeline is evaluated on the Aachen Day-Night v1.1 benchmark (Sattler et al., 2018), a standard dataset for visual localisation under extreme illumination variation. It comprises 4,479 reference database images and 947 query images captured across daytime and nighttime conditions using multiple devices (Milestone, Nexus 4, Nexus 5x). The day-to-night setting is particularly challenging as it tests the robustness of image representations to fundamental appearance changes where texture and colour cues become unreliable.
SelMA is additionally evaluated on the Phototourism stereo benchmark (Reichstag scene, 75 calibrated images) from the Image Matching Benchmark (Jin et al., 2021). The metric is mAA — mean Average Accuracy over pose error thresholds from 1° to 10°.
All scores below use qt_auc@10° (area under the pose-accuracy curve, 0°–10°), the standard metric reported by the Image Matching Benchmark. SelMA's discrete mAA (mean of accuracy at 1°, 2°, …, 10°) is 0.728–0.753, but we report qt_auc here for a fair comparison.
| Method | Type | qt_auc@10° | Pairs | Source |
|---|---|---|---|---|
| Kaggle IMC 2022 Winner | Learned ensemble | ~0.86 | — | Kaggle |
| SelMA (ours) | Classical | 0.704 | 100 | This work |
| SP+DISK+SuperGlue 8K | Learned | 0.640 | ~4500 | IMW 2021 #1 |
| RootSIFT + DEGENSAC | Classical | 0.620 | ~4500 | IMB baseline |
| SP+SuperGlue (ss-dpth) | Learned | 0.597 | ~4500 | IMW 2021 #4 |
| 1° | 2° | 3° | 4° | 5° | 6° | 7° | 8° | 9° | 10° |
|---|---|---|---|---|---|---|---|---|---|
| 0.33 | 0.53 | 0.67 | 0.76 | 0.80 | 0.81 | 0.83 | 0.83 | 0.86 | 0.86 |
Our headline mAA numbers are not directly comparable to published benchmark scores. We report them transparently below so that readers can assess the results fairly.
1. Different metric definitions. Our mAA is the mean of accuracy at 10 discrete thresholds (1°, 2°, …, 10°). The official Image Matching Benchmark reports qt_auc — the area under the continuous accuracy curve via trapezoidal integration. On an apples-to-apples qt_auc@10° basis, SelMA scores 0.704 vs the published RootSIFT+DEGENSAC baseline of 0.620 — a +13.5% improvement, but smaller than the discrete mAA gap suggests.
2. Different pair selection protocol. The official benchmark uses pair lists derived from 3D point covisibility, which deliberately includes challenging pairs with low visual overlap. Our dataset (data/phototourism/reichstag) does not include the official pair files (new-vis-pairs/), so we generate pairs from camera geometry (baseline distance < 3× median, viewing angle < 60°). This filter is mild — only 10.3% of possible pairs are excluded — but the resulting pair distribution may still differ from the official test set.
3. Different pair counts. We evaluate on 100 randomly sampled pairs (seed = 42). The official benchmark evaluates on the full set of covisibility pairs (~4500 for reichstag). With only 100 pairs, a few hard or easy pairs can shift the score by ±0.02.
4. Stochastic variance. MAGSAC/PROSAC geometric estimation is non-deterministic. Across runs on the same 100 pairs, we observe mAA ranging from 0.728 to 0.753 (±0.013).
5. Single scene vs multi-scene averages. Published scores of 0.504–0.640 are typically averaged across all 9 Phototourism scenes, which include harder scenes (e.g., St. Peter's Basilica, Sacre Coeur). Reichstag is a structured building with repetitive texture, making it relatively easier for SIFT-based methods.
The improvement over RootSIFT+DEGENSAC is genuine — our edge-selective keypoint filtering, dual MAGSAC+PROSAC estimation, and Sampson error model selection produce better pose estimates. However, the magnitude of the improvement is best understood through the comparable qt_auc@10° metric: 0.704 (SelMA) vs 0.620 (baseline), an improvement of approximately 13.5% on the Reichstag scene.
- Jin, Y., et al. (2021). "Image Matching across Wide Baselines: From Paper to Practice." IJCV.
- Sattler, T., et al. (2018). "Benchmarking 6DoF Outdoor Visual Localization in Changing Conditions." CVPR.
- Oquab, M., et al. (2023). "DINOv2: Learning Robust Visual Features without Supervision." arXiv:2304.07193.
- Barath, D., et al. (2020). "MAGSAC++, a Fast, Reliable and Accurate Robust Estimator." CVPR.
- Chum, O. & Matas, J. (2005). "Matching with PROSAC — Progressive Sample Consensus." CVPR.