Semantic search for your memes, stickers, GIFs, and reaction images.
Have you ever struggled to find that perfect sticker or GIF for a group chat? Or that selfie you took with your besties on a trip a few years ago? Well I have. That is why I built GIFZA.
GIFZA (GIF - ZAH) is a local-first semantic search engine for visual assets. Upload an image, sticker, or GIF, annotate it naturally, and retrieve it later using meaning instead of exact filenames or manual tags.
Searches like:
"sad cat staring into space""strict asian obama""low quality shocked reaction image""anime girl crying aggressively"
...will retrieve semantically similar assets, even if those exact words were never used in the filename or tags.
- Local-first on-device inference
- Semantic image retrieval using vector embeddings
- Approximate nearest neighbour (ANN) search
- Natural language querying
- Shared embedding space between images and text
- Fast retrieval with ObjectBox vector search
- Offline-capable
- Native mobile inference with ExecuTorch
GIFZA uses Apple's MobileCLIP-S1 model, split into separate image and text encoders. Both encoders project their inputs into a shared embedding space, enabling natural language queries to retrieve visually and semantically related assets.
Search Flow:
- User enters a query (e.g.,
"confused cat at 3am") - Text encoder converts query into an embedding vector
- ObjectBox runs ANN similarity search against stored image embeddings
- Nearest matching assets are returned instantly
┌──────────────────┐
│ Image / GIF │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ MobileCLIP Image │
│ Encoder │
└────────┬─────────┘
│
Image Embedding
│
▼
┌──────────────────┐
│ ObjectBox │
│ Vector Storage │
└────────┬─────────┘
▲
ANN Search
│
Query Embedding
│
┌────────┴─────────┐
│ MobileCLIP Text │
│ Encoder │
└──────────────────┘
Both image and text modalities exist in the same latent vector space. This enables zero-shot cross-modal retrieval without requiring paired training data at inference time.
Embeddings are stored in ObjectBox as key-value pairs alongside asset metadata. On search, the query text is embedded, and ObjectBox performs native ANN similarity search to return the closest image vectors.
So a typical storage would look something like:
Image of a cat => Image location on users file system( we do not duplicate their assets) 'My cat'(Users annotation) => Same Image location
This means that even if the query does not match the image embedings, it will at least match the annotations embedding.
Heres a pretty good visualization I made with Manim
All inference runs locally. Both MobileCLIP encoders were converted from PyTorch to ExecuTorch (.pte) format for mobile execution. This enables:
- Fully offline retrieval
- Low-latency inference
- Private, local asset indexing
- Zero external API dependencies
But converting to Executorch also brought about a problem, tokenization! before generating text embeddings, we need to first tokenize + pad the query(or annotation), but we could not trace the tokenizer and pack it into .pte convert model, the solution to this was to download the models tokenizer.json and then use the dart sentencepice package for BPE tokenization.
- Flutter
- ObjectBox
- Executorch
- Apple MobileCLIP S1 OpenCLIP
First, run the install script at the project root
chmod +x install.sh
./install.shand then run the conversion script
python3 download_and_convert.pyRun the app!
flutter run --releasePS: release mode gives us a little better inference performance
Some of the more interesting engineering hurdles included:
- Splitting MobileCLIP into separate, deployable encoders
- Achieving reasonable inference latency for the text encoder on mobile(still a challenge!!)
- Converting PyTorch models to ExecuTorch without losing accuracy
- Handling unsupported tokenizer tracing and implementing manual BPE
- Maintaining embedding consistency across image and text modalities
- Integrating ANN search smoothly with ObjectBox
Anything you want me to know? Improvements I should make? You're welcome to file a PR or open an issue!


