GUing is a GUI search engine based on a large vision-language model called GUIClip, which we trained specifically for the app GUI domain. For this, we first collected app introduction images from Google Play, which usually display the most representative screenshots selected and often captioned (i.e.~labeled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This finally results in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind in GUI retrieval.
If you find our work useful, please cite our paper:
@misc{wei2024guing,
title={GUing: A Mobile GUI Search Engine using a Vision-Language Model},
author={Jialiang Wei and Anne-Lise Courbis and Thomas Lambolais and Binbin Xu and Pierre Louis Bernard and Gérard Dray and Walid Maalej},
year={2024},
eprint={2405.00145},
archivePrefix={arXiv},
primaryClass={cs.SE}
}
-
Install
poetry
-
Install dependencies
poetry install
- Set the environment variables
export GUI_SEARCH_ROOT_PATH= your project path
export MONGODB_URI= your mongodb uri
You can find our GUIClip
model that was pretrained on SCapRepo, Screen2Words and Clarity dataset at: https://huggingface.co/Jl-wei/guiclip-vit-base-patch32
The code for GUIClip training and testing is in the retrieval
folder.
To train the classifier, use the train_clip.py
script.
To run the classifier on whole dataset, use the test.py
script.
The code is in the more_applications/classification
folder.
The code is in the more_applications/sketch
folder.
The labels for the SCapRepo
(Google Play Screenshot Caption) dataset are available at dataset/google_play/captioning
. Due to their substantial size, we are unable to provide the images directly. However, you can download these images from Google Play using the IDs provided in the data.jsonl
file, and then crop them according to the specified bounding boxes.
The datasets utilized for training the image classifier (dataset/google_play/classification
) and the screen cropper (dataset/google_play/detection
) will be released upon the acceptance of our paper.
You can access the other datasets by downloading them from their respective websites.
The code for classification is in the classification
folder.
To train the classifier, use the train_image_classification.py
script.
To run the classifier on whole dataset, use the run_image_classification.py
script.
The code for cropping screens is in the detection
folder.
To train the detector, use the train_object_detection.py
script.
To run the detector on whole dataset, use the run_object_detection.py
script.
The code for caption extraction is in the ocr
folder.
To run the detector on whole dataset, use the run_ocr.py
script.
The code for GUing
, our GUI search engine, is in the search_engine
folder.
You need to add data to MongoDB and create faiss
index before use it.