This repository is intended as a container of various educational content (documents, code, blogs, ...) related to the visual grounding problem. This project aims to develop a visual grounding architecture for object detection in images based on natural language descriptions. The project uses the pretrained CLIP model from OpenAI for text-image similarity and the YOLOv5 model for object detection.
At this link, you can find a first draft of the report describing the project, and the experiments that have yet to be merged.
To set up the project, follow these steps:
-
Clone the Repository: Clone this repository to your local machine. (
git clone https://github.com/jackb0t/visual-grounding.git) -
Create a virtual or conda environment (
conda create --name DL python=3.10) -
Install the necessary libraries (
pip install -r requirements.txt) -
Download and extract the images (
python download_images.py)
-
Layers to Add:
- Attention Mechanisms: To focus on relevant parts of the image or text.
- Additional Transformer Layers: To capture more complex relationships.
- Fully Connected Layers: For additional complexity and non-linearity.
-
Layers to Freeze:
- Early Layers: These capture basic features like edges and textures, which are generally useful.
- Late Layers: These are more task-specific and are good candidates for fine-tuning.
-
Loss Function:
- Contrastive Loss: Good for similarity tasks.
- Cross-Entropy Loss: If you have labeled data for each image-text pair.
- Mean Squared Error (MSE): For regression-like tasks.
| Decision | Pros | Cons |
|---|---|---|
| Add Attention (self-attention, cross-attention) | Better focus on relevant features | Increased complexity |
| Add Transformer Layers | More expressive power | Risk of overfitting |
| Add Fully Connected Layers | More complexity and non-linearity | Increased number of parameters |
| Freeze Early Layers | Faster training | May lose generality |
| Freeze Late Layers | Retain high-level features | May not adapt well to new task |
-
Layers to Add:
- Add an attention mechanism to the vision model to focus on relevant image regions.
- Add a fully connected layer after the text model for additional complexity.
-
Layers to Freeze:
- Freeze the early layers of both the vision and text models.
- Fine-tune the late layers and any newly added layers.
-
Loss Function:
- Given your focus on similarity metrics, Contrastive Loss would be a suitable choice.
-
papers: we store selected pubblications relevant to the problem we decided to comment together. (preferred format: {year-of-publication ; title ; quick description/abstract})
-
code: we store repositories and implementations from which to draw inspirations or to incorporate into the project.
-
blogs: we store blog posts, videos, and various other sources.
-
2022; YORO -- Lightweight End to End Visual Grounding
- Multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task seeking a better trade-off between speed an accuracy by embracing a single-stage design, without CNN backbone. A patch-text alignment loss is proposed.
-
- The paper proposes a novel method called CLIP-VG to solve the visual grounding problem using VLP models to realize unsupervised transfer learning in downstream grounding tasks. A self-paced curriculum adapting of CLIP is conducted via exploiting pseudo-language labels to solve the VG problem.
- OWL-ViT
- OWL-ViT is a zero-shot text-conditioned object detection model using CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features, enabling open-vocabulary classification. It can perform zero-shot text-conditioned object detection using one or multiple text queries per image.