Referring Expression Comprehension

This repository is intended as a container of various educational content (documents, code, blogs, ...) related to the visual grounding problem. This project aims to develop a visual grounding architecture for object detection in images based on natural language descriptions. The project uses the pretrained CLIP model from OpenAI for text-image similarity and the YOLOv5 model for object detection.

Report draft

At this link, you can find a first draft of the report describing the project, and the experiments that have yet to be merged.

Project Setup

To set up the project, follow these steps:

Clone the Repository: Clone this repository to your local machine. (git clone https://github.com/jackb0t/visual-grounding.git)
Create a virtual or conda environment (conda create --name DL python=3.10)
Install the necessary libraries (pip install -r requirements.txt)
Download and extract the images (python download_images.py)

Decisions to make

Layers to Add:
- Attention Mechanisms: To focus on relevant parts of the image or text.
- Additional Transformer Layers: To capture more complex relationships.
- Fully Connected Layers: For additional complexity and non-linearity.
Layers to Freeze:
- Early Layers: These capture basic features like edges and textures, which are generally useful.
- Late Layers: These are more task-specific and are good candidates for fine-tuning.
Loss Function:
- Contrastive Loss: Good for similarity tasks.
- Cross-Entropy Loss: If you have labeled data for each image-text pair.
- Mean Squared Error (MSE): For regression-like tasks.

Pros and Cons:

Decision	Pros	Cons
Add Attention (self-attention, cross-attention)	Better focus on relevant features	Increased complexity
Add Transformer Layers	More expressive power	Risk of overfitting
Add Fully Connected Layers	More complexity and non-linearity	Increased number of parameters
Freeze Early Layers	Faster training	May lose generality
Freeze Late Layers	Retain high-level features	May not adapt well to new task

Recommendations:

Layers to Add:
- Add an attention mechanism to the vision model to focus on relevant image regions.
- Add a fully connected layer after the text model for additional complexity.
Layers to Freeze:
- Freeze the early layers of both the vision and text models.
- Fine-tune the late layers and any newly added layers.
Loss Function:
- Given your focus on similarity metrics, Contrastive Loss would be a suitable choice.

Resources

papers: we store selected pubblications relevant to the problem we decided to comment together. (preferred format: {year-of-publication ; title ; quick description/abstract})
code: we store repositories and implementations from which to draw inspirations or to incorporate into the project.
blogs: we store blog posts, videos, and various other sources.

Papers

2022; YORO -- Lightweight End to End Visual Grounding
- Multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task seeking a better trade-off between speed an accuracy by embracing a single-stage design, without CNN backbone. A patch-text alignment loss is proposed.
2023; CLIP-VG: Self-paced Curriculum Adapting of CLIP via Exploiting Pseudo-Language Labels for Visual Grounding
- The paper proposes a novel method called CLIP-VG to solve the visual grounding problem using VLP models to realize unsupervised transfer learning in downstream grounding tasks. A self-paced curriculum adapting of CLIP is conducted via exploiting pseudo-language labels to solve the VG problem.

Code

OWL-ViT
- OWL-ViT is a zero-shot text-conditioned object detection model using CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features, enabling open-vocabulary classification. It can perform zero-shot text-conditioned object detection using one or multiple text queries per image.

Blogs (misc.)

2023; A Dive into Vision-Language Models.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
.vscode		.vscode
src		src
.gitignore		.gitignore
README.md		README.md
local_ci.sh		local_ci.sh
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Referring Expression Comprehension

Report draft

Project Setup

Decisions to make

Pros and Cons:

Recommendations:

Resources

Papers

Code

Blogs (misc.)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Referring Expression Comprehension

Report draft

Project Setup

Decisions to make

Pros and Cons:

Recommendations:

Resources

Papers

Code

Blogs (misc.)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages