Skip to content
This repository was archived by the owner on Jun 14, 2025. It is now read-only.

Itakello/REC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Referring Expression Comprehension

This repository is intended as a container of various educational content (documents, code, blogs, ...) related to the visual grounding problem. This project aims to develop a visual grounding architecture for object detection in images based on natural language descriptions. The project uses the pretrained CLIP model from OpenAI for text-image similarity and the YOLOv5 model for object detection.

Report draft

At this link, you can find a first draft of the report describing the project, and the experiments that have yet to be merged.

Project Setup

To set up the project, follow these steps:

  1. Clone the Repository: Clone this repository to your local machine. (git clone https://github.com/jackb0t/visual-grounding.git)

  2. Create a virtual or conda environment (conda create --name DL python=3.10)

  3. Install the necessary libraries (pip install -r requirements.txt)

  4. Download and extract the images (python download_images.py)

Decisions to make

  1. Layers to Add:

    • Attention Mechanisms: To focus on relevant parts of the image or text.
    • Additional Transformer Layers: To capture more complex relationships.
    • Fully Connected Layers: For additional complexity and non-linearity.
  2. Layers to Freeze:

    • Early Layers: These capture basic features like edges and textures, which are generally useful.
    • Late Layers: These are more task-specific and are good candidates for fine-tuning.
  3. Loss Function:

    • Contrastive Loss: Good for similarity tasks.
    • Cross-Entropy Loss: If you have labeled data for each image-text pair.
    • Mean Squared Error (MSE): For regression-like tasks.

Pros and Cons:

Decision Pros Cons
Add Attention (self-attention, cross-attention) Better focus on relevant features Increased complexity
Add Transformer Layers More expressive power Risk of overfitting
Add Fully Connected Layers More complexity and non-linearity Increased number of parameters
Freeze Early Layers Faster training May lose generality
Freeze Late Layers Retain high-level features May not adapt well to new task

Recommendations:

  1. Layers to Add:

    • Add an attention mechanism to the vision model to focus on relevant image regions.
    • Add a fully connected layer after the text model for additional complexity.
  2. Layers to Freeze:

    • Freeze the early layers of both the vision and text models.
    • Fine-tune the late layers and any newly added layers.
  3. Loss Function:

    • Given your focus on similarity metrics, Contrastive Loss would be a suitable choice.

Resources

  • papers: we store selected pubblications relevant to the problem we decided to comment together. (preferred format: {year-of-publication ; title ; quick description/abstract})

  • code: we store repositories and implementations from which to draw inspirations or to incorporate into the project.

  • blogs: we store blog posts, videos, and various other sources.


Papers


Code

  • OWL-ViT
    • OWL-ViT is a zero-shot text-conditioned object detection model using CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features, enabling open-vocabulary classification. It can perform zero-shot text-conditioned object detection using one or multiple text queries per image.

Blogs (misc.)


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors