Skip to content

ThaiLearnCoding/Coco_caption_classification

Repository files navigation

COCO Caption Classification

Open In Colab Hugging Face Spaces

This project focuses on Multimodal Classification using the COCO dataset. It evaluates the model’s ability to align visual semantics with complex linguistic context through Zero-shot and Few-shot learning.

Instead of standard categorical classification, the model is presented with an image and a set of $N$ candidate text descriptions ($N \ge 12$). The objective is to identify the one true ground-truth description, overcoming the $N-1$ distractors.


📂 Project Structure

Coco_caption_classification/
├── streamlit_app.py                 # Streamlit demo entry
├── requirements_streamlit.txt       # Streamlit dependencies
├── README.md                        # Documentation
├── coco_subset_images/              # Extracted image data + JSON
│   ├── coco_multimodal_subset.json  # COCO subset metadata
│   └── images/                      # Image files
├── docs/                            # Results landing page and assets
│   ├── index.html                   # Results dashboard
│   ├── images/                      # Figures and diagrams
│   └── plot/                        # Plotly JSON exports
│       ├── acc.json
│       ├── f1_macro.json
│       ├── f1_micro.json
│       ├── f1_weighted.json
│       └── time.json
├── models/                          # Checkpoints
│   ├── best_rn50_8shot.pth
│   └── best_vit_b32_8shot.pth
├── notebooks/                       # Training and experiment logs
│   ├── train_colab.ipynb            # Colab training notebook
│   └── wandb/                       # W and B runs
├── settings/                        # Configuration and requirements
│   ├── config.yaml                  # Model and W and B variables
│   └── requirements.txt             # Primary environment targets
└── src/                             # Core code
	├── __init__.py
	├── data_loader.py               # Dataset objects, subset loading and splits
	├── model_arch.py                # Wrapper models based on openai/CLIP
	├── notebook_utils.py            # Notebook helpers
	├── utils.py                     # Evaluation functions and counters
	└── Image_ID_Exclusive_Sampling.ipynb # Data subset generation

🚀 How to Run

Development & Training

  1. Clone the project to your drive/local machine.
  2. Upload the coco_multimodal_subset.json and the corresponding images to coco_subset_images/.
  3. Open notebooks/train_colab.ipynb in Colab.
  4. Mount your drive and install dependencies from settings/requirements.txt.
  5. Run the cells to train. The configuration can be modified in settings/config.yaml.

Streamlit Demo (Local)

To launch the Streamlit demo:

pip install -r requirements_streamlit.txt
streamlit run streamlit_app.py

The demo supports Zero-shot and 8-shot inference with attention visualizations.

Streamlit Demo (Hugging Face Spaces)

  1. Create a new Space and select the Streamlit SDK.
  2. Add the following files to the Space repo:
    • streamlit_app.py
    • requirements.txt (copy from requirements_streamlit.txt)
    • src/ and models/ (or download weights inside the app)
  3. Commit and push to trigger the build.

🔬 Core Workflow

  1. Vision Extractor: Converts inputs using normalizations standard to CLIP variants (ViT-B/32 or RN50).
  2. Text Extractor: Standardizes $N$-size list of texts into embedding tensors.
  3. Linear Classification Probing: Employs Cross-Entropy across dot-product similarities spanning $k={0, 8}$.
  4. Analysis: Metrics (Acc, Macro-F1), inference compute sizes, and GradCAM/EigenCam map generations.

📊 Report and Info

Please refer to the GitHub Pages Site or the provided Youtube Links for the complete metric visualization and architectural diagrams breakdown.

About

Multiple text choice on image using CLIP and Few-shot

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors