This project focuses on Multimodal Classification using the COCO dataset. It evaluates the model’s ability to align visual semantics with complex linguistic context through Zero-shot and Few-shot learning.
Instead of standard categorical classification, the model is presented with an image and a set of
Coco_caption_classification/
├── streamlit_app.py # Streamlit demo entry
├── requirements_streamlit.txt # Streamlit dependencies
├── README.md # Documentation
├── coco_subset_images/ # Extracted image data + JSON
│ ├── coco_multimodal_subset.json # COCO subset metadata
│ └── images/ # Image files
├── docs/ # Results landing page and assets
│ ├── index.html # Results dashboard
│ ├── images/ # Figures and diagrams
│ └── plot/ # Plotly JSON exports
│ ├── acc.json
│ ├── f1_macro.json
│ ├── f1_micro.json
│ ├── f1_weighted.json
│ └── time.json
├── models/ # Checkpoints
│ ├── best_rn50_8shot.pth
│ └── best_vit_b32_8shot.pth
├── notebooks/ # Training and experiment logs
│ ├── train_colab.ipynb # Colab training notebook
│ └── wandb/ # W and B runs
├── settings/ # Configuration and requirements
│ ├── config.yaml # Model and W and B variables
│ └── requirements.txt # Primary environment targets
└── src/ # Core code
├── __init__.py
├── data_loader.py # Dataset objects, subset loading and splits
├── model_arch.py # Wrapper models based on openai/CLIP
├── notebook_utils.py # Notebook helpers
├── utils.py # Evaluation functions and counters
└── Image_ID_Exclusive_Sampling.ipynb # Data subset generation
- Clone the project to your drive/local machine.
- Upload the
coco_multimodal_subset.jsonand the corresponding images tococo_subset_images/. - Open
notebooks/train_colab.ipynbin Colab. - Mount your drive and install dependencies from
settings/requirements.txt. - Run the cells to train. The configuration can be modified in
settings/config.yaml.
To launch the Streamlit demo:
pip install -r requirements_streamlit.txt
streamlit run streamlit_app.pyThe demo supports Zero-shot and 8-shot inference with attention visualizations.
- Create a new Space and select the Streamlit SDK.
- Add the following files to the Space repo:
streamlit_app.pyrequirements.txt(copy fromrequirements_streamlit.txt)src/andmodels/(or download weights inside the app)
- Commit and push to trigger the build.
-
Vision Extractor: Converts inputs using normalizations standard to CLIP variants (
ViT-B/32orRN50). -
Text Extractor: Standardizes
$N$ -size list of texts into embedding tensors. -
Linear Classification Probing: Employs Cross-Entropy across dot-product similarities spanning
$k={0, 8}$ . - Analysis: Metrics (Acc, Macro-F1), inference compute sizes, and GradCAM/EigenCam map generations.
Please refer to the GitHub Pages Site or the provided Youtube Links for the complete metric visualization and architectural diagrams breakdown.