This project implemets a ViT-GPT2-based image captioning model, with experiments analyzing model robustness under different levels of image occlusion (10%, 50%, 80%). Link to dataset: https://github.com/eco-mini/custom_captions_dataset
The repository includes the following files:
-
1. Notebooks
20_PartA-1.ipynb: Contains zero-shot smolVLM captioning andImageCaptioningModelalong with training and evaluation on the given dataset in the PS.20_PartB-1.ipynb: Part 1 of the robustness analysis where we evaluate captions generated usingsmolVLMon occluded images with varying levels of occlusion -[10% , 50%, 80%]20_PartB-2.ipynb: Part 2 of the robustness analysis where we evaluate captions generated usingImageCaptioningModelon occluded images with varying levels of occlusion -[10% , 50%, 80%]20_PartC-1.ipynb: Contains code for a custom BERT Classifier that classifies captions generated into 2 classes- smolVLM (0)
- ImageCaptioningModel (1) This notebook trains and evaluates the model.
-
2. Captions
Captions - Custom: This folder contains captions generated by the custom captioning model under all above given occlusion levels.Captions - SmolVLM: This folder contains captions generated by the smolVLM under all above given occlusion levels.
-
3. Results
Results: This folder contains the scores for generated captions (both smolVLM and custom model) under all occlusion levels. Also the filepredictions.csvhas the results of theCaptionClassifier.
There are no specific dependencies beyond standard Python libraries and Hugging Face Transformers. All experiments were run in Kaggle Notebooks.
To reproduce results:
- Open any notebook in Kaggle.
- Ensure GPU is enabled.
- Run all cells top to bottom.
The trained model file (best_captioning_model.pt) is loaded directly from a Kaggle dataset in the captioning and evaluation notebooks.
-
Gayathri Anant
Email: gayathrianant05@gmail.com Roll No: 22CS30026 -
Tuhin Mondal
Email: email2tuhin04@gmail.com Roll No: 22CS10087 -
Diganta Mandal
Email: digantamindia@gmail.com Roll No: 22CS30062
We sincerely thank our Deep Learning course instructor and TAs for their support and feedback throughout the course.
We also acknowledge the open-source community and tools that made this work possible:
- Hugging Face Transformers
- PyTorch
- Kaggle Datasets and Notebooks