Data: https://www.kaggle.com/datasets/adityajn105/flickr8k Goal: Multi-modal learning of text (caption) from image.