It is a deep Neural network which uses CNNs, RNNs, and MLPs for captioning images. It was done for SUT DL course.
You can intall all the dependencies using pip install -r requirements.txt
.
I used resnet152 as CNN encoder and LSTMs as RNN decoder. You can see a schematic of the model in the image below.
The model was trained for 500 epochs on Tesla k80 GPU and flickr8k. The embedding layer weights were obtained from Stanford glove.42B.300d (random value was used for the words which weren't in the glove).