This project implements an image captioning model using a Convolutional Neural Network (CNN) and a Long Short-Term Memory (LSTM) network. The architecture is inspired by the "Show and Tell" model, with modifications to improve performance. The model is trained on the Flickr8K dataset.
The model follows an encoder-decoder structure. A CNN acts as the encoder to extract features from the image, and an LSTM network acts as the decoder to generate a descriptive caption.
- Model: A pre-trained DenseNet201 model, originally trained on the ImageNet dataset, is used for feature extraction.
- Process: Each image is resized to 224x224 pixels and fed into the DenseNet201 model.
- Feature Vector: The output from the final Global Average Pooling layer of the CNN is used as the image feature vector. This results in a 1920-dimensional vector that provides a rich representation of the image's contents. These features are pre-computed and stored before training the decoder.
The captions from the Flickr8K dataset are preprocessed before being used for training:
- Converted to lowercase.
- Removal of special characters and numbers.
- Tokenization of sentences into words.
- Addition of
startseqandendseqtokens to each caption to signify the beginning and end of a sequence.
The decoder is an LSTM network responsible for generating the caption word-by-word.
- Inputs: The decoder takes two inputs at each time step:
- The 1920-dimensional image feature vector from the encoder.
- The sequence of words generated so far, starting with
startseq.
- Architecture Details:
- The image feature vector is passed through a
Denselayer to reduce its dimensionality to 256. - The input text sequence is passed through an
Embeddinglayer to create a 256-dimensional vector for each word. - The condensed image vector is concatenated with the word embedding sequence. This combined sequence is fed into an
LSTMlayer with 256 units. - Key Modification: In a departure from the original "Show and Tell" architecture, the output of the LSTM is added to the condensed image vector. This residual-style connection reinforces the image context throughout the generation process.
- The combined vector is then passed through
Denselayers and a finalsoftmaxactivation function to predict the next word in the sequence from the entire vocabulary.
- The image feature vector is passed through a
- Data Generation: A custom Keras
Sequencedata generator is used to feed data in batches, which is memory-efficient for large datasets. - Optimizer: The model is compiled with the
Adamoptimizer. - Loss Function:
categorical_crossentropyis used as the loss function, suitable for multi-class classification (predicting the next word). - Callbacks: Several callbacks are used to manage the training process:
ModelCheckpoint: Saves the model with the best validation loss.EarlyStopping: Halts training if the validation loss does not improve for 5 consecutive epochs.ReduceLROnPlateau: Reduces the learning rate if the validation loss plateaus.
-
Clone the repository:
git clone https://github.com/ResorcinolWorks/ICG.git cd ICG -
Install dependencies:
pip install numpy pandas tensorflow matplotlib seaborn
-
Download the Dataset: Download the Flickr8K dataset. You will need the
Imagesfolder and thecaptions.txtfile. Make sure the paths in theCNN-LSTM project.ipynbnotebook point to the correct locations of these files. -
Run the Jupyter Notebook: Open and execute the
CNN-LSTM project.ipynbnotebook in a Jupyter environment.
- Train the model on a larger dataset (e.g., Flickr30k or MS COCO) to improve generalization.
- Implement an Attention mechanism to allow the model to focus on more relevant parts of the image while generating captions.
- Evaluate the model using standard metrics like BLEU score.