Implementation of our original solution that is described in the paper Stories for Images-in-Sequence by using Visual and Narrative Components. Our project is inspired by the solution in Visual Storytelling. The model generates stories, sentence by sentence with respect to the sequence of images and the previously generated sentence. The architecture of our solution consists of an image sequence encoder that models the sequential behaviour of the images, a previous-sentence encoder and a current-sentence decoder. The previous-sentence encoder encodes the sentence that was associated with the previous image and the current-sentence decoder is responsible for generating a sentence for the current image of the sequence. We also introduce a novel way of grouping the images of the sequence during the training process, in order to encapture the effect of the previous images in the sequence. Our goal with this approach was to create a model that will generate stories that contain more narrative and evaluative language and that every generated sentence in the story will be affected not only by the sequence of images but also by what has been previously generated in the story.
The project is built using Python 2.7.14, Tensorflow 1.6.0 and Keras 2.1.6. Install these dependencies to get a development env running
sudo easy_install --upgrade pip sudo easy_install --upgrade six sudo pip install tensorflow sudo pip install keras pip install opencv-python pip install h5py pip install unidecode python -mpip install matplotlib
Download the Visual Storytelling Dataset (VIST) from http://visionandlanguage.net/VIST/dataset.html and save it in the dataset/vist_dataset directory. Also download the pre-trained weights for AlexnNet and put them in the dataset/models/alexnet directory.
First we need to extract the image features from all the images and save them in a file. This is possible with
The script creates the file /dataset/models/alexnet/alexnet_image_train_features.hdf5, that contains all the image features. Next we need to associate every image feature vector with it's corresponding vectorized sentence. We vectorize the sentence using the functions in sis_datareader. With the function sentences_to_index we align every image feature with every sentence. If all the file paths are set properly, all of the above can be done by running the command
Options and differences from the paper
Other than our proposed solution, the project can be used to train an encoder-decoder and an encoder-decoder with Luong attention mechanism.
Our proposed solution
The architecture of the proposed model. The images highlighted with red are the ones that are encoded and together with the previous sentence, they influence the generated sentence in the current time step.
Training the model
Training the model and adjusting the parameters is done in the training_model.py. If the attention mechanism is used, make sure that image_encoder_latent_dim = sentence_encoder_latent_dim.
To generate stories in inference_model.py set model_name to the model you want to generate from and run
In following images we can see the generated stories from the aforementioned models. The first row reprsents the original story, the second row is the generated story from the model with loss=0.82, the third row is the story from the model with loss=1.01 and the forth row is the story from the model with loss=1.72.