Skip to content
Implementation of seq2seq model for Visual Storytelling Challenge (VIST)
Branch: master
Clone or download
msmilevski Update sis_datareader
Add image descriptions to the main train/valid/test files and remove all the stories that have atleast one image that doesn't have a description
Latest commit 203751b Aug 18, 2018


Implementation of our original solution that is described in the paper Stories for Images-in-Sequence by using Visual and Narrative Components. Our project is inspired by the solution in Visual Storytelling. The model generates stories, sentence by sentence with respect to the sequence of images and the previously generated sentence. The architecture of our solution consists of an image sequence encoder that models the sequential behaviour of the images, a previous-sentence encoder and a current-sentence decoder. The previous-sentence encoder encodes the sentence that was associated with the previous image and the current-sentence decoder is responsible for generating a sentence for the current image of the sequence. We also introduce a novel way of grouping the images of the sequence during the training process, in order to encapture the effect of the previous images in the sequence. Our goal with this approach was to create a model that will generate stories that contain more narrative and evaluative language and that every generated sentence in the story will be affected not only by the sequence of images but also by what has been previously generated in the story.


The project is built using Python 2.7.14, Tensorflow 1.6.0 and Keras 2.1.6. Install these dependencies to get a development env running

sudo easy_install --upgrade pip
sudo easy_install --upgrade six
sudo pip install tensorflow
sudo pip install keras
pip install opencv-python
pip install h5py
pip install unidecode
python -mpip install matplotlib


Download the Visual Storytelling Dataset (VIST) from and save it in the dataset/vist_dataset directory. Also download the pre-trained weights for AlexnNet and put them in the dataset/models/alexnet directory.

Data pre-processing

First we need to extract the image features from all the images and save them in a file. This is possible with

python dataset/models/alexnet/

The script creates the file /dataset/models/alexnet/alexnet_image_train_features.hdf5, that contains all the image features. Next we need to associate every image feature vector with it's corresponding vectorized sentence. We vectorize the sentence using the functions in sis_datareader. With the function sentences_to_index we align every image feature with every sentence. If all the file paths are set properly, all of the above can be done by running the command

python data_reader/

Options and differences from the paper

Other than our proposed solution, the project can be used to train an encoder-decoder and an encoder-decoder with Luong attention mechanism.

Our proposed solution

alt text The architecture of the proposed model. The images highlighted with red are the ones that are encoded and together with the previous sentence, they influence the generated sentence in the current time step.

Training the model

Training the model and adjusting the parameters is done in the If the attention mechanism is used, make sure that image_encoder_latent_dim = sentence_encoder_latent_dim.


Generating stories

To generate stories in set model_name to the model you want to generate from and run


Some Results

In following images we can see the generated stories from the aforementioned models. The first row reprsents the original story, the second row is the generated story from the model with loss=0.82, the third row is the story from the model with loss=1.01 and the forth row is the story from the model with loss=1.72. alt text alt text alt text alt text

You can’t perform that action at this time.