Skip to content

rshivansh/San-Pytorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

San-Pytorch

Let us try implementing SAN in pytorch from scratch The original implementation of Stacked Attention Networks for Image Question Answering by Zichao Yang was in Theano . During my winter intern at IIT-K on VQA let me first implement this in pytorch and then begin my work . Kindly refer to the paper and the original theano code before proceeding . You could also refer to the torch implementation of SAN . image

Requirements : The code is written in Python and requires PyTorch. The preprocssinng code is in Python and Lua, and you need to install NLTK if you want to use NLTK to tokenize the question.

Download Dataset

We simply follow the steps provide by HieCoAttenVQA to prepare VQA data. The first thing you need to do is to download the data and do some preprocessing. Head over to the data/ folder and run

$ python vqa_preprocess.py --download 1 --split 1

--download Ture means you choose to download the VQA data from the VQA website and --split 1 means you use COCO train set to train and validation set to evaluation. --split 2 means you use COCO train+val set to train and test set to evaluate. After this step, it will generate two files under the data folder. vqa_raw_train.json and vqa_raw_test.json

Download Image Model

We are using VGG_ILSVRC_19_layers model and Deep Residual network implement model by Facebook .

##Generate Image/Question Features

Head over to the prepro folder and run

$ python prepro_vqa.py --input_train_json ../data/vqa_raw_train.json --input_test_json ../data/vqa_raw_test.json --num_ans 1000

to get the question features. --num_ans specifiy how many top answers you want to use during training. You will also see some question and answer statistics in the terminal output. This will generate two files in data/ folder, vqa_data_prepro.h5 and vqa_data_prepro.json.

Then we are ready to extract the image features by VGG 19.

$ th prepro_img_vgg.lua -input_json ../data/vqa_data_prepro.json -image_root /home/jiasenlu/data/ -cnn_proto ../image_model/VGG_ILSVRC_19_layers_deploy.prototxt -cnn_model ../image_model/VGG_ILSVRC_19_layers.caffemodel

Before running this make sure you create a new folder called image_model and put the downloaded VGG caffe model ( slong with .prototxt file )and Deep residual net in that folder . For the image root give the path of coco dataset in your system . You can change the -gpuid, -backend and -batch_size based on your gpu.

Train the model

We have everything ready to train the VQA. Back to the main folder and execute

python train.py --use_gpu <0 or 1> --batch_size <batch size> --epochs <no. of epochs>

you can also change many other options. For a list of all options, see train.py

Evaluate the model

In main folder run

python eval.py --use_gpu <0 or 1>

you can also change many other options. For a list of all options, see eval.py

Modifications

To gain some dependency between adjacent words within a single sentence, we use a bidirectional LSTM to process the sentence. After we get the question features and image features , before passing to the stacked attention mechanism the attention features are subdivided getting combine feature through 2 linear layers and then passed to san layer. We have also modified the way attention is calculated , it is now a combination of A STRUCTURED SELF -ATTENTIVE SENTENCE EMBEDDING ( There aim was to encode a variable length sentence into a fixed size embedding which they achieve by choosing a linear combination of the n-LSTM hidden vectors in H. Computing the linear combination requires the self-attention mechanism. The attention mechanism takes the whole LSTM hidden states H as input, and outputs a vector of weights A ) & STACKED ATTENTION NETWORKS.

About

Let us try implementing SAN in pytorch from scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published