Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio

どんなもの？

今までのイメージキャプションに加え、アテンションメカニズムを追加することで、今の文章がどこに注目されたものなのかを可視化

イメージキャプションにアテンションメカニズムを導入したところ

基本的なネットワークの構造はShow and Tellに似ている。

アテンションメカニズムにはhardとsoftがある。hardは見ているところそのものに着目？softは確率から着目?

アテンションメカニズムの検証自体はFlicker8k, Flicker30k, COCOデータセットのBLEU-1, 2, 3, 4, METEORで数値化

他の検証として、CNNのところのモデルを変えたtらどうなるのか、シングルモデルとアンサンブルモデルの比較、データセットの分割による違いを評価

Multiple object recognition with visual attention
Neural machine translation by jointly learning to align and tanslate
Learning phase representations using RNN encoder-decoder for statistical machine translation
Deep visual-semantic alignments for generating image descriptions
Sequence to sequence learning with neural networks