Input vectors #168

jsenellart · 2017-03-19T14:34:39Z

Added 2 new features for adding support of input vectors

option -idx_files indicate that source/target files are indexed meaning that their format is key value not necessarily in the same order
new data type feattext which is using Kaldi text ark dump format

KEY [
VAL1 VAL2 VAL3
VAL4 VAL5 VAL6 ]
KEY2 [
...
]

Typical preprocessing command:

th preprocess.lua -data_type 'feattext' -train_src TIMIT/raw_mfcc_train.ark.txt \
                -train_tgt TIMIT/data/train/text -valid_src TIMIT/raw_mfcc_dev.ark.txt \
                -valid_tgt TIMIT/data/dev/text -save_data TIMIT/baseline -idx_files \
                -src_seq_length 1000 -tgt_seq_length 100

For training:

th train.lua -data TIMIT/baseline-train.t7 -save_model TIMIT -pdbrnn -report_every 1 \
           -rnn_size 800 -word_vec_size 20  -layers 4 -max_batch_size 16 -learning_rate 0.7 \
           -learning_rate_decay 0.8 -end_epoch 20

For decoding:

th translate.lua -model data/asr1_epoch15_8.17.t7 -src TIMIT/raw_mfcc_test.ark.txt  -batch_size 1

# Conflicts: # onmt/Factory.lua # onmt/data/Preprocessor.lua # onmt/translate/DecoderAdvancer.lua # onmt/translate/Translator.lua # preprocess.lua # translate.lua

guillaumekln

We should also add a documentation page on this.

guillaumekln · 2017-04-13T08:34:14Z

onmt/Factory.lua

+                                     opt.pre_word_vecs_enc, opt.fix_word_vecs_enc == 1,
+                                     verbose)
+  else
+    inputNetwork = nn.Identity()


I think it should be in another function, like buildInputEncoder.

guillaumekln · 2017-04-13T08:48:29Z

onmt/modules/PDBiEncoder.lua

@@ -110,6 +110,23 @@ function PDBiEncoder:maskPadding()
  self.layers[1]:maskPadding()
 end

+-- size of context vector
+function PDBiEncoder:contextSize(sourceSize, sourceLength)


Why do we need that? Past the first layer, padding no more applies.

contextSize is used in decoder to have size of the context size outside of the encoder which varies for pdbiencoder

guillaumekln · 2017-04-18T08:22:50Z

preprocess.lua

+    if dataType == 'monotext' then
+      src_file = opt.train
+    end
+    data.dicts.src = Vocabulary.init('train',


Can we keep 'source' instead? See #162.

geffy · 2017-04-23T13:19:15Z

onmt/data/Batch.lua

@@ -74,14 +74,24 @@ function Batch:__init(src, srcFeatures, tgt, tgtFeatures)

  self.sourceLength, self.sourceSize, self.uneven = getLength(src)

+  -- if input vectors (speech for instance)
+  self.inputVectors = src[1]:dim() > 1


This fails with default constructor like local batch = Batch()

Good catch! It has been fixed.

Reviewed in #168.

Jean A. Senellart added 6 commits March 19, 2017 15:27

input vector feature

6aeabc7

fix missing condition

4a2b93f

adapt MaskPadding for PDBiEncoder (context size ~= source size)

c2419b3

invalid references to context sizes

43d3874

fix translate for PDBiEncoder

546b4fd

invalid variable access

f4d2449

jsenellart changed the title ~~[WIP] Input vectors~~ Input vectors Mar 21, 2017

guillaumekln and others added 8 commits March 21, 2017 10:13

Merge branch 'master' into inputVectors

8a4b475

correctly use -idx_files option

8d70353

invalid table reordering

192567a

Merge remote-tracking branch 'upstream/master' into inputVectors

9090053

# Conflicts: # onmt/Factory.lua # onmt/data/Preprocessor.lua # onmt/translate/DecoderAdvancer.lua # onmt/translate/Translator.lua # preprocess.lua # translate.lua

Merge remote-tracking branch 'upstream/master' into inputVectors

1e9f26a

Merge remote-tracking branch 'upstream/master' into inputVectors

6148f2a

Fix undefined variables

4e6b65a

Fix padding masking

fd960b0

guillaumekln reviewed Apr 13, 2017

View reviewed changes

Jean A. Senellart and others added 5 commits April 13, 2017 16:16

move building of input network to buildInputNetwork function

f46a22a

Index Type and Input Vector documentation

8a81fd9

Restore indentation

df75170

Harmonize documentation

58bf1e2

Merge branch 'master' into inputVectors

d189456

guillaumekln reviewed Apr 18, 2017

View reviewed changes

guillaumekln and others added 2 commits April 18, 2017 10:39

Update changelog

46f4630

report fix from OpenNMT#162

a426ee8

guillaumekln merged commit 1b7632a into OpenNMT:master Apr 18, 2017

geffy reviewed Apr 23, 2017

View reviewed changes

guillaumekln added a commit that referenced this pull request Apr 24, 2017

Allow empty batch creation

4c3f05d

Reviewed in #168.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input vectors #168

Input vectors #168

jsenellart commented Mar 19, 2017 •

edited

guillaumekln left a comment •

edited

guillaumekln Apr 13, 2017

guillaumekln Apr 13, 2017

jsenellart Apr 13, 2017

guillaumekln Apr 18, 2017

geffy Apr 23, 2017

guillaumekln Apr 24, 2017

Input vectors #168

Input vectors #168

Conversation

jsenellart commented Mar 19, 2017 • edited

guillaumekln left a comment • edited

Choose a reason for hiding this comment

guillaumekln Apr 13, 2017

Choose a reason for hiding this comment

guillaumekln Apr 13, 2017

Choose a reason for hiding this comment

jsenellart Apr 13, 2017

Choose a reason for hiding this comment

guillaumekln Apr 18, 2017

Choose a reason for hiding this comment

geffy Apr 23, 2017

Choose a reason for hiding this comment

guillaumekln Apr 24, 2017

Choose a reason for hiding this comment

jsenellart commented Mar 19, 2017 •

edited

guillaumekln left a comment •

edited