Error while running domain adaption (fine tuning) with distributed mode #269

mohammedayub44 · 2018-11-20T04:21:27Z

Hi,

I have created a new vocabulary files (source and target) on the domain data set and have updated the base model checkpoint file using the below statment:
onmt-update-vocab
--model_dir /home/ubuntu/mayub/datasets/in_use/euro/run1/en_es_transformer_b/
--output_dir /home/ubuntu/mayub/datasets/in_use/euro/run1/en_es_transformer_b/added_vocab/
--src_vocab /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/src_vocab_50k.txt
--tgt_vocab /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/trg_vocab_50k.txt
--new_src_vocab /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/src_vocab_nfpa_50k.txt
--new_tgt_vocab /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/trg_vocab_nfpa_50k.txt

This generates the new checkpoint file which I pass to the fine tuning train_and_eval command:
onmt-main train_and_eval
--model_type Transformer
--checkpoint_path /home/ubuntu/mayub/datasets/in_use/euro/run1/en_es_transformer_b/added_vocab/
--config /home/ubuntu/mayub/datasets/in_use/euro/run1/config_run_da_nfpa.yml
--auto_config --num_gpus 8

Changes I have made to the config file -only updated the train and eval feature and labels file (source and target vocabulary are same)

data:
train_features_file: /home/ubuntu/mayub/datasets/in_use/euro/run1/nfpa_train_tokenized_bpe_applied.en
train_labels_file: /home/ubuntu/mayub/datasets/in_use/euro/run1/nfpa_train_tokenized_bpe_applied.es
eval_features_file: /home/ubuntu/mayub/datasets/in_use/euro/run1/nfpa_dev_tokenized_bpe_applied.en
eval_labels_file: /home/ubuntu/mayub/datasets/in_use/euro/run1/nfpa_dev_tokenized_bpe_applied.es
source_words_vocabulary: /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/src_vocab_50k.txt
target_words_vocabulary: /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/trg_vocab_50k.txt

Below is the error I'm getting:

Not sure where I'm going wrong. Any help appreciated.

Thanks !

Mohammed Ayub

The text was updated successfully, but these errors were encountered:

guillaumekln · 2018-11-20T07:42:57Z

Hi,

(source and target vocabulary are same)

You have new vocabularies so you should update them in your configuration.

mohammedayub44 · 2018-11-20T16:30:30Z

I have four vocabulary files with me presently -
Old Src - 50000
Old Trg - 50000
New Src - 11302
New Target - 15285
I'm guessing I cannot directly pass this new vocabulary files as sizes are different. Looks like I need a merged vocabulary file that has 55900 words. ?

The onmt-update-vocab just updated checkpoints but did not output a merged vocabulary file. Is there something I'm missing or understanding incorrectly.

Thanks !

Mohammed Ayub

mohammedayub44 · 2018-11-20T17:01:18Z

Hi @guillaumekln ,

I passed the new vocabulary files (src_vocab_nfpa_50k.txt and trg_vocab_nfpa_50k.txt) that have 11302 and 15285 terms. and I'm using onmt-update-vocab with --mode replace, it's giving me the below error when running the below train command:

onmt-main train_and_eval --model_type Transformer --checkpoint_path /home/ubuntu/mayub/datasets/in_use/euro/run1/en_es_transformer_b/added_vocab2/ --config /home/ubuntu/mayub/datasets/in_use/euro/run1/config_run_da_nfpa.yml --auto_config --num_gpus 8

2018-11-20 16:52:17.028197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:7 with 10756 MB memory) -> physical GPU (device: 7, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
INFO:tensorflow:Restoring parameters from /home/ubuntu/mayub/datasets/in_use/euro/run1/en_es_transformer_b/added_vocab2/model.ckpt-25000
2018-11-20 16:52:19.926786: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Invalid argument: tensor_name = optim/learning_rate; expected dtype float does not equal original dtype double
2018-11-20 16:52:19.928493: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Invalid argument: tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1292, in _do_call
return fn(*args)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1277, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: tensor_name = optim/learning_rate; expected dtype float does not equal original dtype double
[[{{node save/RestoreV2}} = RestoreV2[dtypes=[DT_INT64, DT_FLOAT, DT_INT64, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

tensor_name = optim/learning_rate; expected dtype float does not equal original dtype double
[[{{node save/RestoreV2}} = RestoreV2[dtypes=[DT_INT64, DT_FLOAT, DT_INT64, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Not sure why its giving this error.

Mohammed Ayub

mohammedayub44 · 2018-11-21T03:44:27Z

I have not changed anything apart from updating the vocabularies using the onmt-update-vocab, here is the comparison of the log files of both models (base model and the one used for fine tuning):

Mohammed Ayub

guillaumekln · 2018-11-21T09:55:46Z

The onmt-update-vocab just updated checkpoints but did not output a merged vocabulary file. Is there something I'm missing or understanding incorrectly.

The PR referenced above will add a generation of the merged vocabulary to make this easier.

For the error, does it work if instead of using --checkpoint_path you change the model_dir in the configuration to point to the updated checkpoint directory?

mohammedayub44 · 2018-11-21T14:52:32Z

The PR referenced above will add a generation of the merged vocabulary to make this easier.

Great ! Thank you for adding that @guillaumekln

For the error, does it work if instead of using --checkpoint_path you change the model_dir in the configuration to point to the updated checkpoint directory?

No, it did not work for some reason. I did point my model_dir to the updated checkpoint and tried with and without passing the --checkpoint_path argument. Gives me the same error.

Also, its if you look at the log file it says "you can change only the non-structural values like dropout etc." I'm assuming that doesn't have to do anything with data or train parameters as I have updated them. Not sure if its a tensorflow import_meta_graph bug ?

Mohammed Ayub

guillaumekln · 2018-11-21T15:32:55Z

It's most definitely a bug in our code. Do you have the full logs of the onmt-update-vocab command?

mohammedayub44 · 2018-11-21T16:06:00Z

It did not give much logging info when running the command: below is what I got on std output, let me know if this works:

INFO:tensorflow:Updating vocabulary related variables in checkpoint /home/ubuntu/mayub/datasets/in_use/euro/run1/en_es_transformer_b/model.ckpt-25000
2018-11-20 17:08:46.094014: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
18-11-20 17:08:46.335043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-20 17:08:46.335079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]
INFO:tensorflow:Saving new checkpoint to /home/ubuntu/mayub/datasets/in_use/euro/run1/en_es_transformer_b/added_vocab/

mohammedayub44 · 2018-11-27T23:50:32Z

@guillaumekln Just checking if there was any update on this ?
Or if any other way for fine tuning using this repo.

Thanks.

Mohammed Ayub

guillaumekln · 2018-11-28T08:28:59Z

I tried to reproduce this error but did not succeed. Will keep looking unless you are able to send me the checkpoint and the vocabularies.

Is there something I should know about your setup/installation? Looks like scalar tensors are silently promoted to float64.

mohammedayub44 · 2018-11-28T16:08:58Z

Sure. Here is the Dropbox link containing the files: https://www.dropbox.com/s/1cmynzc5kvbr89t/Issue269.zip?dl=0

Base Model Vocabulary - Vocabulary without domain terms
Finetune Model Vocabulary - Merged Vocabulary with domain terms.
Base Model Checkpoint - checkpoint before updating
Finetune Model Checkpoint - Updated checkpoint after running onmt-update-vocab

It is a basic out of box setup. no custom code and changes are done.

Mohammed Ayub

guillaumekln · 2018-11-28T16:42:36Z

Thanks, that's helpful.

In both checkpoints, scalar variables are float64 (e.g. the learning rate). However, the error indicates a data type mismatch so that means the new graph stored these variables as float32 and can no longer load the checkpoint.

Did you run the initial training in a different setup (e.g. a different server)?

mohammedayub44 · 2018-11-28T18:12:55Z

I had to stop and change the instance type of AWS instance, but I'm using the same p2.8xlarge instance for training and fine tuning. Apart from that no changes on the server side.

I had to update the OpenNMT-tf somewhere in between. Could that be because I ran training using different version of OpenNMT-tf (1.10.0) and fine tuning with another version 1.13.1 , any changes on repo side related to this ?

To double check let me retry this with another model on a different machine today.

Mohammed Ayub

mohammedayub44 · 2018-11-28T20:20:47Z

@guillaumekln
I ran the same steps on the different model (using the same server specs for training and fine tuning). I don't get the float32 however I get the below error: (FYI I got the same error on the above model too)
tensorflow.python.framework.errors_impl.NotFoundError: Key optim/cond/beta1_power not found in checkpoint

Here is the full log file: https://www.dropbox.com/s/o6a4jamua4el7um/da_error.txt?dl=0
Using OpenNMT-tf version 1.14.0

guillaumekln · 2018-11-29T08:47:02Z

We have been using the vocabulary update feature a lot around here so there might be something special here. Couple questions:

What are your TensorFlow and Numpy versions?
Can you even reload the non fine tuned checkpoint?
Are you using distributed training? Does it work in a non distributed setup?

mohammedayub44 · 2018-11-29T16:45:45Z

Below are my findings:

What are your TensorFlow and Numpy versions?
conda list give me the below:
tensorflow 1.10.0 <pip>
tensorflow-gpu 1.12.0 <pip>
(i see two numpy versions)
numpy 1.14.5 <pip>
numpy 1.14.3 py36hcd700cb_1

Can you even reload the non fine tuned checkpoint?

Distributed Mode: - Apparently No. Gives me the same error Key optim/cond/beta1_power not found in checkpoint
Replicated Mode - Yes I can. Runs perfectly fine.

Are you using distributed training? Does it work in a non distributed setup?

Yes, I'm using distributed training (because training is twice as fast and cost effective). Fine tuning seems to be working fine in non distributed mode.

In short, looks like loading model checkpoints works in replicated mode (for retraining and fine tuning ) but not in distributed mode.

Mohammed Ayub

guillaumekln · 2018-11-29T16:59:10Z

Thanks, that's very interesting. I will check what is happening in distributed mode (even though we let TensorFlow do everything).

mohammedayub44 · 2018-11-30T00:44:51Z

Also, not sure if this is a cascading issue --> When I run the fine tuning (in replicated mode) on the domain data my BLEU scores are continuously dropping and my eval predictions are getting worse.
Here is the log file - https://www.dropbox.com/s/kf7epkjxnpoq9go/en_es_transformer_a_un_da_11292018.log?dl=0

guillaumekln · 2018-11-30T12:44:51Z

Key optim/cond/beta1_power not found in checkpoint

You highlighted another issue here, thanks! Models trained with gradient accumulation had some different variable names than models trained without. Fixed in ff38e89.

mohammedayub44 · 2018-11-30T14:17:54Z

Great ! Thanks for the fix. @guillaumekln
Let me try again and check if it works fine now.

mohammedayub44 · 2018-11-30T21:14:45Z

Good news is loading checkpoint works fine in distributed mode now.
Bad news is BLEU scores (fine tuning) for some reason are drastically dropping for evaluation set.
http://forum.opennmt.net/t/opennmt-tf-fine-tuning-base-model-gives-worse-and-decreases-bleu-scores/2284

mohammedayub44 · 2018-12-05T21:24:04Z

Hi @guillaumekln,

I did some more experiments on fine tuning with other base models, looks like the BLEU scores were decreasing because I had over-fit my base model. Running fine-tuning on partially learnt models seems to give better fine tuned BLEU scores.
I will close this as original issues was resolved.

Thanks!

Mohammed Ayub

guillaumekln added the question label Nov 20, 2018

guillaumekln mentioned this issue Nov 21, 2018

Save merged vocabularies #271

Merged

mohammedayub44 changed the title ~~Error while running domain adaption (fine tuning)~~ Error while running domain adaption (fine tuning) with distributed mode Nov 29, 2018

mohammedayub44 closed this as completed Dec 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while running domain adaption (fine tuning) with distributed mode #269

Error while running domain adaption (fine tuning) with distributed mode #269

mohammedayub44 commented Nov 20, 2018

guillaumekln commented Nov 20, 2018

mohammedayub44 commented Nov 20, 2018

mohammedayub44 commented Nov 20, 2018 •

edited

mohammedayub44 commented Nov 21, 2018

guillaumekln commented Nov 21, 2018

mohammedayub44 commented Nov 21, 2018 •

edited

guillaumekln commented Nov 21, 2018

mohammedayub44 commented Nov 21, 2018

mohammedayub44 commented Nov 27, 2018 •

edited

guillaumekln commented Nov 28, 2018 •

edited

mohammedayub44 commented Nov 28, 2018

guillaumekln commented Nov 28, 2018

mohammedayub44 commented Nov 28, 2018

mohammedayub44 commented Nov 28, 2018 •

edited

guillaumekln commented Nov 29, 2018

mohammedayub44 commented Nov 29, 2018 •

edited

guillaumekln commented Nov 29, 2018

mohammedayub44 commented Nov 30, 2018 •

edited

guillaumekln commented Nov 30, 2018

mohammedayub44 commented Nov 30, 2018 •

edited

mohammedayub44 commented Nov 30, 2018 •

edited

mohammedayub44 commented Dec 5, 2018

Error while running domain adaption (fine tuning) with distributed mode #269

Error while running domain adaption (fine tuning) with distributed mode #269

Comments

mohammedayub44 commented Nov 20, 2018

guillaumekln commented Nov 20, 2018

mohammedayub44 commented Nov 20, 2018

mohammedayub44 commented Nov 20, 2018 • edited

mohammedayub44 commented Nov 21, 2018

guillaumekln commented Nov 21, 2018

mohammedayub44 commented Nov 21, 2018 • edited

guillaumekln commented Nov 21, 2018

mohammedayub44 commented Nov 21, 2018

mohammedayub44 commented Nov 27, 2018 • edited

guillaumekln commented Nov 28, 2018 • edited

mohammedayub44 commented Nov 28, 2018

guillaumekln commented Nov 28, 2018

mohammedayub44 commented Nov 28, 2018

mohammedayub44 commented Nov 28, 2018 • edited

guillaumekln commented Nov 29, 2018

mohammedayub44 commented Nov 29, 2018 • edited

guillaumekln commented Nov 29, 2018

mohammedayub44 commented Nov 30, 2018 • edited

guillaumekln commented Nov 30, 2018

mohammedayub44 commented Nov 30, 2018 • edited

mohammedayub44 commented Nov 30, 2018 • edited

mohammedayub44 commented Dec 5, 2018

mohammedayub44 commented Nov 20, 2018 •

edited

mohammedayub44 commented Nov 21, 2018 •

edited

mohammedayub44 commented Nov 27, 2018 •

edited

guillaumekln commented Nov 28, 2018 •

edited

mohammedayub44 commented Nov 28, 2018 •

edited

mohammedayub44 commented Nov 29, 2018 •

edited

mohammedayub44 commented Nov 30, 2018 •

edited

mohammedayub44 commented Nov 30, 2018 •

edited

mohammedayub44 commented Nov 30, 2018 •

edited