Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while running domain adaption (fine tuning) with distributed mode #269

Closed
mohammedayub44 opened this issue Nov 20, 2018 · 22 comments
Closed
Labels

Comments

@mohammedayub44
Copy link

Hi,

I have created a new vocabulary files (source and target) on the domain data set and have updated the base model checkpoint file using the below statment:
onmt-update-vocab
--model_dir /home/ubuntu/mayub/datasets/in_use/euro/run1/en_es_transformer_b/
--output_dir /home/ubuntu/mayub/datasets/in_use/euro/run1/en_es_transformer_b/added_vocab/
--src_vocab /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/src_vocab_50k.txt
--tgt_vocab /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/trg_vocab_50k.txt
--new_src_vocab /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/src_vocab_nfpa_50k.txt
--new_tgt_vocab /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/trg_vocab_nfpa_50k.txt

This generates the new checkpoint file which I pass to the fine tuning train_and_eval command:
onmt-main train_and_eval
--model_type Transformer
--checkpoint_path /home/ubuntu/mayub/datasets/in_use/euro/run1/en_es_transformer_b/added_vocab/
--config /home/ubuntu/mayub/datasets/in_use/euro/run1/config_run_da_nfpa.yml
--auto_config --num_gpus 8

Changes I have made to the config file -only updated the train and eval feature and labels file (source and target vocabulary are same)

data:
train_features_file: /home/ubuntu/mayub/datasets/in_use/euro/run1/nfpa_train_tokenized_bpe_applied.en
train_labels_file: /home/ubuntu/mayub/datasets/in_use/euro/run1/nfpa_train_tokenized_bpe_applied.es
eval_features_file: /home/ubuntu/mayub/datasets/in_use/euro/run1/nfpa_dev_tokenized_bpe_applied.en
eval_labels_file: /home/ubuntu/mayub/datasets/in_use/euro/run1/nfpa_dev_tokenized_bpe_applied.es
source_words_vocabulary: /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/src_vocab_50k.txt
target_words_vocabulary: /home/ubuntu/mayub/datasets/in_use/euro/train_vocab/trg_vocab_50k.txt

Below is the error I'm getting:
image

Not sure where I'm going wrong. Any help appreciated.

Thanks !

Mohammed Ayub

@guillaumekln
Copy link
Contributor

Hi,

(source and target vocabulary are same)

You have new vocabularies so you should update them in your configuration.

@mohammedayub44
Copy link
Author

I have four vocabulary files with me presently -
Old Src - 50000
Old Trg - 50000
New Src - 11302
New Target - 15285
I'm guessing I cannot directly pass this new vocabulary files as sizes are different. Looks like I need a merged vocabulary file that has 55900 words. ?

The onmt-update-vocab just updated checkpoints but did not output a merged vocabulary file. Is there something I'm missing or understanding incorrectly.

Thanks !

Mohammed Ayub

@mohammedayub44
Copy link
Author

mohammedayub44 commented Nov 20, 2018

Hi @guillaumekln ,

I passed the new vocabulary files (src_vocab_nfpa_50k.txt and trg_vocab_nfpa_50k.txt) that have 11302 and 15285 terms. and I'm using onmt-update-vocab with --mode replace, it's giving me the below error when running the below train command:

onmt-main train_and_eval --model_type Transformer --checkpoint_path /home/ubuntu/mayub/datasets/in_use/euro/run1/en_es_transformer_b/added_vocab2/ --config /home/ubuntu/mayub/datasets/in_use/euro/run1/config_run_da_nfpa.yml --auto_config --num_gpus 8

2018-11-20 16:52:17.028197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:7 with 10756 MB memory) -> physical GPU (device: 7, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
INFO:tensorflow:Restoring parameters from /home/ubuntu/mayub/datasets/in_use/euro/run1/en_es_transformer_b/added_vocab2/model.ckpt-25000
2018-11-20 16:52:19.926786: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Invalid argument: tensor_name = optim/learning_rate; expected dtype float does not equal original dtype double
2018-11-20 16:52:19.928493: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Invalid argument: tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1292, in _do_call
return fn(*args)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1277, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: tensor_name = optim/learning_rate; expected dtype float does not equal original dtype double
[[{{node save/RestoreV2}} = RestoreV2[dtypes=[DT_INT64, DT_FLOAT, DT_INT64, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

tensor_name = optim/learning_rate; expected dtype float does not equal original dtype double
[[{{node save/RestoreV2}} = RestoreV2[dtypes=[DT_INT64, DT_FLOAT, DT_INT64, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Not sure why its giving this error.

Mohammed Ayub

@mohammedayub44
Copy link
Author

I have not changed anything apart from updating the vocabularies using the onmt-update-vocab, here is the comparison of the log files of both models (base model and the one used for fine tuning):

image

Mohammed Ayub

@guillaumekln
Copy link
Contributor

The onmt-update-vocab just updated checkpoints but did not output a merged vocabulary file. Is there something I'm missing or understanding incorrectly.

The PR referenced above will add a generation of the merged vocabulary to make this easier.

For the error, does it work if instead of using --checkpoint_path you change the model_dir in the configuration to point to the updated checkpoint directory?

@mohammedayub44
Copy link
Author

mohammedayub44 commented Nov 21, 2018

The PR referenced above will add a generation of the merged vocabulary to make this easier.

Great ! Thank you for adding that @guillaumekln

For the error, does it work if instead of using --checkpoint_path you change the model_dir in the configuration to point to the updated checkpoint directory?

No, it did not work for some reason. I did point my model_dir to the updated checkpoint and tried with and without passing the --checkpoint_path argument. Gives me the same error.

Also, its if you look at the log file it says "you can change only the non-structural values like dropout etc." I'm assuming that doesn't have to do anything with data or train parameters as I have updated them. Not sure if its a tensorflow import_meta_graph bug ?

Mohammed Ayub

@guillaumekln
Copy link
Contributor

It's most definitely a bug in our code. Do you have the full logs of the onmt-update-vocab command?

@mohammedayub44
Copy link
Author

It did not give much logging info when running the command: below is what I got on std output, let me know if this works:

INFO:tensorflow:Updating vocabulary related variables in checkpoint /home/ubuntu/mayub/datasets/in_use/euro/run1/en_es_transformer_b/model.ckpt-25000
2018-11-20 17:08:46.094014: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
18-11-20 17:08:46.335043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-20 17:08:46.335079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]
INFO:tensorflow:Saving new checkpoint to /home/ubuntu/mayub/datasets/in_use/euro/run1/en_es_transformer_b/added_vocab/

@mohammedayub44
Copy link
Author

mohammedayub44 commented Nov 27, 2018

@guillaumekln Just checking if there was any update on this ?
Or if any other way for fine tuning using this repo.

Thanks.

Mohammed Ayub

@guillaumekln
Copy link
Contributor

guillaumekln commented Nov 28, 2018

I tried to reproduce this error but did not succeed. Will keep looking unless you are able to send me the checkpoint and the vocabularies.

Is there something I should know about your setup/installation? Looks like scalar tensors are silently promoted to float64.

@mohammedayub44
Copy link
Author

Sure. Here is the Dropbox link containing the files: https://www.dropbox.com/s/1cmynzc5kvbr89t/Issue269.zip?dl=0

  • Base Model Vocabulary - Vocabulary without domain terms
  • Finetune Model Vocabulary - Merged Vocabulary with domain terms.
  • Base Model Checkpoint - checkpoint before updating
  • Finetune Model Checkpoint - Updated checkpoint after running onmt-update-vocab

It is a basic out of box setup. no custom code and changes are done.

Mohammed Ayub

@guillaumekln
Copy link
Contributor

Thanks, that's helpful.

In both checkpoints, scalar variables are float64 (e.g. the learning rate). However, the error indicates a data type mismatch so that means the new graph stored these variables as float32 and can no longer load the checkpoint.

Did you run the initial training in a different setup (e.g. a different server)?

@mohammedayub44
Copy link
Author

I had to stop and change the instance type of AWS instance, but I'm using the same p2.8xlarge instance for training and fine tuning. Apart from that no changes on the server side.

I had to update the OpenNMT-tf somewhere in between. Could that be because I ran training using different version of OpenNMT-tf (1.10.0) and fine tuning with another version 1.13.1 , any changes on repo side related to this ?

To double check let me retry this with another model on a different machine today.

Mohammed Ayub

@mohammedayub44
Copy link
Author

mohammedayub44 commented Nov 28, 2018

@guillaumekln
I ran the same steps on the different model (using the same server specs for training and fine tuning). I don't get the float32 however I get the below error: (FYI I got the same error on the above model too)
tensorflow.python.framework.errors_impl.NotFoundError: Key optim/cond/beta1_power not found in checkpoint

Here is the full log file: https://www.dropbox.com/s/o6a4jamua4el7um/da_error.txt?dl=0
Using OpenNMT-tf version 1.14.0

@guillaumekln
Copy link
Contributor

We have been using the vocabulary update feature a lot around here so there might be something special here. Couple questions:

  • What are your TensorFlow and Numpy versions?
  • Can you even reload the non fine tuned checkpoint?
  • Are you using distributed training? Does it work in a non distributed setup?

@mohammedayub44 mohammedayub44 changed the title Error while running domain adaption (fine tuning) Error while running domain adaption (fine tuning) with distributed mode Nov 29, 2018
@mohammedayub44
Copy link
Author

mohammedayub44 commented Nov 29, 2018

Below are my findings:

What are your TensorFlow and Numpy versions?
conda list give me the below:
tensorflow 1.10.0 <pip>
tensorflow-gpu 1.12.0 <pip>
(i see two numpy versions)
numpy 1.14.5 <pip>
numpy 1.14.3 py36hcd700cb_1

Can you even reload the non fine tuned checkpoint?

Distributed Mode: - Apparently No. Gives me the same error Key optim/cond/beta1_power not found in checkpoint
Replicated Mode - Yes I can. Runs perfectly fine.

Are you using distributed training? Does it work in a non distributed setup?

Yes, I'm using distributed training (because training is twice as fast and cost effective). Fine tuning seems to be working fine in non distributed mode.

In short, looks like loading model checkpoints works in replicated mode (for retraining and fine tuning ) but not in distributed mode.

Mohammed Ayub

@guillaumekln
Copy link
Contributor

Thanks, that's very interesting. I will check what is happening in distributed mode (even though we let TensorFlow do everything).

@mohammedayub44
Copy link
Author

mohammedayub44 commented Nov 30, 2018

Also, not sure if this is a cascading issue --> When I run the fine tuning (in replicated mode) on the domain data my BLEU scores are continuously dropping and my eval predictions are getting worse.
Here is the log file - https://www.dropbox.com/s/kf7epkjxnpoq9go/en_es_transformer_a_un_da_11292018.log?dl=0

@guillaumekln
Copy link
Contributor

Key optim/cond/beta1_power not found in checkpoint

You highlighted another issue here, thanks! Models trained with gradient accumulation had some different variable names than models trained without. Fixed in ff38e89.

@mohammedayub44
Copy link
Author

mohammedayub44 commented Nov 30, 2018

Great ! Thanks for the fix. @guillaumekln
Let me try again and check if it works fine now.

@mohammedayub44
Copy link
Author

mohammedayub44 commented Nov 30, 2018

Good news is loading checkpoint works fine in distributed mode now.
Bad news is BLEU scores (fine tuning) for some reason are drastically dropping for evaluation set.
http://forum.opennmt.net/t/opennmt-tf-fine-tuning-base-model-gives-worse-and-decreases-bleu-scores/2284

@mohammedayub44
Copy link
Author

Hi @guillaumekln,

I did some more experiments on fine tuning with other base models, looks like the BLEU scores were decreasing because I had over-fit my base model. Running fine-tuning on partially learnt models seems to give better fine tuned BLEU scores.
I will close this as original issues was resolved.

Thanks!

Mohammed Ayub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants