Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train ResNet-20 model for the CIFAR-10 on colab not working #72

Closed
besherh opened this issue Nov 16, 2018 · 13 comments
Closed

Train ResNet-20 model for the CIFAR-10 on colab not working #72

besherh opened this issue Nov 16, 2018 · 13 comments

Comments

@besherh
Copy link

besherh commented Nov 16, 2018

Hello,
I am using colab to train resnet on cifar10, after mounting google drive I cloned the repository and I was able to run the script. However, Tensorflow is loaded and the data files are passed to the network but I am ending with:

tensorflow.python.framework.errors_impl.NotFoundError: /content/drive/My; No such file or directory

Any suggestions?

Below you can find the code and the log file:

from google.colab import drive
drive.mount('/content/gdrive')
import os
os.chdir("/content/gdrive/My Drive/apps/PocketFlow")
!chmod 755 ./scripts/run_local.sh
!./scripts/run_local.sh nets/resnet_at_cifar10_run.py

the log:

Python script: nets/resnet_at_cifar10_run.py
# of GPUs: 1
extra arguments:  --model_http_url https://api.ai.tencent.com/pocketflow --data_dir_local /content/drive/My Drive/apps/datasets/cifar10
'nets/resnet_at_cifar10_run.py' -> 'main.py'
multi-GPU training disabled
[WARNING] TF-Plus & Horovod cannot be imported; multi-GPU training is unsupported
INFO:tensorflow:FLAGS:
INFO:tensorflow:data_disk: local
INFO:tensorflow:data_hdfs_host: None
INFO:tensorflow:data_dir_local: /content/drive/My
INFO:tensorflow:data_dir_hdfs: None
INFO:tensorflow:cycle_length: 4
INFO:tensorflow:nb_threads: 8
INFO:tensorflow:buffer_size: 1024
INFO:tensorflow:prefetch_size: 8
INFO:tensorflow:nb_classes: 10
INFO:tensorflow:nb_smpls_train: 50000
INFO:tensorflow:nb_smpls_val: 5000
INFO:tensorflow:nb_smpls_eval: 10000
INFO:tensorflow:batch_size: 128
INFO:tensorflow:batch_size_eval: 100
INFO:tensorflow:resnet_size: 20
INFO:tensorflow:lrn_rate_init: 0.1
INFO:tensorflow:batch_size_norm: 128.0
INFO:tensorflow:momentum: 0.9
INFO:tensorflow:loss_w_dcy: 0.0002
INFO:tensorflow:model_http_url: https://api.ai.tencent.com/pocketflow
INFO:tensorflow:summ_step: 100
INFO:tensorflow:save_step: 10000
INFO:tensorflow:save_path: ./models/model.ckpt
INFO:tensorflow:save_path_eval: ./models_eval/model.ckpt
INFO:tensorflow:enbl_dst: False
INFO:tensorflow:enbl_warm_start: False
INFO:tensorflow:loss_w_dst: 4.0
INFO:tensorflow:tempr_dst: 4.0
INFO:tensorflow:save_path_dst: ./models_dst/model.ckpt
INFO:tensorflow:nb_epochs_rat: 1.0
INFO:tensorflow:ddpg_actor_depth: 2
INFO:tensorflow:ddpg_actor_width: 64
INFO:tensorflow:ddpg_critic_depth: 2
INFO:tensorflow:ddpg_critic_width: 64
INFO:tensorflow:ddpg_noise_type: param
INFO:tensorflow:ddpg_noise_prtl: tdecy
INFO:tensorflow:ddpg_noise_std_init: 1.0
INFO:tensorflow:ddpg_noise_dst_finl: 0.01
INFO:tensorflow:ddpg_noise_adpt_rat: 1.03
INFO:tensorflow:ddpg_noise_std_finl: 1e-05
INFO:tensorflow:ddpg_rms_eps: 0.0001
INFO:tensorflow:ddpg_tau: 0.01
INFO:tensorflow:ddpg_gamma: 0.9
INFO:tensorflow:ddpg_lrn_rate: 0.001
INFO:tensorflow:ddpg_loss_w_dcy: 0.0
INFO:tensorflow:ddpg_record_step: 1
INFO:tensorflow:ddpg_batch_size: 64
INFO:tensorflow:ddpg_enbl_bsln_func: True
INFO:tensorflow:ddpg_bsln_decy_rate: 0.95
INFO:tensorflow:ws_save_path: ./models_ws/model.ckpt
INFO:tensorflow:ws_prune_ratio: 0.75
INFO:tensorflow:ws_prune_ratio_prtl: optimal
INFO:tensorflow:ws_nb_rlouts: 200
INFO:tensorflow:ws_nb_rlouts_min: 50
INFO:tensorflow:ws_reward_type: single-obj
INFO:tensorflow:ws_lrn_rate_rg: 0.03
INFO:tensorflow:ws_nb_iters_rg: 20
INFO:tensorflow:ws_lrn_rate_ft: 0.0003
INFO:tensorflow:ws_nb_iters_ft: 400
INFO:tensorflow:ws_nb_iters_feval: 25
INFO:tensorflow:ws_prune_ratio_exp: 3.0
INFO:tensorflow:ws_iter_ratio_beg: 0.1
INFO:tensorflow:ws_iter_ratio_end: 0.5
INFO:tensorflow:ws_mask_update_step: 500.0
INFO:tensorflow:cp_lasso: True
INFO:tensorflow:cp_quadruple: False
INFO:tensorflow:cp_reward_policy: accuracy
INFO:tensorflow:cp_nb_points_per_layer: 10
INFO:tensorflow:cp_nb_batches: 30
INFO:tensorflow:cp_prune_option: auto
INFO:tensorflow:cp_prune_list_file: ratio.list
INFO:tensorflow:cp_channel_pruned_path: ./models/pruned_model.ckpt
INFO:tensorflow:cp_best_path: ./models/best_model.ckpt
INFO:tensorflow:cp_original_path: ./models/original_model.ckpt
INFO:tensorflow:cp_preserve_ratio: 0.5
INFO:tensorflow:cp_uniform_preserve_ratio: 0.6
INFO:tensorflow:cp_noise_tolerance: 0.15
INFO:tensorflow:cp_lrn_rate_ft: 0.0001
INFO:tensorflow:cp_nb_iters_ft_ratio: 0.2
INFO:tensorflow:cp_finetune: False
INFO:tensorflow:cp_retrain: False
INFO:tensorflow:cp_list_group: 1000
INFO:tensorflow:cp_nb_rlouts: 200
INFO:tensorflow:cp_nb_rlouts_min: 50
INFO:tensorflow:dcp_save_path: ./models_dcp/model.ckpt
INFO:tensorflow:dcp_save_path_eval: ./models_dcp_eval/model.ckpt
INFO:tensorflow:dcp_prune_ratio: 0.5
INFO:tensorflow:dcp_nb_stages: 3
INFO:tensorflow:dcp_lrn_rate_adam: 0.001
INFO:tensorflow:dcp_nb_iters_block: 10000
INFO:tensorflow:dcp_nb_iters_layer: 500
INFO:tensorflow:uql_equivalent_bits: 4
INFO:tensorflow:uql_nb_rlouts: 200
INFO:tensorflow:uql_w_bit_min: 2
INFO:tensorflow:uql_w_bit_max: 8
INFO:tensorflow:uql_tune_layerwise_steps: 100
INFO:tensorflow:uql_tune_global_steps: 2000
INFO:tensorflow:uql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:uql_tune_disp_steps: 300
INFO:tensorflow:uql_enbl_random_layers: True
INFO:tensorflow:uql_enbl_rl_agent: False
INFO:tensorflow:uql_enbl_rl_global_tune: True
INFO:tensorflow:uql_enbl_rl_layerwise_tune: False
INFO:tensorflow:uql_weight_bits: 4
INFO:tensorflow:uql_activation_bits: 32
INFO:tensorflow:uql_use_buckets: False
INFO:tensorflow:uql_bucket_size: 256
INFO:tensorflow:uql_quant_epochs: 60
INFO:tensorflow:uql_save_quant_model_path: ./uql_quant_models/uql_quant_model.ckpt
INFO:tensorflow:uql_quantize_all_layers: False
INFO:tensorflow:uql_bucket_type: channel
INFO:tensorflow:uqtf_save_path: ./models_uqtf/model.ckpt
INFO:tensorflow:uqtf_save_path_eval: ./models_uqtf_eval/model.ckpt
INFO:tensorflow:uqtf_weight_bits: 8
INFO:tensorflow:uqtf_activation_bits: 8
INFO:tensorflow:uqtf_quant_delay: 0
INFO:tensorflow:uqtf_freeze_bn_delay: None
INFO:tensorflow:uqtf_lrn_rate_dcy: 0.01
INFO:tensorflow:nuql_equivalent_bits: 4
INFO:tensorflow:nuql_nb_rlouts: 200
INFO:tensorflow:nuql_w_bit_min: 2
INFO:tensorflow:nuql_w_bit_max: 8
INFO:tensorflow:nuql_tune_layerwise_steps: 100
INFO:tensorflow:nuql_tune_global_steps: 2101
INFO:tensorflow:nuql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:nuql_tune_disp_steps: 300
INFO:tensorflow:nuql_enbl_random_layers: True
INFO:tensorflow:nuql_enbl_rl_agent: False
INFO:tensorflow:nuql_enbl_rl_global_tune: True
INFO:tensorflow:nuql_enbl_rl_layerwise_tune: False
INFO:tensorflow:nuql_init_style: quantile
INFO:tensorflow:nuql_opt_mode: weights
INFO:tensorflow:nuql_weight_bits: 4
INFO:tensorflow:nuql_activation_bits: 32
INFO:tensorflow:nuql_use_buckets: False
INFO:tensorflow:nuql_bucket_size: 256
INFO:tensorflow:nuql_quant_epochs: 60
INFO:tensorflow:nuql_save_quant_model_path: ./nuql_quant_models/model.ckpt
INFO:tensorflow:nuql_quantize_all_layers: False
INFO:tensorflow:nuql_bucket_type: split
INFO:tensorflow:log_dir: ./logs
INFO:tensorflow:enbl_multi_gpu: False
INFO:tensorflow:learner: full-prec
INFO:tensorflow:exec_mode: train
INFO:tensorflow:debug: False
INFO:tensorflow:h: False
INFO:tensorflow:help: False
INFO:tensorflow:helpfull: False
INFO:tensorflow:helpshort: False
2018-11-16 12:53:20.147847: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-16 12:53:20.148287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-11-16 12:53:20.148358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-16 12:53:20.565167: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-16 12:53:20.565235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-11-16 12:53:20.565262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-11-16 12:53:20.565561: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2018-11-16 12:53:20.565637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10758 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
WARNING:tensorflow:From /content/gdrive/My Drive/apps/PocketFlow/datasets/abstract_dataset.py:85: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
WARNING:tensorflow:From /content/gdrive/My Drive/apps/PocketFlow/datasets/abstract_dataset.py:106: shuffle_and_repeat (from tensorflow.contrib.data.python.ops.shuffle_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.shuffle_and_repeat(...)`.
2018-11-16 12:53:23.066723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-16 12:53:23.066814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-16 12:53:23.066857: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-11-16 12:53:23.066882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-11-16 12:53:23.067168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10758 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
2018-11-16 12:53:24.963790: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at matching_files_op.cc:49 : Not found: /content/drive/My; No such file or directory
2018-11-16 12:53:24.964542: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at matching_files_op.cc:49 : Not found: /content/drive/My; No such file or directory
2018-11-16 12:53:24.964744: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at iterator_ops.cc:947 : Not found: /content/drive/My; No such file or directory
     [[{{node ShuffleDataset/data/list_files/MatchingFiles}} = MatchingFiles[](ShuffleDataset/data/list_files/file_pattern)]]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: /content/drive/My; No such file or directory
     [[{{node ShuffleDataset/data/list_files/MatchingFiles}} = MatchingFiles[](ShuffleDataset/data/list_files/file_pattern)]]
     [[{{node data/OneShotIterator}} = OneShotIterator[container="", dataset_factory=_make_dataset_E02JEaYNEAE[], output_shapes=[[?,32,32,3], [?,10]], output_types=[DT_FLOAT, DT_FLOAT], shared_name="", _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
     [[{{node data/IteratorGetNext/_3}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_107_data/IteratorGetNext", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 69, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "main.py", line 55, in main
    learner.train()
  File "/content/gdrive/My Drive/apps/PocketFlow/learners/full_precision/learner.py", line 71, in train
    self.sess_train.run(self.train_op)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: /content/drive/My; No such file or directory
     [[{{node ShuffleDataset/data/list_files/MatchingFiles}} = MatchingFiles[](ShuffleDataset/data/list_files/file_pattern)]]
     [[node data/OneShotIterator (defined at /content/gdrive/My Drive/apps/PocketFlow/datasets/abstract_dataset.py:109)  = OneShotIterator[container="", dataset_factory=_make_dataset_E02JEaYNEAE[], output_shapes=[[?,32,32,3], [?,10]], output_types=[DT_FLOAT, DT_FLOAT], shared_name="", _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
     [[{{node data/IteratorGetNext/_3}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_107_data/IteratorGetNext", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
@jiaxiang-wu
Copy link
Contributor

@besherh Is it possible to remove the whitespace in "/content/gdrive/My Drive/apps/PocketFlow"? The path configuration file seems not able to parse file paths with whitespaces.

@besherh
Copy link
Author

besherh commented Nov 19, 2018

Thanks for the hint, fix it by using:
!ln -s "/content/gdrive/My Drive" "/content/mydrive", So I removed the space as suggested,
but I have another issue, please!
How to measure the performance of a pruning method? For instance, I want to compare the model size for Resnet on cifar10 before and after the pruning!

@jiaxiang-wu
Copy link
Contributor

Use tools/conversion/export_pb_tflite_models.py to convert checkpoint files into *.pb & *.tflite models. Two models will be generated:

  • model_original.pb/tflite for the model without rewriting the graph, so the model size equals to the model without pruning.
  • model_transformed.pb/tflite for the model with rewriting the graph, so the model size equals to the model with pruning.

@besherh
Copy link
Author

besherh commented Nov 20, 2018

Thanks, now I have the files that you mentioned.
model_original.pb (1,129 KB)
model_transformed (659 KB)
But honestly, I did not get the idea of how to compare the two models in term of :
pruning ratio, model size, how many channels were pruned, accuracy and loss. Is there any graphical way to do this (plots ?)

Thanks for your support!

@jiaxiang-wu
Copy link
Contributor

You should be able to see how many channels have been pruned in the model conversion script's log. Besides, *.pb files can be visualized in TensorBoard.

@besherh
Copy link
Author

besherh commented Nov 21, 2018

Thanks, but what about the pruning ratio? How to figure this out?
I want to get something like this, (this table is on pocket flow docs)

Model Pruning Ratio
MobileNet-v1 50%
MobileNet-v1 60%
MobileNet-v1 70%
Mobilenet-v1 80%

Thanks!

@jiaxiang-wu
Copy link
Contributor

To compute the overall pruning ratio, you need to calculate the number of parameters in each layer before and after pruning, and then sum them up. This can be inferred from model conversion script's log.

@besherh
Copy link
Author

besherh commented Nov 21, 2018

again thanks, here is the snapshot from the log of the conversion script:
Original Model:
original

Pruned Model:
transformed

based on this, the pruning ratio supposes to be 22% , is this correct?

@jiaxiang-wu
Copy link
Contributor

No, this is the number of weight tensors. You need to compute the number of parameters in each tensor (based on each tensor's shape information) and sum them up to obtain the overall number of parameters in the model before and after pruning.

@besherh
Copy link
Author

besherh commented Nov 26, 2018

I was not able to figure how to do this, could you give a quick example if possible!

@jiaxiang-wu
Copy link
Contributor

jiaxiang-wu commented Nov 29, 2018

@besherh
You can obtain a full list of variables presented in the graph via tf.get_collection(). After restoring the model, you should have access to each variable's shape and actual value.

@besherh
Copy link
Author

besherh commented Dec 6, 2018

import tensorflow as tf
from tensorflow.python.platform import gfile
from tensorflow.python.framework import tensor_util

GRAPH_PB_PATH = './model_original.pb' 
with tf.Session() as sess:
  print("load graph")
  with gfile.FastGFile(GRAPH_PB_PATH,'rb') as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())
    sess.graph.as_default()
    tf.import_graph_def(graph_def)
    graph_nodes=[n for n in graph_def.node]
wts = [n for n in graph_nodes if n.op=='Const']

original_params_count = 0
temp = 1
for n in wts:
    temp = 1
    for dim in tensor_util.MakeNdarray(n.attr['value'].tensor).shape:
      temp*= dim
    original_params_count += temp

I am trying to count the params in the saved model as stated above, is this right ?
Thanks in advance!

@jiaxiang-wu
Copy link
Contributor

@besherh Yes, I think the above code is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants