Train ResNet-20 model for the CIFAR-10 on colab not working #72

besherh · 2018-11-16T13:54:44Z

Hello,
I am using colab to train resnet on cifar10, after mounting google drive I cloned the repository and I was able to run the script. However, Tensorflow is loaded and the data files are passed to the network but I am ending with:

tensorflow.python.framework.errors_impl.NotFoundError: /content/drive/My; No such file or directory

Any suggestions?

Below you can find the code and the log file:

from google.colab import drive
drive.mount('/content/gdrive')
import os
os.chdir("/content/gdrive/My Drive/apps/PocketFlow")
!chmod 755 ./scripts/run_local.sh
!./scripts/run_local.sh nets/resnet_at_cifar10_run.py

the log:

Python script: nets/resnet_at_cifar10_run.py
# of GPUs: 1
extra arguments:  --model_http_url https://api.ai.tencent.com/pocketflow --data_dir_local /content/drive/My Drive/apps/datasets/cifar10
'nets/resnet_at_cifar10_run.py' -> 'main.py'
multi-GPU training disabled
[WARNING] TF-Plus & Horovod cannot be imported; multi-GPU training is unsupported
INFO:tensorflow:FLAGS:
INFO:tensorflow:data_disk: local
INFO:tensorflow:data_hdfs_host: None
INFO:tensorflow:data_dir_local: /content/drive/My
INFO:tensorflow:data_dir_hdfs: None
INFO:tensorflow:cycle_length: 4
INFO:tensorflow:nb_threads: 8
INFO:tensorflow:buffer_size: 1024
INFO:tensorflow:prefetch_size: 8
INFO:tensorflow:nb_classes: 10
INFO:tensorflow:nb_smpls_train: 50000
INFO:tensorflow:nb_smpls_val: 5000
INFO:tensorflow:nb_smpls_eval: 10000
INFO:tensorflow:batch_size: 128
INFO:tensorflow:batch_size_eval: 100
INFO:tensorflow:resnet_size: 20
INFO:tensorflow:lrn_rate_init: 0.1
INFO:tensorflow:batch_size_norm: 128.0
INFO:tensorflow:momentum: 0.9
INFO:tensorflow:loss_w_dcy: 0.0002
INFO:tensorflow:model_http_url: https://api.ai.tencent.com/pocketflow
INFO:tensorflow:summ_step: 100
INFO:tensorflow:save_step: 10000
INFO:tensorflow:save_path: ./models/model.ckpt
INFO:tensorflow:save_path_eval: ./models_eval/model.ckpt
INFO:tensorflow:enbl_dst: False
INFO:tensorflow:enbl_warm_start: False
INFO:tensorflow:loss_w_dst: 4.0
INFO:tensorflow:tempr_dst: 4.0
INFO:tensorflow:save_path_dst: ./models_dst/model.ckpt
INFO:tensorflow:nb_epochs_rat: 1.0
INFO:tensorflow:ddpg_actor_depth: 2
INFO:tensorflow:ddpg_actor_width: 64
INFO:tensorflow:ddpg_critic_depth: 2
INFO:tensorflow:ddpg_critic_width: 64
INFO:tensorflow:ddpg_noise_type: param
INFO:tensorflow:ddpg_noise_prtl: tdecy
INFO:tensorflow:ddpg_noise_std_init: 1.0
INFO:tensorflow:ddpg_noise_dst_finl: 0.01
INFO:tensorflow:ddpg_noise_adpt_rat: 1.03
INFO:tensorflow:ddpg_noise_std_finl: 1e-05
INFO:tensorflow:ddpg_rms_eps: 0.0001
INFO:tensorflow:ddpg_tau: 0.01
INFO:tensorflow:ddpg_gamma: 0.9
INFO:tensorflow:ddpg_lrn_rate: 0.001
INFO:tensorflow:ddpg_loss_w_dcy: 0.0
INFO:tensorflow:ddpg_record_step: 1
INFO:tensorflow:ddpg_batch_size: 64
INFO:tensorflow:ddpg_enbl_bsln_func: True
INFO:tensorflow:ddpg_bsln_decy_rate: 0.95
INFO:tensorflow:ws_save_path: ./models_ws/model.ckpt
INFO:tensorflow:ws_prune_ratio: 0.75
INFO:tensorflow:ws_prune_ratio_prtl: optimal
INFO:tensorflow:ws_nb_rlouts: 200
INFO:tensorflow:ws_nb_rlouts_min: 50
INFO:tensorflow:ws_reward_type: single-obj
INFO:tensorflow:ws_lrn_rate_rg: 0.03
INFO:tensorflow:ws_nb_iters_rg: 20
INFO:tensorflow:ws_lrn_rate_ft: 0.0003
INFO:tensorflow:ws_nb_iters_ft: 400
INFO:tensorflow:ws_nb_iters_feval: 25
INFO:tensorflow:ws_prune_ratio_exp: 3.0
INFO:tensorflow:ws_iter_ratio_beg: 0.1
INFO:tensorflow:ws_iter_ratio_end: 0.5
INFO:tensorflow:ws_mask_update_step: 500.0
INFO:tensorflow:cp_lasso: True
INFO:tensorflow:cp_quadruple: False
INFO:tensorflow:cp_reward_policy: accuracy
INFO:tensorflow:cp_nb_points_per_layer: 10
INFO:tensorflow:cp_nb_batches: 30
INFO:tensorflow:cp_prune_option: auto
INFO:tensorflow:cp_prune_list_file: ratio.list
INFO:tensorflow:cp_channel_pruned_path: ./models/pruned_model.ckpt
INFO:tensorflow:cp_best_path: ./models/best_model.ckpt
INFO:tensorflow:cp_original_path: ./models/original_model.ckpt
INFO:tensorflow:cp_preserve_ratio: 0.5
INFO:tensorflow:cp_uniform_preserve_ratio: 0.6
INFO:tensorflow:cp_noise_tolerance: 0.15
INFO:tensorflow:cp_lrn_rate_ft: 0.0001
INFO:tensorflow:cp_nb_iters_ft_ratio: 0.2
INFO:tensorflow:cp_finetune: False
INFO:tensorflow:cp_retrain: False
INFO:tensorflow:cp_list_group: 1000
INFO:tensorflow:cp_nb_rlouts: 200
INFO:tensorflow:cp_nb_rlouts_min: 50
INFO:tensorflow:dcp_save_path: ./models_dcp/model.ckpt
INFO:tensorflow:dcp_save_path_eval: ./models_dcp_eval/model.ckpt
INFO:tensorflow:dcp_prune_ratio: 0.5
INFO:tensorflow:dcp_nb_stages: 3
INFO:tensorflow:dcp_lrn_rate_adam: 0.001
INFO:tensorflow:dcp_nb_iters_block: 10000
INFO:tensorflow:dcp_nb_iters_layer: 500
INFO:tensorflow:uql_equivalent_bits: 4
INFO:tensorflow:uql_nb_rlouts: 200
INFO:tensorflow:uql_w_bit_min: 2
INFO:tensorflow:uql_w_bit_max: 8
INFO:tensorflow:uql_tune_layerwise_steps: 100
INFO:tensorflow:uql_tune_global_steps: 2000
INFO:tensorflow:uql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:uql_tune_disp_steps: 300
INFO:tensorflow:uql_enbl_random_layers: True
INFO:tensorflow:uql_enbl_rl_agent: False
INFO:tensorflow:uql_enbl_rl_global_tune: True
INFO:tensorflow:uql_enbl_rl_layerwise_tune: False
INFO:tensorflow:uql_weight_bits: 4
INFO:tensorflow:uql_activation_bits: 32
INFO:tensorflow:uql_use_buckets: False
INFO:tensorflow:uql_bucket_size: 256
INFO:tensorflow:uql_quant_epochs: 60
INFO:tensorflow:uql_save_quant_model_path: ./uql_quant_models/uql_quant_model.ckpt
INFO:tensorflow:uql_quantize_all_layers: False
INFO:tensorflow:uql_bucket_type: channel
INFO:tensorflow:uqtf_save_path: ./models_uqtf/model.ckpt
INFO:tensorflow:uqtf_save_path_eval: ./models_uqtf_eval/model.ckpt
INFO:tensorflow:uqtf_weight_bits: 8
INFO:tensorflow:uqtf_activation_bits: 8
INFO:tensorflow:uqtf_quant_delay: 0
INFO:tensorflow:uqtf_freeze_bn_delay: None
INFO:tensorflow:uqtf_lrn_rate_dcy: 0.01
INFO:tensorflow:nuql_equivalent_bits: 4
INFO:tensorflow:nuql_nb_rlouts: 200
INFO:tensorflow:nuql_w_bit_min: 2
INFO:tensorflow:nuql_w_bit_max: 8
INFO:tensorflow:nuql_tune_layerwise_steps: 100
INFO:tensorflow:nuql_tune_global_steps: 2101
INFO:tensorflow:nuql_tune_save_path: ./rl_tune_models/model.ckpt
INFO:tensorflow:nuql_tune_disp_steps: 300
INFO:tensorflow:nuql_enbl_random_layers: True
INFO:tensorflow:nuql_enbl_rl_agent: False
INFO:tensorflow:nuql_enbl_rl_global_tune: True
INFO:tensorflow:nuql_enbl_rl_layerwise_tune: False
INFO:tensorflow:nuql_init_style: quantile
INFO:tensorflow:nuql_opt_mode: weights
INFO:tensorflow:nuql_weight_bits: 4
INFO:tensorflow:nuql_activation_bits: 32
INFO:tensorflow:nuql_use_buckets: False
INFO:tensorflow:nuql_bucket_size: 256
INFO:tensorflow:nuql_quant_epochs: 60
INFO:tensorflow:nuql_save_quant_model_path: ./nuql_quant_models/model.ckpt
INFO:tensorflow:nuql_quantize_all_layers: False
INFO:tensorflow:nuql_bucket_type: split
INFO:tensorflow:log_dir: ./logs
INFO:tensorflow:enbl_multi_gpu: False
INFO:tensorflow:learner: full-prec
INFO:tensorflow:exec_mode: train
INFO:tensorflow:debug: False
INFO:tensorflow:h: False
INFO:tensorflow:help: False
INFO:tensorflow:helpfull: False
INFO:tensorflow:helpshort: False
2018-11-16 12:53:20.147847: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-16 12:53:20.148287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-11-16 12:53:20.148358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-16 12:53:20.565167: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-16 12:53:20.565235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-11-16 12:53:20.565262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-11-16 12:53:20.565561: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2018-11-16 12:53:20.565637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10758 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
WARNING:tensorflow:From /content/gdrive/My Drive/apps/PocketFlow/datasets/abstract_dataset.py:85: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
WARNING:tensorflow:From /content/gdrive/My Drive/apps/PocketFlow/datasets/abstract_dataset.py:106: shuffle_and_repeat (from tensorflow.contrib.data.python.ops.shuffle_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.shuffle_and_repeat(...)`.
2018-11-16 12:53:23.066723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-16 12:53:23.066814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-16 12:53:23.066857: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-11-16 12:53:23.066882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-11-16 12:53:23.067168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10758 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
2018-11-16 12:53:24.963790: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at matching_files_op.cc:49 : Not found: /content/drive/My; No such file or directory
2018-11-16 12:53:24.964542: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at matching_files_op.cc:49 : Not found: /content/drive/My; No such file or directory
2018-11-16 12:53:24.964744: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at iterator_ops.cc:947 : Not found: /content/drive/My; No such file or directory
     [[{{node ShuffleDataset/data/list_files/MatchingFiles}} = MatchingFiles[](ShuffleDataset/data/list_files/file_pattern)]]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: /content/drive/My; No such file or directory
     [[{{node ShuffleDataset/data/list_files/MatchingFiles}} = MatchingFiles[](ShuffleDataset/data/list_files/file_pattern)]]
     [[{{node data/OneShotIterator}} = OneShotIterator[container="", dataset_factory=_make_dataset_E02JEaYNEAE[], output_shapes=[[?,32,32,3], [?,10]], output_types=[DT_FLOAT, DT_FLOAT], shared_name="", _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
     [[{{node data/IteratorGetNext/_3}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_107_data/IteratorGetNext", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 69, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "main.py", line 55, in main
    learner.train()
  File "/content/gdrive/My Drive/apps/PocketFlow/learners/full_precision/learner.py", line 71, in train
    self.sess_train.run(self.train_op)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: /content/drive/My; No such file or directory
     [[{{node ShuffleDataset/data/list_files/MatchingFiles}} = MatchingFiles[](ShuffleDataset/data/list_files/file_pattern)]]
     [[node data/OneShotIterator (defined at /content/gdrive/My Drive/apps/PocketFlow/datasets/abstract_dataset.py:109)  = OneShotIterator[container="", dataset_factory=_make_dataset_E02JEaYNEAE[], output_shapes=[[?,32,32,3], [?,10]], output_types=[DT_FLOAT, DT_FLOAT], shared_name="", _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
     [[{{node data/IteratorGetNext/_3}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_107_data/IteratorGetNext", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

The text was updated successfully, but these errors were encountered:

jiaxiang-wu · 2018-11-19T01:03:13Z

@besherh Is it possible to remove the whitespace in "/content/gdrive/My Drive/apps/PocketFlow"? The path configuration file seems not able to parse file paths with whitespaces.

besherh · 2018-11-19T15:06:19Z

Thanks for the hint, fix it by using:
!ln -s "/content/gdrive/My Drive" "/content/mydrive", So I removed the space as suggested,
but I have another issue, please!
How to measure the performance of a pruning method? For instance, I want to compare the model size for Resnet on cifar10 before and after the pruning!

jiaxiang-wu · 2018-11-20T01:00:42Z

Use tools/conversion/export_pb_tflite_models.py to convert checkpoint files into *.pb & *.tflite models. Two models will be generated:

model_original.pb/tflite for the model without rewriting the graph, so the model size equals to the model without pruning.
model_transformed.pb/tflite for the model with rewriting the graph, so the model size equals to the model with pruning.

besherh · 2018-11-20T20:49:53Z

Thanks, now I have the files that you mentioned.
model_original.pb (1,129 KB)
model_transformed (659 KB)
But honestly, I did not get the idea of how to compare the two models in term of :
pruning ratio, model size, how many channels were pruned, accuracy and loss. Is there any graphical way to do this (plots ?)

Thanks for your support!

jiaxiang-wu · 2018-11-21T00:57:31Z

You should be able to see how many channels have been pruned in the model conversion script's log. Besides, *.pb files can be visualized in TensorBoard.

besherh · 2018-11-21T14:02:09Z

Thanks, but what about the pruning ratio? How to figure this out?
I want to get something like this, (this table is on pocket flow docs)

Model	Pruning Ratio
MobileNet-v1	50%
MobileNet-v1	60%
MobileNet-v1	70%
Mobilenet-v1	80%

Thanks!

jiaxiang-wu · 2018-11-21T14:49:04Z

To compute the overall pruning ratio, you need to calculate the number of parameters in each layer before and after pruning, and then sum them up. This can be inferred from model conversion script's log.

besherh · 2018-11-21T15:29:03Z

again thanks, here is the snapshot from the log of the conversion script:
Original Model:

Pruned Model:

based on this, the pruning ratio supposes to be 22% , is this correct?

jiaxiang-wu · 2018-11-21T16:00:18Z

No, this is the number of weight tensors. You need to compute the number of parameters in each tensor (based on each tensor's shape information) and sum them up to obtain the overall number of parameters in the model before and after pruning.

besherh · 2018-11-26T14:35:17Z

I was not able to figure how to do this, could you give a quick example if possible!

jiaxiang-wu · 2018-11-29T00:15:09Z

@besherh
You can obtain a full list of variables presented in the graph via tf.get_collection(). After restoring the model, you should have access to each variable's shape and actual value.

besherh · 2018-12-06T15:12:32Z

import tensorflow as tf
from tensorflow.python.platform import gfile
from tensorflow.python.framework import tensor_util

GRAPH_PB_PATH = './model_original.pb' 
with tf.Session() as sess:
  print("load graph")
  with gfile.FastGFile(GRAPH_PB_PATH,'rb') as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())
    sess.graph.as_default()
    tf.import_graph_def(graph_def)
    graph_nodes=[n for n in graph_def.node]
wts = [n for n in graph_nodes if n.op=='Const']

original_params_count = 0
temp = 1
for n in wts:
    temp = 1
    for dim in tensor_util.MakeNdarray(n.attr['value'].tensor).shape:
      temp*= dim
    original_params_count += temp

I am trying to count the params in the saved model as stated above, is this right ?
Thanks in advance!

jiaxiang-wu · 2018-12-10T01:58:02Z

@besherh Yes, I think the above code is correct.

GoldenSpark mentioned this issue Nov 21, 2018

DisChnPrunedLearner with resnet18 on ImageNet can't converge in local mode #85

Closed

jiaxiang-wu closed this as completed Dec 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train ResNet-20 model for the CIFAR-10 on colab not working #72

Train ResNet-20 model for the CIFAR-10 on colab not working #72

besherh commented Nov 16, 2018

jiaxiang-wu commented Nov 19, 2018

besherh commented Nov 19, 2018 •

edited

Loading

jiaxiang-wu commented Nov 20, 2018

besherh commented Nov 20, 2018 •

edited

Loading

jiaxiang-wu commented Nov 21, 2018

besherh commented Nov 21, 2018

jiaxiang-wu commented Nov 21, 2018

besherh commented Nov 21, 2018

jiaxiang-wu commented Nov 21, 2018

besherh commented Nov 26, 2018

jiaxiang-wu commented Nov 29, 2018 •

edited

Loading

besherh commented Dec 6, 2018

jiaxiang-wu commented Dec 10, 2018

Train ResNet-20 model for the CIFAR-10 on colab not working #72

Train ResNet-20 model for the CIFAR-10 on colab not working #72

Comments

besherh commented Nov 16, 2018

jiaxiang-wu commented Nov 19, 2018

besherh commented Nov 19, 2018 • edited Loading

jiaxiang-wu commented Nov 20, 2018

besherh commented Nov 20, 2018 • edited Loading

jiaxiang-wu commented Nov 21, 2018

besherh commented Nov 21, 2018

jiaxiang-wu commented Nov 21, 2018

besherh commented Nov 21, 2018

jiaxiang-wu commented Nov 21, 2018

besherh commented Nov 26, 2018

jiaxiang-wu commented Nov 29, 2018 • edited Loading

besherh commented Dec 6, 2018

jiaxiang-wu commented Dec 10, 2018

besherh commented Nov 19, 2018 •

edited

Loading

besherh commented Nov 20, 2018 •

edited

Loading

jiaxiang-wu commented Nov 29, 2018 •

edited

Loading