Model Pruning taking too much time to train #60

dhingratul · 2018-11-15T00:08:34Z

Describe the bug
A clear and concise description of what the bug is.
Hi, I have been training the pruning script for last 3 days, as of now it has only generated a couple of checkpoints in model_dcp, but to generate the .tflite and .pb file, i need the model_dcp_eval , which i am assuming will be generated after the training has been "done". I want to just skip to the end, and evaluate the inference times of pruned vs non-pruned model. I dont care about accuracy at this point as much. If i freeze the graph from these checkpoints will it give me the pruned model, because in the documentation it says, "conversion script automatically detects which channels can be safely pruned, and then produces a light-weighted compressed model". I just need the pruned .pb file.
To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

jiaxiang-wu · 2018-11-15T00:32:52Z

To export a TF-Lite model, you need checkpoint files of the evaluation graph (stored in "./models_dcp_eval"), instead of the training graph (stored in "./models_dcp"). If you want to use these checkpoint files of the training graph, you need to restore variables from them and save again as the evaluation graph (take a look at DisChnPrunedLearner's implementation).

BTW, if you want to accelerate the training process and quickly evaluate the run-time speed comparison, and do not care much about the accuracy, you can set --nb_epochs_rat to some small value. This argument specifies how many training epochs will be used. For instance, if you set --nb_epochs_rat to 0.1, then only 10% training epochs will be used, compared with the standard setting, so that the training time will be 10 times faster.

dhingratul · 2018-11-15T17:46:27Z

Is there a way to resume the training from where it stopped, and pass that argument this time ?

jiaxiang-wu · 2018-11-16T06:13:07Z

Sorry, current DisChnPrunedLearner's implementation does not support this. You need to modify its implementation to be able to recover from a previous run.

dhingratul · 2018-11-20T21:16:19Z

@jiaxiang-wu Does --nb_epochs_rat parameter also work with other optimizations such as uniform quantization ?

jiaxiang-wu · 2018-11-21T00:26:12Z

@dhingratul The --nb_epochs_rat argument is supported in UniformQuantTFLearner, but not in UniformQuantLearner, due to slightly different implementation for learning rate schedules. We are considering unifying UniformQuantLearner's implementation to support this.

jiaxiang-wu · 2018-11-21T00:27:57Z

Enhancement required: add support for the --nb_epochs_rat argument in UniformQuantLearner.

dhingratul · 2018-11-21T00:31:52Z

@jiaxiang-wu It should be added for all the optimizers. Is there a comparison on how TF version works as compared to native Pocketflow ?

jiaxiang-wu · 2018-11-21T01:02:29Z

Basically, their performance (in accuracy) is similar, since the underlying training algorithm is the same, despite some implementation details. The native version, UniformQuantLearner, provides more features, including variable number of quantization bits for each layer (so that RL can be used to optimize the strategy), than the TF version, UniformQuantLearnerTFLearner. However, the latter can be exported to TF-Lite models and deployed on mobile devices, while the former cannot.

dhingratul closed this as completed Nov 16, 2018

dhingratul reopened this Nov 20, 2018

jiaxiang-wu self-assigned this Nov 21, 2018

jiaxiang-wu added the enhancement New feature or request label Nov 21, 2018

GoldenSpark mentioned this issue Nov 21, 2018

DisChnPrunedLearner with resnet18 on ImageNet can't converge in local mode #85

Closed

dhingratul closed this as completed Nov 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Pruning taking too much time to train #60

Model Pruning taking too much time to train #60

dhingratul commented Nov 15, 2018

jiaxiang-wu commented Nov 15, 2018

dhingratul commented Nov 15, 2018

jiaxiang-wu commented Nov 16, 2018

dhingratul commented Nov 20, 2018

jiaxiang-wu commented Nov 21, 2018 •

edited

Loading

jiaxiang-wu commented Nov 21, 2018

dhingratul commented Nov 21, 2018 •

edited

Loading

jiaxiang-wu commented Nov 21, 2018

Model Pruning taking too much time to train #60

Model Pruning taking too much time to train #60

Comments

dhingratul commented Nov 15, 2018

jiaxiang-wu commented Nov 15, 2018

dhingratul commented Nov 15, 2018

jiaxiang-wu commented Nov 16, 2018

dhingratul commented Nov 20, 2018

jiaxiang-wu commented Nov 21, 2018 • edited Loading

jiaxiang-wu commented Nov 21, 2018

dhingratul commented Nov 21, 2018 • edited Loading

jiaxiang-wu commented Nov 21, 2018

jiaxiang-wu commented Nov 21, 2018 •

edited

Loading

dhingratul commented Nov 21, 2018 •

edited

Loading