Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster topo #5794

Merged
merged 35 commits into from Jun 1, 2017
Merged

Faster topo #5794

merged 35 commits into from Jun 1, 2017

Conversation

ReyhaneAskari
Copy link
Member

Regarding issue #4233, this PR adds a theano flag disbale_cycle_detection.

@@ -1555,6 +1555,12 @@ def filter_vm_lazy(val):
IntParam(5, lambda i: i > 0, allow_override=False),
in_c_key=False)

AddConfigVar('disbale_cycle_detection',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would use a different flags type, in case we had more option in the futur about about a flag: cycle_detection, that default to "topo" and the new version would be "fast". The next version of fast that allow some sequence of inputs could be given another name like sequence. (or replace fast if fast don't have advantage over it. I think it would be that case)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! So you would want it to be a StrParam?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small small small comment: disbale -> disable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @obilaniu.

dm = getattr(app.op, 'destroy_map', None)
if not dm:
return
inputs = sum(dm.values()) # list of app's destroyed inputs
# inputs = sum(dm.values()) # list of app's destroyed inputs
inputs = dm.values()[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't good. An op can have multiple outputs and in that cases, we want all the inputs that can be destroyed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I fixed it. The items inside dm.values() are them selves lists. In the new version the inputs is a list of all the inputs that have been destroyed.

v = getattr(app2, 'view_map', {}).get(inp_idx2, [])
dv = d + v
assert len(dv) <= 1
if len(v) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an error here and it should be "len(dv)" not "len(v)". What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it should be len(dv).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Fir the first version, we only allow that case.

@@ -924,7 +930,8 @@ def on_change_input(self, fgraph, app, i, old_r, new_r, reason):
self.view_o.setdefault(new_r, OrderedSet()).add(output)

# TODO: check here only one level of fast destroy_map.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the todo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

@@ -822,7 +827,8 @@ def on_import(self, fgraph, app, reason):
if getattr(app.op, 'destroy_map', {}):
# TODO: check here only one level of fast destroy_map.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove todo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

if config.disbale_cycle_detection and self.fail_validate:
self.fail_validate = False
# raise self.fail_validate
InconsistencyError("error")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want that here:
err = self.fail_validate
self.fail_validate = False
raise err

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

@nouiz
Copy link
Member

nouiz commented Mar 31, 2017 via email

@ReyhaneAskari
Copy link
Member Author

ReyhaneAskari commented Apr 6, 2017

@nouiz
There are some theano tests that are going to fail. I have investigated them. They are regarding the inplace function. It assumes that the inplace happens and it has a destroy_map, but we have not allowed the inplace to happen, thus it does not have the destroy_map.

DLT tests passed. Here is the result of profiling with sb_resnet of 11 layers. I cleared the cache and ran the experiment once before the first experiment of each version.

Function.call total compile time Optimizer time Linker time Time since theano import
run master 2 103.5112 56.00142 9.498645 45.28887 188.894
run master 3 103.3318 56.14321 9.269634 45.65213 189.130
avg. master 103.4215 56.072315 9.3841395 45.4705 189.012
run fast destroy 2 104.1225 55.80595 9.338654 45.26488 190.116
run fast destroy 3 103.7676 55.43250 9.106600 45.11833 188.687
avg. fast destroy 103.94505 55.619225 9.222627 45.191605 189.4015

# inputs = sum(dm.values()) # list of app's destroyed inputs
inputs = dm.values()[0]
inputs = list(set(itertools.
chain.from_iterable(dm.values()))) # list of app's destroyed inputs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the problem that in python 3 dm.values() is an iterable? If so, this would be a simpler fix:

inputs = sum(list(dm.values))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the problem is that each item insidedm.values() is itself a list. So for example :
dm = {0: [1, 2], 1:[0, 1]}. In this case the inputs that are destroyed are
list(set(1, 2, 0, 1)) = [0, 1, 2].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the call to list. It should not be useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. You are right.

"""If true it disables the cycle detection in graph.
""",
BoolParam(False),
StrParam('topo'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add 'fast' in the list here of acceptable values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Thanks.

ords = self.orderings(fgraph)
if _contains_cycle(fgraph, ords):
raise InconsistencyError("Dependency graph contains cycles")
for n in fgraph.apply_nodes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment that explain why we need this. Something like:

self.fail_validate can only be a hint that maybe/probably there is a cycle.
This is because inside replace() we could record many reason to not accept a change.
But we don't know which one will fail first inside validate().

Note, if you find in your benchmark that this is still taking times, we could speed it up again by making self.fail_validate a dict where values are the nodes that failed and the values the error. This would ask us to only revalidate the values of the dict and if they don't revalidate, we would return the error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

@@ -891,6 +928,8 @@ def on_change_input(self, fgraph, app, i, old_r, new_r, reason):

self.view_o.setdefault(new_r, OrderedSet()).add(output)

if config.cycle_detection == 'fast':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a global config inside the op, I would put in the init of this class a new parameter: cycle_detection that default to None. When none, it use the config value. This would make the instance use an instance value instead of a global variable. This would allow user without changes to Theano to have different Theano function with different configuration of this variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Done.

@ReyhaneAskari
Copy link
Member Author

@nouiz , here is the result of benchmarking:
(first run is with empty cache and I don't include it here.)

Function.call total compile time Optimizer time Linker time Time since theano import
run master 2 621.6067 163.8108 16.74826 145.2421 851.222
run master 3 664.7983 164.5491 16.64918 146.0820 900.029
avg. master 643.2025 164.17995 16.69872 145.66205 875.6255
run fast destroy 2 613.7685 163.8715 16.66225 145.3980 843.228
run fast destroy 3 626.4770 162.1992 16.67189 143.7104 863.466
avg. fast destroy 620.12275 163.03535 16.66707 144.5542 853.347

and here is the result of memory profiling:

Run master 2:
Max peak memory with current setting
CPU: 654300KB (676946KB)
GPU: 0KB (0KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 654300KB (676946KB)
GPU: 0KB (0KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 5229126KB
GPU: 0KB
CPU + GPU: 5229126KB

Run master 3:
Max peak memory with current setting
CPU: 654300KB (676946KB)
GPU: 0KB (0KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 654300KB (676946KB)
GPU: 0KB (0KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 5229126KB
GPU: 0KB
CPU + GPU: 5229126KB

Run fast destroy 2:
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 654300KB (676946KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 654300KB (676946KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 21KB
GPU: 5229105KB
CPU + GPU: 5229126KB

Run fast destroy 3:
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 654300KB (676946KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 654300KB (676946KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 21KB
GPU: 5229105KB
CPU + GPU: 5229126KB

@nouiz
Copy link
Member

nouiz commented Apr 13, 2017

The code look good, just waiting for the profile results.

@ReyhaneAskari
Copy link
Member Author

ReyhaneAskari commented Apr 14, 2017

@nouiz , here is the result of benchmarking of sb_resnet with 29 layers:
(first run is with empty cache and I don't include it here.)

Function.call total compile time Optimizer time Linker time Time since theano import
run master 2 2320.228 1241.990 524.0085 706.8868 3751.587
run master 3 2322.461 1243.081 524.2337 707.7609 3750.364
avg. master 2321.3445 1242.5355 524.1211 707.32385 3750.9755
run fast destroy 2 2348.764 872.1136 433.1899 428.4450 3398.524
run fast destroy 3 2339.346 870.1231 432.8565 426.6568 3388.438
avg. fast destroy 2344.055 871.11835 433.0232 427.5509 3393.481

here is the result of memory profiling;
Run master 2:
Max peak memory with current setting
CPU: 17997KB (20525KB)
GPU: 2296853KB (2160045KB)
CPU + GPU: 2314850KB (2180570KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 18357KB (20885KB)
GPU: 2299498KB (2160099KB)
CPU + GPU: 2317855KB (2180984KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 20567KB
GPU: 8477270KB
CPU + GPU: 8497837KB
Run master 3:
Max peak memory with current setting
CPU: 17997KB (20525KB)
GPU: 2296853KB (2160045KB)
CPU + GPU: 2314850KB (2180570KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 18357KB (20885KB)
GPU: 2299498KB (2160099KB)
CPU + GPU: 2317855KB (2180984KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 20567KB
GPU: 8477270KB
CPU + GPU: 8497837KB
Run faster 2:
Max peak memory with current setting
CPU: 17637KB (20882KB)
GPU: 2186604KB (2709783KB)
CPU + GPU: 2204241KB (2730665KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 17637KB (20882KB)
GPU: 2186604KB (2709783KB)
CPU + GPU: 2204241KB (2730665KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 41087KB
GPU: 10262778KB
CPU + GPU: 10303865KB
Run faster 3:
Max peak memory with current setting
CPU: 17637KB (20882KB)
GPU: 2186604KB (2709783KB)
CPU + GPU: 2204241KB (2730665KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 17637KB (20882KB)
GPU: 2186604KB (2709783KB)
CPU + GPU: 2204241KB (2730665KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 41087KB
GPU: 10262778KB
CPU + GPU: 10303865KB

@nouiz
Copy link
Member

nouiz commented Apr 18, 2017

The results are strange. The faster topo make optimizer faster, which is great. But here is some strange facts:

  • It use less or more memory on the GPU depending of the order of execution. Not a problem, as in the default, it use less memory!
  • The biggest compilation speed up come from the linker phase speed up. This would be fixed differently in the master of Theano and libgpuarray by the libgpuarray cache. So we should ignore this speed difference here.
  • The opt gpua_inplace_opt didn't got any speed up! Also, we should modify it to try separatly all the optimization instead of bunding them when the speed issue is fixed.
  • the opt local_dnna_conv_inplace rejected all modification. I find that strange, but maybe it is normal. Can you check that? It get much faster, this is where the optimization speed up mostly come from.

@@ -783,6 +783,9 @@ def on_detach(self, fgraph):
delattr(self.fgraph, 'destroy_handler')
self.fgraph = None

def on_revert(self, fgraph):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove that method. Finally we didn't use it in this fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, thanks.

self.fail_validate[app] = InconsistencyError(
"Attempting to destroy indestructible variables: %s" %
inp)
else:
if len(inp.clients) > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this indentation to have if, elif, elif from the line 806

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks.

# self.fast_destroy(app, 'validate')
for app in fgraph.apply_nodes:
self.fast_destroy(app, 'validate')
self.fail_validate = app_err_pairs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to document how it work. Here is some self notes that I would like to end up in that doc. For later when we are close to merge as this can still change.

  • We needed to take care of cases where we record a possible failure, but it is another feature that cause the revert.
  • We need to only verify nodes that are still in the graph. Nodes marked as failure are potential failure and need to be rechecked during validate()
  • We need to handle the case where after a failed validate, there is no revert and the following validate still need to fail.
  • In the case of revert, we need to track that a nodes isn't in the graph anymore and we need to remove from potential failures nodes whose inputs aren't in potential failure anymore.

If you think of other cases we try to cover, add reply with them.

@ReyhaneAskari
Copy link
Member Author

ReyhaneAskari commented Apr 24, 2017

Here are the results for the new commits:

Function.call total compile time Optimizer time Linker time Time since theano import
run fast destroy 2 399.1832 674.6214 241.0301 423.0907 1189.315
run fast destroy 3 398.9949 673.6743 241.6363 421.5410 1188.851
avg. fast destroy 399.08905 674.14785 241.3332 422.31585 1189.083

@ReyhaneAskari
Copy link
Member Author

Here are the results for the new commits:

Function.call total compile time Optimizer time Linker time Time since theano import
run fast destroy 2 412.4167 762.1780 302.5036 448.7382 1287.581
run fast destroy 3 410.6906 765.6680 305.1738 449.4357 1289.136
avg. fast destroy 411.55365 763.923 303.8387 449.08695 1288.3585

self.fail_validate[app] = theano.gof.InconsistencyError(
"Destroyed variable has destroy_map or view_map. " + str(reason))
"Destroyed variable has view_map. " + str(reason))
d = getattr(app2.op, 'destroy_map', {})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put all that part in an else to not execute when the first one fail.

self.fast_destroy(app, 'validate')
if self.fail_validate:
self.fail_validate = app_err_pairs
err = app_err_pairs.values()[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace:
app_err_pairs.values()[0]
by
next(o.itervalues())

@ReyhaneAskari
Copy link
Member Author

@nouiz , here is the benchmarking of sb_resnet with 29 layers. faster is the version that has been pushed to github.
faster with nodes is with this diff:

@@ -218,7 +218,7 @@ class InplaceElemwiseOptimizer(Optimizer):
 
         check_each_change = config.tensor.insert_inplace_optimizer_validate_nb
         if check_each_change == -1:
-            if len(fgraph.apply_nodes) > 500:
+            if len(fgraph.apply_nodes) > 500 and fgraph.destroy_handler.algo == "topo":
                 check_each_change = 10
             else:
                 check_each_change = 1

faster with clients is with this diff:

@@ -316,6 +316,7 @@ class InplaceElemwiseOptimizer(Optimizer):
 
                     updated_vars = []
                     vars_from_inplace = []
+                    one_clients = []
                     other_vars = []
@@ -331,11 +332,18 @@ class InplaceElemwiseOptimizer(Optimizer):
                             # inplace on the updated input via a sequence of
                             # one or more inplace operations
                             vars_from_inplace.append(inp_idx)
+                        elif len(inp.clients) == 1:
+                            one_clients.append(inp_idx)
                         else:
                             other_vars.append(inp_idx)
                     sorted_candidate_inputs = (updated_vars +
-                                               vars_from_inplace + other_vars)
+                                               vars_from_inplace +
+                                               one_clients +
+                                               other_vars)

and faster with clients and nodes is with the combination of both of the diffs.

Function.call total compile time Optimizer time Linker time Time since theano import
master 2 2336.2 564.4 531.79 21.2 3053.704
master 3 2343.5 566.2 533.59 21.3 3065.059
avg. master 2339.8 565.3 532.6 21.2 3059.381
faster 2 2348.2 319.9 288.66 19.9 2820.300
faster 3 2348.1 319.7 288.52 19.9 2820.040
avg. faster 2348.1 319.8 288.59 19.9 2820.170
faster with nodes 2 2355.2 446.8 412.93 22.8 2951.870
faster with nodes 3 2350.1 447.1 413.49 22.8 2946.918
avg. faster with nodes 2352.6 446.9 413.21 22.8 2949.394
faster with clients 2 2345.2 564.8 532.23 21.2 3062.979
faster with clients 3 2341.5 571.4 536.11 23.7 3065.738
avg. faster with clients 2343.3 568.1 534.17 22.4 3064.358
faster with clients and nodes 2 2343.0 567.5 532.7 23.7 3063.707
faster with clients and nodes 3 2345.5 565.6 530.9 23.7 3059.645
avg. faster with clients and nodes 2344.2 566.5 531.8 23.7 3061.676

@ReyhaneAskari
Copy link
Member Author

I ran all the benchmarking again to double check. Here is the results:

Function.call total compile time Opt time Linker time Time since theano import
node_2 2359.018 478.5213 443.7829 23.8545 2988.656
node_3 2362.502 478.4525 444.4603 22.98099 2991.534
avg node_2 node_3 2360.76 478.4869 444.1216 23.417745 2990.095
client_2 2339.796 334.8008 303.5052 20.03821 2835.333
client_3 2340.467 337.5902 306.2961 20.01576 2838.645
avg client_2 client_3 2340.1315 336.1955 304.90065 20.026985 2836.989
client_node_2 2387.533 477.736 444.0152 23.00391 3015.129
client_node_3 2368.015 476.6075 442.7222 23.02957 2995.325
avg client_node_2 client_node_3 2377.774 477.17175 443.3687 23.01674 3005.227

@ReyhaneAskari
Copy link
Member Author

Full results:

Function.call total compile time Opt time Linker time Time since theano import
master_2 2343.89 564.02 530.9986 21.3757 3079.552
master_3 2336.77 564.3109 530.4165 22.21166 3072.559
avg master_2 master_3 2340.33 564.16545 530.70755 21.79368 3076.0555
faster_2 2349.516 321.593 289.6749 20.31129 2838.189
faster_3 2355.735 322.3881 290.4812 20.2576 2844.84
avg faster_2 faster_3 2352.6255 321.99055 290.07805 20.284445 2841.5145
node_2 2364.787 473.7134 441.7535 20.32315 3001.74
node_3 2357.871 477.1179 445.2522 20.2912 2998.278
avg node_2 node_3 2361.329 475.41565 443.50285 20.307175 3000.009
client_2 2356.345 339.9928 307.9422 20.35324 2863.726
client_3 2353.055 336.1462 304.3138 20.23053 2856.64
avg client_2 client_3 2354.7 338.0695 306.128 20.291885 2860.183
client_node_2 2355.181 476.5542 444.4957 20.31519 2999.266
client_node_3 2357.456 474.9499 442.9806 20.30029 2999.692
avg client_node_2 client_node_3 2356.3185 475.75205 443.73815 20.30774 2999.479

@ReyhaneAskari
Copy link
Member Author

Here is the result of memory profiling:

master faster node client
peak memory cpu 3.0 2.0 2.0 2.0
peak memory gpu 2296853.0 2186307.0 2183947.0 2186307.0
peak memory cpu + gpu 2296855.0 2186309.0 2183949.0 2186309.0
flag = optimizer_excluding cpu 3.0 2.0 2.0 2.0
flag = optimizer_excluding gpu 2313898.0 2199204.0 2199204.0 2199204.0
flag = optimizer_excluding cpu + gpu 2313901.0 2199206.0 2199206.0 2199206.0
allow_gc=False cpu 47.0 47.0 44.0 47.0
allow_gc=False gpu 8456750.0 8693082.0 8014349.0 8693082.0
allow_gc=False cpu + gpu 8456797.0 8693129.0 8014394.0 8693129.0

@nouiz
Copy link
Member

nouiz commented May 16, 2017

I pushed commits to your branch that should help the profilings. Can you redo the Theano profiling and python profiling of the "node" version and master version?

@ReyhaneAskari
Copy link
Member Author

ReyhaneAskari commented May 16, 2017

Function.call total compile time Opt time Linker time Time since theano import
master_1 2358.285 581.9009 545.5013 24.8629 3096.893
faster_1 2353.743 336.3887 301.5318 22.96834 2851.59
node_1 2367.863 474.6799 442.5479 20.49995 3005.857

@ReyhaneAskari
Copy link
Member Author

master_1 faster_1 node_1
peak memory cpu 3.0 2.0 2.0
peak memory gpu 2296853.0 2186307.0 2183947.0
peak memory cpu + gpu 2296855.0 2186309.0 2183949.0
flag = optimizer_excluding cpu 3.0 2.0 2.0
flag = optimizer_excluding gpu 2313898.0 2199204.0 2199204.0
flag = optimizer_excluding cpu + gpu 2313901.0 2199206.0 2199206.0
allow_gc=False cpu 47.0 47.0 44.0
allow_gc=False gpu 8456750.0 8693082.0 8014349.0
allow_gc=False cpu + gpu 8456797.0 8693129.0 8014394.0

@ReyhaneAskari
Copy link
Member Author

@nouiz , Here is the result of profiling. I removed the python profiling but it still didn't reduce the timings.

Function.call total compile time Opt time Linker time Time since theano import
master_2 2338.086 533.6504 496.1532 24.34856 3083.279
master_3 2388.766 533.0599 497.8445 21.9859 3109.112
avg master_2 master_3 2363.426 533.35515 496.99885 23.16723 3096.1955
faster_2 2362.921 321.5113 287.7741 20.55508 2866.6
faster_3 2356.994 321.2518 287.4285 20.55622 2860.374
avg faster_2 faster_3 2359.9575 321.38155 287.6013 20.55565 2863.487
node_2 2362.66 443.7989 409.6934 20.91191 2999.223
node_3 2355.702 444.244 410.305 20.77304 2992.904
avg node_2 node_3 2359.181 444.02145 409.9992 20.842475 2996.0635
client_2 2349.6 318.6416 285.2724 20.20445 2862.887
client_3 2353.197 319.1659 285.7407 20.25854 2867.173
avg client_2 client_3 2351.3985 318.90375 285.50655 20.231495 2865.03
master faster node client
peak memory cpu 3.0 2.0 2.0 2.0
peak memory gpu 2296853.0 2186307.0 2183947.0 2186307.0
peak memory cpu + gpu 2296855.0 2186309.0 2183949.0 2186309.0
flag = optimizer_excluding cpu 3.0 2.0 2.0 2.0
flag = optimizer_excluding gpu 2313898.0 2199204.0 2199204.0 2199204.0
flag = optimizer_excluding cpu + gpu 2313901.0 2199206.0 2199206.0 2199206.0
allow_gc=False cpu 47.0 47.0 44.0 47.0
allow_gc=False gpu 8456750.0 8693082.0 8014349.0 8693082.0
allow_gc=False cpu + gpu 8456797.0 8693129.0 8014394.0 8693129.0

@nouiz
Copy link
Member

nouiz commented May 19, 2017

I think we are good to merge. Remove the change to the test suite and fix the pep8 failure.
Also, can you benchmark the fast cycle detect with this Theano flag:

gpuarray.preallocate=-1

@ReyhaneAskari
Copy link
Member Author

Alright. Sure.

@nouiz
Copy link
Member

nouiz commented May 19, 2017

Why I ask new profile is because now, with the fast topo, the slowest optimization is still the gpu_elemwise_inplace. But having less inplace opt, we don't see a slow down in run time and we don't see an increase in the memory usage. So mostly, why do we do that optimization anymore?

So, not for this PR, but to make sure if we still need that optimization or not, we would need those profiles:

  • (the one requested above): this PR with fast detection and gpuarray.preallocate=-1
  • this PR without topo detection(named master in your profile) and gpuarray.preallocate=-1
  • this PR with flag optimizer_excluding=gpua_inplace_opt and gpuarray.preallocate=-1
  • this PR with flag optimizer_excluding=gpua_inplace_opt

Maybe now when we have the preallocate, the inplace is much less useful, but still useful when we disable the cache of allocated memory on the GPU.

@ReyhaneAskari
Copy link
Member Author

I didn't rebase as rebasing will bring all the commits down and we would not know in the future which profiling is for which commits.

@ReyhaneAskari
Copy link
Member Author

Here is the result of faster branch compared to the same branch with flag gpuarray.preallocate=-1:

Function.call total compile time Opt time Linker time Time since theano import
faster 2356.994 321.2518 287.4285 20.55622 2860.374
faster_preallocate 2611.0 321.957 285.5965 23.13251 3139.56
faster faster_preallocate
peak memory cpu 2.0 2.0
peak memory gpu 2186307.0 2186307.0
peak memory cpu + gpu 2186309.0 2186309.0
flag = optimizer_excluding cpu 2.0 2.0
flag = optimizer_excluding gpu 2199204.0 2199204.0
flag = optimizer_excluding cpu + gpu 2199206.0 2199206.0
allow_gc=False cpu 47.0 47.0
allow_gc=False gpu 8693082.0 8693082.0
allow_gc=False cpu + gpu 8693129.0 8693129.0

@lamblin
Copy link
Member

lamblin commented May 23, 2017

Trigger new buildbot run

@ReyhaneAskari ReyhaneAskari force-pushed the faster_topo branch 3 times, most recently from 9eedd8b to 99f568c Compare May 31, 2017 03:29
@nouiz nouiz merged commit 6792767 into Theano:master Jun 1, 2017
@lamblin lamblin mentioned this pull request Jun 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants