Faster topo #5794

ReyhaneAskari · 2017-03-31T15:39:41Z

Regarding issue #4233, this PR adds a theano flag disbale_cycle_detection.

nouiz · 2017-03-31T15:57:12Z

theano/configdefaults.py

@@ -1555,6 +1555,12 @@ def filter_vm_lazy(val):
             IntParam(5, lambda i: i > 0, allow_override=False),
             in_c_key=False)

+AddConfigVar('disbale_cycle_detection',


I think I would use a different flags type, in case we had more option in the futur about about a flag: cycle_detection, that default to "topo" and the new version would be "fast". The next version of fast that allow some sequence of inputs could be given another name like sequence. (or replace fast if fast don't have advantage over it. I think it would be that case)

Right! So you would want it to be a StrParam?

Small small small comment: disbale -> disable.

Thanks @obilaniu.

nouiz · 2017-03-31T15:58:33Z

theano/gof/destroyhandler.py

        dm = getattr(app.op, 'destroy_map', None)
        if not dm:
            return
-        inputs = sum(dm.values())  # list of app's destroyed inputs
+        # inputs = sum(dm.values())  # list of app's destroyed inputs
+        inputs = dm.values()[0]


This isn't good. An op can have multiple outputs and in that cases, we want all the inputs that can be destroyed.

Thanks, I fixed it. The items inside dm.values() are them selves lists. In the new version the inputs is a list of all the inputs that have been destroyed.

nouiz · 2017-03-31T16:01:11Z

theano/gof/destroyhandler.py

+                    v = getattr(app2, 'view_map', {}).get(inp_idx2, [])
+                    dv = d + v
+                    assert len(dv) <= 1
+                    if len(v) > 0:


I think there is an error here and it should be "len(dv)" not "len(v)". What do you think?

Yes, I think it should be len(dv).

Right. Fir the first version, we only allow that case.

nouiz · 2017-03-31T16:01:38Z

theano/gof/destroyhandler.py

@@ -924,7 +930,8 @@ def on_change_input(self, fgraph, app, i, old_r, new_r, reason):
                    self.view_o.setdefault(new_r, OrderedSet()).add(output)

            # TODO: check here only one level of fast destroy_map.


remove the todo

Done. Thanks.

nouiz · 2017-03-31T16:01:44Z

theano/gof/destroyhandler.py

@@ -822,7 +827,8 @@ def on_import(self, fgraph, app, reason):
        if getattr(app.op, 'destroy_map', {}):
            # TODO: check here only one level of fast destroy_map.


remove todo

Done. Thanks.

nouiz · 2017-03-31T16:02:39Z

theano/gof/destroyhandler.py

+            if config.disbale_cycle_detection and self.fail_validate:
+                self.fail_validate = False
+                # raise self.fail_validate
+                InconsistencyError("error")


I think you want that here:
err = self.fail_validate
self.fail_validate = False
raise err

Done. Thanks.

nouiz · 2017-03-31T19:13:40Z

yes.

…

On Fri, Mar 31, 2017 at 2:53 PM Reyhane Askari ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In theano/configdefaults.py <#5794 (comment)>: > @@ -1555,6 +1555,12 @@ def filter_vm_lazy(val): IntParam(5, lambda i: i > 0, allow_override=False), in_c_key=False) +AddConfigVar('disbale_cycle_detection', Right! So you would want it to be a StrParam? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5794 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALC-0kyHK3G4-1mc1b2fNangV_o4Nv6ks5rrUvCgaJpZM4Mv1DA> .

ReyhaneAskari · 2017-04-06T15:25:30Z

@nouiz
There are some theano tests that are going to fail. I have investigated them. They are regarding the inplace function. It assumes that the inplace happens and it has a destroy_map, but we have not allowed the inplace to happen, thus it does not have the destroy_map.

DLT tests passed. Here is the result of profiling with sb_resnet of 11 layers. I cleared the cache and ran the experiment once before the first experiment of each version.

	Function.call	total compile time	Optimizer time	Linker time	Time since theano import
run master 2	103.5112	56.00142	9.498645	45.28887	188.894
run master 3	103.3318	56.14321	9.269634	45.65213	189.130
avg. master	103.4215	56.072315	9.3841395	45.4705	189.012
run fast destroy 2	104.1225	55.80595	9.338654	45.26488	190.116
run fast destroy 3	103.7676	55.43250	9.106600	45.11833	188.687
avg. fast destroy	103.94505	55.619225	9.222627	45.191605	189.4015

nouiz · 2017-04-06T18:37:43Z

theano/gof/destroyhandler.py

-        # inputs = sum(dm.values())  # list of app's destroyed inputs
-        inputs = dm.values()[0]
+        inputs = list(set(itertools.
+                          chain.from_iterable(dm.values())))   # list of app's destroyed inputs


Is the problem that in python 3 dm.values() is an iterable? If so, this would be a simpler fix:

inputs = sum(list(dm.values))

No the problem is that each item insidedm.values() is itself a list. So for example :
dm = {0: [1, 2], 1:[0, 1]}. In this case the inputs that are destroyed are
list(set(1, 2, 0, 1)) = [0, 1, 2].

remove the call to list. It should not be useful.

I see. You are right.

nouiz · 2017-04-06T18:39:00Z

theano/configdefaults.py

             """If true it disables the cycle detection in graph.
             """,
-             BoolParam(False),
+             StrParam('topo'),


You need to add 'fast' in the list here of acceptable values.

OK. Thanks.

nouiz · 2017-04-11T03:21:28Z

theano/gof/destroyhandler.py

-            ords = self.orderings(fgraph)
-            if _contains_cycle(fgraph, ords):
-                raise InconsistencyError("Dependency graph contains cycles")
+                    for n in fgraph.apply_nodes:


Add a comment that explain why we need this. Something like:

self.fail_validate can only be a hint that maybe/probably there is a cycle.
This is because inside replace() we could record many reason to not accept a change.
But we don't know which one will fail first inside validate().

Note, if you find in your benchmark that this is still taking times, we could speed it up again by making self.fail_validate a dict where values are the nodes that failed and the values the error. This would ask us to only revalidate the values of the dict and if they don't revalidate, we would return the error.

Thanks, done.

nouiz · 2017-04-11T03:23:44Z

theano/gof/destroyhandler.py

@@ -891,6 +928,8 @@ def on_change_input(self, fgraph, app, i, old_r, new_r, reason):

                    self.view_o.setdefault(new_r, OrderedSet()).add(output)

+            if config.cycle_detection == 'fast':


Instead of using a global config inside the op, I would put in the init of this class a new parameter: cycle_detection that default to None. When none, it use the config value. This would make the instance use an instance value instead of a global variable. This would allow user without changes to Theano to have different Theano function with different configuration of this variable.

Thanks. Done.

ReyhaneAskari · 2017-04-11T19:32:54Z

@nouiz , here is the result of benchmarking:
(first run is with empty cache and I don't include it here.)

	Function.call	total compile time	Optimizer time	Linker time	Time since theano import
run master 2	621.6067	163.8108	16.74826	145.2421	851.222
run master 3	664.7983	164.5491	16.64918	146.0820	900.029
avg. master	643.2025	164.17995	16.69872	145.66205	875.6255
run fast destroy 2	613.7685	163.8715	16.66225	145.3980	843.228
run fast destroy 3	626.4770	162.1992	16.67189	143.7104	863.466
avg. fast destroy	620.12275	163.03535	16.66707	144.5542	853.347

and here is the result of memory profiling:

Run master 2:
Max peak memory with current setting
CPU: 654300KB (676946KB)
GPU: 0KB (0KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 654300KB (676946KB)
GPU: 0KB (0KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 5229126KB
GPU: 0KB
CPU + GPU: 5229126KB

Run master 3:
Max peak memory with current setting
CPU: 654300KB (676946KB)
GPU: 0KB (0KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 654300KB (676946KB)
GPU: 0KB (0KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 5229126KB
GPU: 0KB
CPU + GPU: 5229126KB

Run fast destroy 2:
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 654300KB (676946KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 654300KB (676946KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 21KB
GPU: 5229105KB
CPU + GPU: 5229126KB

Run fast destroy 3:
Max peak memory with current setting
CPU: 0KB (0KB)
GPU: 654300KB (676946KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 0KB (0KB)
GPU: 654300KB (676946KB)
CPU + GPU: 654300KB (676946KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 21KB
GPU: 5229105KB
CPU + GPU: 5229126KB

nouiz · 2017-04-13T12:14:48Z

The code look good, just waiting for the profile results.

ReyhaneAskari · 2017-04-14T02:33:20Z

@nouiz , here is the result of benchmarking of sb_resnet with 29 layers:
(first run is with empty cache and I don't include it here.)

	Function.call	total compile time	Optimizer time	Linker time	Time since theano import
run master 2	2320.228	1241.990	524.0085	706.8868	3751.587
run master 3	2322.461	1243.081	524.2337	707.7609	3750.364
avg. master	2321.3445	1242.5355	524.1211	707.32385	3750.9755
run fast destroy 2	2348.764	872.1136	433.1899	428.4450	3398.524
run fast destroy 3	2339.346	870.1231	432.8565	426.6568	3388.438
avg. fast destroy	2344.055	871.11835	433.0232	427.5509	3393.481

here is the result of memory profiling;
Run master 2:
Max peak memory with current setting
CPU: 17997KB (20525KB)
GPU: 2296853KB (2160045KB)
CPU + GPU: 2314850KB (2180570KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 18357KB (20885KB)
GPU: 2299498KB (2160099KB)
CPU + GPU: 2317855KB (2180984KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 20567KB
GPU: 8477270KB
CPU + GPU: 8497837KB
Run master 3:
Max peak memory with current setting
CPU: 17997KB (20525KB)
GPU: 2296853KB (2160045KB)
CPU + GPU: 2314850KB (2180570KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 18357KB (20885KB)
GPU: 2299498KB (2160099KB)
CPU + GPU: 2317855KB (2180984KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 20567KB
GPU: 8477270KB
CPU + GPU: 8497837KB
Run faster 2:
Max peak memory with current setting
CPU: 17637KB (20882KB)
GPU: 2186604KB (2709783KB)
CPU + GPU: 2204241KB (2730665KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 17637KB (20882KB)
GPU: 2186604KB (2709783KB)
CPU + GPU: 2204241KB (2730665KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 41087KB
GPU: 10262778KB
CPU + GPU: 10303865KB
Run faster 3:
Max peak memory with current setting
CPU: 17637KB (20882KB)
GPU: 2186604KB (2709783KB)
CPU + GPU: 2204241KB (2730665KB)
Max peak memory with current setting and Theano flag optimizer_excluding=inplace
CPU: 17637KB (20882KB)
GPU: 2186604KB (2709783KB)
CPU + GPU: 2204241KB (2730665KB)
Max peak memory if allow_gc=False (linker don't make a difference)
CPU: 41087KB
GPU: 10262778KB
CPU + GPU: 10303865KB

nouiz · 2017-04-18T14:05:56Z

The results are strange. The faster topo make optimizer faster, which is great. But here is some strange facts:

It use less or more memory on the GPU depending of the order of execution. Not a problem, as in the default, it use less memory!
The biggest compilation speed up come from the linker phase speed up. This would be fixed differently in the master of Theano and libgpuarray by the libgpuarray cache. So we should ignore this speed difference here.
The opt gpua_inplace_opt didn't got any speed up! Also, we should modify it to try separatly all the optimization instead of bunding them when the speed issue is fixed.
the opt local_dnna_conv_inplace rejected all modification. I find that strange, but maybe it is normal. Can you check that? It get much faster, this is where the optimization speed up mostly come from.

nouiz · 2017-04-21T19:51:55Z

theano/gof/destroyhandler.py

@@ -783,6 +783,9 @@ def on_detach(self, fgraph):
        delattr(self.fgraph, 'destroy_handler')
        self.fgraph = None

+    def on_revert(self, fgraph):


remove that method. Finally we didn't use it in this fix.

removed, thanks.

nouiz · 2017-04-21T19:52:36Z

theano/gof/destroyhandler.py

+                self.fail_validate[app] = InconsistencyError(
+                    "Attempting to destroy indestructible variables: %s" %
+                    inp)
+            else:
                if len(inp.clients) > 1:


If this indentation to have if, elif, elif from the line 806

fixed, thanks.

nouiz · 2017-04-21T19:58:09Z

theano/gof/destroyhandler.py

+                    #        self.fast_destroy(app, 'validate')
+                    for app in fgraph.apply_nodes:
+                        self.fast_destroy(app, 'validate')
+                    self.fail_validate = app_err_pairs


I think we need to document how it work. Here is some self notes that I would like to end up in that doc. For later when we are close to merge as this can still change.

We needed to take care of cases where we record a possible failure, but it is another feature that cause the revert.

We need to only verify nodes that are still in the graph. Nodes marked as failure are potential failure and need to be rechecked during validate()

We need to handle the case where after a failed validate, there is no revert and the following validate still need to fail.

In the case of revert, we need to track that a nodes isn't in the graph anymore and we need to remove from potential failures nodes whose inputs aren't in potential failure anymore.

If you think of other cases we try to cover, add reply with them.

ReyhaneAskari · 2017-04-24T18:14:34Z

Here are the results for the new commits:

	Function.call	total compile time	Optimizer time	Linker time	Time since theano import
run fast destroy 2	399.1832	674.6214	241.0301	423.0907	1189.315
run fast destroy 3	398.9949	673.6743	241.6363	421.5410	1188.851
avg. fast destroy	399.08905	674.14785	241.3332	422.31585	1189.083

ReyhaneAskari · 2017-04-25T21:17:52Z

Here are the results for the new commits:

	Function.call	total compile time	Optimizer time	Linker time	Time since theano import
run fast destroy 2	412.4167	762.1780	302.5036	448.7382	1287.581
run fast destroy 3	410.6906	765.6680	305.1738	449.4357	1289.136
avg. fast destroy	411.55365	763.923	303.8387	449.08695	1288.3585

nouiz · 2017-04-24T15:31:21Z

theano/gof/destroyhandler.py

-                        self.fail_validate[app] = theano.gof.InconsistencyError(
-                            "Destroyed variable has destroy_map or view_map. " + str(reason))
+                        "Destroyed variable has view_map. " + str(reason))
+                d = getattr(app2.op, 'destroy_map', {})


Put all that part in an else to not execute when the first one fail.

nouiz · 2017-05-02T18:35:02Z

theano/gof/destroyhandler.py

+                            self.fast_destroy(app, 'validate')
+                    if self.fail_validate:
+                        self.fail_validate = app_err_pairs
+                        err = app_err_pairs.values()[0]


replace:
app_err_pairs.values()[0]
by
next(o.itervalues())

ReyhaneAskari · 2017-05-05T18:23:25Z

@nouiz , here is the benchmarking of sb_resnet with 29 layers. faster is the version that has been pushed to github.
faster with nodes is with this diff:

@@ -218,7 +218,7 @@ class InplaceElemwiseOptimizer(Optimizer):
 
         check_each_change = config.tensor.insert_inplace_optimizer_validate_nb
         if check_each_change == -1:
-            if len(fgraph.apply_nodes) > 500:
+            if len(fgraph.apply_nodes) > 500 and fgraph.destroy_handler.algo == "topo":
                 check_each_change = 10
             else:
                 check_each_change = 1

faster with clients is with this diff:

@@ -316,6 +316,7 @@ class InplaceElemwiseOptimizer(Optimizer):
 
                     updated_vars = []
                     vars_from_inplace = []
+                    one_clients = []
                     other_vars = []
@@ -331,11 +332,18 @@ class InplaceElemwiseOptimizer(Optimizer):
                             # inplace on the updated input via a sequence of
                             # one or more inplace operations
                             vars_from_inplace.append(inp_idx)
+                        elif len(inp.clients) == 1:
+                            one_clients.append(inp_idx)
                         else:
                             other_vars.append(inp_idx)
                     sorted_candidate_inputs = (updated_vars +
-                                               vars_from_inplace + other_vars)
+                                               vars_from_inplace +
+                                               one_clients +
+                                               other_vars)

and faster with clients and nodes is with the combination of both of the diffs.

	Function.call	total compile time	Optimizer time	Linker time	Time since theano import
master 2	2336.2	564.4	531.79	21.2	3053.704
master 3	2343.5	566.2	533.59	21.3	3065.059
avg. master	2339.8	565.3	532.6	21.2	3059.381
faster 2	2348.2	319.9	288.66	19.9	2820.300
faster 3	2348.1	319.7	288.52	19.9	2820.040
avg. faster	2348.1	319.8	288.59	19.9	2820.170
faster with nodes 2	2355.2	446.8	412.93	22.8	2951.870
faster with nodes 3	2350.1	447.1	413.49	22.8	2946.918
avg. faster with nodes	2352.6	446.9	413.21	22.8	2949.394
faster with clients 2	2345.2	564.8	532.23	21.2	3062.979
faster with clients 3	2341.5	571.4	536.11	23.7	3065.738
avg. faster with clients	2343.3	568.1	534.17	22.4	3064.358
faster with clients and nodes 2	2343.0	567.5	532.7	23.7	3063.707
faster with clients and nodes 3	2345.5	565.6	530.9	23.7	3059.645
avg. faster with clients and nodes	2344.2	566.5	531.8	23.7	3061.676

ReyhaneAskari · 2017-05-08T21:33:02Z

I ran all the benchmarking again to double check. Here is the results:

	Function.call	total compile time	Opt time	Linker time	Time since theano import
node_2	2359.018	478.5213	443.7829	23.8545	2988.656
node_3	2362.502	478.4525	444.4603	22.98099	2991.534
avg node_2 node_3	2360.76	478.4869	444.1216	23.417745	2990.095
client_2	2339.796	334.8008	303.5052	20.03821	2835.333
client_3	2340.467	337.5902	306.2961	20.01576	2838.645
avg client_2 client_3	2340.1315	336.1955	304.90065	20.026985	2836.989
client_node_2	2387.533	477.736	444.0152	23.00391	3015.129
client_node_3	2368.015	476.6075	442.7222	23.02957	2995.325
avg client_node_2 client_node_3	2377.774	477.17175	443.3687	23.01674	3005.227

ReyhaneAskari · 2017-05-15T18:10:07Z

Full results:

	Function.call	total compile time	Opt time	Linker time	Time since theano import
master_2	2343.89	564.02	530.9986	21.3757	3079.552
master_3	2336.77	564.3109	530.4165	22.21166	3072.559
avg master_2 master_3	2340.33	564.16545	530.70755	21.79368	3076.0555
faster_2	2349.516	321.593	289.6749	20.31129	2838.189
faster_3	2355.735	322.3881	290.4812	20.2576	2844.84
avg faster_2 faster_3	2352.6255	321.99055	290.07805	20.284445	2841.5145
node_2	2364.787	473.7134	441.7535	20.32315	3001.74
node_3	2357.871	477.1179	445.2522	20.2912	2998.278
avg node_2 node_3	2361.329	475.41565	443.50285	20.307175	3000.009
client_2	2356.345	339.9928	307.9422	20.35324	2863.726
client_3	2353.055	336.1462	304.3138	20.23053	2856.64
avg client_2 client_3	2354.7	338.0695	306.128	20.291885	2860.183
client_node_2	2355.181	476.5542	444.4957	20.31519	2999.266
client_node_3	2357.456	474.9499	442.9806	20.30029	2999.692
avg client_node_2 client_node_3	2356.3185	475.75205	443.73815	20.30774	2999.479

ReyhaneAskari · 2017-05-16T16:36:46Z

Here is the result of memory profiling:

	master	faster	node	client
peak memory cpu	3.0	2.0	2.0	2.0
peak memory gpu	2296853.0	2186307.0	2183947.0	2186307.0
peak memory cpu + gpu	2296855.0	2186309.0	2183949.0	2186309.0
flag = optimizer_excluding cpu	3.0	2.0	2.0	2.0
flag = optimizer_excluding gpu	2313898.0	2199204.0	2199204.0	2199204.0
flag = optimizer_excluding cpu + gpu	2313901.0	2199206.0	2199206.0	2199206.0
allow_gc=False cpu	47.0	47.0	44.0	47.0
allow_gc=False gpu	8456750.0	8693082.0	8014349.0	8693082.0
allow_gc=False cpu + gpu	8456797.0	8693129.0	8014394.0	8693129.0

nouiz · 2017-05-16T18:27:06Z

I pushed commits to your branch that should help the profilings. Can you redo the Theano profiling and python profiling of the "node" version and master version?

ReyhaneAskari · 2017-05-16T22:05:50Z

	Function.call	total compile time	Opt time	Linker time	Time since theano import
master_1	2358.285	581.9009	545.5013	24.8629	3096.893
faster_1	2353.743	336.3887	301.5318	22.96834	2851.59
node_1	2367.863	474.6799	442.5479	20.49995	3005.857

ReyhaneAskari · 2017-05-16T22:07:07Z

	master_1	faster_1	node_1
peak memory cpu	3.0	2.0	2.0
peak memory gpu	2296853.0	2186307.0	2183947.0
peak memory cpu + gpu	2296855.0	2186309.0	2183949.0
flag = optimizer_excluding cpu	3.0	2.0	2.0
flag = optimizer_excluding gpu	2313898.0	2199204.0	2199204.0
flag = optimizer_excluding cpu + gpu	2313901.0	2199206.0	2199206.0
allow_gc=False cpu	47.0	47.0	44.0
allow_gc=False gpu	8456750.0	8693082.0	8014349.0
allow_gc=False cpu + gpu	8456797.0	8693129.0	8014394.0

ReyhaneAskari · 2017-05-19T14:16:02Z

@nouiz , Here is the result of profiling. I removed the python profiling but it still didn't reduce the timings.

	Function.call	total compile time	Opt time	Linker time	Time since theano import
master_2	2338.086	533.6504	496.1532	24.34856	3083.279
master_3	2388.766	533.0599	497.8445	21.9859	3109.112
avg master_2 master_3	2363.426	533.35515	496.99885	23.16723	3096.1955
faster_2	2362.921	321.5113	287.7741	20.55508	2866.6
faster_3	2356.994	321.2518	287.4285	20.55622	2860.374
avg faster_2 faster_3	2359.9575	321.38155	287.6013	20.55565	2863.487
node_2	2362.66	443.7989	409.6934	20.91191	2999.223
node_3	2355.702	444.244	410.305	20.77304	2992.904
avg node_2 node_3	2359.181	444.02145	409.9992	20.842475	2996.0635
client_2	2349.6	318.6416	285.2724	20.20445	2862.887
client_3	2353.197	319.1659	285.7407	20.25854	2867.173
avg client_2 client_3	2351.3985	318.90375	285.50655	20.231495	2865.03

	master	faster	node	client
peak memory cpu	3.0	2.0	2.0	2.0
peak memory gpu	2296853.0	2186307.0	2183947.0	2186307.0
peak memory cpu + gpu	2296855.0	2186309.0	2183949.0	2186309.0
flag = optimizer_excluding cpu	3.0	2.0	2.0	2.0
flag = optimizer_excluding gpu	2313898.0	2199204.0	2199204.0	2199204.0
flag = optimizer_excluding cpu + gpu	2313901.0	2199206.0	2199206.0	2199206.0
allow_gc=False cpu	47.0	47.0	44.0	47.0
allow_gc=False gpu	8456750.0	8693082.0	8014349.0	8693082.0
allow_gc=False cpu + gpu	8456797.0	8693129.0	8014394.0	8693129.0

nouiz · 2017-05-19T14:59:17Z

I think we are good to merge. Remove the change to the test suite and fix the pep8 failure.
Also, can you benchmark the fast cycle detect with this Theano flag:

gpuarray.preallocate=-1

ReyhaneAskari · 2017-05-19T15:01:05Z

Alright. Sure.

nouiz · 2017-05-19T15:05:25Z

Why I ask new profile is because now, with the fast topo, the slowest optimization is still the gpu_elemwise_inplace. But having less inplace opt, we don't see a slow down in run time and we don't see an increase in the memory usage. So mostly, why do we do that optimization anymore?

So, not for this PR, but to make sure if we still need that optimization or not, we would need those profiles:

(the one requested above): this PR with fast detection and gpuarray.preallocate=-1
this PR without topo detection(named master in your profile) and gpuarray.preallocate=-1
this PR with flag optimizer_excluding=gpua_inplace_opt and gpuarray.preallocate=-1
this PR with flag optimizer_excluding=gpua_inplace_opt

Maybe now when we have the preallocate, the inplace is much less useful, but still useful when we disable the cache of allocated memory on the GPU.

ReyhaneAskari · 2017-05-19T15:47:35Z

I didn't rebase as rebasing will bring all the commits down and we would not know in the future which profiling is for which commits.

ReyhaneAskari · 2017-05-19T16:08:58Z

Here is the result of faster branch compared to the same branch with flag gpuarray.preallocate=-1:

	Function.call	total compile time	Opt time	Linker time	Time since theano import
faster	2356.994	321.2518	287.4285	20.55622	2860.374
faster_preallocate	2611.0	321.957	285.5965	23.13251	3139.56

	faster	faster_preallocate
peak memory cpu	2.0	2.0
peak memory gpu	2186307.0	2186307.0
peak memory cpu + gpu	2186309.0	2186309.0
flag = optimizer_excluding cpu	2.0	2.0
flag = optimizer_excluding gpu	2199204.0	2199204.0
flag = optimizer_excluding cpu + gpu	2199206.0	2199206.0
allow_gc=False cpu	47.0	47.0
allow_gc=False gpu	8693082.0	8693082.0
allow_gc=False cpu + gpu	8693129.0	8693129.0

lamblin · 2017-05-23T16:00:07Z

Trigger new buildbot run

…or don't care about the iteration order. Only local changes.

nouiz requested changes Mar 31, 2017

View reviewed changes

ReyhaneAskari force-pushed the faster_topo branch from 39e7a7d to d0dfbcd Compare April 6, 2017 14:58

nouiz requested changes Apr 6, 2017

View reviewed changes

nouiz requested changes Apr 11, 2017

View reviewed changes

ReyhaneAskari force-pushed the faster_topo branch from 75403ea to afa0de3 Compare April 12, 2017 21:53

nouiz requested changes Apr 21, 2017

View reviewed changes

ReyhaneAskari force-pushed the faster_topo branch from 806ddbb to 6c724d4 Compare May 1, 2017 16:03

nouiz reviewed May 2, 2017

View reviewed changes

ReyhaneAskari and others added 22 commits May 30, 2017 16:26

minor speedup in fast_destroy

3cab384

speedup in validate function

a5261ea

minor fix in validate

c17f025

jenkins changed to test cycle_detection

014190f

fix for constants in faster_destroy and fix for jenkins

72e5d99

change of travis

6504906

minor fix and fix for py3

0e4b2ea

minor changes in fast_destroy

cde0a6f

small change in doc string

8bce76d

Make it clear that some variable aren't used

e4a1061

Convert some OrderedSet when we only check if element are in the set …

d1cf7ce

…or don't care about the iteration order. Only local changes.

Make DestoryHandler.validate use a non ordered orderings.

3afa8c8

Use dict as we never iterate on it.

a436156

We don't iterate on it, just check membership

aa546d1

Don't merge ordering when there is only 1 ordering.

9a3f089

removed jenkins fast cycle detection flage

82ff0b0

travis cycle detection flag removed

40230f9

Simplify the optimization I did.

24e7ed3

Small code simplification

6fd7f18

Delete old code in 'if 0:'

87a857c

fixed some doc strings

885816f

removed code that seemed useless

dd349d2

ReyhaneAskari force-pushed the faster_topo branch 3 times, most recently from 9eedd8b to 99f568c Compare May 31, 2017 03:29

minor fix

82abb2a

ReyhaneAskari force-pushed the faster_topo branch from 99f568c to 82abb2a Compare May 31, 2017 03:32

nouiz approved these changes Jun 1, 2017

View reviewed changes

nouiz merged commit 6792767 into Theano:master Jun 1, 2017

lamblin mentioned this pull request Jun 1, 2017

New destroy handler #6000

Merged

		@@ -924,7 +930,8 @@ def on_change_input(self, fgraph, app, i, old_r, new_r, reason):
		self.view_o.setdefault(new_r, OrderedSet()).add(output)

		# TODO: check here only one level of fast destroy_map.

		@@ -822,7 +827,8 @@ def on_import(self, fgraph, app, reason):
		if getattr(app.op, 'destroy_map', {}):
		# TODO: check here only one level of fast destroy_map.

		@@ -891,6 +928,8 @@ def on_change_input(self, fgraph, app, i, old_r, new_r, reason):

		self.view_o.setdefault(new_r, OrderedSet()).add(output)

		if config.cycle_detection == 'fast':

Faster topo #5794

Faster topo #5794

Conversation

ReyhaneAskari commented Mar 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nouiz commented Mar 31, 2017 via email

ReyhaneAskari commented Apr 6, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ReyhaneAskari commented Apr 11, 2017

nouiz commented Apr 13, 2017

ReyhaneAskari commented Apr 14, 2017 • edited

nouiz commented Apr 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ReyhaneAskari commented Apr 24, 2017 • edited

ReyhaneAskari commented Apr 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ReyhaneAskari commented May 5, 2017

ReyhaneAskari commented May 8, 2017

ReyhaneAskari commented May 15, 2017

ReyhaneAskari commented May 16, 2017

nouiz commented May 16, 2017

ReyhaneAskari commented May 16, 2017 • edited

ReyhaneAskari commented May 16, 2017

ReyhaneAskari commented May 19, 2017

nouiz commented May 19, 2017

ReyhaneAskari commented May 19, 2017

nouiz commented May 19, 2017 • edited by ReyhaneAskari

ReyhaneAskari commented May 19, 2017

ReyhaneAskari commented May 19, 2017

lamblin commented May 23, 2017

ReyhaneAskari commented Apr 6, 2017 •

edited

ReyhaneAskari commented Apr 14, 2017 •

edited

ReyhaneAskari commented Apr 24, 2017 •

edited

ReyhaneAskari commented May 16, 2017 •

edited

nouiz commented May 19, 2017 •

edited by ReyhaneAskari