Skip to content
SethTisue edited this page Dec 4, 2012 · 2 revisions

"Less time than before, more speed than before
You're rich not poor, what are you doing it for?
Want more want more..."
— Wire, “Champs”

Current notes

hey, maybe agentsets should be bitsets? we don't want to waste memory though if the turtle numbers keep climbing and the old turtles are dead. oh but hey, the bitset could be tagged with the range of turtle ids that the agentset covers. yeah. and we could update the range when we wipe old turtles at iteration time?

make nobody an Agent, so values that return agents can return Agent instead of Object. perhaps I should even have a nobody-turtle, a nobody-patch, and a nobody-link (and maybe a generic nobody or a nobody-observer) so that the methods can return Turtle, Patch, etc. it's too big a change for 4.1 though

the whole AgentException thing is no good, or at least should be used in many fewer cases. it's preventing us from rejiggering _patch, for example, and the code involved could just as easily use a null return instead of an exception.

"set heading random 360" is calling setturtlevariable(Object), should be (double). I think we need _setturtlevariabledouble and friends.

_oneofneighbors(4) would be an easy optimization to add that would help GridWalk and maybe Heatbugs. it is in a fair number of models.

MutableLong seems like overkill for _repeat. why not mutableint?

we can speed up exclusive-mode activations of procedures by using the same Activation.args array every time if we know that the procedure is not going to call itself... which we can know by searching the procedure for occurrences of _call, _callreport, _run, and _runresult. is that really it? this probably wouldn't speed up enough models to be worth it, would it?

can we do some super crazy thing where AgentSets are lazy -- are essentially iterators rather than collections -- but only within a single reporter? that might help with the any-other-count-with confusion and maybe the mean-max-sum-min thing too

Engine

Change data structures for Patch.turtlesHere

turtles-here (and other-turtles-here) has to convert an ArrayList to an agentset -- seems silly -- can we eliminate the copying overhead? they're both arrays underneath, after all.

maybe just use an agentset in the first place? instead of an ArrayList

or maybe turtleshere should be a linked list! we never use the random access, do we? vector means copying overhead when removing items -- can we reduce/eliminate? they could be linked lists, but would that really be faster given how short they would usually be? object creation/deletion for node object might kill us should test LinkedList performance on full array of benchmarks. on Termites, LinkedList seems to help by about 3%.

Reduce overhead from checkAgentClass()

This should be happening in the compiler, not at runtime (except maybe in places like procedure entry and ask entry).

if I just eliminate the calls to checkAgentClass(), I save about 10% on most benchmarks!

maybe I should add some tests that fail when the calls to checkAgentClass() are eliminated, to help me understand better the situations in which it is needed

agent class checking can be per-command not per-instruction (but watch out for _with!); currently we don't agent-type-check reporters at all! is that a bug? wait, actually, maybe the compiler is doing this, maybe there's no bug

perform/performObserver/performTurtle/performPatch? maybe use __observercode and friends to explicitly indicate where checks should go? or cache the checks? maybe have the model contain three different programs!

can we make TypeParser mark primitives that have already been agent-type-checked, so we don't need to do runtime checks? also, can we mark entire procedures by type and then have _call do checking so it can be avoided within the procedure? except there's also _ask, _hatch, etc. maybe a new primitive that we would insert any time a check is needed, then not check otherwise? (but there's big overhead to every prim added prim inc compiled code...) or compile separate observer, turtle, and/or patch versions of each chunk of code. Maybe _ask and friends should actually cause a procedure to be defined; then we'd have a one-to-one correspondence between procedures and chunks of code

when do we really need to call checkAgentClass? only on entry to procedures and entry to ask, right? and things like cct/hatch/etc. but which commands do we check? need to skip commands inside ask/hatch/cct etc. -- this is what motivated the idea of making all command blocks that involving switching agents be separate procedures (that end with _done instead of _return, I guess) what about "run"?

maybe doing the separate-procedures-for-command-blocks thing wouldn't be that hard! careful though, then not all procedure calls would be going through _call. would need to look at _call code carefully. also need to handle error reporting carefully. how will this work -- by having _ask.assemble() call back to the compiler? that could be tricky... maybe it's better to have the compiler deal with it directly, so that the procedures created get handled as part of the normal "flow" of the compiler.

Continue improving bytecode generator

Improve bytecode for simple arithmetic expressions and comparisons ===

Generate whole basic blocks at once ===

Parallelize some things?

the diffuse primitive seems like a good candidate. also clear-all. anything else?

also we could analyze simple "asks" and determine whether they are safe to parallelize. stuff like ask patches [ set pcolor red ].

maybe let the user choose which asks can be done in parallel, and leave it up to them to promise it's safe?

Don't make agentsets when creating turtles

e.g. in hatch, there's no need to make an agentset, is there? can't we just have each agent run the initialization block one at a time with a separate call to runExclusive each time? would that really be faster?

maybe just have a special one-turtle forms of hatch and sprout that avoid the agentset when the first input is the constant 1? (using the optimizer)

Reduce dead checks?

Instead of having to compare the agent's id to -1 all the time, can we, at the time we kill a turtle, do something to it that will keep it from stepping any contexts without explicit checks? Even if we did it by exceptions that would (perhaps!) be OK since in most models this doesn't even arise.

Machine generate the code in org.nlogo.prim

start thinking machine generation -- instead of writing commands completely in java, write them in an annotated form of java that can be expanded out to much more java code at compile time, basically turning runtime checks into compile time checks...

Improve data structures in JobManager/JobThread

better data structures in JobManager.run()? (linked list, not vector?)

Avoid creating Double and Integer objects

  • could optimize -of forms on agent variables that are always doubles
  • add something to "how to write a prim" that explains this stuff
  • when we see code like "set pcolor pcolor - 1" we should convert it to "set pcolor pcolor - 1.0" otherwise we're repeatedly converting 1 to 1.0 at runtime, aren't we?
  • don't make Doubles for Turtle.VAR_COLOR either including changing the setTurtleVariable(double) VAR_COLOR case
  • to do: _multdouble, etc.
  • optimize to the level of _pcolor(), _heading(), etc? why not?
  • need to add optimizing to _minusconstdouble(DoubleReporter)
  • maybe this is all misguided -- I could be unboxing something and then reboxing it later, couldn't I, when the box could have just been passed through? try to construct an example where that happens. what about "set pcolor xcor"? sometimes it's better to report a box, other times it's better to report a prim. ugh! what about "if pcolor = yellow" ? [0]_ifelse{-T-}(_equal1{-T-}(_patchvariabledouble{-T-}:PCOLOR):0.0):+5 then it's having to box the color after all!
  • need to optimize both _equal and _equal1 to catch both "pcolor = yellow" and "pcolor = size"; should I go all the way to _equalconstdouble?

further idea: within each command, we don't need lots of wrapper objects... we need one wrapper object for storing temporary results! I think I tried this once as "netlogo.sharedbox" or something like that and it didn't help, but I'm not sure if that experiment was actually conclusive

Inline procedures

can we inline procedures? thoughts:

  • reporter procedures that consist of a single "report" command should be easy to inline
  • reporter procedures are probably easier to inline than command procedures
  • NetLogo uses a lot of no-arg procedures... those should be especially easy to inline
  • once we have LET then inlining will probably get easier
  • we need to worry about error reporting

I made a start at this and checked it into org.nlogo.compiler.Inliner, but with the meat commented out so it does nothing

Don't store neighbor sets in patches themselves

store neighbors/neighbors4/neighbors6 info in separate arrays...? that way we wouldn't always pay the memory cost for all three in models where not all three are used

Break up code between switches into blocks

idea: break the code up into "blocks" between switch statements and ensure the blocks execute more quickly somehow by avoiding extra checks? (would it be beneficial to have an explicit _switch command? probably not, we want to avoid instruction-fetching overhead)

Eliminate Job.owner slot?

perhaps the owner slot could be eliminated from Job by having JobManager track owners instead -- just keep a Map from top level jobs to owners

Make makeEvaluationContext smarter

i had to back out the Context.java change where makeEvaluationContext always returned the same Context, but it could be reapplied with some care: we need to only return the same Context the first time, nested calls to makeEvaluationContext are the only ones that must make a new Context

(older idea: can makeEvaluationContext() just stash the context and use it again later? this might speed up some models...)

Avoid Math.* calls

Math.rint is slow, don't use it in TurtleDrawer.java Math.floor is slow, don't use it in Patch.java check for others like that

Make optimizing mechanism more general

shouldn't be that hard to make the optimizer smart enough to know that _plus1double is the such-and-such form of _plus, and make that extendible to other prims without having to write those difficult optimize methods individually. this would make it much easier to add a bunch more optimizations.

(note: if this could be combined with the automatic-generation-of- org.nlogo.prim idea, then we'd really be in business, because we wouldn't have to handcode all the optimized forms either)

Add more compiler optimizations

  • optimize reverse+sort -> _sortreverse
  • optimize if+not -> ifnot
  • optimize "T> PVAR-of patch-here" => "T> PVAR"
  • optimize turtlesat like we did patchat
  • do iConstant + _plus + iConstant => iConstant, not just for ints
  • constpatch, constturtle?
  • "random-one-of turtles-here" and "random-one-of breed-here" could be optimized to not have to construct an agentset
  • optimize _ask+_woi and _woi+_ask to a single _askwoi command
  • optimize _distance(xy) != 0 or > 0 to not compute the distance, just compare the coordinates for equality

Avoid recomputing agent-invariant subexpressions

in general, knowing whether an expression varies from agent to agent or not would help us speed up because then those subexpressions could be evaluated once ahead of time for example in Cell Aut: patches with [p{x,y}cor = line] line is an observer variable so we can treat it is a constant (and then we can go on and optimize it to grab just that row without having to look at every patch); same for screen-edge-x/y and expressions involving arithmetic on constants, observer variables, sex/y, etc. but how often does this actually come up in practice? perhaps not very often judging from benchmark models, at least, it's rare that a whole subexpression (not just a single prim) is invariant across agents

Change boolean operator associativity

now that they short circuit, and/or should right-associate not left-associate, for efficiency!

(a and (b and c)) is more efficient than ((a and b) and c) because if a is false the second and gets skipped.

should anything else right-associate? go down list

Reuse Context objects?

agents could reuse Context objects... once they are created, don't destroy them, just reuse them? might speed up Fire

Merge repeated asks (add a "comma")

when two asks of the same agentset follow each other, we can optimize them into a single one with an implicit comma (a StarLogoT-style comma). how would we implement comma? I guess by keeping a count of how many of the job's contexts aren't blocked.

Eliminate Context.stopping

it seems a shame to waste the memory in every Context object for the "stopping" flag, which has no other purpose than to support the hack whereby using "stop" in a procedure can stop the enclosing forever button. couldn't we implement that hack some other way?

Eliminate Context.myself?

do we really need Context.myself? seems like a waste of memory... maybe we could figure it out on the fly when needed...? shouldn't it be parentContext.agent or something like that?

In _ask.perform(), don't construct one-agent agent sets

when asking a single agent, currently we have to construct an agentset to contain it. this seems very fixable.

avoid making Job objects

idea: instead of "ask" making a new Job, can we temporarily swap new info into the old Job, and then swap it back out again when we're done...?

AgentException/LogoException unification?

We can avoid catching and throwing an exception just to get it's message. The AgentException is NOT a LogoException because we don't want a dependency in the agent package.

Graphics

Use Java2d to rotate rectangles

right now we draw rotated rectangles using drawPolygon. in Java 1.1 we had no choice, but now that we're Java 1.4, presumably Java2D would draw a rotated rectangle faster if we directly asked for that, instead of constructing a polygon?

(misc notes)

update screen from jobmanager outer loop, not job loop?

can the performance of the shapes cache on Macs be improved? maybe by decreasing cost of alpha? (I don't remember what I meant by this...)

maybe we should optimize the shapes cache (on Java 1.3+, at least, so we have BufferedImage) so that it's independent of the turtle color too, not just the patch color?

graphics notes:

 graphics:
 - idea: can we store patch colors directly in a MemoryImageSource
   or BufferedImage?
 - can we take better advantage of multiple processors by using
   the same technique HubNet uses, of having an object (ClientPatch)
   for recording a graphical change, so that computation can
   continue forward even while a graphics update is happening
   in parallel?  (not synchronize on Engine)
 - the drawLine stuff in CachedShape may slow us down; maybe we should
   use a different strategy such as making the image big, but then
   drawing the shape small within it (using clipping), then
   leaving out parts of the image when we draw it; need to benchmark this
   versus current approach, and also versus on-screen clipping, and by the way
   are we now sometimes clipping when we needn't?
 - when the patch size is a non integer, should we cache
   four different bitmaps?  (e.g. 3x3, 3x4, 4x3, 4x4)
 - should we increase the patch size ceiling when we're on a VM
   that has BufferedImage...?  and decrease anglestep?
 - is the queueing strategy good enough?
 - make the patch size always an integer, even when zooming?
   would we speed up...?
 - org.nlogo.shapes.CacheKey.hashCode() is kinda dumb -- do we care
   enough to make it better?
 - can we do something with JComponent.isOptimizedDrawingEnabled?
 - can we optimize the shapes cache for single-colored turtles
   by only caching the bitmap in one color and then changing
   the color on the fly at draw time...?
 - should the patchcustodian try harder to determine whether a patch
   is *really* dirty or not?  we could cache the old color & turtle
   info and compare the current to the old
 - on OS X native, we might take better advantage of the GPU
   if we use the repaint() mechanism whenever we need a full redraw of
   the graphics window...  in order to ensure that every frame was
   drawn we'd need a semaphore to know when paintComponent() actually
   was done being called
 - put screen updates on separate thread from JobManager?
   then we'll take advantage of dual CPU's
 - does java 1.4 have low level graphics stuff we can use to make NetLogo
   run fast?
 - setting a turtle's color to the same color it already has shouldn't
   mark the patch dirty (hence causing flicker)... but do we want pay
   the performance price of checking this every time?  maybe not, or at
   least maybe not until we're not using wrappers for everything
 - what if the patchcustodian didn't need to use "synchronized"?
   (or at least, let's reduce use of synchronization...)
 - maybe agent subclasses shouldn't call markPatchDirty... put responsibility
   on caller?  hmmmm...
 - are we reallocating the dirtypatch ringbuffer at newmodel time?
   we should be!
 - sometimes paint patches without bothering to go through dirty mechanism?
 - we could override repaint() to immediately turn off all
   update()-ing until paint() finishes?

= Sort through these =

why did Flocking slow down so much from 1.3 to 2.0beta3 on Windows?
most of the slowdown was the IBM11->Sun14 VM switch; can we pinpoint?
is it because of strictfp/StrictMath?

{{{
 Eamon McKenzie wrote:
 >I've been thinking about a potential optimization to netlogo that you may
 >want to try.  the code may be set up like this already, but if my memory
 >serves me correctly it's not.  although netlogo is not a simd execution
 >engine it still can benefit from a simd style memory layout.  the
 >attached program "ca test.nlogo" made me think of this.  patch code
 >switches on each command and in the absence of conditionals is in
 >lockstep.  because of this the code is working on the same patch
 >variables across all patches on each tick.  if you arrange the data like:
 >struct
 >{
 >     pc[num-patches];
 >     old-pc[num-patches];
 >};
 >instead of:
 >struct
 >{
 >     pc;
 >     old-pc;
 >} [num-patches];
 >it should be able to more efficiently use the L2 cache and memory bus.

why are Fire and FireBig now slower than NetLogo 1.3 even 
without graphics off?  was my "optimization" of "set pcolor"
in Patch.java misguided?
(or is it just the VM switch?  check the benchmark data)

can we avoid the possible slowness of using
Collections.synchronizedList in JobThread?

I think the topLevelActivation slot could be maybe be
eliminate from Job somehow, to save memory

can we make it so the same string isn't compiled over and over
again, if you run/runresult it over and over again?

don't bother with Turtle.x/ycor and Patch.px/ycor() methods --
just access the slots directly

hmm, maybe asks of a single agent could be run in exclusive mode?
(but if what if that agent asks an agentset.. maybe not...)

 global:
 - try this: http://research.sun.com/projects/jfluid/
   for perhaps more accurate profiling of engine?
 - study StarLogoT
 - native compilation: JOVE/JET/gcj?
 - fool around with GC options for smoother operation?
   e.g. eventually enable -Xincgc?
   see http://java.sun.com/docs/hotspot/gc/

 benchmarks:
 - note: Fire benchmark can be sped up by factor of two by optimizing
   the Logo code -- do we prefer the easier-to-understand code?
 - "splotch" in Myself Example is surprisingly slow -- fixable?
 - rebenchmark/reprofile __fire
 - Fire benchmark -- restore it to match the StarLogo/StarLogoT
   benchmark exactly; make the to-report version separate

 engine (misc):
 - are there any places where we can avoid creating new objects
    by using singletons?
 - optimization: _with etc. don't need to even make new agentsets
   until a counterexample is found...!  but maybe it's not worth
   the extra checks in the inner loop
 - optimization: turn divides by floating point constants into multiply
   by inverse.  (this makes me nervous though, what if the result is
   slightly different, that breaks exact reproducibility between
   versions, or between two versions of the same code that would seem
   like they ought to be exactly equivalent)
 - we should do real tail recursion!
 - also optimize ask + with => askwith?
   need to revise Job so the agents can be added one at a time
   instead of being collected first in an agentset
 - after Geoff finishes his changes, check firstOne(), randomOne(),
   contains() for potential speedups (they can be O(1) in many cases,
   and/or don't need to bother making an Enumeration object)
 - Geoff's code should probably use AgentEnumeration instead of
   Enumeration, or maybe TurtleEnumeration and PatchEnumeration,
   to avoid typecasting
 - turtlesHere could be specialized on org.nlogo.agent.Turtle to
   avoid typecasts
 - can we special case "ask patches" with the knowledge that
   the number of patches never changes?  e.g. reuse Contexts?
 - in Fire, we really want to use reuse the exact same Context
   objects over and over again -- not just reuse any old
   dead Context object, but the exact one we used last
   time for that patch and that Job.addr -- maybe the same
   Job too!  we'd need to not null the Contexts out as
   they finish..
 - replace get*Variable calls with direct array access whenever
   possible... e.g. in _nsum4.report() -- is this worth the
   potential extra maintenance effort...?
 - can we speed up Ants by making diffuse not have to make
   all those Doubles?  leave them as doubles somehow...
 - optimize "breed in-radius ..."?
 - funny cases: _fd, _bk, _repeat, _waitforjob, _report,
    _waitforjob, _while, _withoutinterruption, _call,
    plus _wait/_waitinternal
 - halt code in userchoice/input OK?
 - can we statically analyze how many Double objects, how many
   Integer objects, etc., will be created per agent during the
   run of a particular model, and statically allocate them...?
   (at least for some of the code)  perhaps by analyzing the
   call graph...
 - perhaps we should treat observer code differently from turtle
   and patch code, like in StarLogoT
 - speed up diffuse by doing our own wrapping?
 - Math.floor is slow (probably other Math. stuff is slow);
   roll our own instead
 - for logo values, maybe use our own mutable classes instead of the
   built in nonmutable Double, Integer, etc. -- would have to
   be done very carefully though so we don't wind up with
   "set" changing things unexpectedly
 - here's another (perhaps too weird!) hatch speedup idea: it's
   expensive to make a new job every time a turtle hatches, so why not
   create the job in advance, initially with no agents in it, with
   remove=false, running=false, exclusive=true.  then when a turtle
   hatches, we just add a context to that existing job and start it
   running. we'd need to change Job.java to stop remove from being
   set to true when the context finished.  and _waitforjob wouldn't
   work anymore, but maybe we wouldn't need _waitforjob because the
   initialization job would be exclusive, so we wouldn't need to 
   wait for it?  hmm
 - have more methods use fastGetPatchAt?
 - how to optimize diffuse:
   - call fastGetPatchAt in the second call, too
   - special case the boundary patches so most patches can be handled
     with less arithmetic?
   - crazy idea: store all patch variables in arrays not in the Patch
     objects themselves!  maybe faster on the balance (maybe)
 - don't store nulls in turtles(), instead reuse the turtles?
 - (same as prev?) pre-allocate turtles, use a marker of id -1 not a
   marker of null in turtles()?  this might speed up e.g. Wolf/Sheep
 - can we take more advantage of the speediness of arraycopy somehow?
 - Practical Java praxis 35 (use stack objects)
 - in Context.step, do we need the job.running check?  what
   is it needed for?  I guess without the check the Halt
   button wouldn't be able to stop a job which was in a loop
   of non-switching commands (or is that right?)
 - AgentSet.randomOne() should test the contiguous flag, not just do
   number == max
 - what happens in these two lines of code at the end of assemble?
      absoluteAddressing( model.program ) ;
      relink( model ) ;
   will something stored away inside ins get missed?
   yes #1:
      absoluteAddressing looks for _address commands and makes
      them absolute not relative -- this could be accomplished
      by sticking all _address objects in a container as we
      went, though
   yes #2:
      relink does the same with _call objects -- it replaces
      names with addresses -- same solution
   but we can still code this up as a test, it just won't work
     when _address and _call objects are nested
   do reporters ever have blocks in them?  yes... values-from;
     but do they ever have command blocks in them?  only indirectly,
     if we call a reporter procedure
 - currently we always switch when a command switches -- why not wait
     until ip decreases?  I think this is what MIT does...  what about
     our screen update policy though?  is it going to be OK that "ask
     patches [ set pcolor yellow set pcolor blue ]" never displays any
     yellow?  (test in MIT, StarLogoT) and what about our _fd policy?
     watch out for cases where the ip stays the same, for example,
     _waitforjob; maybe perform should return a boolean indicating
     whether to switch, so we could be more nuanced about this
 - should the compiler know which statements and procedures can alter
   data and which can't, and enforce that with and values-from can
   only use read-only code?  that would enable some compiler optimizations
 - instruction pointer could be short not int (negligible? possible
   even a performance hit?)
 - the agentset code could be made even faster, probably -- Fernando
   thought it was horribly inefficient but I could never get him to
   explain why
 - what's up with the threading stuff in JobManager.java?
   (it sleeps every half second, maybe should wait on a condition...)
 - we should never call new from useTimeSlice, ever!!
 - is there ask overhead we can reduce?  are we creating a new job
   for every ask?  if so, ugh! look at threading in ask
   there's probably a lot of overhead associated with doing
   an ask; can we get rid of some of it somehow?
   if we must make a Job, maybe instead of adding it to the end of
   the job vector, we could *replace* the parent job in the vector,
   then once we finish, reinstate the parent.... sweet!  basically
   the vector of jobs would become a vector of stacks of jobs..
   hey wait a minute, that doesn't make sense, because individual
   contexts are the parents of jobs -- jobs aren't direct parents
 - special-case the built-ins variables such as coordinates!
 - mostly it's just the slowness of calling all the _perform methods,
   it seems

do this sooner rather than later:
https://ccl.northwestern.edu/internal/wiki/index.php/Speedup_ideas#In__ask.perform.28.29.2C_don.27t_construct_one-agent_agentsets

don't forget checkAgentType() -- big speedup there. well,
maybe only 5%? 5% is pretty good though!

could we optimize our own "byte code"?  e.g. replace gotos to dones with
dones...  I bet I could think of more such optimizations. these would
synergize with Forrest's work.

we should optimize "patches with [distancexy < " like in-radius is
optimized
also need to optimize "patches with [distance myself < ..."
do both < and <=, and the reverse forms... yuk
what if the ".." depends on the other patch?  huge yukkkkk

TreeAgentSet uses a TreeMap instead of a TreeSet even though the only
reason we ever need a Map is to do lookups on who number in the set of
all turtles -- never any other set.  do we care? is it important?
should we change TreeMap to TreeSet and then make other provisions for
doing lookup by who number?  (in theory we could write our own thing
that walked the tree and found the right turtle in O(log n) time without
having an actual key object in hand, but can we do that without having
access to TreeSet internals?)

the comment in TreeAgentSet says
"we use a tree map here so that regardless of what order the turtles
are put in they come out in the same order (by who number) otherwise
we get different results after an import and export."  if the only
time we need the turtles in order is at export time, then surely
we could just sort them at export time if we needed to..?  then maybe
we wouldn't need a tree at all, a hash table would do fine

generator stuff:
- the generator has to generate the polymorphic
  sort.report() instead of the AgentSet one, even
  though the return type of OF is known in this case.
  how hard would it be to do that much type inference,
  at least?
    sort [my-in-links] of turtle 0 => []
- are we slowing it down by doing too much *inlining*?  e.g. of really
  long perform/report methods

use a quadtree to speed up arbitrary agent/agent distance calculations
in NetLogo?

do we have an inefficiency where fdinternal calls jump(0) after counting
down from the input to fd?  no, but it does call _fdinternal.perform()
one extra unneeded time

add optimized forms of OF? (e.g. _ofdouble? but _ofdouble is hard because of the polymorphism)

the way repeat is implemented assumes we're running parallel -- if we
weren't we wouldn't need a stack -- maybe should check and implement
differently if running w/o-i.  and maybe there are other similar
commands that could also be sped up.
Clone this wiki locally