New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature idea: 'sealed' and 'specialized' keywords #248

Closed
lukego opened this Issue Dec 4, 2016 · 25 comments

Comments

Projects
None yet
5 participants
@lukego

lukego commented Dec 4, 2016

Here is a feature idea that I would like to share for feedback. Please take as a humble thought experiment to think through some things that LuaJIT does and could do.

Idea

Add two new keywords: specialized modifier for functions and sealed modifier for locals.

Declaring a function specialized means that each closure instance will have a separate prototype i.e. traces starting lexically within the function will be unique to one closure instance.

Declaring a local sealed means that the JIT can treat its value as a constant at recording time i.e. inline the value into the recorded trace instead of generating code to perform a lookup. Changing the value of a sealed local is an NYI and forces a JIT flush to invalidate existing machine code i.e. you are allowed to change the value but doing so breaks an expensive barrier.

Example

Trivial example:

function copier (size)
   sealed local n = size
   return specialized function (src, dst)
      for i = 0, n-1 do dst[i] = src[i] end
   end
end

This would generate aggressively optimized code:

  • Each closure instance would have separate machine code: a trace starting on the for loop bytecode.
  • Each closure instance's machine code would be specialized for one exact array size.
  • Each closure would be specialized for one type of source/destination e.g. table or specific FFI type. (LuaJIT already specializes on type so this would happen automatically after the specialized modifier ensures separate prototypes for each closure instance.)

Generalization to tables

The sealed mechanism could be extended to Lua tables and used for often-referenced and seldom-modifier tables e.g. important modules. This would make lookups in those tables more efficient which would potentially tighten up the code generated for non-looping traces / side traces and reduce the need to keep the CPU in a trace loop.

Example of a sealed module that is efficient to access:

M = {}
function M.f1 = {}
function M.f2 = {}
setmetatable(M, {__sealed = true})
return M

and the global environment could also be sealed:

setmetatable(_G, {__sealed = true})

Lookups into sealed modules would be free - performed at recording time - with the cost of a JIT flush if a sealed module is modified.

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Dec 4, 2016

See also #246 and #208.

lukego commented Dec 4, 2016

See also #246 and #208.

@fsfod

This comment has been minimized.

Show comment
Hide comment
@fsfod

fsfod Dec 4, 2016

I had experimented with a read only tables system a while ago but based on an api approach for simplicty. I abused the colocated size field with special values to flag them as readonly because all the table flag bits were already used I think. I had the same restrictions like you suggested that all traces are flushed if a readonly table has tobe modified. I was toying with the idea of readonly but missing keys are write once as well.

fsfod commented Dec 4, 2016

I had experimented with a read only tables system a while ago but based on an api approach for simplicty. I abused the colocated size field with special values to flag them as readonly because all the table flag bits were already used I think. I had the same restrictions like you suggested that all traces are flushed if a readonly table has tobe modified. I was toying with the idea of readonly but missing keys are write once as well.

@MageSlayer

This comment has been minimized.

Show comment
Hide comment
@MageSlayer

MageSlayer Dec 5, 2016

I am not sure I understand why "specialized" keyword is needed.
Shouldn't VM specialize function on types automatically when detecting it's hot?
I mean why not fix VM instead of introducing manual workarounds?

MageSlayer commented Dec 5, 2016

I am not sure I understand why "specialized" keyword is needed.
Shouldn't VM specialize function on types automatically when detecting it's hot?
I mean why not fix VM instead of introducing manual workarounds?

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jan 2, 2017

@MageSlayer The VM creates one root-trace per loop per prototype. The amount of specialization is heuristic and depends on the instance count at recording time. The specialized keyword would create a separate prototype for each closure instance meaning (a) separate root trace for each closure (rather than one shared) and (b) consistently specialize the closure aggressively (because always only one instance exists for the prototype.)

Does that explanation make sense?

lukego commented Jan 2, 2017

@MageSlayer The VM creates one root-trace per loop per prototype. The amount of specialization is heuristic and depends on the instance count at recording time. The specialized keyword would create a separate prototype for each closure instance meaning (a) separate root trace for each closure (rather than one shared) and (b) consistently specialize the closure aggressively (because always only one instance exists for the prototype.)

Does that explanation make sense?

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jan 2, 2017

@fsfod I took your readonlytables branch for a quick spin. This is exciting work!

Here is the initial test case that I cooked up: a loop with an unbiased branch and a couple of loop-invariant table lookups:

m = {}
function m.add (a, b) return a+b end
function m.mul (a, b) return a*b end

table.setreadonly(_G)
table.setreadonly(m)

local acc = 0
for i = 1, 1e9 do
   acc = acc + (i%2==0 and m.add or m.mul)(acc, 2)
end

This runs in 4.0s with readonly tables vs 5.4s with that commented out.

Have to take a closer look!

lukego commented Jan 2, 2017

@fsfod I took your readonlytables branch for a quick spin. This is exciting work!

Here is the initial test case that I cooked up: a loop with an unbiased branch and a couple of loop-invariant table lookups:

m = {}
function m.add (a, b) return a+b end
function m.mul (a, b) return a*b end

table.setreadonly(_G)
table.setreadonly(m)

local acc = 0
for i = 1, 1e9 do
   acc = acc + (i%2==0 and m.add or m.mul)(acc, 2)
end

This runs in 4.0s with readonly tables vs 5.4s with that commented out.

Have to take a closer look!

@MageSlayer

This comment has been minimized.

Show comment
Hide comment
@MageSlayer

MageSlayer Jan 2, 2017

@lukego

The VM creates one root-trace per loop per prototype. The amount of specialization is heuristic and depends on the instance count at recording time.

Do you mean that trying to use your "copier" function within one loop with several different parameter type combinations results in only one function instance being specialized/optimized?
It's not entirely clear which use case you are trying to improve.

MageSlayer commented Jan 2, 2017

@lukego

The VM creates one root-trace per loop per prototype. The amount of specialization is heuristic and depends on the instance count at recording time.

Do you mean that trying to use your "copier" function within one loop with several different parameter type combinations results in only one function instance being specialized/optimized?
It's not entirely clear which use case you are trying to improve.

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jan 2, 2017

@MageSlayer Concrete example may make more sense: luafun/luafun#33.

lukego commented Jan 2, 2017

@MageSlayer Concrete example may make more sense: luafun/luafun#33.

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jan 2, 2017

@fsfod Dude! I am a fan. This readonly tables branch is awesome.

I have done a little more micro-benchmarking. I am mostly interested in how well readonly tables can optimize the loop-invariant parts of traces. The simplest way to estimate this seems to be to compile with -O-loop to disable the loop optimization that usually "hoists" this code out of the inner loop. This way we should always see the "bad case" that side-traces sometimes approach (loosely speaking.)

Here is the test case that I cooked up. I don't bother with branches and side-traces since that was mostly only there to screw up loop optimization.

-- Create a module with some constants and operations
_G.m = {}
m.x = 2
m.y = 3
function m.add (a, b) return a+b end
function m.sub (a, b) return a-b end

-- comment/uncomment this line to toggle read-only:
table.setreadonly(_G) table.setreadonly(m) table.setreadonly(getfenv())

local acc = 0
for i = 1, 1e9 do
   acc = acc + m.add(acc, m.x) + m.sub(acc, m.y)
end

The loop at the end is mostly the boring "glue" of looking up fields in them module. Question is how much this costs in various scenarios. Here are the instructions/iteration and cycles/iteration performance that I observe with perf:

-Oloop=yes readonly=no:   9 instructions /  6 cycles
-Oloop=yes readonly=yes:  9 instructions /  6 cycles
-Oloop=no  readonly=no:  57 instructions / 30 cycles
-Oloop=no  readonly=yes: 28 instructions / 16 cycles

So:

Loop optimization enabled costs 6 cycles per iteration whether or not read-only tables are used. This is no surprise: LuaJIT loves branchless loops and will hoist all of the lookups out of the loop body.

Disabling loop optimization takes 5x the time with 30 cycles per iteration. Ouch! This seems to be basically the same overhead that we incur with an unbiased branch leading to a side-trace i.e. having to continually re-lookup the fields of the m module on each iteration.

Enabling read-only tables nearly doubles the performance to 16 cycles per iteration. This is a big improvement! Now we are taking 2.66x the time of the loop optimized version which is certainly much closer to being respectable.

Taking a look at the IR code for each variant...

Loop optimization - showing only the looping part:

0029 ------ LOOP ------------
0030    num ADD    0026  0016
0031    num ADD    0030  0026
0032    num SUB    0026  0023
0033  + num ADD    0032  0031
0034  + int ADD    0027  +1  
0035 >  int LE     0034  +1000000000
0036    int PHI    0027  0034
0037    num PHI    0026  0033

Looks fine: Just arithmetic and the loop termination check. I am slightly surprised that it did not transform acc + 3 - 2 into simply acc + 1 but whatevs... maybe an optimizer limitation or point of floating-point semantics.

Now with read-only tables:

---- TRACE 1 IR
0001    int SLOAD  #2    CI
0002    fun SLOAD  #0    R
0003    tab FLOAD  0002  func.env
0004 >  tab EQ     0003  {0x404b39d8}
0005 >  num SLOAD  #1    T
0006    num ADD    0005  +2  
0007    num ADD    0006  0005
0008    num SUB    0005  +3  
0009    num ADD    0008  0007
0010    int ADD    0001  +1  
0011 >  int LE     0010  +1000000000
0012    num CONV   0010  num.int

Some extra work: has to load values from the stack instead of reusing them from registers, has to check that the fenv points to the expected table, but otherwise the same. Notably it did manage to treat the m.x and m.y values as constants +3 and +2.

Now looking at what we get with loop optimization disabled and without this lovely read-only table feature:

---- TRACE 1 IR
0001    int SLOAD  #2    CI
0002    fun SLOAD  #0    R
0003    tab FLOAD  0002  func.env
0004    int FLOAD  0003  tab.hmask
0005 >  int EQ     0004  +63 
0006    p32 FLOAD  0003  tab.node
0007 >  p32 HREFK  0006  "m"  @49
0008 >  tab HLOAD  0007
0009    int FLOAD  0008  tab.hmask
0010 >  int EQ     0009  +3  
0011    p32 FLOAD  0008  tab.node
0012 >  p32 HREFK  0011  "add" @2
0013 >  fun HLOAD  0012
0014 >  num SLOAD  #1    T
0015 >  p32 HREFK  0011  "x"  @1
0016 >  num HLOAD  0015
0017 >  fun EQ     0013  foo.lua:5
0018    num ADD    0016  0014
0019    num ADD    0018  0014
0020 >  p32 HREFK  0011  "sub" @3
0021 >  fun HLOAD  0020
0022 >  p32 HREFK  0011  "y"  @0
0023 >  num HLOAD  0022
0024 >  fun EQ     0021  foo.lua:6
0025    num SUB    0014  0023
0026    num ADD    0025  0019
0027    int ADD    0001  +1  
0028 >  int LE     0027  +1000000000
0029    num CONV   0027  num.int

This is MUCH more code. The main difference seems to be a lot of work done to lookup the various fields of the m module (x, y, add, sub) which is a hashtable. LuaJIT actually optimizes this extremely well, knowing which hashtable slot to look in ahead of time and only sanity-checking that the table has not been resized, but it is still much more work than using a constant when we are here counting cycles. Notable also that the m.x and m.y values were not treated as constants and so they would not be able to feed into other optimizations.

So! This is extremely promising and I would like to experiment with this in Snabb. I would imagine that we could make the _G environment read-only and likewise basically every module. Since we are often using module.function(foo) syntax, and often with modules referenced via _G, this may provide a substantial benefit for code paths that cannot take advantage of loop optimization (for whatever reason - non looping, side trace, etc.)

I owe you a beer at the next LuaJIT Users Group meetup @fsfod :-)

lukego commented Jan 2, 2017

@fsfod Dude! I am a fan. This readonly tables branch is awesome.

I have done a little more micro-benchmarking. I am mostly interested in how well readonly tables can optimize the loop-invariant parts of traces. The simplest way to estimate this seems to be to compile with -O-loop to disable the loop optimization that usually "hoists" this code out of the inner loop. This way we should always see the "bad case" that side-traces sometimes approach (loosely speaking.)

Here is the test case that I cooked up. I don't bother with branches and side-traces since that was mostly only there to screw up loop optimization.

-- Create a module with some constants and operations
_G.m = {}
m.x = 2
m.y = 3
function m.add (a, b) return a+b end
function m.sub (a, b) return a-b end

-- comment/uncomment this line to toggle read-only:
table.setreadonly(_G) table.setreadonly(m) table.setreadonly(getfenv())

local acc = 0
for i = 1, 1e9 do
   acc = acc + m.add(acc, m.x) + m.sub(acc, m.y)
end

The loop at the end is mostly the boring "glue" of looking up fields in them module. Question is how much this costs in various scenarios. Here are the instructions/iteration and cycles/iteration performance that I observe with perf:

-Oloop=yes readonly=no:   9 instructions /  6 cycles
-Oloop=yes readonly=yes:  9 instructions /  6 cycles
-Oloop=no  readonly=no:  57 instructions / 30 cycles
-Oloop=no  readonly=yes: 28 instructions / 16 cycles

So:

Loop optimization enabled costs 6 cycles per iteration whether or not read-only tables are used. This is no surprise: LuaJIT loves branchless loops and will hoist all of the lookups out of the loop body.

Disabling loop optimization takes 5x the time with 30 cycles per iteration. Ouch! This seems to be basically the same overhead that we incur with an unbiased branch leading to a side-trace i.e. having to continually re-lookup the fields of the m module on each iteration.

Enabling read-only tables nearly doubles the performance to 16 cycles per iteration. This is a big improvement! Now we are taking 2.66x the time of the loop optimized version which is certainly much closer to being respectable.

Taking a look at the IR code for each variant...

Loop optimization - showing only the looping part:

0029 ------ LOOP ------------
0030    num ADD    0026  0016
0031    num ADD    0030  0026
0032    num SUB    0026  0023
0033  + num ADD    0032  0031
0034  + int ADD    0027  +1  
0035 >  int LE     0034  +1000000000
0036    int PHI    0027  0034
0037    num PHI    0026  0033

Looks fine: Just arithmetic and the loop termination check. I am slightly surprised that it did not transform acc + 3 - 2 into simply acc + 1 but whatevs... maybe an optimizer limitation or point of floating-point semantics.

Now with read-only tables:

---- TRACE 1 IR
0001    int SLOAD  #2    CI
0002    fun SLOAD  #0    R
0003    tab FLOAD  0002  func.env
0004 >  tab EQ     0003  {0x404b39d8}
0005 >  num SLOAD  #1    T
0006    num ADD    0005  +2  
0007    num ADD    0006  0005
0008    num SUB    0005  +3  
0009    num ADD    0008  0007
0010    int ADD    0001  +1  
0011 >  int LE     0010  +1000000000
0012    num CONV   0010  num.int

Some extra work: has to load values from the stack instead of reusing them from registers, has to check that the fenv points to the expected table, but otherwise the same. Notably it did manage to treat the m.x and m.y values as constants +3 and +2.

Now looking at what we get with loop optimization disabled and without this lovely read-only table feature:

---- TRACE 1 IR
0001    int SLOAD  #2    CI
0002    fun SLOAD  #0    R
0003    tab FLOAD  0002  func.env
0004    int FLOAD  0003  tab.hmask
0005 >  int EQ     0004  +63 
0006    p32 FLOAD  0003  tab.node
0007 >  p32 HREFK  0006  "m"  @49
0008 >  tab HLOAD  0007
0009    int FLOAD  0008  tab.hmask
0010 >  int EQ     0009  +3  
0011    p32 FLOAD  0008  tab.node
0012 >  p32 HREFK  0011  "add" @2
0013 >  fun HLOAD  0012
0014 >  num SLOAD  #1    T
0015 >  p32 HREFK  0011  "x"  @1
0016 >  num HLOAD  0015
0017 >  fun EQ     0013  foo.lua:5
0018    num ADD    0016  0014
0019    num ADD    0018  0014
0020 >  p32 HREFK  0011  "sub" @3
0021 >  fun HLOAD  0020
0022 >  p32 HREFK  0011  "y"  @0
0023 >  num HLOAD  0022
0024 >  fun EQ     0021  foo.lua:6
0025    num SUB    0014  0023
0026    num ADD    0025  0019
0027    int ADD    0001  +1  
0028 >  int LE     0027  +1000000000
0029    num CONV   0027  num.int

This is MUCH more code. The main difference seems to be a lot of work done to lookup the various fields of the m module (x, y, add, sub) which is a hashtable. LuaJIT actually optimizes this extremely well, knowing which hashtable slot to look in ahead of time and only sanity-checking that the table has not been resized, but it is still much more work than using a constant when we are here counting cycles. Notable also that the m.x and m.y values were not treated as constants and so they would not be able to feed into other optimizations.

So! This is extremely promising and I would like to experiment with this in Snabb. I would imagine that we could make the _G environment read-only and likewise basically every module. Since we are often using module.function(foo) syntax, and often with modules referenced via _G, this may provide a substantial benefit for code paths that cannot take advantage of loop optimization (for whatever reason - non looping, side trace, etc.)

I owe you a beer at the next LuaJIT Users Group meetup @fsfod :-)

@Laaas

This comment has been minimized.

Show comment
Hide comment
@Laaas

Laaas Jan 2, 2017

Laaas commented Jan 2, 2017

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jan 5, 2017

@Laaas Possibly it can. But how?

lukego commented Jan 5, 2017

@Laaas Possibly it can. But how?

@Laaas

This comment has been minimized.

Show comment
Hide comment
@Laaas

Laaas Jan 5, 2017

Laaas commented Jan 5, 2017

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jan 19, 2017

Woo! @fsfod mentioned in private correspondence that I may be able to achieve the effect that I want using the already built-in FFI metatypes rather than read-only tables. This would not require any extension to LuaJIT. This seems to work on the example above at least!

Here is the new variant using a sealed(table) library function:

-- Create a module with some constants and operations                                                                                                                                 
_G.m = {}
m.x = 2
m.y = 3
function m.add (a, b) return a+b end
function m.sub (a, b) return a-b end

-- Returned a "sealed" copy of the input table in which lookups can be                                                                                                                
-- resolved at recording time. The keys and values are fixed. (The                                                                                                                    
-- return value is actually an FFI object with the key-value                                                                                                                          
-- associations stored in a metatype table which LuaJIT treats as                                                                                                                     
-- read-only and inlinable.)                                                                                                                                                          
function sealed (table)
   local ffi = require("ffi")
   local new = {}
   for k, v in pairs(table) do
      new[k] = v
   end
   local t = ffi.typeof("struct {}")
   ffi.metatype(t, {__index = new})
   return ffi.new(t)
end

local m = sealed(m)

local acc = 0
for i = 1, 1e9 do
   acc = acc + m.add(acc, m.x) + m.sub(acc, m.y)
end

Running this with loop optimization disabled (-O-loop) shows almost exactly the same performance as the readonly table version:

-Oloop=no  readonly=no:  57 instructions / 30 cycles
-Oloop=no  readonly=yes: 28 instructions / 16 cycles
-Oloop=no  metatype=yes: 26 instructions / 16 cycles

and the IR code looks tight:

---- TRACE 1 IR
0001    int SLOAD  #3    CI
0002 >  cdt SLOAD  #1    T
0003    u16 FLOAD  0002  cdata.ctypeid
0004 >  int EQ     0003  +96 
0005 >  num SLOAD  #2    T
0006    num ADD    0005  +2  
0007    num ADD    0006  0005
0008    num SUB    0005  +3  
0009    num ADD    0008  0007
0010    int ADD    0001  +1  
0011 >  int LE     0010  +1000000000
0012    num CONV   0010  num.int

Gotta look into this more closely and try it on a larger example.

(Has anybody already experimented with defining whole modules as FFI objects with their functions stored in a metatable? Is this a "design pattern" already?)

lukego commented Jan 19, 2017

Woo! @fsfod mentioned in private correspondence that I may be able to achieve the effect that I want using the already built-in FFI metatypes rather than read-only tables. This would not require any extension to LuaJIT. This seems to work on the example above at least!

Here is the new variant using a sealed(table) library function:

-- Create a module with some constants and operations                                                                                                                                 
_G.m = {}
m.x = 2
m.y = 3
function m.add (a, b) return a+b end
function m.sub (a, b) return a-b end

-- Returned a "sealed" copy of the input table in which lookups can be                                                                                                                
-- resolved at recording time. The keys and values are fixed. (The                                                                                                                    
-- return value is actually an FFI object with the key-value                                                                                                                          
-- associations stored in a metatype table which LuaJIT treats as                                                                                                                     
-- read-only and inlinable.)                                                                                                                                                          
function sealed (table)
   local ffi = require("ffi")
   local new = {}
   for k, v in pairs(table) do
      new[k] = v
   end
   local t = ffi.typeof("struct {}")
   ffi.metatype(t, {__index = new})
   return ffi.new(t)
end

local m = sealed(m)

local acc = 0
for i = 1, 1e9 do
   acc = acc + m.add(acc, m.x) + m.sub(acc, m.y)
end

Running this with loop optimization disabled (-O-loop) shows almost exactly the same performance as the readonly table version:

-Oloop=no  readonly=no:  57 instructions / 30 cycles
-Oloop=no  readonly=yes: 28 instructions / 16 cycles
-Oloop=no  metatype=yes: 26 instructions / 16 cycles

and the IR code looks tight:

---- TRACE 1 IR
0001    int SLOAD  #3    CI
0002 >  cdt SLOAD  #1    T
0003    u16 FLOAD  0002  cdata.ctypeid
0004 >  int EQ     0003  +96 
0005 >  num SLOAD  #2    T
0006    num ADD    0005  +2  
0007    num ADD    0006  0005
0008    num SUB    0005  +3  
0009    num ADD    0008  0007
0010    int ADD    0001  +1  
0011 >  int LE     0010  +1000000000
0012    num CONV   0010  num.int

Gotta look into this more closely and try it on a larger example.

(Has anybody already experimented with defining whole modules as FFI objects with their functions stored in a metatable? Is this a "design pattern" already?)

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jan 19, 2017

Just a little more happy testing...

If I change local m = sealed(m) to m = sealed(m), i.e. continue to use a global name instead of a local one, then the performance switches to 34 instructions (+30%) and 15.5 cycles (-3%). So the JIT generates significantly more code but performance actually improves due to better instructions-per-cycle execution on this (Haswell) CPU. This phenomenon may well be peculiar to this example though.

Generally the problem I would like to solve now is to be able to use global names and have them resolved at recording time. I imagine "sealing" both _G and all of the most frequently accessed modules and hopefully the effect is that every function call to those modules skips all the hashtable lookups. I am not sure how to formulate this yet...?

lukego commented Jan 19, 2017

Just a little more happy testing...

If I change local m = sealed(m) to m = sealed(m), i.e. continue to use a global name instead of a local one, then the performance switches to 34 instructions (+30%) and 15.5 cycles (-3%). So the JIT generates significantly more code but performance actually improves due to better instructions-per-cycle execution on this (Haswell) CPU. This phenomenon may well be peculiar to this example though.

Generally the problem I would like to solve now is to be able to use global names and have them resolved at recording time. I imagine "sealing" both _G and all of the most frequently accessed modules and hopefully the effect is that every function call to those modules skips all the hashtable lookups. I am not sure how to formulate this yet...?

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jan 19, 2017

Just as trivia I wonder if anybody can explain why the trace on the left takes 16 cycles per iteration while the trace on the right takes 15.5 cycles per iteration even though the latter executes 8 more instructions? I hope I have done a reasonable job of hand-aligning the traces for comparison. This is on a Xeon E5-1650 v3 CPU.

---- TRACE 1 IR				  ---- TRACE 1 IR                     
0001    int SLOAD  #3    CI		  0001    int SLOAD  #2    CI
0002 >  cdt SLOAD  #1    T		  0002    fun SLOAD  #0    R
0003    u16 FLOAD  0002  cdata.ctypeid	  0003    tab FLOAD  0002  func.env
0004 >  int EQ     0003  +96 		  0004    int FLOAD  0003  tab.hmask
					  0005 >  int EQ     0004  +63 
					  0006    p32 FLOAD  0003  tab.node
					  0007 >  p32 HREFK  0006  "m"  @49
					  0008 >  cdt HLOAD  0007
					  0009    u16 FLOAD  0008  cdata.ctypeid
					  0010 >  int EQ     0009  +96 
0005 >  num SLOAD  #2    T		  0011 >  num SLOAD  #1    T
0006    num ADD    0005  +2  		  0012    num ADD    0011  +2  
0007    num ADD    0006  0005		  0013    num ADD    0012  0011
0008    num SUB    0005  +3  		  0014    num SUB    0011  +3  
0009    num ADD    0008  0007		  0015    num ADD    0014  0013
0010    int ADD    0001  +1  		  0016    int ADD    0001  +1  
0011 >  int LE     0010  +1000000000	  0017 >  int LE     0016  +1000000000
0012    num CONV   0010  num.int	  0018    num CONV   0016  num.int
---- TRACE 1 mcode 136			  ---- TRACE 1 mcode 183
0bcbff71  mov dword [0x41ce5410], 0x1	  0bcbff42  mov dword [0x4011f410], 0x1
0bcbff7c  movsd xmm5, [0x41cf2fe8]	  0bcbff4d  movsd xmm5, [0x4012d120]
0bcbff85  movsd xmm4, [0x41cf3018]	  0bcbff56  movsd xmm4, [0x4012d150]
0bcbff8e  cvttsd2si ebp, [rdx+0x10]	  0bcbff5f  cvttsd2si ebp, [rdx+0x8]
					  0bcbff64  mov ebx, [rdx-0x8]
					  0bcbff67  mov ebx, [rbx+0x8]
					  0bcbff6a  cmp dword [rbx+0x1c], +0x3f
					  0bcbff6e  jnz 0x0bcb0010        ->0
					  0bcbff74  mov ebx, [rbx+0x14]
					  0bcbff77  mov rdi, 0xfffffffb4012e6d8
					  0bcbff81  cmp rdi, [rbx+0x4a0]
					  0bcbff88  jnz 0x0bcb0010        ->0
0bcbff93  cmp dword [rdx+0x4], -0x0b	  0bcbff8e  cmp dword [rbx+0x49c], -0x0b
0bcbff97  jnz 0x0bcb0010        ->0	  0bcbff95  jnz 0x0bcb0010        ->0
0bcbff9d  mov ebx, [rdx]		  0bcbff9b  mov ebx, [rbx+0x498]
0bcbff9f  cmp word [rbx+0x6], +0x60	  0bcbffa1  cmp word [rbx+0x6], +0x60
0bcbffa4  jnz 0x0bcb0010        ->0	  0bcbffa6  jnz 0x0bcb0010        ->0
0bcbffaa  cmp dword [rdx+0xc], 0xfffeffff 0bcbffac  cmp dword [rdx+0x4], 0xfffeffff
0bcbffb1  jnb 0x0bcb0010        ->0	  0bcbffb3  jnb 0x0bcb0010        ->0
0bcbffb7  movsd xmm7, [rdx+0x8]		  0bcbffb9  movsd xmm7, [rdx]
0bcbffbc  movaps xmm6, xmm7		  0bcbffbd  movaps xmm6, xmm7
0bcbffbf  addsd xmm6, xmm4		  0bcbffc0  addsd xmm6, xmm4
0bcbffc3  addsd xmm6, xmm7		  0bcbffc4  addsd xmm6, xmm7
0bcbffc7  subsd xmm7, xmm5		  0bcbffc8  subsd xmm7, xmm5
0bcbffcb  addsd xmm7, xmm6		  0bcbffcc  addsd xmm7, xmm6
0bcbffcf  add ebp, +0x01		  0bcbffd0  add ebp, +0x01
0bcbffd2  cmp ebp, 0x3b9aca00		  0bcbffd3  cmp ebp, 0x3b9aca00
0bcbffd8  jg 0x0bcb0014 ->1		  0bcbffd9  jg 0x0bcb0014 ->1
0bcbffde  xorps xmm6, xmm6		  0bcbffdf  xorps xmm6, xmm6
0bcbffe1  cvtsi2sd xmm6, ebp		  0bcbffe2  cvtsi2sd xmm6, ebp
0bcbffe5  movsd [rdx+0x28], xmm6	  0bcbffe6  movsd [rdx+0x20], xmm6
0bcbffea  movsd [rdx+0x10], xmm6	  0bcbffeb  movsd [rdx+0x8], xmm6
0bcbffef  movsd [rdx+0x8], xmm7		  0bcbfff0  movsd [rdx], xmm7
0bcbfff4  jmp 0x0bcbff71		  0bcbfff4  jmp 0x0bcbff42
---- TRACE 1 stop -> loop                 ---- TRACE 1 stop -> loop

lukego commented Jan 19, 2017

Just as trivia I wonder if anybody can explain why the trace on the left takes 16 cycles per iteration while the trace on the right takes 15.5 cycles per iteration even though the latter executes 8 more instructions? I hope I have done a reasonable job of hand-aligning the traces for comparison. This is on a Xeon E5-1650 v3 CPU.

---- TRACE 1 IR				  ---- TRACE 1 IR                     
0001    int SLOAD  #3    CI		  0001    int SLOAD  #2    CI
0002 >  cdt SLOAD  #1    T		  0002    fun SLOAD  #0    R
0003    u16 FLOAD  0002  cdata.ctypeid	  0003    tab FLOAD  0002  func.env
0004 >  int EQ     0003  +96 		  0004    int FLOAD  0003  tab.hmask
					  0005 >  int EQ     0004  +63 
					  0006    p32 FLOAD  0003  tab.node
					  0007 >  p32 HREFK  0006  "m"  @49
					  0008 >  cdt HLOAD  0007
					  0009    u16 FLOAD  0008  cdata.ctypeid
					  0010 >  int EQ     0009  +96 
0005 >  num SLOAD  #2    T		  0011 >  num SLOAD  #1    T
0006    num ADD    0005  +2  		  0012    num ADD    0011  +2  
0007    num ADD    0006  0005		  0013    num ADD    0012  0011
0008    num SUB    0005  +3  		  0014    num SUB    0011  +3  
0009    num ADD    0008  0007		  0015    num ADD    0014  0013
0010    int ADD    0001  +1  		  0016    int ADD    0001  +1  
0011 >  int LE     0010  +1000000000	  0017 >  int LE     0016  +1000000000
0012    num CONV   0010  num.int	  0018    num CONV   0016  num.int
---- TRACE 1 mcode 136			  ---- TRACE 1 mcode 183
0bcbff71  mov dword [0x41ce5410], 0x1	  0bcbff42  mov dword [0x4011f410], 0x1
0bcbff7c  movsd xmm5, [0x41cf2fe8]	  0bcbff4d  movsd xmm5, [0x4012d120]
0bcbff85  movsd xmm4, [0x41cf3018]	  0bcbff56  movsd xmm4, [0x4012d150]
0bcbff8e  cvttsd2si ebp, [rdx+0x10]	  0bcbff5f  cvttsd2si ebp, [rdx+0x8]
					  0bcbff64  mov ebx, [rdx-0x8]
					  0bcbff67  mov ebx, [rbx+0x8]
					  0bcbff6a  cmp dword [rbx+0x1c], +0x3f
					  0bcbff6e  jnz 0x0bcb0010        ->0
					  0bcbff74  mov ebx, [rbx+0x14]
					  0bcbff77  mov rdi, 0xfffffffb4012e6d8
					  0bcbff81  cmp rdi, [rbx+0x4a0]
					  0bcbff88  jnz 0x0bcb0010        ->0
0bcbff93  cmp dword [rdx+0x4], -0x0b	  0bcbff8e  cmp dword [rbx+0x49c], -0x0b
0bcbff97  jnz 0x0bcb0010        ->0	  0bcbff95  jnz 0x0bcb0010        ->0
0bcbff9d  mov ebx, [rdx]		  0bcbff9b  mov ebx, [rbx+0x498]
0bcbff9f  cmp word [rbx+0x6], +0x60	  0bcbffa1  cmp word [rbx+0x6], +0x60
0bcbffa4  jnz 0x0bcb0010        ->0	  0bcbffa6  jnz 0x0bcb0010        ->0
0bcbffaa  cmp dword [rdx+0xc], 0xfffeffff 0bcbffac  cmp dword [rdx+0x4], 0xfffeffff
0bcbffb1  jnb 0x0bcb0010        ->0	  0bcbffb3  jnb 0x0bcb0010        ->0
0bcbffb7  movsd xmm7, [rdx+0x8]		  0bcbffb9  movsd xmm7, [rdx]
0bcbffbc  movaps xmm6, xmm7		  0bcbffbd  movaps xmm6, xmm7
0bcbffbf  addsd xmm6, xmm4		  0bcbffc0  addsd xmm6, xmm4
0bcbffc3  addsd xmm6, xmm7		  0bcbffc4  addsd xmm6, xmm7
0bcbffc7  subsd xmm7, xmm5		  0bcbffc8  subsd xmm7, xmm5
0bcbffcb  addsd xmm7, xmm6		  0bcbffcc  addsd xmm7, xmm6
0bcbffcf  add ebp, +0x01		  0bcbffd0  add ebp, +0x01
0bcbffd2  cmp ebp, 0x3b9aca00		  0bcbffd3  cmp ebp, 0x3b9aca00
0bcbffd8  jg 0x0bcb0014 ->1		  0bcbffd9  jg 0x0bcb0014 ->1
0bcbffde  xorps xmm6, xmm6		  0bcbffdf  xorps xmm6, xmm6
0bcbffe1  cvtsi2sd xmm6, ebp		  0bcbffe2  cvtsi2sd xmm6, ebp
0bcbffe5  movsd [rdx+0x28], xmm6	  0bcbffe6  movsd [rdx+0x20], xmm6
0bcbffea  movsd [rdx+0x10], xmm6	  0bcbffeb  movsd [rdx+0x8], xmm6
0bcbffef  movsd [rdx+0x8], xmm7		  0bcbfff0  movsd [rdx], xmm7
0bcbfff4  jmp 0x0bcbff71		  0bcbfff4  jmp 0x0bcbff42
---- TRACE 1 stop -> loop                 ---- TRACE 1 stop -> loop
@Laaas

This comment has been minimized.

Show comment
Hide comment
@Laaas

Laaas Jan 19, 2017

Laaas commented Jan 19, 2017

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jan 20, 2017

Sorry that this thread is becoming so long but I am fascinated by this topic... I continue to think aloud to elaborate mental models that will need more data to verify...

We have three versions of the non-loop-optimized code: simple (no optimizations), readonly (optimized with @fsfod's branch), and metatype (the second variant with a global variable that references the table as an immutable FFI metatype.) How do these approaches compare?

simple takes very many instructions (57) and is slow (30 cycles). It is loading values from hashtables and using those values to compute its results. The slowness comes from these lookups somehow: either waiting on the computations to be executed or waiting for a chain of dependent L1 cache loads (I'm not sure which.)

readonly takes few instructions (28) and is fast (16 cycles). It skips the hashtable lookups entirely because they are resolved at recording time.

metatype takes many instructions (36) and is fast (15.5 cycles). It performs lookups on _G and fenv but miraculously it does not pay for them. How? Because the lookups are only used for checking guards, and do not contribute to the main computation, and so the CPU is able to predict their results and execute them in parallell. The lookups do not contribute to the dependency chain that determines the throughput of the loop. This is a fundamental advantage of the metatype approach: the exact values needed for the computation is known at recording time and so the table lookups are not on the critical path.

Here is a simple model of how the metatype code executes. I have annotated the machine code so that instructions that directly contribute to the computation are marked with an asterisk (*) to separate them from instructions that are only used for checking guards. These two sets of instructions should execute in parallel without slowing each other down. (This is a simplified model that does not account for contention for execution units or parallelism within each set of instructions.)

EDIT: Initially pasted the trace for the local version of metatype

 ---- TRACE 1 IR
 0001    int SLOAD  #2    CI
 0002    fun SLOAD  #0    R
 0003    tab FLOAD  0002  func.env
 0004    int FLOAD  0003  tab.hmask
 0005 >  int EQ     0004  +63
 0006    p32 FLOAD  0003  tab.node
 0007 >  p32 HREFK  0006  "m"  @49
 0008 >  cdt HLOAD  0007
 0009    u16 FLOAD  0008  cdata.ctypeid
 0010 >  int EQ     0009  +96
 0011 >  num SLOAD  #1    T
 0012    num ADD    0011  +2
 0013    num ADD    0012  0011
 0014    num SUB    0011  +3
 0015    num ADD    0014  0013
 0016    int ADD    0001  +1
 0017 >  int LE     0016  +1000000000
 0018    num CONV   0016  num.int
 ---- TRACE 1 mcode 183
 0bcbff42  mov dword [0x4011f410], 0x1
*0bcbff4d  movsd xmm5, [0x4012d120]
*0bcbff56  movsd xmm4, [0x4012d150]
*0bcbff5f  cvttsd2si ebp, [rdx+0x8]
 0bcbff64  mov ebx, [rdx-0x8]
 0bcbff67  mov ebx, [rbx+0x8]
 0bcbff6a  cmp dword [rbx+0x1c], +0x3f
 0bcbff6e  jnz 0x0bcb0010        ->0
 0bcbff74  mov ebx, [rbx+0x14]
 0bcbff77  mov rdi, 0xfffffffb4012e6d8
 0bcbff81  cmp rdi, [rbx+0x4a0]
 0bcbff88  jnz 0x0bcb0010        ->0
 0bcbff8e  cmp dword [rbx+0x49c], -0x0b
 0bcbff95  jnz 0x0bcb0010        ->0
 0bcbff9b  mov ebx, [rbx+0x498]
 0bcbffa1  cmp word [rbx+0x6], +0x60
 0bcbffa6  jnz 0x0bcb0010        ->0
 0bcbffac  cmp dword [rdx+0x4], 0xfffeffff
 0bcbffb3  jnb 0x0bcb0010        ->0
*0bcbffb9  movsd xmm7, [rdx]
*0bcbffbd  movaps xmm6, xmm7
*0bcbffc0  addsd xmm6, xmm4
*0bcbffc4  addsd xmm6, xmm7
*0bcbffc8  subsd xmm7, xmm5
*0bcbffcc  addsd xmm7, xmm6
*0bcbffd0  add ebp, +0x01
 0bcbffd3  cmp ebp, 0x3b9a a00
 0bcbffd9  jg 0x0bcb0014 - 1
*0bcbffdf  xorps xmm6, xmm
*0bcbffe2  cvtsi2sd xmm6,  bp
*0bcbffe6  movsd [rdx+0x20 , xmm6
*0bcbffeb  movsd [rdx+0x8]  xmm6
*0bcbfff0  movsd [rdx], xm 7
 0bcbfff4  jmp 0x0bcbff42
 ---- TRACE 1 stop -> loop

There are a lot of instructions but the result only depends directly on 15 of them.

Does this make sense to anybody else?

Have to expand to larger examples...

lukego commented Jan 20, 2017

Sorry that this thread is becoming so long but I am fascinated by this topic... I continue to think aloud to elaborate mental models that will need more data to verify...

We have three versions of the non-loop-optimized code: simple (no optimizations), readonly (optimized with @fsfod's branch), and metatype (the second variant with a global variable that references the table as an immutable FFI metatype.) How do these approaches compare?

simple takes very many instructions (57) and is slow (30 cycles). It is loading values from hashtables and using those values to compute its results. The slowness comes from these lookups somehow: either waiting on the computations to be executed or waiting for a chain of dependent L1 cache loads (I'm not sure which.)

readonly takes few instructions (28) and is fast (16 cycles). It skips the hashtable lookups entirely because they are resolved at recording time.

metatype takes many instructions (36) and is fast (15.5 cycles). It performs lookups on _G and fenv but miraculously it does not pay for them. How? Because the lookups are only used for checking guards, and do not contribute to the main computation, and so the CPU is able to predict their results and execute them in parallell. The lookups do not contribute to the dependency chain that determines the throughput of the loop. This is a fundamental advantage of the metatype approach: the exact values needed for the computation is known at recording time and so the table lookups are not on the critical path.

Here is a simple model of how the metatype code executes. I have annotated the machine code so that instructions that directly contribute to the computation are marked with an asterisk (*) to separate them from instructions that are only used for checking guards. These two sets of instructions should execute in parallel without slowing each other down. (This is a simplified model that does not account for contention for execution units or parallelism within each set of instructions.)

EDIT: Initially pasted the trace for the local version of metatype

 ---- TRACE 1 IR
 0001    int SLOAD  #2    CI
 0002    fun SLOAD  #0    R
 0003    tab FLOAD  0002  func.env
 0004    int FLOAD  0003  tab.hmask
 0005 >  int EQ     0004  +63
 0006    p32 FLOAD  0003  tab.node
 0007 >  p32 HREFK  0006  "m"  @49
 0008 >  cdt HLOAD  0007
 0009    u16 FLOAD  0008  cdata.ctypeid
 0010 >  int EQ     0009  +96
 0011 >  num SLOAD  #1    T
 0012    num ADD    0011  +2
 0013    num ADD    0012  0011
 0014    num SUB    0011  +3
 0015    num ADD    0014  0013
 0016    int ADD    0001  +1
 0017 >  int LE     0016  +1000000000
 0018    num CONV   0016  num.int
 ---- TRACE 1 mcode 183
 0bcbff42  mov dword [0x4011f410], 0x1
*0bcbff4d  movsd xmm5, [0x4012d120]
*0bcbff56  movsd xmm4, [0x4012d150]
*0bcbff5f  cvttsd2si ebp, [rdx+0x8]
 0bcbff64  mov ebx, [rdx-0x8]
 0bcbff67  mov ebx, [rbx+0x8]
 0bcbff6a  cmp dword [rbx+0x1c], +0x3f
 0bcbff6e  jnz 0x0bcb0010        ->0
 0bcbff74  mov ebx, [rbx+0x14]
 0bcbff77  mov rdi, 0xfffffffb4012e6d8
 0bcbff81  cmp rdi, [rbx+0x4a0]
 0bcbff88  jnz 0x0bcb0010        ->0
 0bcbff8e  cmp dword [rbx+0x49c], -0x0b
 0bcbff95  jnz 0x0bcb0010        ->0
 0bcbff9b  mov ebx, [rbx+0x498]
 0bcbffa1  cmp word [rbx+0x6], +0x60
 0bcbffa6  jnz 0x0bcb0010        ->0
 0bcbffac  cmp dword [rdx+0x4], 0xfffeffff
 0bcbffb3  jnb 0x0bcb0010        ->0
*0bcbffb9  movsd xmm7, [rdx]
*0bcbffbd  movaps xmm6, xmm7
*0bcbffc0  addsd xmm6, xmm4
*0bcbffc4  addsd xmm6, xmm7
*0bcbffc8  subsd xmm7, xmm5
*0bcbffcc  addsd xmm7, xmm6
*0bcbffd0  add ebp, +0x01
 0bcbffd3  cmp ebp, 0x3b9a a00
 0bcbffd9  jg 0x0bcb0014 - 1
*0bcbffdf  xorps xmm6, xmm
*0bcbffe2  cvtsi2sd xmm6,  bp
*0bcbffe6  movsd [rdx+0x20 , xmm6
*0bcbffeb  movsd [rdx+0x8]  xmm6
*0bcbfff0  movsd [rdx], xm 7
 0bcbfff4  jmp 0x0bcbff42
 ---- TRACE 1 stop -> loop

There are a lot of instructions but the result only depends directly on 15 of them.

Does this make sense to anybody else?

Have to expand to larger examples...

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jan 20, 2017

... Just looking at the number of micro-operations (uops) each of Haswell's eight per-core execution units is executing to get a feel for whether these extra guards are causing contention for CPU execution resources.

Raw data from pmu-tools ocperf.py:

 Performance counter stats for './luajit -jdump=+r -O-loop test.lua':

    15,539,874,773      cycles                                                      
    34,026,705,893      instructions              #    2.19  insns per cycle        
     4,645,292,475      uops_executed_port_port_0                                   
     7,379,637,664      uops_executed_port_port_1                                   
     6,506,009,201      uops_executed_port_port_2                                   
     6,507,061,292      uops_executed_port_port_3                                   
     4,005,758,910      uops_executed_port_port_4                                   
     4,362,389,943      uops_executed_port_port_5                                   
     7,881,736,603      uops_executed_port_port_6                                   
     4,002,456,493      uops_executed_port_port_7                                   

       4.440512500 seconds time elapsed

Translated into percentage utilization of each execution unit (uops executed / cycles):

30%     uops_executed_port_port_0                                   
47%     uops_executed_port_port_1                                   
42%     uops_executed_port_port_2                                   
42%     uops_executed_port_port_3                                   
26%     uops_executed_port_port_4                                   
28%     uops_executed_port_port_5                                   
51%     uops_executed_port_port_6                                   
25%     uops_executed_port_port_7                                   

I would like to interpret this as meaning that we have plenty of spare execution unit capacity, and that we could have twice the ratio of guards to "real work" before we would impact performance by saturating an execution unit, but I dare say that is either a little or a lot too simplistic. Performance is presumably determined by a fairly intricate combination of dependencies between instructions, latencies of individual instructions, and availability of execution resources during the cycles when they are needed most.

Perhaps somebody else knows how to make a more specific analysis :).

lukego commented Jan 20, 2017

... Just looking at the number of micro-operations (uops) each of Haswell's eight per-core execution units is executing to get a feel for whether these extra guards are causing contention for CPU execution resources.

Raw data from pmu-tools ocperf.py:

 Performance counter stats for './luajit -jdump=+r -O-loop test.lua':

    15,539,874,773      cycles                                                      
    34,026,705,893      instructions              #    2.19  insns per cycle        
     4,645,292,475      uops_executed_port_port_0                                   
     7,379,637,664      uops_executed_port_port_1                                   
     6,506,009,201      uops_executed_port_port_2                                   
     6,507,061,292      uops_executed_port_port_3                                   
     4,005,758,910      uops_executed_port_port_4                                   
     4,362,389,943      uops_executed_port_port_5                                   
     7,881,736,603      uops_executed_port_port_6                                   
     4,002,456,493      uops_executed_port_port_7                                   

       4.440512500 seconds time elapsed

Translated into percentage utilization of each execution unit (uops executed / cycles):

30%     uops_executed_port_port_0                                   
47%     uops_executed_port_port_1                                   
42%     uops_executed_port_port_2                                   
42%     uops_executed_port_port_3                                   
26%     uops_executed_port_port_4                                   
28%     uops_executed_port_port_5                                   
51%     uops_executed_port_port_6                                   
25%     uops_executed_port_port_7                                   

I would like to interpret this as meaning that we have plenty of spare execution unit capacity, and that we could have twice the ratio of guards to "real work" before we would impact performance by saturating an execution unit, but I dare say that is either a little or a lot too simplistic. Performance is presumably determined by a fairly intricate combination of dependencies between instructions, latencies of individual instructions, and availability of execution resources during the cycles when they are needed most.

Perhaps somebody else knows how to make a more specific analysis :).

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jan 20, 2017

Also, regarding ffi readonly tables, doesn't accessing a non-existent
member of a struct give an error, regardless of whether or not it has an
__index table?

@Laaas Yes it is true that using FFI objects instead of tables is far from being transparent plug-and-play. I also notice that you cannot setfenv() to an FFI object. However, I could still see the approach as being worth it for eliminating the overhead of accessing common modules via global names. I really dislike the alternative of caching every module and function with local in every module - bleh.

lukego commented Jan 20, 2017

Also, regarding ffi readonly tables, doesn't accessing a non-existent
member of a struct give an error, regardless of whether or not it has an
__index table?

@Laaas Yes it is true that using FFI objects instead of tables is far from being transparent plug-and-play. I also notice that you cannot setfenv() to an FFI object. However, I could still see the approach as being worth it for eliminating the overhead of accessing common modules via global names. I really dislike the alternative of caching every module and function with local in every module - bleh.

@Laaas

This comment has been minimized.

Show comment
Hide comment
@Laaas

Laaas Jan 20, 2017

Laaas commented Jan 20, 2017

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jan 20, 2017

@Laaas I run e.g. perf stat luajit -O-loop test.lua and it prints output like this:

 Performance counter stats for './luajit -jdump=+r -O-loop test.lua':

       4916.581781      task-clock (msec)         #    1.000 CPUs utilized          
                 9      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               152      page-faults               #    0.031 K/sec                  
    17,206,902,854      cycles                    #    3.500 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    48,030,786,038      instructions              #    2.79  insns per cycle        
    12,005,943,517      branches                  # 2441.929 M/sec                  
           127,149      branch-misses             #    0.00% of all branches        

       4.916915199 seconds time elapsed

The loop in this example executes 1e9 times so divide each number by one billion to estimate the per-iteration value. (17,206,902,854 cycles means 17 cycles per iteration, etc.)

lukego commented Jan 20, 2017

@Laaas I run e.g. perf stat luajit -O-loop test.lua and it prints output like this:

 Performance counter stats for './luajit -jdump=+r -O-loop test.lua':

       4916.581781      task-clock (msec)         #    1.000 CPUs utilized          
                 9      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               152      page-faults               #    0.031 K/sec                  
    17,206,902,854      cycles                    #    3.500 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    48,030,786,038      instructions              #    2.79  insns per cycle        
    12,005,943,517      branches                  # 2441.929 M/sec                  
           127,149      branch-misses             #    0.00% of all branches        

       4.916915199 seconds time elapsed

The loop in this example executes 1e9 times so divide each number by one billion to estimate the per-iteration value. (17,206,902,854 cycles means 17 cycles per iteration, etc.)

@javierguerragiraldez

This comment has been minimized.

Show comment
Hide comment
@javierguerragiraldez

javierguerragiraldez Jan 20, 2017

I really dislike the alternative of caching every module and function with local in every module - bleh.

This is huge!

local function asmod(t) return ffi.metatype(ffi.typeof('struct {}'), {__index=t}) end
local s,t = asmod(string), asmod(table)

and then use s.format(), t.concat(), etc instead of the ugly s_format(), t_concat()...

javierguerragiraldez commented Jan 20, 2017

I really dislike the alternative of caching every module and function with local in every module - bleh.

This is huge!

local function asmod(t) return ffi.metatype(ffi.typeof('struct {}'), {__index=t}) end
local s,t = asmod(string), asmod(table)

and then use s.format(), t.concat(), etc instead of the ugly s_format(), t_concat()...

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jan 20, 2017

@javierguerragiraldez Yep! Or even better from my perspective would be along the lines of running _G.string = asmod(string) once on startup and then simply using the global name without any local or require in individual source files.

lukego commented Jan 20, 2017

@javierguerragiraldez Yep! Or even better from my perspective would be along the lines of running _G.string = asmod(string) once on startup and then simply using the global name without any local or require in individual source files.

@Laaas

This comment has been minimized.

Show comment
Hide comment
@Laaas

Laaas Jan 20, 2017

Laaas commented Jan 20, 2017

@Laaas

This comment has been minimized.

Show comment
Hide comment
@Laaas

Laaas Jan 21, 2017

Laaas commented Jan 21, 2017

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Mar 9, 2017

(Closed to avoid clutter.)

lukego commented Mar 9, 2017

(Closed to avoid clutter.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment