Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IL Virtual Machine #3888

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

IL Virtual Machine #3888

wants to merge 1 commit into from

Conversation

Scooletz
Copy link
Contributor

@Scooletz Scooletz commented Mar 18, 2022

Implements #4672

This PR proposes an introduction of another IVirtualMachine implementation based on transpiling EVM bytecode to IL. IL, or MSIL, is an intermediate language that .NET (the runtime that Nethermind uses) languages compile to, to be later JITted to assembly by the runtime. In this PR, instead of having a loop and executing instructions on the pc basis it emits the whole contract as a single method.

Tentative Plan of Future Actions

The plan of action, that is frequently updated and reorganized:

  • initial implementation of a few opcodes
  • gas calculation of existing ones
  • ClrMD and ASM print for the method
  • stack checks with potential change from Word* to offset based (start + int as the current)
  • endianess of the executor

Currently supported

  • opcodes:
    • POP
    • PC
    • PUSH1, PUSH2, PUSH4
    • DUP1
    • SWAP1
    • SUB
    • JUMPDEST
    • JUMP - this is a full blown two layer jump table, first based on the switch with fanout 128, second layer with simple ifs
    • JUMPI - uses the same jump table as above + branch-free condition pop from the stack
  • gas management, calculations and returning OutOfGas when it happens

Ahead of ILVM

  • more gas cost calculation - for any sequence of instruction between flow control statements and instructions of variable cost (for example SHA3)
  • no stack head checks - similar as above, every operation has a push and pop behavior added, so that it can be calculated upfront whether the stack wont be breached. Majority of the checks can be optimized away or moved at the beginning of the jump
  • jumps - IL label for each JUMPDEST, a global jump table at the end of the function. If the destination for the jump is known at static time (PUSHN followed by JUMP), this can go directly to the label without any check
  • endianess - compilining a method directly for the specific endianess
  • tracing - compliling methods with various flavors, like no tracing at all and selecting the right one for the specific set of a tracer flags
  • handling all the CALLs and discussing how to interop with other contracts

Potential

Potential usages:

  • precompile hot contracts like StarkNet, Uniswap, Sushi when Nethermind is built so that the client includes much faster VM
  • provide TIERed execution, when hot contracts are IL emitted in the client whenever some statistics of usage shows that it should be done
  • be the fastest EVM implementation (non-business case, pure ego-driven)

Benchmarks

The following code was used for a terribly simple benchmark. It represents a simple loop that performs multiple spins

byte[] code = Prepare.EvmCode
    .PushData(repeat)
    .Op(Instruction.JUMPDEST)
    .PushData(1)
    .Op(Instruction.SWAP1)
    .Op(Instruction.SUB)
    .Op(Instruction.DUP1)
    .PushData(1 + repeat.Length) // jump adress
    .Op(Instruction.JUMPI)
    .Op(Instruction.POP)
    .Done;

The comparison with a long enough run that should amortize all the const costs is as follows:

VM probe size time per one million spins (less better)
existing 10_000_000 5,228 s
IL VM 1_000_000_000 0,048 s

🔥 This means that in this terrible benchmark ILVM is 100x faster than the existing one!

Benchmarks ASM output

The bytecode JITted away, according to my knowledge of JIT, and ASM and addressing, results in the following code

0000: push rbp
0001: push rdi
0002: push rsi
0003: sub rsp,90h
000a: vzeroupper
000d: lea rbp,[rsp+20h]
0012: mov rax,59E479AB6165h
001c: mov [rbp+8],rax
0020: add rsp,20h
0024: mov eax,8040h
0029: neg rax
002c: add rax,rsp
002f: jb short 00007FFCD40918F3h
0031: xor eax,eax
0033: test [rsp],esp
0036: mov rdx,rsp
0039: sub rdx,1000h
0040: mov rsp,rdx
0043: cmp rsp,rax
0046: jae short 00007FFCD40918F3h
0048: mov rsp,rax
004b: test [rsp],esp
004e: sub rsp,20h
0052: lea rdx,[rsp+20h]
0057: add rdx,20h
005b: and rdx,0FFFFFFFFFFFFFFE0h
005f: mov rsi,rcx
0062: cmp rsi,3
0066: jl 00007FFCD4091A93h
006c: add rsi,0FFFFFFFFFFFFFFFDh
0070: vxorps xmm0,xmm0,xmm0
0074: vmovdqu [rdx],xmm0
0078: vmovdqu [rdx+10h],xmm0
007d: mov dword ptr [rdx+1Ch],0E1F505h
0084: add rdx,20h
0088: cmp rsi,1Ah
008c: jl 00007FFCD4091A93h
0092: add rsi,0FFFFFFFFFFFFFFE6h
0096: vxorps xmm0,xmm0,xmm0
009a: vmovdqu [rdx],xmm0
009e: vmovdqu [rdx+10h],xmm0
00a3: mov byte ptr [rdx+1Fh],1
00a7: add rdx,20h
00ab: lea rcx,[rdx-20h]
00af: vmovdqu xmm0,[rcx]
00b3: vmovdqu [rbp+10h],xmm0
00b8: vmovdqu xmm0,[rcx+10h]
00bd: vmovdqu [rbp+20h],xmm0
00c2: lea rdi,[rdx-40h]
00c6: vmovdqu xmm0,[rdi]
00ca: vmovdqu [rcx],xmm0
00ce: vmovdqu xmm0,[rdi+10h]
00d3: vmovdqu [rcx+10h],xmm0
00d8: vmovdqu xmm0,[rbp+10h]
00dd: vmovdqu [rdi],xmm0
00e1: vmovdqu xmm0,[rbp+20h]
00e6: vmovdqu [rdi+10h],xmm0
00eb: lea rdx,[rbp+50h]
00ef: call 00007FFCD46D35B8h
00f4: mov rcx,rdi
00f7: lea rdx,[rbp+30h]
00fb: call 00007FFCD46D35B8h
0100: lea rcx,[rbp+50h]
0104: lea rdx,[rbp+30h]
0108: lea r8,[rbp+10h]
010c: call 00007FFCD42A9DC0h
0111: mov rdx,rdi
0114: mov rax,[rbp+10h]
0118: mov rcx,[rbp+18h]
011c: mov r8,[rbp+20h]
0120: mov r9,[rbp+28h]
0124: bswap r9
0127: mov [rdx],r9
012a: bswap r8
012d: mov [rdx+8],r8
0131: bswap rcx
0134: mov [rdx+10h],rcx
0138: bswap rax
013b: mov [rdx+18h],rax
013f: add rdx,20h
0143: vmovdqu xmm0,[rdx-20h]
0148: vmovdqu [rdx],xmm0
014c: vmovdqu xmm0,[rdx-10h]
0151: vmovdqu [rdx+10h],xmm0
0156: add rdx,20h
015a: vxorps xmm0,xmm0,xmm0
015e: vmovdqu [rdx],xmm0
0162: vmovdqu [rdx+10h],xmm0
0167: mov byte ptr [rdx+1Fh],5
016b: add rdx,20h
016f: lea rax,[rdx-40h]
0173: mov rcx,[rax+18h]
0177: or rcx,[rax+10h]
017b: or rcx,[rax+8]
017f: or rcx,[rax]
0182: je short 00007FFCD4091A89h
0184: add rdx,0FFFFFFFFFFFFFFE0h
0188: mov rax,[rdx]
018b: or rax,[rdx+8]
018f: or rax,[rdx+10h]
0193: mov ecx,[rdx+18h]
0196: or rax,rcx
0199: jne short 00007FFCD4091A9Ah
019b: mov eax,[rdx+1Ch]
019e: mov ecx,eax
01a0: bswap ecx
01a2: sub rdx,20h
01a6: mov r8d,ecx
01a9: and r8d,7Fh
01ad: cmp r8d,6
01b1: ja short 00007FFCD4091A9Ah
01b3: mov eax,5Fh
01b8: bt eax,r8d
01bc: jb short 00007FFCD4091A9Ah
01be: cmp ecx,5
01c1: je 00007FFCD4091956h
01c7: jmp short 00007FFCD4091A9Ah
01c9: cmp rsi,2
01cd: jl short 00007FFCD4091A93h
01cf: xor eax,eax
01d1: jmp short 00007FFCD4091A9Fh
01d3: mov eax,4
01d8: jmp short 00007FFCD4091A9Fh
01da: mov eax,8
01df: mov rcx,59E479AB6165h
01e9: cmp [rbp+8],rcx
01ed: je short 00007FFCD4091AB4h
01ef: call 00007FFD336F0280h
01f4: nop
01f5: lea rsp,[rbp+70h]
01f9: pop rsi
01fa: pop rdi
01fb: pop rbp
01fc: ret

@Scooletz Scooletz added difficult It requires detailed knowledge of the codebase and changes can easily lead to severe issues. a evm wip Work in Progress labels Mar 18, 2022
@tkstanczak
Copy link
Member

@Ruteri
Copy link

Ruteri commented Mar 24, 2022

Great benchmark results!
I'd consider checking if the JIT is not optimising away the benchmarked logic, it'd be really good to see the IL generated for the benchmarked contract

@Scooletz
Copy link
Contributor Author

Scooletz commented Mar 25, 2022

Great benchmark results! I'd consider checking if the JIT is not optimising away the benchmarked logic,

It should not as there's a jump so assume it's safe.

it'd be really good to see the IL generated for the benchmarked contract

Definitely! I was thinking about the same to extract the ASM and print it in the output. This will require using https://github.com/microsoft/clrmd probably, which requires the author to load it in their head again 😅 I'll provide this print soon.

@Scooletz
Copy link
Contributor Author

Scooletz commented Mar 27, 2022

@Ruteri Please take a look at the description. I added the bottom section that shows the asm of the bytecode used in the benchmark. To me it looks more or less valid as I see:

  • vmovdqu for Word operations (which is a verctorized copy of the word)
  • calls that are probably for Uint256 getters, but I did not check it
  • add rdx, 20h to bump up the stack pointer by 32 bytes

It was not easy as I self attach and do put iced on top of it, but it should be more or less valid. Let me know what you see in there. I could push forward to even map ByteCode -> IL -> addresses, but this would be an exercise that probably would not bring a lot as there's like 95% of opcodes that are still missing in this VM

@Scooletz
Copy link
Contributor Author

Scooletz commented Apr 2, 2022

History rewritten, to allow interacting with VM.

@Scooletz
Copy link
Contributor Author

Scooletz commented Apr 4, 2022

An update before taking a break from this PR. After amending the way the tests are run the gains are much less bold that claimed before. The current way the ILVM is integrated is the call within the current implementation of VirtualMachine. This ensures that it includes all the same checks for both cases. The scenario of executing of 200000 spins in a loop looks as follows now:

  • regular VM execution took 00:00:01.3301542 taking 6,65ms per 1000 spins
  • IL VM execution took 00:00:00.0881434 taking 0,44ms per 1000 spins

The multiplier than fell down from initial 100x to 15x but now, it's embedded in the existing VM like it'd be if this was fully implemented.

@benaadams
Copy link
Member

👀

Copy link

@zsluedem zsluedem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello. My name is WillQ. And I am one of the participant in ethereum protocol fellow program.
I chose the IL-VM project to work on and @LukaszRozmej guided me to this pr. FYI, I am a noob in C# and dotnet. I am trying to comprehend what this pr is doing and see what I can do for.

I got some questions on the pr especially for IL-VM part.I hope I can get some help here.

And I also wrote a benchmark for this pr in my own branch https://github.com/zsluedem/nethermind/blob/il-vm/src/Nethermind/Nethermind.Evm.ILBenchmark/Program.cs .
Here is my benchmark result.

Method Bytecode Mean Error StdDev
ILEvm 5850 196.2 us 3.74 us 3.49 us
Evm 5850 192.2 us 1.57 us 1.39 us
ILEvm 6000600157 194.0 us 1.85 us 1.73 us
Evm 6000600157 194.1 us 2.05 us 1.81 us
ILEvm 600156 178.3 us 1.68 us 1.49 us
Evm 600156 215.5 us 1.59 us 1.33 us
ILEvm 6001600157 176.2 us 2.69 us 2.39 us
Evm 6001600157 218.8 us 2.73 us 2.55 us
ILEvm 60016(...)30303 [22] 193.0 us 1.81 us 1.51 us
Evm 60016(...)30303 [22] 199.6 us 0.74 us 0.66 us
ILEvm 60016005575B 190.9 us 2.33 us 2.06 us
Evm 60016005575B 195.3 us 2.36 us 1.97 us
ILEvm 6001800350 193.6 us 2.20 us 1.84 us
Evm 6001800350 196.6 us 1.93 us 1.71 us
ILEvm 600260019003 190.8 us 1.95 us 1.73 us
Evm 600260019003 196.7 us 1.16 us 1.03 us
ILEvm 6003565B 191.4 us 2.53 us 2.24 us
Evm 6003565B 196.6 us 1.50 us 1.33 us
ILEvm 63000(...)55750 [30] 6,532.3 us 39.32 us 34.85 us
Evm 63000(...)55750 [30] 11,069.8 us 112.30 us 99.56 us

I got a little bit different result compared to your testcase which haven't been warmed up. The 63000(...)55750 [30] case is the same as your loop testcase. I hope this data could help.

if (isIL)
{
// differentiate by adding one point
code = code.Op(Instruction.POP);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this difference needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember.

const int wordToAlignTo = 32;

il.Emit(OpCodes.Ldc_I4, EvmStack.MaxStackSize * Word.Size + wordToAlignTo);
il.Emit(OpCodes.Localloc);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc of Localloc says

Allocates a certain number of bytes from the local dynamic memory pool and pushes
the address (a transient pointer, type *) of the first allocated byte onto the
evaluation stack.

Does this Localloc opcode would load the locals define above like uint256A into memory pool ?
And what is the first allocated byte from the docs?Is it current?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Localalloc is used here to allocate the whole EVM stack on the actual stack.

// 4. set the field
// 5. advance pointer
case Instruction.PUSH1:
il.Load(current);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not really understand how stack works in msil. Could you expand this with more knowledges?
I feel like everything is operating on this current local.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current here is the top of the evm stack. If PUSH1 is executed, the plan for it is flushed out above between lines 122-127. Load the value, zero it, set 1, advance.

@Scooletz
Copy link
Contributor Author

Hello. My name is WillQ. And I am one of the participant in ethereum protocol fellow program. I chose the IL-VM project to work on and @LukaszRozmej guided me to this pr. FYI, I am a noob in C# and dotnet. I am trying to comprehend what this pr is doing and see what I can do for.

Hey, nice to meet you 😃 This PR requires fair deep understanding of IL, .NET runtime and C#. Not sure if this is the best way to start with .NET 😅 .

One remark that I need to start with is that I flushed it 6 months ago and did not revisit from this moment. My context of it atm is high level, and it might require me to spend more time on recalling specifics. Also, currently I cannot support implementing it fully or dive deep into specifics. Still, will do my best to provide you with some answers.

I got some questions on the pr especially for IL-VM part.I hope I can get some help here.

And I also wrote a benchmark for this pr in my own branch https://github.com/zsluedem/nethermind/blob/il-vm/src/Nethermind/Nethermind.Evm.ILBenchmark/Program.cs . Here is my benchmark result.
I got a little bit different result compared to your testcase which haven't been warmed up. The 63000(...)55750 [30] case is the same as your loop testcase. I hope this data could help.

In regards to the benchmarks, I can see that you call the following one

public void BuildILForNext() => _buildILForNext = true;

in GlobalSetup but I don't remember the semantics for the code execution in EVM. From this PR point of view BuildILForNext was added just for initial performance check and will make the VM IL emit the next contract that goes into it. Yes, there's memorization underneath, but maybe it's broken?

_codeCache.Set(codeHash, cachedCodeInfo);

The initial tests were focused on longer executions, so that the rest of the infrastructure should just work. It was a mad idea and I was pushing it further looking if it breaks. Maybe you found a breaking point, but before comparing numbers, I'd check the rest. For short ones, the numbers should be comparable I believe.

This is the best that I can share atm @zsluedem I hope it helps a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficult It requires detailed knowledge of the codebase and changes can easily lead to severe issues. a evm wip Work in Progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants