Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port lib0 encoding/decoding #383

Open
3 of 4 tasks
Brooooooklyn opened this issue Apr 14, 2023 · 3 comments
Open
3 of 4 tasks

Port lib0 encoding/decoding #383

Brooooooklyn opened this issue Apr 14, 2023 · 3 comments
Assignees
Labels
feat New feature or request roadmap
Milestone

Comments

@Brooooooklyn
Copy link
Member

Brooooooklyn commented Apr 14, 2023

Background

We want to port encoding/decoding features of lib0 into OctoBase. The existing lib0 implementation is buggy and unsafe to use it in production. And we don't see any optimization for that in their roadmap.

So let's just start porting our own high-performance and production-ready lib0 encoding/decoding features in OctoBase.

Steps

  1. darkskygit thorseraq
  2. darkskygit thorseraq
@darkskygit
Copy link
Member

darkskygit commented Apr 14, 2023

Overview

lib0/encode provides two types of encoding methods:

  • write an integer of explicit byte length、write integers to variable-length bytes、write float to bytes as big endian
  • composite type
    • write a variable-length int for the tag type
    • write length if need(string、stringified json etc.)
    • write the corresponding type of bytes

So in fact, what we encode and decode is a sequence containing multiple types

Design

I plan to implement a port of the lib0 part using nom for decoding and byteorder for encoding.
Since lib0 only provides codecs for multi-type sequences and does not provide specific structural abstractions, I do not plan to introduce libraries such as serde at this level, but introduce them in the port of yjs.

Testing

At present, y-crdt has implemented a lib0 version. I plan to use it as a result reference to fuzzing, and write a test case comparing the results of js version lib0 for each type of encoding.

cc @Brooooooklyn @zuoxiaodong0815

@darkskygit
Copy link
Member

darkskygit commented Apr 17, 2023

The following pseudo-code describes how to read a ydoc (read method without ytype)
VarXxx means a read_var_xxx function, accept &[u8] and consume part of it, return new &[u8] and parsing results

1. readUpdateV2
	a. readClientStructRefs
		i. numOfStateUpdates = VarUint
		ii. Loop numOfStateUpdates
			1) numOfStructs = VarUint
			2) client = VarUint
			3) clock = VarUint
			4) Loop numOfStructs
				a) info = Uint8
				b) First5Bit = 0b00011111 & info
				c) Switch First5Bit 
					i) Case0 // GC
						1. len = VarUint
						2. Break
					ii) Case10 // Skip
						1. len = VarUint
						2. Break
					iii) Default
						1. cantCopyParentInfo = 0b1100000 & info
						2. If info & 0b10000000 == 0b10000000 then 
							– leftId = Id {VarUint, VarUint}
						3. End
						4. If info & 0b01000000 == 0b01000000 then 
							– rightId = Id {VarUint, VarUint}
						5. End
						6. If VarUint = 1 then
							– doc.get(VarString)
						7. Else
							– leftId = Id {VarUint, VarUint}
						8. End
						9. If info & 0b00100000 == 0b00100000 then 
							– VarString
						10. End
						11. readContentByType(First5Bit)
	b. readAndApplyDeleteSet
		i. numClients = VarUint
		ii. Loop numClients
			1) client = VarUint
			2) numberOfDeletes = VarUint
			3) Loop numberOfDeletes
				a) clock = VarUint
				b) clockLen = VarUint

@darkskygit
Copy link
Member

darkskygit commented Apr 18, 2023

Ybinary structure analysis

Currently, I have implemented the parser of ybinary(update) in #389.

There are several known problems, which I will write here for future improvements to the format:

  1. format without analytic expressions

    In mathematics, a solution that can be expressed by a analytic expression is called a analytic solution, and I use it to describe a file format: if a file format can read any data in a file container with a fixed logic, I call it has analytic expression. MKV is a typical format that has a analytic expression, you can read any data stored in MKV with a fixed read logic.

    ybinary, on the other hand, has not analytic expression. If you want to read ybinary, you first need to read multiple variable-length integers, which describe how many times you need to loop through the item, and then there is a lot of variable-length integer read/write logic in the process of reading the item, which means that if any one byte is corrupted in the ybinary, the whole ybinary will no longer be readable, which greatly increases the risk of store ybinary for a long time.

  2. can only be read in full

    ybinary is stateful: ybinary does not store the complete crdt item state, some state depends on the state in the previously loaded crdt item, which means if you want to read or write a crdt item in ybinary, you have to read the whole ybinary into memory.

  3. No file or value level self-checking

    As mentioned earlier, ybinary has no analytic expression, which means that any byte corruption in binary will result in the corruption of the whole binary. Since binary does not contain checksum on the value, we cannot know whether ybinary is corrupted and from where until we read it.

  4. Complexity of decoding

    ybinary uses a lot of variable-length integer encoding to save the values, which consumes a lot of cpu resources when coding and decoding.

    in contrast, fixed-length binary numbers can be decoded at high speed on any platform (including js runtime, using dataview)

    the variable-length integer encoding used by ybinary contains conditional judgments in the encoding and decoding logic, which makes it several times slower than fixed-length binary numbers even if it is optimized on the native platform.

    image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat New feature or request roadmap
Projects
None yet
Development

No branches or pull requests

4 participants