New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization format #5

Closed
MichaelMure opened this Issue Aug 3, 2018 · 26 comments

Comments

Projects
6 participants
@MichaelMure
Owner

MichaelMure commented Aug 3, 2018

Bug's data are stored using git Commit, Tree and Blob. Inside a Blob is serialized an OperationPack, that is an array of edit operation on the bug's state.

This OperationPack is currently serialized using golang's gob, which is neat because it just works. However, it might not be the best option for interoperability with other tools in the future.

How should that be serialized ? Json ? In any case, git will compress the data using zlib so a text format might not be that terrible.

Feel free to argue a case here.

@MichaelMure MichaelMure added the RFC label Aug 6, 2018

@daurnimator

This comment has been minimized.

daurnimator commented Aug 17, 2018

I just found the project via hackernews. I'd love to give this sort of thing a try, and integrate it into other tools. However using go serialisation rules out all my languages of choice.

I'd say use something JSON based, or if that's not enough, CBOR.

@MichaelMure

This comment has been minimized.

Owner

MichaelMure commented Aug 17, 2018

For the record, it would be an easy change to use something else, it's only a few lines of code:

func ParseOperationPack(data []byte) (*OperationPack, error) {
reader := bytes.NewReader(data)
decoder := gob.NewDecoder(reader)
var opp OperationPack
err := decoder.Decode(&opp)
if err != nil {
return nil, err
}
return &opp, nil
}
// Serialize will serialise an OperationPack into raw bytes
func (opp *OperationPack) Serialize() ([]byte, error) {
var data bytes.Buffer
encoder := gob.NewEncoder(&data)
err := encoder.Encode(*opp)
if err != nil {
return nil, err
}
return data.Bytes(), nil
}

@lukechampine

This comment has been minimized.

lukechampine commented Aug 17, 2018

My two cents: gob is convenient and efficient, but not a great choice if you want interop with other languages. Unfortunately there just aren't many binary formats that are widely supported, except perhaps protobufs.

JSON is probably your best bet. As you noted, it will be compressed anyway, and if performance is an issue you can always switch to a faster JSON encoder. The only big downside to JSON that I'm aware of is poor support for encoding binary blobs (encoding/json encodes []byte as a base-64 string). But if OperationPack is almost entirely textual data anyway, there's little reason to worry about that.

@avar

This comment has been minimized.

avar commented Aug 17, 2018

Git does its own delta-compression on top of zlib. You should decide this using a combination of whatever format needs you have (can you add more fields, is it extensible etc.), and how git manages to compress this using both delta compression and zlib, which you can figure out using a large enough set of realistic test data.

@MichaelMure

This comment has been minimized.

Owner

MichaelMure commented Aug 17, 2018

To give more details about the requirement, OperationPack hold currently very simple data (string, int, array..), and it's likely to stay the same even when adding new operations. For instance, embedded files are stored in git blobs and then linked in the git tree.

The only tricky part is that an OperationPack is an mixed array of Operation so the decoder need to support that and match the correct go struct for each operation.

@MichaelMure MichaelMure added this to Ready in git-bug Aug 18, 2018

@MichaelMure MichaelMure moved this from Ready to Backlog in git-bug Aug 18, 2018

@MichaelMure

This comment has been minimized.

Owner

MichaelMure commented Aug 19, 2018

With all these format that could fit the bill, the best way to choose would be a benchmark for both performance and blob size for several format (at least JSON and CBOR). Who knows how the git compression behave on something that is already binary.

Maybe the git people could do an educated guess.

@j-f1

This comment has been minimized.

Contributor

j-f1 commented Aug 22, 2018

MessagePack is another option, but I feel like MessagePack and CBOR are both designed for getting the smallest possible representation of data, whereas JSON is designed to be human-readable, ASCII-compatible, and simple to parse. Compare JSON’s spec (the sidebar) with the CBOR and MessagePack specs.

@MichaelMure

This comment has been minimized.

Owner

MichaelMure commented Sep 6, 2018

I wrote some throwaway code to test the resulting blob size for various format. Here is one run:

Creating repo: /tmp/512275589

GOB
raw: 5210, git: 2216, ratio: 42.53359%
raw: 5536, git: 2320, ratio: 41.907513%
raw: 3987, git: 1768, ratio: 44.34412%
raw: 4407, git: 1893, ratio: 42.95439%
raw: 6368, git: 2593, ratio: 40.71922%
raw: 4905, git: 2143, ratio: 43.690113%
raw: 6524, git: 2660, ratio: 40.772533%
raw: 3315, git: 1549, ratio: 46.726997%
raw: 4116, git: 1780, ratio: 43.24587%
raw: 3928, git: 1751, ratio: 44.577393%
total: 20673

JSON
raw: 4862, git: 1966, ratio: 40.436035%
raw: 5188, git: 2072, ratio: 39.93832%
raw: 3633, git: 1528, ratio: 42.058907%
raw: 4055, git: 1657, ratio: 40.863132%
raw: 6026, git: 2339, ratio: 38.815136%
raw: 4555, git: 1903, ratio: 41.778267%
raw: 6184, git: 2411, ratio: 38.987713%
raw: 2965, git: 1314, ratio: 44.31703%
raw: 3764, git: 1551, ratio: 41.20616%
raw: 3568, git: 1515, ratio: 42.460762%
total: 18256

CBOR
raw: 4746, git: 1961, ratio: 41.319008%
raw: 5071, git: 2065, ratio: 40.72175%
raw: 3524, git: 1527, ratio: 43.33144%
raw: 3944, git: 1656, ratio: 41.98783%
raw: 5902, git: 2337, ratio: 39.59675%
raw: 4440, git: 1899, ratio: 42.77027%
raw: 6062, git: 2410, ratio: 39.755856%
raw: 2852, git: 1308, ratio: 45.862553%
raw: 3652, git: 1543, ratio: 42.25082%
raw: 3463, git: 1507, ratio: 43.51718%
total: 18213

MsgPack
raw: 4746, git: 1980, ratio: 41.71934%
raw: 5072, git: 2087, ratio: 41.147476%
raw: 3521, git: 1541, ratio: 43.765976%
raw: 3941, git: 1665, ratio: 42.24816%
raw: 5902, git: 2357, ratio: 39.935616%
raw: 4439, git: 1914, ratio: 43.117817%
raw: 6060, git: 2425, ratio: 40.016502%
raw: 2853, git: 1323, ratio: 46.372242%
raw: 3654, git: 1558, ratio: 42.638203%
raw: 3464, git: 1526, ratio: 44.053116%
total: 18376

As expected, there is not that much differences after encoding + compression. CBOR consistently win the size contest though.

@MichaelMure

This comment has been minimized.

Owner

MichaelMure commented Sep 6, 2018

Note: each serialization format is tested on the same set of randomly generated OperationPack with one Create and 4 AddComment ops.

@avar

This comment has been minimized.

avar commented Sep 6, 2018

@MichaelMure This test case really isn't meaningful. You're just testing how a given payload compresses with zlib when creating loose objects, since when you add a new object it's compressed, a header is added to it, and it's added to the object store.

Instead, you should after every addition do git add && git commit && git gc. Then measure the total size of the now-packed .git/objects directory, not individual objects.

At that point, these objects will be delta-compressed, so you can see how the size of the repo grows as they're added.

The size of individual objects is pretty much irrelevant. You can have 10 objects that are all 1GB, but delta-compress down to 1GB + 1MB or whatever, or 10GB if they don't delta-compress at all.

@avar

This comment has been minimized.

avar commented Sep 6, 2018

@MichaelMure Also in reply to:

Who knows how the git compression behave on something that is already binary.

I'm sure there's some obscure edge case where the compression is tweaked for textual content in some way that'll prove me wrong, but in general this doesn't matter at all.

Git's just as good at delta-compressing binary data and non-binary data. What it's not good at compressing (and this goes for any compression), is data that's wildly different from one object to the next.

It just so happens that generally binary data is less delta-compressible, think say two *.mp3s with different songs v.s. a *.txt change to its lyrics.

But for these sort of pack formats I wouldn't expect them to delta-compress any worse than say JSON. It's going to be other things that matter, e.g. let's say you use a JSON encoder where the keys of the payload aren't sorted, and thus are different every time, that'll compress worse than if they're sorted, same for doing the same in some binary key-value format.

I do think that for UI purposes it makes sense to pick a widely implemented & used text format like JSON for introspection purposes and the availability of tooling (e.g. jq), if the compression numbers for it aren't much worse that is.

@MichaelMure

This comment has been minimized.

Owner

MichaelMure commented Sep 6, 2018

That's a good point, I'll check the repo size as well, before and after a git gc.

@MichaelMure

This comment has been minimized.

Owner

MichaelMure commented Sep 6, 2018

Alright, another run with the size of the repo before and after a git gc (initial empty size substracted), 1000 OperationPack serialized:

GOB
Creating repo: /tmp/272689118
raw: 4446, git: 1944, ratio: 43.724697%
raw: 4774, git: 2075, ratio: 43.4646%
raw: 5075, git: 2203, ratio: 43.408867%
raw: 4135, git: 1795, ratio: 43.409916%
raw: 5901, git: 2437, ratio: 41.298084%
raw: 2919, git: 1372, ratio: 47.0024%
raw: 4974, git: 2098, ratio: 42.179333%
raw: 5074, git: 2153, ratio: 42.432007%
raw: 3600, git: 1613, ratio: 44.805557%
raw: 4663, git: 2016, ratio: 43.23397%
...
Unpacked: 1926463
GC packed: 1926510
Packing diff: 47
GC packed aggressive: 1926510
Packing diff: 0

JSON
Creating repo: /tmp/263735205
raw: 4094, git: 1706, ratio: 41.67074%
raw: 4428, git: 1837, ratio: 41.485996%
raw: 4731, git: 1968, ratio: 41.597973%
raw: 3776, git: 1547, ratio: 40.96928%
raw: 5554, git: 2192, ratio: 39.467052%
raw: 2566, git: 1136, ratio: 44.27124%
raw: 4628, git: 1863, ratio: 40.254967%
raw: 4732, git: 1921, ratio: 40.595943%
raw: 3242, git: 1377, ratio: 42.47378%
raw: 4320, git: 1773, ratio: 41.041668%
...
Unpacked: 1687200
GC packed: 1687247
Packing diff: 47
GC packed aggressive: 1687247
Packing diff: 0

CBOR
Creating repo: /tmp/701783232
raw: 3984, git: 1705, ratio: 42.796185%
raw: 4311, git: 1838, ratio: 42.63512%
raw: 4613, git: 1965, ratio: 42.597008%
raw: 3674, git: 1550, ratio: 42.18835%
raw: 5438, git: 2192, ratio: 40.308937%
raw: 2462, git: 1134, ratio: 46.060116%
raw: 4514, git: 1863, ratio: 41.2716%
raw: 4613, git: 1916, ratio: 41.534794%
raw: 3137, git: 1376, ratio: 43.863564%
raw: 4202, git: 1766, ratio: 42.027603%
...
Unpacked: 1685158
GC packed: 1685205
Packing diff: 47
GC packed aggressive: 1685205
Packing diff: 0

MsgPack
Creating repo: /tmp/132917535
raw: 3984, git: 1723, ratio: 43.247993%
raw: 4310, git: 1854, ratio: 43.01624%
raw: 4611, git: 1985, ratio: 43.049232%
raw: 3672, git: 1562, ratio: 42.538128%
raw: 5436, git: 2204, ratio: 40.544518%
raw: 2460, git: 1152, ratio: 46.82927%
raw: 4512, git: 1875, ratio: 41.55585%
raw: 4614, git: 1932, ratio: 41.872562%
raw: 3138, git: 1395, ratio: 44.455067%
raw: 4202, git: 1783, ratio: 42.432175%
...
Unpacked: 1700178
GC packed: 1700225
Packing diff: 47
GC packed aggressive: 1700225
Packing diff: 0

Whatever the format, there is no compression taking advantage of the similarity between each OperationPack. The packed repo is actually bigger by 47 bytes, and a git gc --agressive does nothing.

@MichaelMure

This comment has been minimized.

Owner

MichaelMure commented Sep 6, 2018

Another run with 100k OperationPack (so 500k operations), just for the sake of it:

GOB
Creating repo: /tmp/235087672
raw: 3231, git: 1492, ratio: 46.177654%
raw: 4688, git: 2097, ratio: 44.731228%
raw: 3611, git: 1625, ratio: 45.001385%
raw: 3566, git: 1620, ratio: 45.42905%
raw: 3911, git: 1718, ratio: 43.927383%
raw: 6047, git: 2526, ratio: 41.77278%
raw: 3487, git: 1595, ratio: 45.741325%
raw: 5425, git: 2267, ratio: 41.788017%
raw: 3013, git: 1341, ratio: 44.507137%
raw: 6101, git: 2549, ratio: 41.780037%
...
Unpacked: 194 MB
GC packed: 194 MB
Packing diff: 47
GC packed aggressive: 194 MB
Packing diff: 0

JSON
Creating repo: /tmp/145768759
raw: 2870, git: 1261, ratio: 43.937283%
raw: 4332, git: 1842, ratio: 42.520775%
raw: 3248, git: 1398, ratio: 43.041874%
raw: 3215, git: 1392, ratio: 43.297047%
raw: 3553, git: 1485, ratio: 41.795666%
raw: 5699, git: 2280, ratio: 40.00702%
raw: 3130, git: 1356, ratio: 43.32268%
raw: 5083, git: 2032, ratio: 39.97639%
raw: 2660, git: 1119, ratio: 42.06767%
raw: 5753, git: 2301, ratio: 39.99652%
...
Unpacked: 170 MB
GC packed: 170 MB
Packing diff: 47
GC packed aggressive: 170 MB
Packing diff: 0

CBOR
Creating repo: /tmp/170025770
raw: 2773, git: 1255, ratio: 45.257843%
raw: 4227, git: 1851, ratio: 43.789925%
raw: 3149, git: 1395, ratio: 44.299778%
raw: 3107, git: 1395, ratio: 44.898617%
raw: 3448, git: 1480, ratio: 42.92343%
raw: 5587, git: 2284, ratio: 40.880615%
raw: 3027, git: 1363, ratio: 45.02808%
raw: 4964, git: 2030, ratio: 40.89444%
raw: 2558, git: 1113, ratio: 43.510555%
raw: 5641, git: 2300, ratio: 40.77291%
...
Unpacked: 170 MB
GC packed: 170 MB
Packing diff: 47
GC packed aggressive: 170 MB
Packing diff: 0

MsgPack
Creating repo: /tmp/418211457
raw: 2778, git: 1272, ratio: 45.788338%
raw: 4228, git: 1868, ratio: 44.181644%
raw: 3150, git: 1409, ratio: 44.73016%
raw: 3109, git: 1408, ratio: 45.287876%
raw: 3447, git: 1495, ratio: 43.371048%
raw: 5587, git: 2302, ratio: 41.202793%
raw: 3026, git: 1379, ratio: 45.571712%
raw: 4965, git: 2041, ratio: 41.107754%
raw: 2562, git: 1125, ratio: 43.911007%
raw: 5641, git: 2316, ratio: 41.05655%
...
Unpacked: 171 MB
GC packed: 171 MB
Packing diff: 47
GC packed aggressive: 171 MB
Packing diff: 0
@j-f1

This comment has been minimized.

Contributor

j-f1 commented Sep 6, 2018

Interesting that JSON and CBOR end up almost the same size.

@avar

This comment has been minimized.

avar commented Sep 7, 2018

In a lot of cases --aggressive does nothing, since e.g. if you have files that keep growing they'll already be in the --window and --depth described in the git-repack manpage, --aggressive just tweaks those values from the default of 10/50 to 250/50. I wouldn't be surprised if for such an artificial testcase you got simliar/the same results with --window=1 --depth=1 or whatever.

Is the history this go tool produces accessible somewhere?

@MichaelMure

This comment has been minimized.

Owner

MichaelMure commented Sep 7, 2018

Each OperationPack are independent, the similarities would be between them would be only the serialization format structure. There is no file growing.

It's not that surprising that git doesn't compress that.

Is the history this go tool produces accessible somewhere?

I'm not sure it answer your question, but have a look at https://github.com/MichaelMure/git-bug/blob/master/doc/model.md.

@avar

This comment has been minimized.

avar commented Sep 7, 2018

@MichaelMure

This comment has been minimized.

Owner

MichaelMure commented Sep 7, 2018

@avar These blobs are not tied up in a branch, it's rather impractical to push that somewhere.
Please install go (probably just a package), checkout the branch and run go run misc/serial_format_research/main.go.

MichaelMure added a commit that referenced this issue Sep 12, 2018

bug: change the OperationPack serialization format for Json
See #5 for the details of this choice
@MichaelMure

This comment has been minimized.

Owner

MichaelMure commented Sep 12, 2018

With 60fcfcd, I changed the serialization format for Json.

Here are a few measurement with 10k random bugs and 10op/bug (100k ops total, same as the previous test):

generation & writing 61s
repo size 161M
git gc 4s
repo size 21M
cache building 40s
cache size 1.5M
bug query 0.04s

Quite happy with these results! Note that the cache building is currently mono-processor. There is still perf to gain.

Also, now that the blobs are connected in a chain of commit, git gc start to actually compress them. 21Mo for 10k bugs is nice.

@MichaelMure MichaelMure moved this from Todo to In progress in git-bug Sep 12, 2018

@MichaelMure

This comment has been minimized.

Owner

MichaelMure commented Sep 13, 2018

With no sign of troubles after various tests, let's consider the matter resolved :-)

git-bug automation moved this from In progress to Done Sep 13, 2018

@andyl

This comment has been minimized.

andyl commented Sep 13, 2018

Is there a CLI command to generate a JSON dump from the issues?

@MichaelMure

This comment has been minimized.

Owner

MichaelMure commented Sep 13, 2018

@andyl there is not. What's your usecase ?

@andyl

This comment has been minimized.

andyl commented Sep 13, 2018

@MichaelMure I'm working on a project that allows people to post auction-style bids for issues (see bugmark.net). We'd very much like to integrate with git-bug. To do this, we need to be able to poll the issue repository and grab a json-like representation. JSON would be simple for us, but if there is another way to integrate open to that too.

@MichaelMure

This comment has been minimized.

Owner

MichaelMure commented Sep 13, 2018

@andyl that's certainly doable and should be supported by the CLI tools.

Could you open a new issue where we can discuss that ?

@andyl

This comment has been minimized.

andyl commented Sep 13, 2018

Could you open a new issue where we can discuss that ?

@MichaelMure see #45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment