Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support arbitary encoding #51

Closed
Raynos opened this issue Jan 4, 2013 · 24 comments · Fixed by #149
Closed

support arbitary encoding #51

Raynos opened this issue Jan 4, 2013 · 24 comments · Fixed by #149
Labels
enhancement New feature or request

Comments

@Raynos
Copy link
Member

Raynos commented Jan 4, 2013

Let's say I want encoding like

var db = levelup("uri", {
    toEncoding: function (raw) {
        return new Type(JSON.parse(raw))
    }
    , toBuffer: function (type) {
        delete type._id
        return JSON.stringify(type)
    }
})

function Type(data) { ... }

I basically want to wrap all objects coming out of the db in a certain encoding and clean up all objects I put into the db with a certain cleanup logic.

@dominictarr
Copy link
Contributor

I think there are better names than "toEncoding" and "toBuffer"

@Raynos
Copy link
Member Author

Raynos commented Jan 4, 2013

I would agree on that

@ralphtheninja
Copy link
Member

How about

var db = levelup("uri", {
    transform: function (raw) {
        return new Type(JSON.parse(raw))
    }
})

@juliangruber
Copy link
Member

var db = levelup('uri', {
  encode : function (raw) { return JSON.stringify(raw) },
  decode : function (json) { return JSON.parse(json) }
})

@dominictarr
Copy link
Contributor

what about

var db = levelup('uri', {
  encoding : {
   encode | stringify : function () {...},
   decode | parse : function () {...}
  }
})

pretty much every node module that does this will either have encode & decode, or stringify & parse.
so you can also just do

var db = levelup('uri', {
  encoding : require('yamlish')
})

or JSON, or msg-pack or whatever.

That is how I think you should do that, if you where going to do this.
I think that we are better off not putting litte convienience functions into the core bindings.
Only put the things that is strictly necessary, put the sugar in a wrapper, like connect vs. express.
... though if we where to apply this principle to the end, I think it would mean removing some current features.

@ralphtheninja
Copy link
Member

@dominictarr With that approach we could check if it's a callback or not and have support for streams. Just pipe the data through a stream provided by the user and pipe it back as a duplex if the data should continue somewhere else. The user can then create any type of chain of streams.

@Raynos
Copy link
Member Author

Raynos commented Jan 4, 2013

@dominictarr the reason this needs to go into core is because there's a bunch of json / buffer encoding logic all over the place already.

I just want that logic to allow me to use custom encoding / decoding functions.

If we dont want this in code then we can remove all the encoding / decoding logic.

I do agree that encoding: "json" should go in favor of encoding: JSON

@juliangruber
Copy link
Member

stating that you need to provide both methods by wrapping them in a config option is a good idea.

What I don't like is big objects for configuration instead of methods and chaining, which are more beautiful and also semantically more correct.

@rvagg
Copy link
Member

rvagg commented Jan 4, 2013

Something to keep in mind: the binding can handle either Strings or Buffers so that's fine, but coming back out it needs to know whether to put the contents into a String or a Buffer; putting everything into a Buffer is going to have a performance hit where Strings are all that's needed (the majority case) due to an extra copy. So there would need to be some way to signal to the binding what to return. See the asBuffer boolean option that's passed to the binding from get() and also keyAsBuffer and valueAsBuffer for readStream().

Also, my vote would be for adding this to the options object, and 'encode' and 'decode' make sense I guess.

@dominictarr
Copy link
Contributor

@ralphtheninja streams is the wrong idea here, because the leveldb api needs a single buffer/string.
If the api supported Streams, then it would need to buffer them, which would just give users the wrong idea.

I think we should remove the json encoding. just support the same base64, utf-8, ascii encodings that node core has. Small core, big Userland. Note, it's beginning to become evident that node core is still too large.

Easier to nip it in the bud, than to prune it back when fully grown.

@juliangruber
Copy link
Member

The correct thing would be to use buffers as only data type, but that's not an option since it slows things down and I think that 95% of users only need strings.

So what about making strings the default type and having buffers as an option like

db.put('foo', new Buffer('bar'))
db.get('foo', { buffer : true })

@juliangruber
Copy link
Member

Also, should the encode/decode functions need to handle buffers or just be ignored for them?

@rvagg
Copy link
Member

rvagg commented Jan 5, 2013

Maybe if you supply an encoder and/or decoder then you can't supply the encoding option for any operation (you can currently supply it for any operation). Then, at the same time as you give the encoder and decoder you can supply the 'asBuffer' (or just 'buffer') option, this then gets passed down to the binding so it knows when to provide a Buffer for all of its operations.

There's also the issue that you can adjust encoding for keys too, probably a < 1% use-case but it's still there as an option.

@dominictarr
Copy link
Contributor

The binding only needs the encoding when creating a string, or accepting a string, correct?

@rvagg
Copy link
Member

rvagg commented Jan 5, 2013

only when passing back into V8, otherwise it can detect whether it's a Buffer or otherwise. When it comes out of LevelDB it's just an array of bytes and it needs to decide whether it should be utf8ified or stuck into a raw Buffer and so it needs that information from outside of the database.

@dominictarr
Copy link
Contributor

Right, but if the encoding is base64 or something, you turn it into a different string.

Also, can't you just make a Buffer that point to that particular memory range? or is there memory management problems there?

@rvagg
Copy link
Member

rvagg commented Jan 5, 2013

Making a new Buffer performs a memcpy() on the byte array. Then making a new String does the same thing (I believe). If it's not utf8 then it's even more involved than that.
Hence, either stick to Buffer or String where possible and not convert between them if we can help it. The performance improvements when I switched to this were pretty impressive and it'd be best not to go backwards.

@dominictarr
Copy link
Contributor

hmm... so setting any value for encoding implys a String.
Else it's a Buffer.

... this is complicated by arbitrary encodings,
if you want to support an encoding function that may require a string, or may require a buffer...

@dominictarr
Copy link
Contributor

So, the core issue here is how to handle whether an encoding goes to a buffer or a string,
We last discussed this 3 months ago, but this stuff has changed now right -- leveldown handles strings too?

@eugeneware
Copy link

The other issue I've found with custom encodings (see: http://github.com/eugeneware/byteup) is that the 'batch' event that gets emitted ignores the decoding, and you just get the raw buffer, which may/may not be what you want. While there is a performance issue of decoding the buffer, it was a little disconcerting to see the abstraction leak.

@eugeneware
Copy link

Another option with specifying custom encodings would be simply to override the existing keyEncoding and valueEncoding options. If it's text then use the built in codecs. If it's a function use that:

var db = levelup('uri', {
  keyEncoding: {
    encode : function (raw) { return JSON.stringify(raw) },
    decode : function (json) { return JSON.parse(json) }
  },
  valueEncoding: {
    encode : function (raw) { return JSON.stringify(raw) },
    decode : function (json) { return JSON.parse(json) }
  }
})

Then you could have custom libraries that just return the { encode: fn, decode: fn } function pairs:

var mycodec = require('mycode');
var db = levelup('uri', {
  keyEncoding: mycodec,
  valueEncoding: mycodec
})

@rvagg
Copy link
Member

rvagg commented Jun 18, 2013

the 'batch' event issue probably should be fixed! the test suite is very light on tests for events (in fact... I think they are almost entirely untested).

Re your suggestion, yes I think that's the way to go except for the issue of the codec needing to tell LevelDOWN whether it wants to receive buffers or Strings--it matters for performance that we get the right type from the db. So there needs to be a 3rd param.

@dominictarr
Copy link
Contributor

Okay, what about:

var encoding = {
  stringify: function (...) {..}, parse: function (...){...}, buffer: true | false
}

I like stringify/parse because then you can just do

keyEncoding: JSON

Of course, somethings use encode/decode... and even pack/unpack... because people are like that.
I'd go for stringify/parse because it's most familiar.

@dominictarr
Copy link
Contributor

hmm, except that it's already called a value encoding...
var encoding = {stringify: ...} is silly, var encoding = {encode, decode} is more consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants