Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite core parser #26

Closed
boomskats opened this issue Dec 26, 2015 · 10 comments
Closed

Rewrite core parser #26

boomskats opened this issue Dec 26, 2015 · 10 comments

Comments

@boomskats
Copy link
Contributor

Currently we're loading all macro params into a single-variable work dataset and using prx to work through one 32k chunk at a time. DS2 might give us a way of parallelising the processing of multiple frames when datasets exceed the 32k limit.

Also look at JSON package in 9.4m3

@boomskats
Copy link
Contributor Author

We have decided that slightly better prep/preprocessing clientside is a better idea - development branch

@FriedEgg
Copy link

I agree not to pursue this. The parser that is part of the DS2 JSON package is terrible! You may consider using the JSON library engine, assuming that the data structures being transmitted do not require custom mapping (which I don't think they would for the intended purposes here) it would provide the simplest method for digesting. When sending data rows to SAS you could even emulate the schema that SAS uses itself when outputting data to JSON via PROC JSON (for example)

@vnevlev
Copy link

vnevlev commented Mar 14, 2017 via email

@FriedEgg
Copy link

@vnevlev that is a valid point on concern. The DS2 package was not released until 9.4M3 (Jul2015) and the libname was not added until 9.4M4 (Nov2016) which makes it somewhat 'cutting-edge' as far as Base SAS is concerned.

@boomskats
Copy link
Contributor Author

boomskats commented Mar 18, 2017

@FriedEgg @vnevlev Agreeing with both of you here. I think spinning up DS2 each time we want to deserialise a dataset is overkill, and we don't need the full support of the libname engine as we don't need to adhere to strict JSON spec here since we're able to prep our dataset at the frontend.

For parsing we're testing a way simpler data step-based solution which was suggested by Bruno Müller as part of this SAS Communities thread. So we're changing a couple of fundamental parser bits here:

  • different client-side preparation to mask SAS-sensitive chars (ie. replace \" with "", some other bits), instead of relying on escape() and unescape() and the urlencode and urldecode SAS functions for masking & unmasking of special characters
  • for the most part getting rid of parameter/macro based processing and sending data by constructing file upload bodies as part of a multipart post payload. Parameters are only used for body metadata, not the data itself. Amongst other things like the potential for full utf8 support, this change results in a very decent performance boost:

2mb of data on the old parser used to take a single thread up to 15 seconds at full blast. Now
using the ('new') data step method it takes 150ms ish, veeery little CPU and less memory.

before ===> image <===
after ===> image <===

(cpu is red, res mem is orange)

This looks like a parser performance boost of epic proportions, but I want to do some more real world load testing with jmeter to see what the implications are on JVM & spawner stability. Looking very very good at the moment though - if nothing else, it massively reduces the startup cost of the deserialiser, which tends to be spun up with each request to the server.

Anecdotally, to parse the same JSON in Python:

with open('data.json') as data_file:    
    data = json.load(data_file)

...takes 3.5x longer than the new DS method.

On the downside, we're replacing our 'masked json' output macro with PROC JSON output, which only came about in 9.3. If anyone requires 9.2 support we can maintain a legacy branch that uses the old output method, it's not a huge deal.

PR to the development branch coming in the next few days. Been playing with this since October but we've just kicked off a project requiring proper UTF8 support, so...

@boomskats boomskats changed the title Rewrite core parser using Proc DS2 Rewrite core parser Mar 18, 2017
@boomskats boomskats reopened this Mar 18, 2017
@vnevlev
Copy link

vnevlev commented Mar 21, 2017 via email

@boomskats
Copy link
Contributor Author

Interesting. We're running UTF8 integration tests, and while the majority of it works great, certain characters still trip PROC JSON up.

image

This was the last couple of characters it returned before it errored:

۝۞ۣ۟۠ۡۢۤ"

Seems like it's doing a better job, but it might just be that we need better tests...

@boomskats
Copy link
Contributor Author

False alarm. UTF-8 is complicated 🎱

@boomskats
Copy link
Contributor Author

False false alarm, it's not that complicated. You just have to specify length to SAS in bytes rather than characters

@cj13579
Copy link
Contributor

cj13579 commented Jun 12, 2020

In the 2.0 release of the adapter (which we just released) we have moved away from parameters and have introduced a dependency on PROC JSON for serializing output. As per this discussion, we are aware that this introduces a dependency on SAS9.4.

Developers wishing to use h54s to build applications on older versions of SAS will still be able to use v1.0 or v0.11.3 of the adapter.

Closing this issue.

@cj13579 cj13579 closed this as completed Jun 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants