Rewrite core parser #26

boomskats · 2015-12-26T11:12:16Z

Currently we're loading all macro params into a single-variable work dataset and using prx to work through one 32k chunk at a time. DS2 might give us a way of parallelising the processing of multiple frames when datasets exceed the 32k limit.

Also look at JSON package in 9.4m3

boomskats · 2017-03-12T22:55:16Z

We have decided that slightly better prep/preprocessing clientside is a better idea - development branch

FriedEgg · 2017-03-14T21:15:54Z

I agree not to pursue this. The parser that is part of the DS2 JSON package is terrible! You may consider using the JSON library engine, assuming that the data structures being transmitted do not require custom mapping (which I don't think they would for the intended purposes here) it would provide the simplest method for digesting. When sending data rows to SAS you could even emulate the schema that SAS uses itself when outputting data to JSON via PROC JSON (for example)

vnevlev · 2017-03-14T21:36:40Z

If you move away from your custom solution to DS2 or JSON library approach, that would make H54S depended on SAS 9.4 release, making it unusable to anyone with SAS version less than 9.4. Is it worth it?

…

On Tue, 14 Mar 2017 at 22:15 Fried Egg ***@***.***> wrote: I agree not to pursue this. The parser that is part of the DS2 JSON package is terrible! You may consider using the JSON library engine, assuming that the data structures being transmitted do not require custom mapping (which I don't think they would for the intended purposes here) it would provide the simplest method for digesting. When sending data rows to SAS you could even emulate the schema that SAS uses itself when outputting data to JSON via PROC JSON (for example) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#26 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIHy2RLK5FKRvoXhGeTuoHCdJUhW6Th7ks5rlwOKgaJpZM4G7cAD> .

FriedEgg · 2017-03-15T00:22:34Z

@vnevlev that is a valid point on concern. The DS2 package was not released until 9.4M3 (Jul2015) and the libname was not added until 9.4M4 (Nov2016) which makes it somewhat 'cutting-edge' as far as Base SAS is concerned.

boomskats · 2017-03-18T12:34:29Z

@FriedEgg @vnevlev Agreeing with both of you here. I think spinning up DS2 each time we want to deserialise a dataset is overkill, and we don't need the full support of the libname engine as we don't need to adhere to strict JSON spec here since we're able to prep our dataset at the frontend.

For parsing we're testing a way simpler data step-based solution which was suggested by Bruno Müller as part of this SAS Communities thread. So we're changing a couple of fundamental parser bits here:

different client-side preparation to mask SAS-sensitive chars (ie. replace \" with "", some other bits), instead of relying on escape() and unescape() and the urlencode and urldecode SAS functions for masking & unmasking of special characters
for the most part getting rid of parameter/macro based processing and sending data by constructing file upload bodies as part of a multipart post payload. Parameters are only used for body metadata, not the data itself. Amongst other things like the potential for full utf8 support, this change results in a very decent performance boost:

2mb of data on the old parser used to take a single thread up to 15 seconds at full blast. Now
using the ('new') data step method it takes 150ms ish, veeery little CPU and less memory.

before ===> <===
after ===> <===

(cpu is red, res mem is orange)

This looks like a parser performance boost of epic proportions, but I want to do some more real world load testing with jmeter to see what the implications are on JVM & spawner stability. Looking very very good at the moment though - if nothing else, it massively reduces the startup cost of the deserialiser, which tends to be spun up with each request to the server.

Anecdotally, to parse the same JSON in Python:

with open('data.json') as data_file:    
    data = json.load(data_file)

...takes 3.5x longer than the new DS method.

On the downside, we're replacing our 'masked json' output macro with PROC JSON output, which only came about in 9.3. If anyone requires 9.2 support we can maintain a legacy branch that uses the old output method, it's not a huge deal.

PR to the development branch coming in the next few days. Been playing with this since October but we've just kicked off a project requiring proper UTF8 support, so...

vnevlev · 2017-03-21T05:43:06Z

That's good news on performance. The only thing to add, PROC JSON wasn't on 9.3, it was on M1 as experimental. I remember trying to use it on 9.3M1 and wasn't doing a very good job.

…

On Sat, 18 Mar 2017, 12:34 Nikola Markovic, ***@***.***> wrote: @FriedEgg <https://github.com/FriedEgg> @vnevlev <https://github.com/vnevlev> Agreeing with both of you here. I think spinning up DS2 each time we want to deserialise a dataset is overkill, and we don't need the full support of the libname engine as we don't need to adhere to strict JSON spec here since we're able to prep our dataset at the frontend. For parsing we're testing a way simpler data step-based solution which was suggested by Bruno Müller as part of this SAS Communities thread <https://communities.sas.com/t5/SAS-Enterprise-Guide/Reading-json-file-into-SAS-using-SAS-enterprise-guide/m-p/205050/highlight/true#M15389>. So we're changing a couple of fundamental parser bits here: - different client-side preparation to mask SAS-sensitive chars (ie. replace \" with "", some other bits), instead of relying on escape() and unescape() and the urlencode and urldecode SAS functions for masking & unmasking of special characters - for the most part getting rid of parameter/macro based processing and sending data by constructing file upload bodies as part of a multipart post payload. Parameters are only used for body metadata, not the data itself. Amongst other things like the potential for full utf8 support, this change results in a very decent performance boost: 2mb of data on the old parser used to take a single thread up to 15 seconds at full blast. Now using the ('new') data step method it takes 150ms ish, veeery little CPU and less memory. before ===> [image: image] <https://cloud.githubusercontent.com/assets/11962123/24071822/08e85c4a-0bd3-11e7-9391-17fa6835fcf5.png> <=== after ===> [image: image] <https://cloud.githubusercontent.com/assets/11962123/24071840/54e6c76c-0bd3-11e7-9d63-c34b30030bb7.png> <=== (cpu is red, res mem is orange) usage This looks like a parser performance boost of epic proportions, but I want to do some more real world load testing with jmeter to see what the implications are on JVM & spawner stability. Looking very very good at the moment though - if nothing else, it massively reduces the startup cost of the deserialiser, which tends to be spun up with each request to the server. On the downside, we're replacing our 'masked json' output macro with PROC JSON output, which only came about in 9.3. If anyone requires 9.2 support we can maintain a legacy branch that uses the old output method, it's not a huge deal. PR to the development branch coming in the next few days. Been playing with this since October but we've just kicked off a project requiring proper UTF8 support, so... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIHy2QE54GU88luBYbh3hZ_A7o6BSxqqks5rm89WgaJpZM4G7cAD> .

boomskats · 2017-03-23T14:30:30Z

Interesting. We're running UTF8 integration tests, and while the majority of it works great, certain characters still trip PROC JSON up.

This was the last couple of characters it returned before it errored:

۝۞ۣ۟۠ۡۢۤ"

Seems like it's doing a better job, but it might just be that we need better tests...

boomskats · 2017-03-23T18:15:49Z

False alarm. UTF-8 is complicated 🎱

boomskats · 2018-01-16T16:07:32Z

False false alarm, it's not that complicated. You just have to specify length to SAS in bytes rather than characters

cj13579 · 2020-06-12T21:27:41Z

In the 2.0 release of the adapter (which we just released) we have moved away from parameters and have introduced a dependency on PROC JSON for serializing output. As per this discussion, we are aware that this introduces a dependency on SAS9.4.

Developers wishing to use h54s to build applications on older versions of SAS will still be able to use v1.0 or v0.11.3 of the adapter.

Closing this issue.

boomskats added enhancement help wanted labels Dec 26, 2015

boomskats self-assigned this Dec 26, 2015

boomskats closed this as completed Mar 12, 2017

boomskats changed the title ~~Rewrite core parser using Proc DS2~~ Rewrite core parser Mar 18, 2017

boomskats reopened this Mar 18, 2017

boomskats mentioned this issue Mar 18, 2017

URL Encode deprecation #28

Closed

boomskats mentioned this issue Sep 21, 2018

Optimize the code for converting json to dataset #70

Closed

cj13579 closed this as completed Jun 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite core parser #26

Rewrite core parser #26

boomskats commented Dec 26, 2015

boomskats commented Mar 12, 2017

FriedEgg commented Mar 14, 2017

vnevlev commented Mar 14, 2017 via email

FriedEgg commented Mar 15, 2017

boomskats commented Mar 18, 2017 •

edited

Loading

vnevlev commented Mar 21, 2017 via email

boomskats commented Mar 23, 2017

boomskats commented Mar 23, 2017

boomskats commented Jan 16, 2018

cj13579 commented Jun 12, 2020

Rewrite core parser #26

Rewrite core parser #26

Comments

boomskats commented Dec 26, 2015

boomskats commented Mar 12, 2017

FriedEgg commented Mar 14, 2017

vnevlev commented Mar 14, 2017 via email

FriedEgg commented Mar 15, 2017

boomskats commented Mar 18, 2017 • edited Loading

vnevlev commented Mar 21, 2017 via email

boomskats commented Mar 23, 2017

boomskats commented Mar 23, 2017

boomskats commented Jan 16, 2018

cj13579 commented Jun 12, 2020

boomskats commented Mar 18, 2017 •

edited

Loading