Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF]: swap out json_deserializer for simd_json #2228

Merged
merged 7 commits into from
May 5, 2024

Conversation

universalmind303
Copy link
Collaborator

@universalmind303 universalmind303 commented May 4, 2024

This swaps out json_deserializer for simd_json. This does show some pretty noticeable performance gains across the board (~10-20%). This is nice as not only does the local reader show improvements, the object store readers should also benefit from this.

some perf tests I ran locally

%%timeit
daft.read_json('./stackexchange_sample.jsonl').collect()

# branch: main
# 64.4 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# branch: simdjson
# 52.5 ms ± 763 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit

# tpch sf 5 'customer' table
daft.read_json('./customer.json').collect()

# branch: main
# 289 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# branch: simdjson 
# 244 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Edit:

Added support for preserving order
It isnt' quite as performant as the unordered version, but it is still noticeably faster than using json_deserializer

# stack_exchange_sample.jsonl
# 56.9 ms ± 833 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# tpch sf 5 'customer' table
# 256 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a preliminary review, I'm really liking the cleanup that the simd_json port enables!

Biggest remaining thing to figure out is how to preserve key order in Object values, which is supported in json_deserializer but is not supported in simd_json: simd-lite/simd-json#270

BorrowedValue::Static(StaticNode::I64(v)) => T::from(*v),
BorrowedValue::Static(StaticNode::U64(v)) => T::from(*v),
BorrowedValue::Static(StaticNode::F64(v)) => T::from(*v),
BorrowedValue::Static(StaticNode::Bool(v)) => T::from(*v as u8),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like this a good bit more than the int/float-specific closures!

write!(scratch, "{node}").unwrap();
target.push(Some(scratch.as_str()));
scratch.clear();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For integers and floats that contain exponents, I'm assuming that the string representation of StaticNode::I64 and StaticNode::F64 values won't preserve the exponent format, right? E.g. 1.5e9, when decoded into a string field, will be decoded as "1500000000". Is that correct?

Copy link
Collaborator Author

@universalmind303 universalmind303 May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, without manually parsing from simd_json's Tape, there isn't a way to preserve the exponent format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that change in behavior is fine, we can always special-case that if it ends up being an issue.

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, just a few questions around the unsafe code!

Object(Object<'value>),
}

struct OrderPreservingDeserializer<'de>(Deserializer<'de>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'm happy that this order-preserving extension ended up being pretty thin.

.map(|unparsed_record| {
json_deserializer::parse(unparsed_record.as_bytes()).map_err(|e| {
super::Error::JsonDeserializationError {
crate::deserializer::to_value(unsafe { unparsed_record.as_bytes_mut() })
Copy link
Contributor

@clarkzinzow clarkzinzow May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that the safety contract here for as_bytes_mut(), i.e. that the post-deserialization content of the slice is valid UTF8 before the parsed record is used, is trivially satisfied by the deserializer + records never being used after parsing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah exactly. simd_json parser is "destructive" in nature, so as long as you don't attempt to reuse the original data after parsing, this is generally safe

}

pub fn parse(&mut self) -> Value<'de> {
match unsafe { self.0.next_() } {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safety condition here is that a top-level .parse() will never be called on an empty byte slice - do we know that this can't happen since Deserializer::from_slice() errors on an empty byte slice? The only condition check I found when skimming the Deserializer::from_slice() implementation to suggest that it errors when given an empty slice is this: https://docs.rs/simd-json/latest/src/simd_json/lib.rs.html#1081-1083

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know that this can't happen since Deserializer::from_slice() errors on an empty byte slice?

I would assume the Deserializer catches any errors before it makes it into this wrapper, but I can't say with certainty. They do have a pretty extensive test suite that I would hope catches this. So I think unfortunately we just assume that simd_json does what it's supposed to do.

Anecdotally, polars has been using simd_json::Deserializer for a couple years and has never encountered an out of bounds here.

write!(scratch, "{node}").unwrap();
target.push(Some(scratch.as_str()));
scratch.clear();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that change in behavior is fine, we can always special-case that if it ends up being an issue.

@clarkzinzow clarkzinzow merged commit 38ab44a into Eventual-Inc:main May 5, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants