Adds separate runnable examples in Arrow-in-Spark documentation#30
Merged
BryanCutler merged 1 commit intoBryanCutler:arrow-user-docs-SPARK-2221from Jan 26, 2018
Conversation
Owner
|
Thanks @HyukjinKwon ! This is cool, I didn't know you could put multiple examples in a single file like that Uggh, I just updated with some of the comments from the PR.. I'll resolve them |
Collaborator
|
wow this is nice |
Collaborator
Author
|
I wish I could talk to you directly :( .. Did you start this one? I finished to resolve conflicts. |
Owner
|
oh yeah, sorry.. it was just some of the wording on the first example |
Collaborator
Author
|
Yup .. let me just push it :-). |
95624e7 to
eb1b347
Compare
Collaborator
Author
|
Checked the built doc roughly and then ran the lintr / submit too. |
Owner
|
k, think I got it - thanks! |
Collaborator
Author
|
YAY! my very first merged PR to your branch! |
BryanCutler
pushed a commit
that referenced
this pull request
May 28, 2020
…enStageId
### What changes were proposed in this pull request?
Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to help with debugging. One way to get the generated code is through `df.queryExecution.debug.codegen`, or SQL `EXPLAIN CODEGEN` statement.
The generated code is currently printed without specific ordering, which can make debugging a bit annoying. This PR makes a minor improvement to sort the codegen dump by the `codegenStageId`, ascending.
After this change, the following query:
```scala
spark.range(10).agg(sum('id)).queryExecution.debug.codegen
```
will always dump the generated code in a natural, stable order. A version of this example with shorter output is:
```
spark.range(10).agg(sum('id)).queryExecution.debug.codegenToSeq.map(_._1).foreach(println)
*(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L])
+- *(1) Range (0, 10, step=1, splits=16)
*(2) HashAggregate(keys=[], functions=[sum(id#8L)], output=[sum(id)#12L])
+- Exchange SinglePartition, true, [id=#30]
+- *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L])
+- *(1) Range (0, 10, step=1, splits=16)
```
The number of codegen stages within a single SQL query tends to be very small, most likely < 50, so the overhead of adding the sorting shouldn't be significant.
### Why are the changes needed?
Minor improvement to aid WSCG debugging.
### Does this PR introduce any user-facing change?
No user-facing change for end-users; minor change for developers who debug WSCG generated code.
### How was this patch tested?
Manually tested the output; all other tests still pass.
Closes apache#27955 from rednaxelafx/codegen.
Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
BryanCutler
pushed a commit
that referenced
this pull request
Oct 7, 2020
…enStageId
### What changes were proposed in this pull request?
Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to help with debugging. One way to get the generated code is through `df.queryExecution.debug.codegen`, or SQL `EXPLAIN CODEGEN` statement.
The generated code is currently printed without specific ordering, which can make debugging a bit annoying. This PR makes a minor improvement to sort the codegen dump by the `codegenStageId`, ascending.
After this change, the following query:
```scala
spark.range(10).agg(sum('id)).queryExecution.debug.codegen
```
will always dump the generated code in a natural, stable order. A version of this example with shorter output is:
```
spark.range(10).agg(sum('id)).queryExecution.debug.codegenToSeq.map(_._1).foreach(println)
*(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L])
+- *(1) Range (0, 10, step=1, splits=16)
*(2) HashAggregate(keys=[], functions=[sum(id#8L)], output=[sum(id)#12L])
+- Exchange SinglePartition, true, [id=#30]
+- *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L])
+- *(1) Range (0, 10, step=1, splits=16)
```
The number of codegen stages within a single SQL query tends to be very small, most likely < 50, so the overhead of adding the sorting shouldn't be significant.
### Why are the changes needed?
Minor improvement to aid WSCG debugging.
### Does this PR introduce any user-facing change?
No user-facing change for end-users; minor change for developers who debug WSCG generated code.
### How was this patch tested?
Manually tested the output; all other tests still pass.
Closes apache#27955 from rednaxelafx/codegen.
Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(cherry picked from commit a177628)
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR adds separate runnable examples in Arrow-in-Spark documentation:
Before
After
Before
After
Before
After
How was this patch tested?
I manually checked the documentation.