Adds separate runnable examples in Arrow-in-Spark documentation by HyukjinKwon · Pull Request #30 · BryanCutler/spark

HyukjinKwon · 2018-01-25T23:46:59Z

What changes were proposed in this pull request?

This PR adds separate runnable examples in Arrow-in-Spark documentation:

Before

After

Before

After

Before

After

How was this patch tested?

PYSPARK_PYTON=python2.7 ./bin/spark-submit examples/src/main/python/sql/arrow.py
PYSPARK_PYTON=python3.6 ./bin/spark-submit examples/src/main/python/sql/arrow.py
./dev/lint-python

I manually checked the documentation.

BryanCutler · 2018-01-26T00:07:36Z

Thanks @HyukjinKwon ! This is cool, I didn't know you could put multiple examples in a single file like that

Uggh, I just updated with some of the comments from the PR.. I'll resolve them

icexelloss · 2018-01-26T00:13:53Z

wow this is nice

HyukjinKwon · 2018-01-26T00:14:12Z

I wish I could talk to you directly :( .. Did you start this one? I finished to resolve conflicts.

BryanCutler · 2018-01-26T00:15:00Z

oh yeah, sorry.. it was just some of the wording on the first example

HyukjinKwon · 2018-01-26T00:15:29Z

Yup .. let me just push it :-).

HyukjinKwon · 2018-01-26T00:19:01Z

Checked the built doc roughly and then ran the lintr / submit too.

BryanCutler · 2018-01-26T00:20:50Z

k, think I got it - thanks!

HyukjinKwon · 2018-01-26T00:21:32Z

YAY! my very first merged PR to your branch!

…enStageId ### What changes were proposed in this pull request? Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to help with debugging. One way to get the generated code is through `df.queryExecution.debug.codegen`, or SQL `EXPLAIN CODEGEN` statement. The generated code is currently printed without specific ordering, which can make debugging a bit annoying. This PR makes a minor improvement to sort the codegen dump by the `codegenStageId`, ascending. After this change, the following query: ```scala spark.range(10).agg(sum('id)).queryExecution.debug.codegen ``` will always dump the generated code in a natural, stable order. A version of this example with shorter output is: ``` spark.range(10).agg(sum('id)).queryExecution.debug.codegenToSeq.map(_._1).foreach(println) *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L]) +- *(1) Range (0, 10, step=1, splits=16) *(2) HashAggregate(keys=[], functions=[sum(id#8L)], output=[sum(id)#12L]) +- Exchange SinglePartition, true, [id=#30] +- *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L]) +- *(1) Range (0, 10, step=1, splits=16) ``` The number of codegen stages within a single SQL query tends to be very small, most likely < 50, so the overhead of adding the sorting shouldn't be significant. ### Why are the changes needed? Minor improvement to aid WSCG debugging. ### Does this PR introduce any user-facing change? No user-facing change for end-users; minor change for developers who debug WSCG generated code. ### How was this patch tested? Manually tested the output; all other tests still pass. Closes apache#27955 from rednaxelafx/codegen. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…enStageId ### What changes were proposed in this pull request? Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to help with debugging. One way to get the generated code is through `df.queryExecution.debug.codegen`, or SQL `EXPLAIN CODEGEN` statement. The generated code is currently printed without specific ordering, which can make debugging a bit annoying. This PR makes a minor improvement to sort the codegen dump by the `codegenStageId`, ascending. After this change, the following query: ```scala spark.range(10).agg(sum('id)).queryExecution.debug.codegen ``` will always dump the generated code in a natural, stable order. A version of this example with shorter output is: ``` spark.range(10).agg(sum('id)).queryExecution.debug.codegenToSeq.map(_._1).foreach(println) *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L]) +- *(1) Range (0, 10, step=1, splits=16) *(2) HashAggregate(keys=[], functions=[sum(id#8L)], output=[sum(id)#12L]) +- Exchange SinglePartition, true, [id=#30] +- *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L]) +- *(1) Range (0, 10, step=1, splits=16) ``` The number of codegen stages within a single SQL query tends to be very small, most likely < 50, so the overhead of adding the sorting shouldn't be significant. ### Why are the changes needed? Minor improvement to aid WSCG debugging. ### Does this PR introduce any user-facing change? No user-facing change for end-users; minor change for developers who debug WSCG generated code. ### How was this patch tested? Manually tested the output; all other tests still pass. Closes apache#27955 from rednaxelafx/codegen. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit a177628) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

HyukjinKwon mentioned this pull request Jan 25, 2018

[SPARK-22221][DOCS] Adding User Documentation for Arrow apache/spark#19575

Closed

Adds separate runnable examples in Arrow-in-Spark documentation

eb1b347

HyukjinKwon force-pushed the arrow-user-docs-SPARK-2221-examples branch from 95624e7 to eb1b347 Compare January 26, 2018 00:15

BryanCutler merged commit eb1b347 into BryanCutler:arrow-user-docs-SPARK-2221 Jan 26, 2018

HyukjinKwon deleted the arrow-user-docs-SPARK-2221-examples branch October 16, 2018 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds separate runnable examples in Arrow-in-Spark documentation#30

Adds separate runnable examples in Arrow-in-Spark documentation#30
BryanCutler merged 1 commit intoBryanCutler:arrow-user-docs-SPARK-2221from
HyukjinKwon:arrow-user-docs-SPARK-2221-examples

HyukjinKwon commented Jan 25, 2018

Uh oh!

BryanCutler commented Jan 26, 2018

Uh oh!

icexelloss commented Jan 26, 2018

Uh oh!

HyukjinKwon commented Jan 26, 2018

Uh oh!

BryanCutler commented Jan 26, 2018

Uh oh!

HyukjinKwon commented Jan 26, 2018

Uh oh!

HyukjinKwon commented Jan 26, 2018

Uh oh!

BryanCutler commented Jan 26, 2018

Uh oh!

HyukjinKwon commented Jan 26, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

HyukjinKwon commented Jan 25, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

BryanCutler commented Jan 26, 2018

Uh oh!

icexelloss commented Jan 26, 2018

Uh oh!

HyukjinKwon commented Jan 26, 2018

Uh oh!

BryanCutler commented Jan 26, 2018

Uh oh!

HyukjinKwon commented Jan 26, 2018

Uh oh!

HyukjinKwon commented Jan 26, 2018

Uh oh!

BryanCutler commented Jan 26, 2018

Uh oh!

HyukjinKwon commented Jan 26, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants