Skip to content

Adds separate runnable examples in Arrow-in-Spark documentation#30

Merged
BryanCutler merged 1 commit intoBryanCutler:arrow-user-docs-SPARK-2221from
HyukjinKwon:arrow-user-docs-SPARK-2221-examples
Jan 26, 2018
Merged

Adds separate runnable examples in Arrow-in-Spark documentation#30
BryanCutler merged 1 commit intoBryanCutler:arrow-user-docs-SPARK-2221from
HyukjinKwon:arrow-user-docs-SPARK-2221-examples

Conversation

@HyukjinKwon
Copy link
Copy Markdown
Collaborator

What changes were proposed in this pull request?

This PR adds separate runnable examples in Arrow-in-Spark documentation:

Before

2018-01-26 8 37 32

After

2018-01-26 8 38 28

Before

2018-01-26 8 37 46

After

2018-01-26 8 38 36

Before

2018-01-26 8 37 52

After

2018-01-26 8 38 45

How was this patch tested?

PYSPARK_PYTON=python2.7 ./bin/spark-submit examples/src/main/python/sql/arrow.py
PYSPARK_PYTON=python3.6 ./bin/spark-submit examples/src/main/python/sql/arrow.py
./dev/lint-python

I manually checked the documentation.

@BryanCutler
Copy link
Copy Markdown
Owner

Thanks @HyukjinKwon ! This is cool, I didn't know you could put multiple examples in a single file like that

Uggh, I just updated with some of the comments from the PR.. I'll resolve them

@icexelloss
Copy link
Copy Markdown
Collaborator

wow this is nice

@HyukjinKwon
Copy link
Copy Markdown
Collaborator Author

I wish I could talk to you directly :( .. Did you start this one? I finished to resolve conflicts.

@BryanCutler
Copy link
Copy Markdown
Owner

oh yeah, sorry.. it was just some of the wording on the first example

@HyukjinKwon
Copy link
Copy Markdown
Collaborator Author

Yup .. let me just push it :-).

@HyukjinKwon HyukjinKwon force-pushed the arrow-user-docs-SPARK-2221-examples branch from 95624e7 to eb1b347 Compare January 26, 2018 00:15
@HyukjinKwon
Copy link
Copy Markdown
Collaborator Author

Checked the built doc roughly and then ran the lintr / submit too.

@BryanCutler BryanCutler merged commit eb1b347 into BryanCutler:arrow-user-docs-SPARK-2221 Jan 26, 2018
@BryanCutler
Copy link
Copy Markdown
Owner

k, think I got it - thanks!

@HyukjinKwon
Copy link
Copy Markdown
Collaborator Author

YAY! my very first merged PR to your branch!

@HyukjinKwon HyukjinKwon deleted the arrow-user-docs-SPARK-2221-examples branch October 16, 2018 12:44
BryanCutler pushed a commit that referenced this pull request May 28, 2020
…enStageId

### What changes were proposed in this pull request?

Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to help with debugging. One way to get the generated code is through `df.queryExecution.debug.codegen`, or SQL `EXPLAIN CODEGEN` statement.

The generated code is currently printed without specific ordering, which can make debugging a bit annoying. This PR makes a minor improvement to sort the codegen dump by the `codegenStageId`, ascending.

After this change, the following query:
```scala
spark.range(10).agg(sum('id)).queryExecution.debug.codegen
```
will always dump the generated code in a natural, stable order. A version of this example with shorter output is:
```
spark.range(10).agg(sum('id)).queryExecution.debug.codegenToSeq.map(_._1).foreach(println)
*(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L])
+- *(1) Range (0, 10, step=1, splits=16)

*(2) HashAggregate(keys=[], functions=[sum(id#8L)], output=[sum(id)#12L])
+- Exchange SinglePartition, true, [id=#30]
   +- *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L])
      +- *(1) Range (0, 10, step=1, splits=16)
```

The number of codegen stages within a single SQL query tends to be very small, most likely < 50, so the overhead of adding the sorting shouldn't be significant.

### Why are the changes needed?

Minor improvement to aid WSCG debugging.

### Does this PR introduce any user-facing change?

No user-facing change for end-users; minor change for developers who debug WSCG generated code.

### How was this patch tested?

Manually tested the output; all other tests still pass.

Closes apache#27955 from rednaxelafx/codegen.

Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
BryanCutler pushed a commit that referenced this pull request Oct 7, 2020
…enStageId

### What changes were proposed in this pull request?

Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to help with debugging. One way to get the generated code is through `df.queryExecution.debug.codegen`, or SQL `EXPLAIN CODEGEN` statement.

The generated code is currently printed without specific ordering, which can make debugging a bit annoying. This PR makes a minor improvement to sort the codegen dump by the `codegenStageId`, ascending.

After this change, the following query:
```scala
spark.range(10).agg(sum('id)).queryExecution.debug.codegen
```
will always dump the generated code in a natural, stable order. A version of this example with shorter output is:
```
spark.range(10).agg(sum('id)).queryExecution.debug.codegenToSeq.map(_._1).foreach(println)
*(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L])
+- *(1) Range (0, 10, step=1, splits=16)

*(2) HashAggregate(keys=[], functions=[sum(id#8L)], output=[sum(id)#12L])
+- Exchange SinglePartition, true, [id=#30]
   +- *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L])
      +- *(1) Range (0, 10, step=1, splits=16)
```

The number of codegen stages within a single SQL query tends to be very small, most likely < 50, so the overhead of adding the sorting shouldn't be significant.

### Why are the changes needed?

Minor improvement to aid WSCG debugging.

### Does this PR introduce any user-facing change?

No user-facing change for end-users; minor change for developers who debug WSCG generated code.

### How was this patch tested?

Manually tested the output; all other tests still pass.

Closes apache#27955 from rednaxelafx/codegen.

Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(cherry picked from commit a177628)
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants