[SPARK-13534][PySpark] Using Apache Arrow to increase performance of DataFrame.toPandas #15821

BryanCutler · 2016-11-09T01:13:18Z

What changes were proposed in this pull request?

Integrate Apache Arrow with Spark to increase performance of DataFrame.toPandas. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. All non-complex data types are currently supported, otherwise an UnsupportedOperation exception is thrown.

Additions to Spark include a Scala package private method Dataset.toArrowPayloadBytes that will convert data partitions in the executor JVM to ArrowPayloads as byte arrays so they can be easily served. A package private class/object ArrowConverters that provide data type mappings and conversion routines. In Python, a public method DataFrame.collectAsArrow is added to collect Arrow payloads and an optional flag in toPandas(useArrow=False) to enable using Arrow (uses the old conversion by default).

How was this patch tested?

Added a new test suite ArrowConvertersSuite that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly.

Added PySpark tests to verify the toPandas method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.

SparkQA · 2016-11-09T01:20:32Z

Test build #68381 has finished for PR 15821 at commit 4227ec6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-09T23:15:20Z

Test build #68425 has finished for PR 15821 at commit 3f855ec.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-09T23:24:36Z

Test build #68427 has finished for PR 15821 at commit b06e11f.

This patch fails Python style tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-11-17T23:44:16Z

Test build #68806 has finished for PR 15821 at commit 053e3a6.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-18T00:44:29Z

Test build #68812 has finished for PR 15821 at commit 9191b96.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-21T21:24:51Z

Test build #68954 has finished for PR 15821 at commit 9191b96.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-11-22T20:02:29Z

python/pyspark/sql/dataframe.py

@@ -1508,7 +1518,7 @@ def toDF(self, *cols):
        return DataFrame(jdf, self.sql_ctx)

    @since(1.3)
-    def toPandas(self):
+    def toPandas(self, useArrow=True):


Would it maybe make more sense to default this to false or have more thorough checking that the dataframe being written with arrow is supported? At least initially the set of supported dataframes might be rather small.

BryanCutler · 2016-11-23T00:37:22Z

Hey @holdenk, I just had this in to do my own testing and hadn't thought
about keeping the option, but if we do keep it then yeah you're right, it
would be better to default to the original way.

On Nov 22, 2016 12:02 PM, "Holden Karau" notifications@github.com wrote:

@holdenk commented on this pull request.

In python/pyspark/sql/dataframe.py
#15821 (review):

@@ -1508,7 +1518,7 @@ def toDF(self, *cols):
return DataFrame(jdf, self.sql_ctx)
 @since(1.3)
def toPandas(self):

def toPandas(self, useArrow=True):

Would it maybe make more sense to default this to false or have more
thorough checking that the dataframe being written with arrow is supported?
At least initially the set of supported dataframes might be rather small.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#15821 (review),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEUwdd_Y8jogGipNikWJG3JAPy8DoLV8ks5rA0pygaJpZM4KtGBc
.

mariusvniekerk · 2016-11-30T03:38:52Z

So this is very cool stuff.

Would it be reasonable to add some api pieces so that on the python side things like DataFrame.mapPartitions makes use of Apache Arrow to lower the serialization costs? Or is that more a follow-on piece of work

holdenk · 2016-11-30T03:40:53Z

@mariusvniekerk I think just getting this working for local connection is going to be hard so breaking up using arrow on the driver side into a separate follow up piece of work would make sense.

BryanCutler · 2016-11-30T19:31:27Z

Thanks @mariusvniekerk, as @holdenk said we are going to try to get something basic working first and after we show some performance improvement, we can follow up with more things

wesm · 2016-11-30T20:18:48Z

Luckily we are on the home stretch for making the Java and C++ libraries binary compatible -- e.g. I'm working on automated testing today: apache/arrow#219

wesm · 2016-12-01T20:10:10Z

@BryanCutler I'm working with @icexelloss on my end to get involved in this, we were going to start working on unit tests to validate converting each of the Spark SQL data types to Arrow format while the Arrow Java-C++ compatibility work progresses, but we don't want to duplicate any efforts if you're started on this. Perhaps we can create an integration branch someplace to make pull requests into since it will probably take a while until this patch will get accepted into Spark?

wesm · 2016-12-01T20:11:33Z

Related to this we'll also want to be able to precisely instrument and benchmark the Dataset <-> Arrow conversion -- @icexelloss suggested might be able to push down the conversion into the executors instead of doing all the work in the driver, but I'm not sure how feasible that is

BryanCutler · 2016-12-01T23:36:51Z

Hi @wesm and @icexelloss , that sounds good on our end. @yinxusen has been working on validating some basic conversion so far, but everything is still very preliminary so it would be great to work with you guys. I'll setup a new integration branch and ping you all when ready.

Related to this we'll also want to be able to precisely instrument and benchmark the Dataset <-> Arrow conversion -- @icexelloss suggested might be able to push down the conversion into the executors instead of doing all the work in the driver, but I'm not sure how feasible that is

We were thinking about that too, as it would be more ideal. For simplicity we decided to first do the conversion on the driver side, which should hopefully still show a performance increase, then follow up with some work to better optimize it.

icexelloss · 2016-12-02T00:25:22Z

@BryanCutler , I have been working based on your branch here:
https://github.com/BryanCutler/spark/tree/wip-toPandas_with_arrow-SPARK-13534

Is this the right one?

BryanCutler · 2016-12-02T01:14:47Z

@icexelloss, @wesm I branched off here for us to integrate our changes https://github.com/BryanCutler/spark/tree/arrow-integration
cc @yinxusen

wesm · 2016-12-02T20:13:27Z

OK, let's open pull requests into that branch to help with not stepping on each other's toes. thank you

wesm · 2017-01-18T14:22:47Z

Shall we update this PR to the latest and solicit from involvement from Spark committers?

BryanCutler · 2017-01-19T00:51:32Z

Shall we update this PR to the latest and solicit from involvement from Spark committers?

Yeah, I think it's about ready for that. After we integrate the latest changes, I'll go over once more for some minor cleanup and update this. Probably in the next day or so.

icexelloss · 2017-01-23T19:39:41Z

Bryan, I am working on: (1) Add more numbers to benchmark.py (2) Add support for date/timestamp/binary type (3) Fix memory leaking in the code. All these should be done soon (tomorrow, if not today), but I think we can start getting feedbacks from Spark committers. What do you think? Is there anything else you want to be done before updating the PR to Spark committers? Li

…

On Wed, Jan 18, 2017 at 7:52 PM, Bryan Cutler ***@***.***> wrote: Shall we update this PR to the latest and solicit from involvement from Spark committers? Yeah, I think it's about ready for that. After we integrate the latest changes, I'll go over once more for some minor cleanup and update this. Probably in the next day or so. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15821 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAwbrDkFMHPoyXOD0H-OYSyeAx_1EL2Jks5rTrO4gaJpZM4KtGBc> .

BryanCutler · 2017-01-23T23:28:29Z

That sounds good, I'm out of town atm but will update this tomorrow to get some some more eyes on it.

…

On Jan 23, 2017 11:40 AM, "Li Jin" ***@***.***> wrote: Bryan, I am working on: (1) Add more numbers to benchmark.py (2) Add support for date/timestamp/binary type (3) Fix memory leaking in the code. All these should be done soon (tomorrow, if not today), but I think we can start getting feedbacks from Spark committers. What do you think? Is there anything else you want to be done before updating the PR to Spark committers? Li On Wed, Jan 18, 2017 at 7:52 PM, Bryan Cutler ***@***.***> wrote: > Shall we update this PR to the latest and solicit from involvement from > Spark committers? > > Yeah, I think it's about ready for that. After we integrate the latest > changes, I'll go over once more for some minor cleanup and update this. > Probably in the next day or so. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#15821 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AAwbrDkFMHPoyXOD0H- OYSyeAx_1EL2Jks5rTrO4gaJpZM4KtGBc> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15821 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUwdVhKFsGs_XiarHIMbcu9LyQ9jFXlks5rVQIdgaJpZM4KtGBc> .

SparkQA · 2017-01-24T22:46:02Z

Test build #71950 has finished for PR 15821 at commit 9bb75de.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-01-24T23:13:48Z

This has been updated after integrating changes made with @icexelloss and @wesm. There has been good progress made and it would be great if others could take a look and review/test this out.

The current state of toPandas() with Arrow has support for Datasets with primitive, string, and timestamp data types. Complex types such as Structs, Array, and Mapped are not yet supported but are a wip. There is a suite of tests in Scala to test Dataset -> ArrowRecordBatch conversion and a collection on JSON files that serve to validate the converted data is correct. Also, added PySpark tests to verify Pandas frame is correct. It is compiled with the current arrow master 0.1.1-SNAPSHOT at commit apache/arrow@7d3e2a3

The performance so far shows a significant increase and I will follow up with a script to run and details of the results seen. Please ping me with any questions on setting up the build of Arrow or running the benchmarks. It would be great if this could be considered for Spark 2.2 as Arrow 0.2 will be released soon and be able to support the functionality used here.

@holdenk @davies @rxin, I would love to hear your thoughts on this so far. Thanks!
Also cc'ing some on the watch list, @mariusvniekerk @zjffdu @nchammas @zero323 @rdblue

BryanCutler · 2017-01-25T00:23:07Z

Old Benchmarks with Conversion on Driver

Here are some rough benchmarks done locally on machine with 16GB mem and 8 cores, using Spark config defaults and taken from 50 trials of calling toPandas() measuring wall time in seconds with and without Arrow enabled:

1mm Longs

13.52x speedup on average

_	With Arrow	Without Arrow
count	50.000000	50.000000
mean	0.190573	2.576587
std	0.078450	0.114455
min	0.139911	2.259916
25%	0.148212	2.516289
50%	0.163769	2.555433
75%	0.184402	2.631316
max	0.518090	2.946415

1mm Doubles

8.07x speedup on average

_	With Arrow	Without Arrow
count	50.000000	50.000000
mean	0.259145	2.090295
std	0.069620	0.123091
min	0.196666	1.998588
25%	0.209051	2.015083
50%	0.230751	2.032701
75%	0.270519	2.122219
max	0.439556	2.485232

Script to generate these can be found here
Happy to run more if there is interest.

holdenk · 2017-01-25T00:32:27Z

On a personal note, those benchmarks certainly look very exciting (<3 max of with arrow less than min of without arrow) :)

It certainly seems it would probably be worth the review bandwidth to start looking this over but since this is pretty big and adds a new dependency this could take awhile to move forwards.

It would be great to hear what the other Python focused committers (maybe @davies ?) think of this approach :)

leifwalsh · 2017-01-25T01:40:28Z

The next iteration of this for perf would likely involve generating the arrow batches on executors and having the driver use the new streaming arrow format to just forward this to python. In our experiments, assembling arrays of internal rows dominates time, then transposing them and forming an arrow record batch is pretty quick. If we can do that work in parallel on the executors, we're likely to get another big win.

wesm · 2017-01-25T16:18:03Z

Very nice to see the improved wall clock times. I have been busy engineering the pipeline between the byte stream from Spark and the resulting DataFrame -- the only major thing still left on the table that might help is converting strings in C++ to pandas.Categorical rather than returning a dense array of strings.

I'll review this patch in more detail when I can

I'll do a bit of performance analysis (esp. on the Python side) and flesh out some of the architectural next-steps (e.g. what @leifwalsh has described) in advance of Spark Summit in a couple weeks. Parallelizing the record batch conversion and streaming it to Python would be another significant perf win. Having these tools should also be helpful for speeding up UDF evaluation

BryanCutler · 2017-01-25T19:15:20Z

Parallelizing the record batch conversion and streaming it to Python would be another significant perf win.

Right, I should have also mentioned that this PR takes a simplistic approach and collects rows to the driver, where all the conversion is done. Offloading this to the executors should boost the performance more.

shaneknapp · 2017-06-26T21:15:11Z

@holdenk yeah, another set of eyes would be great! i haven't actually touched the test infra code in a long time and i'm currently wrapping my brain around the order of operations that run-pip-tests goes through in conjunction w/everything else.

i have a feeling that the chain of scripts (run-tests-jenkins -> run-tests-jenkins.py -> run-tests -> run-pip-tests) besides being confusing for humans (ie: me), is also fragile WRT conda envs (aka munging PATH) in our environment.

would installing pyarrow 0.4.0 in the py3k conda env fix things? if so, i can bang that out in moments.

holdenk · 2017-06-26T22:37:56Z

@shaneknapp it might, assuming the Conda cache is shared it should avoid needing to fetch the package. I'm not super sure but I think we might have better luck updating conda on the jenkins machines (if people are ok with that) since it seems like this is probably from an out of date conda.

MaheshIBM · 2017-06-27T04:22:42Z

This does not seem like a timeout issue, the certificate CN and the what is used as the hostname are not matching. So clearly the client downloads the certificate but is not able to verify (no timeout). If anything it may be possible to configure the code/command to ignore ssl cert errors.

At this point I checked there is no host with hostname as below. so clearly some cert on the anaconda side is not set up properly, if I get the host name from where the package download is attempted I can investigate more.

ping conda.binstart.org
ping: unknown host conda.binstart.org

For troubleshooting from the problematic host, you can try using openssl to verify the certs, below is a sample of a successful negotiation.

openssl s_client -showcerts -connect anaconda.org:443

Lot of output here.
Server certificate
subject=/C=US/postalCode=78701/ST=TX/L=Austin/street=221 W 6th St, Ste 1550/O=Continuum Analytics Inc/OU=Information Technology/OU=PremiumSSL Wildcard/CN=*.anaconda.org
issuer=/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO RSA Organization Validation Secure Server CA

No client certificate CA names sent
Server Temp Key: ECDH, prime256v1, 256 bits

SSL handshake has read 5488 bytes and written 373 bytes

New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES128-GCM-SHA256
Server public key is 4096 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
SSL-Session:
Protocol : TLSv1.2
Cipher : ECDHE-RSA-AES128-GCM-SHA256
Session-ID: 32C19785E3801B08BB2C5997BC437A54C13C6D3F4D678F5927B028B5AAE7E2C1
Session-ID-ctx:
Master-Key: 5B2AD811A5D131CD9565311AD0A4749DC0D03657E91B32B22B77813905B9CD1865FF7DB0E67395EB1DE194848DD0037A
Key-Arg : None
Krb5 Principal: None
PSK identity: None
PSK identity hint: None
Start Time: 1498537784
Timeout : 300 (sec)
Verify return code: 0 (ok) `

BryanCutler · 2017-06-27T06:34:31Z

It's not looking like the SSL Verification Error is the issue, there are a handful of recent builds that have passed after getting that same error, see below. Maybe something else is timing out? From https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78669 ``` prepending /tmp/tmp.87E7MDUu95/3.5/bin to PATH Fetching package metadata: ..SSL verification error: hostname 'conda.binstar.org' doesn't match either of 'anaconda.com', 'anacondacloud.com', 'anacondacloud.org', 'binstar.org', 'wakari.io' .SSL verification error: hostname 'conda.binstar.org' doesn't match either of 'anaconda.com', 'anacondacloud.com', 'anacondacloud.org', 'binstar.org', 'wakari.io' ... Solving package specifications: ......... Package plan for installation in environment /tmp/tmp.87E7MDUu95/3.5: The following NEW packages will be INSTALLED: arrow-cpp: 0.4.1-np112py35_2 (soft-link) certifi: 2017.4.17-py35_0 (soft-link) jemalloc: 5.0.0-1 (soft-link) ncurses: 5.9-10 (soft-link) parquet-cpp: 1.1.0-2 (soft-link) pyarrow: 0.4.0-np112py35_0 (soft-link) ```

…

On Jun 26, 2017 9:23 PM, "Mahesh Sawaiker" ***@***.***> wrote: This does not seem like a timeout issue, the certificate CN and the what is used as the hostname are not matching. So clearly the client downloads the certificate but is not able to verify (no timeout). If anything it may be possible to configure the code/command to ignore ssl cert errors. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15821 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUwdVzc0FgVuN9mutjWXCxFGQWkQP1zks5sIIOxgaJpZM4KtGBc> .

MaheshIBM · 2017-06-27T06:52:45Z

That lends me to believe that the download request could be resolving to different hosts every time, can it happen if there is a CDN working in the background? Not all hosts are configured to use the bad certificate. While one (or more possibly) are using a certificate with DN of conda.binstar.org and responding to the domain name in the hostname of the url from where the package download is attempted.

If there is a way for configuring pip to ignore ssl errors (only for purpose of troubleshooting and find root cause of the problem here), then that is one possible direction to take. I am looking for ways to ignore ssl errors when using pip, will update the comment if i find something.

--Update
There is a --trusted-host param that can be passed to pip

--Update 2
To double check I downloaded the certificate from binstar.org and saw the values in field below, which exactly matches what pip is complaining about.

 X509v3 Subject Alternative Name: 
                DNS:anaconda.com, DNS:anacondacloud.com, DNS:anacondacloud.org, DNS:binstar.org, DNS:wakari.io

shaneknapp · 2017-06-27T17:35:11Z

i agree w/@MaheshIBM that we're looking at a bad CA cert. i think we're looking at a problem on continuum.io's side, not our side.

however, i do no like the thought of ignoring certs (on principle). :)

and finally, if i'm reading the run-pip-tests code correctly (and please correct me if i'm wrong @holdenk ), we're just creating a temp python environment in /tmp, installing some packages, running the tests, and then moving on.

some thoughts/suggestions:

our conda environment is pretty stagnant and hasn't been explicitly upgraded since we deployed anaconda python over a year ago.
the py3k environment that exists in the workers' conda installation is solely used by spark builds, so updating said environment w/the packages in the run-pip-tests will remove the need to download them, but at the same time, make the tests a NOOP.
we can hope that continuum fixes their cert issue asap. :\

holdenk · 2017-06-27T18:10:54Z

@shaneknapp your understanding about what run-pip-tests code is pretty correct. It's important to note that part of the test is installing the pyspark package its self to makesure we didn't break the packaging, and pyarrow is only installed because we want to be able to run some pyarrow tests with it -- we don't need that to be part of the packaging tests infact it would be simpler to have it be part of the normal tests.

So one possible approach to fix this I think would be updating conda on the machines because its old, installing pyarrow into the py3k worker env, and then taking the pyarrow tests out of the packaging test and instead have them run in the normal flow.

I'm not super sure this is a cert issue per-se, it seems that newer versions of conda are working fine (it's possible the SSL lib is slightly out of date and not understanding wildcards or something else in the cert)?

shaneknapp · 2017-06-27T18:31:01Z

okie dokie. how about i install pyarrow in the py3k conda environment right now... and once that's done, we can remove the pyarrow test from run-pip-tests and add it to the regular tests.

so, who wants to take care of the test updating? :)

holdenk · 2017-06-27T18:33:43Z

I can do the test updating assuming that @BryanCutler is traveling. I've got a webinar this afternoon but I can do it after I'm done with that.

Also I don't think its the wild card issue now that I think about it some more, its that new conda deprecates binstar and our old conda is going to binstar which is just pointing to the conda host but the conda host now has an SSL cert just for conda not conda and binstar. I don't think contium is going to fix that, rather I suspect the answer is going to be just to upgrade to a newer version of conda.

shaneknapp · 2017-06-27T18:35:20Z

yeah, i think you're right. however, upgrading to a new version of conda on a live environment does indeed scare me a little bit. :)

w/the new jenkins, i'll have a staging server dedicated to testing crap like this. ah, the future: so shiny and bright!

shaneknapp · 2017-06-27T18:35:38Z

anyways: installing pyarrow right now.

shaneknapp · 2017-06-27T18:36:37Z

done

shaneknapp · 2017-06-27T18:40:46Z

(py3k)-bash-4.1$ pip install pyarrow
Requirement already satisfied: pyarrow in /home/anaconda/envs/py3k/lib/python3.4/site-packages
Requirement already satisfied: six>=1.0.0 in /home/anaconda/envs/py3k/lib/python3.4/site-packages (from pyarrow)
Requirement already satisfied: numpy>=1.9 in /home/anaconda/envs/py3k/lib/python3.4/site-packages (from pyarrow)

BryanCutler · 2017-06-27T19:03:41Z

Thanks for doing this @shaneknapp and @holdenk! I'm about to hop on a plane but should be online later this afternoon. I can switch out the pyarrow tests then if it still needs to be done. On Jun 27, 2017 11:41 AM, "shane" <notifications@github.com> wrote: (py3k)-bash-4.1$ pip install pyarrow Requirement already satisfied: pyarrow in /home/anaconda/envs/py3k/lib/python3.4/site-packages Requirement already satisfied: six>=1.0.0 in /home/anaconda/envs/py3k/lib/python3.4/site-packages (from pyarrow) Requirement already satisfied: numpy>=1.9 in /home/anaconda/envs/py3k/lib/python3.4/site-packages (from pyarrow) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15821 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUwdY89qgk_vgXme5GE1LFif_KDi183ks5sIUzngaJpZM4KtGBc> .

shaneknapp · 2017-06-27T20:18:09Z

btw, do we want pyarrow-0.4.0 or -0.4.1? i'm assuming the latter based on #15821 (comment)

wesm · 2017-06-27T20:29:23Z

I recommend using the latest. The data format is forward/backward compatible so the JAR doesn't necessarily need to be 0.4.1 if you're using pyarrow 0.4.1 (0.4.1 fixed a Decimal regression in the Java library, but that isn't relevant here quite yet)

shaneknapp · 2017-06-27T20:36:05Z

roger copy... "latest" is 0.4.1, which is what's currently on the jenkins workers.

holdenk · 2017-06-27T20:37:05Z

Great, thanks @shaneknapp . @BryanCutler I've got a webinar and if you don't have a chance to change the tests around until after I'm done teaching I'll do it, but if your flight lands first then go for it :)

cloud-fan · 2017-06-28T06:18:53Z

the last build still failed, shall we update dev/run-pip-tests to use pip?

cloud-fan · 2017-06-28T06:30:29Z

Some PRs are blocked because of this failures, for days. I'm reverting it, @BryanCutler please reopen this PR after fixing the pip stuff, thanks!

HyukjinKwon · 2017-06-28T06:32:00Z

Wait @cloud-fan! just want to ask a quesiton.

HyukjinKwon · 2017-06-28T06:33:46Z

Should we maybe wait for #18443? Actually, I think there is an alternative for this - #18439 rather than reverting whole PR.

Reverting is also an option. I hope these were considered (or I assume already considered).

HyukjinKwon · 2017-06-28T06:36:07Z

FWIW, I am not against reverting. Just wanted to provide some contexts just in case missed.

holdenk · 2017-06-28T06:54:08Z

I'm against #18439 , I'd rather revert this and fix it later than installing packages without SSL.

BryanCutler · 2017-06-28T07:00:28Z

@cloudfan could you wait until the latest test from #18443 finishes? It should be done soon, but it's failed twice so far due to unrelated errors. If it fails a third time, then I agree on reverting for now to not delay other PRs any further.

…

On Jun 27, 2017 11:31 PM, "Wenchen Fan" ***@***.***> wrote: I'm against #18439 <#18439> , I'd rather revert this and fix it later than installing packages without SSL. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15821 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUwdUr5y-dP7ZyPQobMHPwpwPZjn1RIks5sIfi0gaJpZM4KtGBc> .

cloud-fan · 2017-06-28T07:06:58Z

Sorry I didn't know there is a PR fixing the issue, and I already reverted it. Please cherry-pick this commit in the new PR and apply the pip fixing. Sorry for the trouble.

rxin · 2017-06-28T07:22:30Z

In the future we should revert PRs that fail builds IMMEDIATELY. There is no way we should've let the build be broken for days.

…DataFrame.toPandas ## What changes were proposed in this pull request? Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown. Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served. A package private class/object `ArrowConverters` that provide data type mappings and conversion routines. In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default). ## How was this patch tested? Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly. Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas. Author: Bryan Cutler <cutlerb@gmail.com> Author: Li Jin <ice.xelloss@gmail.com> Author: Li Jin <li.jin@twosigma.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.

holdenk reviewed Nov 22, 2016

View reviewed changes

BryanCutler force-pushed the wip-toPandas_with_arrow-SPARK-13534 branch from 9191b96 to 9bb75de Compare January 24, 2017 22:40

maropu mentioned this pull request Jun 27, 2017

[SPARK-20460][SQL] Make it more consistent to handle column name duplication #17758

Closed

viirya mentioned this pull request Jul 5, 2017

[SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFunction Should Support UDAFs #17222

Closed

[SPARK-13534][PySpark] Using Apache Arrow to increase performance of DataFrame.toPandas #15821

[SPARK-13534][PySpark] Using Apache Arrow to increase performance of DataFrame.toPandas #15821

Conversation

BryanCutler commented Nov 9, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 9, 2016

SparkQA commented Nov 9, 2016

SparkQA commented Nov 9, 2016

SparkQA commented Nov 17, 2016

SparkQA commented Nov 18, 2016

SparkQA commented Nov 21, 2016

holdenk Nov 22, 2016

Choose a reason for hiding this comment

BryanCutler commented Nov 23, 2016

@holdenk commented on this pull request.

mariusvniekerk commented Nov 30, 2016

holdenk commented Nov 30, 2016

BryanCutler commented Nov 30, 2016

wesm commented Nov 30, 2016

wesm commented Dec 1, 2016

wesm commented Dec 1, 2016

BryanCutler commented Dec 1, 2016

icexelloss commented Dec 2, 2016

BryanCutler commented Dec 2, 2016

wesm commented Dec 2, 2016

wesm commented Jan 18, 2017

BryanCutler commented Jan 19, 2017

icexelloss commented Jan 23, 2017 via email

BryanCutler commented Jan 23, 2017 via email

SparkQA commented Jan 24, 2017

BryanCutler commented Jan 24, 2017 • edited

BryanCutler commented Jan 25, 2017 • edited

Old Benchmarks with Conversion on Driver

1mm Longs

1mm Doubles

holdenk commented Jan 25, 2017

leifwalsh commented Jan 25, 2017

wesm commented Jan 25, 2017 • edited

BryanCutler commented Jan 25, 2017

shaneknapp commented Jun 26, 2017

holdenk commented Jun 26, 2017

MaheshIBM commented Jun 27, 2017 • edited

BryanCutler commented Jun 27, 2017 via email

MaheshIBM commented Jun 27, 2017 • edited

shaneknapp commented Jun 27, 2017

holdenk commented Jun 27, 2017

shaneknapp commented Jun 27, 2017 • edited

holdenk commented Jun 27, 2017

shaneknapp commented Jun 27, 2017

shaneknapp commented Jun 27, 2017

shaneknapp commented Jun 27, 2017

shaneknapp commented Jun 27, 2017

BryanCutler commented Jun 27, 2017 via email

shaneknapp commented Jun 27, 2017

wesm commented Jun 27, 2017

shaneknapp commented Jun 27, 2017

holdenk commented Jun 27, 2017

cloud-fan commented Jun 28, 2017

cloud-fan commented Jun 28, 2017

HyukjinKwon commented Jun 28, 2017

HyukjinKwon commented Jun 28, 2017

HyukjinKwon commented Jun 28, 2017 • edited

holdenk commented Jun 28, 2017

BryanCutler commented Jun 28, 2017 via email

cloud-fan commented Jun 28, 2017

rxin commented Jun 28, 2017

BryanCutler commented Nov 9, 2016 •

edited

BryanCutler commented Jan 24, 2017 •

edited

BryanCutler commented Jan 25, 2017 •

edited

wesm commented Jan 25, 2017 •

edited

MaheshIBM commented Jun 27, 2017 •

edited

MaheshIBM commented Jun 27, 2017 •

edited

shaneknapp commented Jun 27, 2017 •

edited

HyukjinKwon commented Jun 28, 2017 •

edited