Loading query results to a spark df

I am building a generic query executioner and my current dilemma is getting the query results from Athena into a spark DF. Currently I am using a cursor and fetching rows into a small df and appending that to a final one. The use case of doing this is that I use Glue, so using something like a pandas implementation wouldn't work since that would load it into the memory of the executor only and don't use distributed memory, so if the results are bigger than memory it fails. I have something like this:

```python
spark = SparkSession.builder.getOrCreate()
headers = Row(*column_names)
data_frames = []

while True:
    new_chunk = cursor.fetchmany(fetch_size)
    if len(new_chunk) == 0:
        break

    new_chunk_df = spark.createDataFrame(
        [headers(*i) for i in new_chunk]
    )
    data_frames.append(new_chunk_df)

final_df = reduce(DataFrame.unionAll, data_frames)
```

At this point, `.execute()` already happened. This takes about 45 seconds for 50k rows. So my question is, is there an implementation similar to the pandas one I can use for this? Or alternatively, do you have any suggestions to improve this?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loading query results to a spark df #398

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Loading query results to a spark df #398

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions