Skip to content

Loading query results to a spark df #398

Open
@ksco92

Description

@ksco92

I am building a generic query executioner and my current dilemma is getting the query results from Athena into a spark DF. Currently I am using a cursor and fetching rows into a small df and appending that to a final one. The use case of doing this is that I use Glue, so using something like a pandas implementation wouldn't work since that would load it into the memory of the executor only and don't use distributed memory, so if the results are bigger than memory it fails. I have something like this:

spark = SparkSession.builder.getOrCreate()
headers = Row(*column_names)
data_frames = []

while True:
    new_chunk = cursor.fetchmany(fetch_size)
    if len(new_chunk) == 0:
        break

    new_chunk_df = spark.createDataFrame(
        [headers(*i) for i in new_chunk]
    )
    data_frames.append(new_chunk_df)

final_df = reduce(DataFrame.unionAll, data_frames)

At this point, .execute() already happened. This takes about 45 seconds for 50k rows. So my question is, is there an implementation similar to the pandas one I can use for this? Or alternatively, do you have any suggestions to improve this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions