Open
Description
I am building a generic query executioner and my current dilemma is getting the query results from Athena into a spark DF. Currently I am using a cursor and fetching rows into a small df and appending that to a final one. The use case of doing this is that I use Glue, so using something like a pandas implementation wouldn't work since that would load it into the memory of the executor only and don't use distributed memory, so if the results are bigger than memory it fails. I have something like this:
spark = SparkSession.builder.getOrCreate()
headers = Row(*column_names)
data_frames = []
while True:
new_chunk = cursor.fetchmany(fetch_size)
if len(new_chunk) == 0:
break
new_chunk_df = spark.createDataFrame(
[headers(*i) for i in new_chunk]
)
data_frames.append(new_chunk_df)
final_df = reduce(DataFrame.unionAll, data_frames)
At this point, .execute()
already happened. This takes about 45 seconds for 50k rows. So my question is, is there an implementation similar to the pandas one I can use for this? Or alternatively, do you have any suggestions to improve this?
Metadata
Metadata
Assignees
Labels
No labels