-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Returning large table is slow and consumes lots of memory #60
Comments
Did you try making several RFC calls, returning less data, like 50 calls returning 10,000 rows each ? Which RFC/BAPI is called and did you check how much of 12 min for all rows (or 8 sec for 10k rows) consumed on ABAP server and how much in Python? Are all fields character and is the table passed as "Export" or "Tables" parameter type of the remote function module? "Tables" is faster. |
I will see if I can set this test up tomorrow.
It's a custom built RFC. It would seem like it spends about 2mins on ABAP Server before it starts working through the dataset on the python side, so 10 mins in Python - that's using my posted method above. With the original code, the ABAP time remains the same but it then the Python just consumes all the resources on the machine very quickly trying to fill the results.
There's a mixture of field types in the dataset. The table is passed as Tables parameter type. |
I ran the RFC daily to retrieve a full period (500,000 rows). The timing remained just short of 12mins. Is it possible for pyRFC to retrieve the field list at the start of a table and pass that info to process each row rather than retrieving the info for every row in the table? That may speed up processing a little. As for the memory consumption, I'm not sure why pyRFC uses so much to build the results. Consuming > 8Gb for 500,000 records seems way too much. It's 305Mb in a CSV file. |
@bsrdjan you said "Did you try making several RFC" could you please elaborate what you mean? Imagine I have 10 methods which all could return a lot of rows. Do I need to implement the chunking 10 times? |
Yes, I meant chunking 10 times , which would be in this scenario inconvenient. I am still thinking how to replicate and investigate the issue. Which RFC is called? Is it perhaps standard RFC, I could test in my test system as well? @patbuxton, regarding field type descriptions' caching, it would not bring any visible improvement. Descriptions are already internally cached in SAP NW RFC library, like function/parameter descriptions. |
I'll have to find out about the RFC for you, I believe the RFC was custom built but I'll have to check. Maybe you can replicate the memory consumption issue by a quick hack to loop 500,000 time retrieving the same table row? |
hm, good idea, will try that 👍 If custom, could you eventually have a quick look inside, which BAPI or function modules does the most of the work? Then I could wrap that or these ones and try to replicate rows. As a side note, I published a new release yesterday, offering a new feature to skip returning parameters, like large tables, which are not needed on client side. Names of this parameters can be provided for the call() method and they will not be "requested" (supplied) on ABAP side. It can also help further reduce memory consumption. |
@bsrdjan, sorry for the delay getting back to you on this. The RFC we called was custom, I believe that BAPI_COPAQUERY_GETCOST_ACTDATA should do something similar. Did you ever try a loop to check out the memory consumption issue? |
Hi @patbuxton and sorry it took so long for my reply. I reviewed COUNT=99999
result = conn.call('STFC_PERFORMANCE',
**{'CHECKTAB': 'X', 'LGET0332': str(COUNT), 'LGET1000': str(COUNT)}) I could not find anything to improve in PyRFC, to reduce memory consumption. After (big) ABAP tables are retrieved, they are stored in SAP NW RFC lib cache and by The design is not optimal for large tables and processing in batches/chunks, as you already implemented, would be a nice feature. Adding it in standard requires more analysis, work and testing, including the I also tried adding As workaround for the time being, could BAPI_COPAQUERY_GETCOST_ACTDATA or another RFM called on ABAP side, be extended with some sort of range selection parameters, so that it can be called multiple times, returning chunks of ABAP data? |
I have been thinking about this again. I do not have large input/output tables at the moment, but maybe in the future. It would be nice to have a documented work-around. Would it make sense to store the large result in a content-server and return the content-ID to the client? |
Yes, the content-server could be an option, although also not the simple one. The solution / workaround depends also on use-case I think. BAPIs and remote Function Modules (RFM) usually offer input parameters to narrow/select the amount of data records to be returned. The Python client could simply execute multiple calls, with different selection parameters, processing results as usual. Chunks would be in this case done already on backend/ABAP side, minimising the network traffic and memory required in Python. If the standard BAPI/RFM does not provide selection criteria, it should not a big effort to add selection capabilities in custom RFM wrapper, depends on specific case of course. This looks like the simplest approach to me. PyHDB could eventually also be an option, for reading raw HANA tables' data, bypassing RFMs and ABAP post-processing? The selection should not be an issue here but the data-model knowledge, down to the filed level is required, which might be an issue. Also no existing ABAP business logic would be applied to filter/consolidate data. I will share the question with more experts, perhaps more ideas come up. |
I changed the title to "Chunking big data", to sync with related stackoverflow question: https://stackoverflow.com/questions/49691298/sap-rfc-chunking-of-big-data/51862805#51862805 I will also check possible optimisation, to delete each table row from C heap (SAP NW RFC SDK), after "wrapping" it in Python. If works, it would reduce memory consumption for tables, by not keeping C and Python full table copies in parallel. |
Just for reference, the memory issue does not occur when I use the change detailed in my first post to stream the values directly to a file. The memory issue would then seem to be related to building the result dicts in Python which my method avoids. Thanks for the continued investigation! |
Just published the 1.9,93 release, with tables' processing optimised as follows. The wrapTable method (reading ABAP tables) had two copies of data, exactly as described in stackoverflow comment. The new logic now "pipes" table rows from C/C++ NW RFC SDK to Python, deleting each row in C/C++, after re-created in Python. Two full copies of data, Python and C++, do not exist anymore at the same time and the memory consumption should be reduced to half, when reading ABAP tables. The fillTable method is refactored and can be easily modified, if the same behaviour required when sending big tables from Python to ABAP. To cut the consumption here to half, the lines 1646 and 1654: line = lines[i]
i += 1 need to be replaced respectively by: line = lines[0]
del lines[:1] After this change is implemented, each Python table row is deleted after passing to C/C++ NW RFC SDK, which saves the memory but also makes Python input tables "gone", after the RFC call is executed. It would be inconvenient for other use-cases and I left Another option would be adding the list of table names to be processed this way, as mew RFC call option, which further complicates the interface and testing. It would be great if you could share some figures about the consumption before/after this change? This should keep the amount of used memory to minimum (with the current pyrfc design). If that not enough, the next step could be sending rows to external server for processing, similar to @patbuxton solution. Regarding pagination/batches/chunks, for the time being the only option is at Python application level because PyRFC rework would require considerably more efforts. |
Was the answer of any help? Shall we close the issue? |
We're calling an RFC that returns a table with 500,000 rows (120 fields) (KE24 report) and this is consuming>8Gb RAM on the server and fails because of this.
In an attempt to get around this I've implemented some callbacks so that rows can be processed directly to file. Whilst this does seem to work, it is still taking ~12 minutes to extract the 500,000 rows which isn't particularly fast. The batch_size in the code below doesn't seem to have any effect really, set to batches of 10,000 it takes ~8secs to build and ~3secs to process to a csv file.
My new code is below to potentially spark some ideas of how to get a large table of data out quickly. The bottleneck seems to be the processing of the data out of the resultset. Perhaps processing the fields for every row of the table is not the most efficient using WrapStructure?
Any ideas how to speed this up and keep the amount of used memory to a minimum would be greatly appreciated.
Thanks,
Pat
The text was updated successfully, but these errors were encountered: