Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed parsing from beautifulsoup to lxml #232

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

jvanelteren
Copy link

@jvanelteren jvanelteren commented Jan 6, 2023

Beautifulsoup can be quite slow when parsing large xml files. When a user want to get a large amount of data off the API, this can be annoying. For example, getting the generation of 1 year takes ~75 seconds to parse (~35000 rows, 10 production types and actual & aggregated = 700k values).

Therefore we can change to the lxml package, which is quite a bit faster (3 up to 20x, tested on 3 types of queries with queries for a full year, although the generation query above runs in 4 seconds which is 20x). I didn't test on all functions because it's a PITA.
It almost seems to good to be true, and I'm not very familiar with lxml. Some parse functions do some additional checks, and I kept them in, which resulted in a smaller speedup.

Timing the complete function call (including API call) is easier, I did that for all functions, see this notebook, with date ranges of a month. The speedups are much lower, also because certain calls lead to a large amount of API calls due to the @documents decorator and the API is sometimes shaky.

Main changes

  • parsers.py completely reworked beautifulsoup into lxml. Made two helper functions 'find' and 'findall', which work similarly to bs, but faster. Since lxml is case sensitive, the tags also needed to be case sensitive
  • entsoe.py now has to send response.context instead of response.text to the parsing module, since lxml works best with bytes. This means that where EntsoeRawClient previously returned response.text, it now returns response to EntsoePandasClient, which in turn calls the parsers module with response.content. This is a breaking change for people using the EntsoeRawClient. Note: for API calls returning zip files nothing changes.
  • mappings.py fixed small bug, 'A13': 'Withdrawn' is a valid docstatus. Added

Tests
I've tested all functions on the EntsoePandasClient, and all except 4 returned an idential df (tested with df_1.equals(df2). For the other 4 functions, the current package fails:

  • query_unavailability_transmission: HTTPError: 400 Client Error: Bad Request for url
  • query_offered_capacity: no matching data error (could be bad params by me)
  • query_procured_balancing_capacity: takes very long as API request > 2 min, even for 1 day time period
  • query_withdrawn_unavailability_of_generation_units error 'A13': fixed with this pull request, but therefore couldn't test for equality

I've tested with this notebook, scroll down to the bottom for the equality comparison. This file also included the timings. Including the API calls. The tests without API are not document because it's a PITA where the functions in the package itself need to be modified in order to detect when the API call is finished.

@jvanelteren
Copy link
Author

Updated with tests to make sure dataframes return same result

@jvanelteren
Copy link
Author

Below the bench results incl API call for a full year query (in seconds). I would guess about 40% reduction on average.

query_day_ahead_prices 32.01 7.8
query_net_position 15.67 3.75
query_crossborder_flows 4.89 3.44
query_scheduled_exchanges 13.35 7.28
query_net_transfer_capacity_dayahead 5.07 4.32
query_net_transfer_capacity_yearahead 0.45 0.25
query_aggregate_water_reservoirs_and_hydro_storage 0.86 0.41
query_load 7.13 5.88
query_load_forecast 7.27 10.43
query_load_and_forecast 23.74 14.53
query_generation_forecast 13.86 6.98
query_wind_and_solar_forecast 34.97 23.62
query_generation 155.79 103.81
query_installed_generation_capacity 1.07 0.46
query_installed_generation_capacity_per_unit 0.18 0.16
query_imbalance_prices 30.78 25.21
query_contracted_reserve_prices 67.48 72/21
query_contracted_reserve_amount 124.46 63.26
query_unavailability_of_generation_units 55.25 30.42
query_unavailability_of_production_units 3.37 4.33
query_import 31.09 21.03
query_generation_import 207.24 115.14

@carl-lange2
Copy link

@fboerman are you planning on merging this? Sounds like a great improvement to me.

@nhlong2701
Copy link

nhlong2701 commented Feb 23, 2024

@fboerman is there any updates on this PR? I'm also looking forward to seeing it being merged.

@ivandjuraki
Copy link

Is there any help needed for this one? It looks like a huge improvement. We are caching years of data, and this would save as a lot of trouble

@fboerman
Copy link
Collaborator

hi @carl-lange2 @nhlong2701 @ivandjuraki @jvanelteren I looked at this before and talked to jesse about it (we know each other in real life) and my problem with this is that it requires a fully fledged test suite to be sure that nothing breaks. This change would touch every single thing of the library which is a too large change to do without guaranteeing that nothing break. Also to do myself I dont have enough time right now.

If any of you want to work on this, starting with the test suite, we can break it down in parts and slowly merge in changes. I can assist with that. As it stands now I am not confident enough to merge this whole thing in one go (the PR is also outdated atm).

@ivandjuraki I dont fully understand your issue. Are you running this library for the same data over and over again? I would advice you to fetch the data once and then save into a database for further processing.
If you mean that large queries take a long time I would advice you to check out the sftp server that entsoe makes available for exactly this bulk downloading usecase: https://transparency.entsoe.eu/content/static_content/Static%20content/knowledge%20base/SFTP-Transparency_Docs.html
Hope this helps.

@fboerman fboerman marked this pull request as draft February 29, 2024 21:18
@ivandjuraki
Copy link

ivandjuraki commented Apr 16, 2024

Hey @fboerman sorry for ghosting, i have a lot of github accounts and i am not even sure which one i use. If this PR gets attention, i can make some time for writing test suites if needed.
I am not re-caching data, but am working on a project which has a lot of parsers for CO2 data, so there is caching happening all the time that is "generic" for parsers to not go into the problem further, but since the project is live, it is not that big of a problem i thought it will be, so it's all ok from my side even if this does not go anywhere

@awarsewa
Copy link
Contributor

@fboerman I would also be interested in seeing this merged and can provide assistance if needed

@fleimgruber
Copy link
Contributor

fleimgruber commented Nov 11, 2024

Once #131 lands we could follow up on #232 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants