Changed parsing from beautifulsoup to lxml #232

jvanelteren · 2023-01-06T21:11:10Z

Beautifulsoup can be quite slow when parsing large xml files. When a user want to get a large amount of data off the API, this can be annoying. For example, getting the generation of 1 year takes ~75 seconds to parse (~35000 rows, 10 production types and actual & aggregated = 700k values).

Therefore we can change to the lxml package, which is quite a bit faster (3 up to 20x, tested on 3 types of queries with queries for a full year, although the generation query above runs in 4 seconds which is 20x). I didn't test on all functions because it's a PITA.
It almost seems to good to be true, and I'm not very familiar with lxml. Some parse functions do some additional checks, and I kept them in, which resulted in a smaller speedup.

Timing the complete function call (including API call) is easier, I did that for all functions, see this notebook, with date ranges of a month. The speedups are much lower, also because certain calls lead to a large amount of API calls due to the @documents decorator and the API is sometimes shaky.

Main changes

parsers.py completely reworked beautifulsoup into lxml. Made two helper functions 'find' and 'findall', which work similarly to bs, but faster. Since lxml is case sensitive, the tags also needed to be case sensitive
entsoe.py now has to send response.context instead of response.text to the parsing module, since lxml works best with bytes. This means that where EntsoeRawClient previously returned response.text, it now returns response to EntsoePandasClient, which in turn calls the parsers module with response.content. This is a breaking change for people using the EntsoeRawClient. Note: for API calls returning zip files nothing changes.
mappings.py fixed small bug, 'A13': 'Withdrawn' is a valid docstatus. Added

Tests
I've tested all functions on the EntsoePandasClient, and all except 4 returned an idential df (tested with df_1.equals(df2). For the other 4 functions, the current package fails:

query_unavailability_transmission: HTTPError: 400 Client Error: Bad Request for url
query_offered_capacity: no matching data error (could be bad params by me)
query_procured_balancing_capacity: takes very long as API request > 2 min, even for 1 day time period
query_withdrawn_unavailability_of_generation_units error 'A13': fixed with this pull request, but therefore couldn't test for equality

I've tested with this notebook, scroll down to the bottom for the equality comparison. This file also included the timings. Including the API calls. The tests without API are not document because it's a PITA where the functions in the package itself need to be modified in order to detect when the API call is finished.

jvanelteren · 2023-01-13T19:54:02Z

Updated with tests to make sure dataframes return same result

jvanelteren · 2023-01-13T21:22:22Z

Below the bench results incl API call for a full year query (in seconds). I would guess about 40% reduction on average.

query_day_ahead_prices 32.01 7.8
query_net_position 15.67 3.75
query_crossborder_flows 4.89 3.44
query_scheduled_exchanges 13.35 7.28
query_net_transfer_capacity_dayahead 5.07 4.32
query_net_transfer_capacity_yearahead 0.45 0.25
query_aggregate_water_reservoirs_and_hydro_storage 0.86 0.41
query_load 7.13 5.88
query_load_forecast 7.27 10.43
query_load_and_forecast 23.74 14.53
query_generation_forecast 13.86 6.98
query_wind_and_solar_forecast 34.97 23.62
query_generation 155.79 103.81
query_installed_generation_capacity 1.07 0.46
query_installed_generation_capacity_per_unit 0.18 0.16
query_imbalance_prices 30.78 25.21
query_contracted_reserve_prices 67.48 72/21
query_contracted_reserve_amount 124.46 63.26
query_unavailability_of_generation_units 55.25 30.42
query_unavailability_of_production_units 3.37 4.33
query_import 31.09 21.03
query_generation_import 207.24 115.14

carl-lange2 · 2023-01-24T14:06:07Z

@fboerman are you planning on merging this? Sounds like a great improvement to me.

nhlong2701 · 2024-02-23T09:27:14Z

@fboerman is there any updates on this PR? I'm also looking forward to seeing it being merged.

ivandjuraki · 2024-02-29T10:42:43Z

Is there any help needed for this one? It looks like a huge improvement. We are caching years of data, and this would save as a lot of trouble

fboerman · 2024-02-29T21:15:51Z

hi @carl-lange2 @nhlong2701 @ivandjuraki @jvanelteren I looked at this before and talked to jesse about it (we know each other in real life) and my problem with this is that it requires a fully fledged test suite to be sure that nothing breaks. This change would touch every single thing of the library which is a too large change to do without guaranteeing that nothing break. Also to do myself I dont have enough time right now.

If any of you want to work on this, starting with the test suite, we can break it down in parts and slowly merge in changes. I can assist with that. As it stands now I am not confident enough to merge this whole thing in one go (the PR is also outdated atm).

@ivandjuraki I dont fully understand your issue. Are you running this library for the same data over and over again? I would advice you to fetch the data once and then save into a database for further processing.
If you mean that large queries take a long time I would advice you to check out the sftp server that entsoe makes available for exactly this bulk downloading usecase: https://transparency.entsoe.eu/content/static_content/Static%20content/knowledge%20base/SFTP-Transparency_Docs.html
Hope this helps.

ivandjuraki · 2024-04-16T05:55:15Z

Hey @fboerman sorry for ghosting, i have a lot of github accounts and i am not even sure which one i use. If this PR gets attention, i can make some time for writing test suites if needed.
I am not re-caching data, but am working on a project which has a lot of parsers for CO2 data, so there is caching happening all the time that is "generic" for parsers to not go into the problem further, but since the project is live, it is not that big of a problem i thought it will be, so it's all ok from my side even if this does not go anywhere

awarsewa · 2024-07-11T10:32:55Z

@fboerman I would also be interested in seeing this merged and can provide assistance if needed

fleimgruber · 2024-11-11T15:01:33Z

Once #131 lands we could follow up on #232 (comment).

jvanelteren added 11 commits January 4, 2023 21:15

changed generation to lxml

2599590

redid most of bs into lxml elements

a17a80f

capitals in tags since lxml is case-sensitive

c560218

finished checking capitals on tags

197cac7

changed response.text to response.content

97db866

added 'A13': 'Withdrawn' to docstatus

f2aaa89

fix: procurement_Price.amount casing

4a98bfc

remove print statement in entsoe.py

f1c8d17

changed docStatus as pd column to lowercase to prevent change in df

4348437

bug: cap / plant,idx mrid instead of name

7caf7c1

all response.text results now return response

00d764c

fboerman marked this pull request as draft February 29, 2024 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed parsing from beautifulsoup to lxml #232

Changed parsing from beautifulsoup to lxml #232

jvanelteren commented Jan 6, 2023 •

edited

Loading

jvanelteren commented Jan 13, 2023

jvanelteren commented Jan 13, 2023

carl-lange2 commented Jan 24, 2023

nhlong2701 commented Feb 23, 2024 •

edited

Loading

ivandjuraki commented Feb 29, 2024

fboerman commented Feb 29, 2024

ivandjuraki commented Apr 16, 2024 •

edited

Loading

awarsewa commented Jul 11, 2024

fleimgruber commented Nov 11, 2024 •

edited

Loading

Changed parsing from beautifulsoup to lxml #232

Are you sure you want to change the base?

Changed parsing from beautifulsoup to lxml #232

Conversation

jvanelteren commented Jan 6, 2023 • edited Loading

jvanelteren commented Jan 13, 2023

jvanelteren commented Jan 13, 2023

carl-lange2 commented Jan 24, 2023

nhlong2701 commented Feb 23, 2024 • edited Loading

ivandjuraki commented Feb 29, 2024

fboerman commented Feb 29, 2024

ivandjuraki commented Apr 16, 2024 • edited Loading

awarsewa commented Jul 11, 2024

fleimgruber commented Nov 11, 2024 • edited Loading

jvanelteren commented Jan 6, 2023 •

edited

Loading

nhlong2701 commented Feb 23, 2024 •

edited

Loading

ivandjuraki commented Apr 16, 2024 •

edited

Loading

fleimgruber commented Nov 11, 2024 •

edited

Loading