Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor code (and some other changes) #43

Closed
wants to merge 46 commits into from
Closed

Conversation

Llannelongue
Copy link
Contributor

This is following discussions on #26 and during the meeting that now that we are going toward a stable version, it would be worth refactoring the code to make it a bit more future-proof.

This is most likely not final at all, and it'd be great to hear everyone's thoughts to refine it. And also to check that it still works as expected (in particular if I have inadvertently deleted something important!). And we can open dedicated issues for anything arising for discussions here.

Here is a schema of the draft new structure:

cats_refactored drawio

The guiding principles in refactoring were to try to:

  • keep each step separate, with a different class/script
  • minimised chained imports (file A imports B which imports C, file D also imports B etc.)
  • ensure that the main steps can be easily followed from cats.run
  • make it easy (when possible) to add more API/different optimisation tools in the future

As a reminder, the main steps of CATS are:

  1. Read data/information from the command line and the config file
  2. Pull forecast data from a carbon intensity (CI) API
  3. Find the window of time (start time + end time) that would minimise average carbon intensity within the known forecast
  4. (optional) Estimate the total energy used and carbon footprint by starting or not at the best time

An overview of some changes beyond structure

I aimed at keeping the functionalities as they were as much as possible, just trying to enhance when I could. Here are some changed:

  • Output as been changed a bit, here is the new one for now:
Using config.yml found in current directory
Using carbonintensity.org.uk for carbon intensity forecasts.
Using location provided: PE29

Carbon intensity forecast loaded for between 29/05/2023-20:30 UTC and 31/05/2023-20:30 UTC.

Best start time: 31/05/2023-12:00 UTC (in 39.1 hours)
	 Expected end time: 31/05/2023-17:50 UTC
	 Expected average carbon intensity: 70.03 gCO2e/kWh

Estimated carbon footprint of running job at best time: 10.68 gCO2e
	vs running it now: 13.26 gCO2e (2.58 gCO2e saved)
	vs running at worst time (31/05/2023-01:00 UTC): 17.54 gCO2e (6.87 gCO2e saved)
	- Estimated energy usage: 0.15 kWh
  • Regarding how to find the best start time:

  • Now all carbon intensity values with timestamps (from the API or computed internally) are stored as a CarbonIntensityEstimate (defined in CI_api_interface.py but could be defined more centrally). It is a simple adaptation of the previously existing CarbonIntensityAverageEstimate and has properties such as value, start, end and timedelta and can be sorted by CI value.

  • Added some progress update/error messages throughout (more needed!)

How this release addresses existing open issues

Open questions:

  1. Why some things that are not errors are being printed on stderr and others on stdout? As a temporary measure, all updates/messages are sent to stdout and only errors/warnings to stderr, but would be great to hear some thoughts on that.
  2. I suspect I may have broken the pipeline with the at linux scheduler with the different output, any thoughts on the best way to pass information on start time?

What is still needed

Not necessarily before merging, but here is a list non-exhaustive of what still needs to be done (the idea of this PR is to discuss this new structure while polishing the details).

  • Write more tests
  • Comment the code/functions more thoroughly
  • Check how it deals with edge cases, absurd values, etc. (and add error messages)
  • Rewrite README
  • Check how the new structure works with the at linux command

…be nore versatile and be the only new object used
Copy link
Collaborator

@tlestang tlestang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Llannelongue for your work and the detailed PR description. As this is a large set of changes I won't be able to look at everything in one go.

About the API interface:

I don't see the benefit of introducing a new plain class for the API interface, compared to the currently implemented APIInterface namedtuple. It provides the same access to the request url and response parsing functions through dot notation and a lot less code.

More problematic to me is that the API-specific logic of forming the query URL and parsing the response is coupled back to the API interface code. Somebody wishing to interface cats with a new web API would have to modify the implementation of CI_API_interface itself, adding a new statememt if self.choice_CI_API == 'myapi.org' and hardcoding API-specific code in there. I feel like this is a step back from the current implementation (introduced in #37) in which interfacing against a different API is only a matter of providing two functions (possibly grouping them into a new APIInterface object) without having to touch the code of cats itself at any point.

Lastly, do we really want to rename api_query.py and api_interface.py ?

@Llannelongue
Copy link
Contributor Author

Thanks for these @tlestang

I don't see the benefit of introducing a new plain class for the API interface, compared to the currently implemented APIInterface namedtuple. It provides the same access to the request url and response parsing functions through dot notation and a lot less code.

More on this below after the second comment, but regarding class vs namedtuple, I don't mind as long as it's easy to switch between APIs (so we would need a master function to pick the right namedtuple in case it's not provided directly by the user). I can draft a new commit reversing to namedtuple.

More problematic to me is that the API-specific logic of forming the query URL and parsing the response is coupled back to the API interface code. Somebody wishing to interface cats with a new web API would have to modify the implementation of CI_API_interface itself, adding a new statememt if self.choice_CI_API == 'myapi.org' and hardcoding API-specific code in there. I feel like this is a step back from the current implementation (introduced in #37) in which interfacing against a different API is only a matter of providing two functions (possibly grouping them into a new APIInterface object) without having to touch the code of cats itself at any point.

This is an interesting point, and probably needs more thinking. However not sure about not having to touch the code, I see two different use cases here:

  1. We or other contributors will want to add other CI APIs (e.g. for other countries), and we ideally want to make them part of CATS so that these new APIs are available to the whole community. In this case, it would be good to have all the URL/parsing codes in the same place and api_inferface is a good place for that (it also makes it easier to add things by copy-pasting). And in terms of how much hassle it is to add it, it's equivalent now and with the new code (api_interface needs to be modified either way, and current code requires messing with __init__.py as well), but the existing code doesn't allow user to easily pick an API, this is what the new argument --api-carbonintensity introduces.

  2. Second use case is if users want to pass their own API wrapper directly to CATS without having to modify the code. And in this case I agree, it would be good to make it possible in an easier way. But how would that work in practice? It would be good to have an idea of how the user would do it if we want to implement it.

I opened issue #44 to have a space to discuss (2) further.

Lastly, do we really want to rename api_query.py and api_interface.py ?

The motivation behind that is to clarify that the code in there is related to pulling carbon intensities, in case more APIs for other parts of the code are included later. But at the moment, this is only future-proofing so I don't mind removing the CI_ in the names if it's an issue for now.

@Llannelongue
Copy link
Contributor Author

Following up on that, commit 8c927c7 reintroduces namedtuple for the api interfaces.

@tlestang
Copy link
Collaborator

We or other contributors will want to add other CI APIs (e.g. for other countries), and we ideally want to make them part of CATS so that these new APIs are available to the whole community. In this case, it would be good to have all the URL/parsing codes in the same place and api_inferface is a good place for that (it also makes it easier to add things by copy-pasting).

Agreed 100%, and that's the motivation behind #37. But the separation between the API query code and the API specific bits wasn't done with a particular usage of cats in mind - its generally useful for developers adding new APIs because they don't have to touch the api query module. Turns out its also useful for users playing with cats as a library as they can actually bring they own functions. Sure, the right default API interface must be selected in the init script for now but that's about it. Could as well be specified at the command line.

@@ -0,0 +1,91 @@
from datetime import datetime, timezone
Copy link
Collaborator

@tlestang tlestang May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add my general comment on this module here so we can have a thread.

I think this PR adds useful missing pieces but I'm a bit uncomfortable with the fast it completely removes the timeseries_conversion and forecast modules. Makes more sense to me to build upon them instead of discarding them alltogether: they already contain most of the functionality (re)implemented in this module. Particularly, I'm not sure why WindowedForecast disappeared -- it was merged 4 days ago -- see #36 . I think it would be straightforward to make it work with a simple summation of intensity values.

That being a few welcome things in this module that are missing in the current implementation:

  • Detecting cases where the jon duration exceeds the maximum forecast duration.
  • Allowing to compute the windowed/summed/integrated intensity over the window by just summing intervals.
  • Handle the fact that the first and last interval in the time window are fraction of the default interval length (30min).

Again -- I think we'd be better off integrating the above in the current code, rather than reimplementing a whole new module from scratch. Unless of course we think that there are strong reasons to do so, but not obvious to me at this stage.

Copy link
Contributor Author

@Llannelongue Llannelongue May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding WindowedForecast: should definitely be kept, as highlighted in the PR release note and in the code here, the goal is to keep the trapezoidal integration as an option, especially while we figure out which one is right for the UK (issue #42). But it's a rather delicate function to get the bounds right (especially first and last segments), so rather than me just shoehorn in into this updated structure on my own, it'd be good to discuss about the best way to include it. Very happy to see suggestions.

Regarding forecast it's not removed, just renamed as I'm not sure forecast still describes what this part of the code is doing. But it still serves the same purpose.

As of timeseries_conversion (and most of the code in this PR), it's of course reusing existing code, sometimes grouped differently based on how usage has evolved. Which functions in particular from timeseries_conversion have been removed completely but still needed? My understanding is that csv_loader and cat_converter aren't needed anymore (as per PR #41), check_duration has just been moved to check_clean_arguments as it's what it does and get_lowest_carbon_intensity is now in optimise_starttime for the same reason.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the goal is to keep the trapezoidal integration as an option [..]. But it's a rather delicate function to get the bounds right (especially first and last segments), so rather than me just shoehorn in into this updated structure on my own, it'd be good to discuss about the best way to include it.

Don't you think it will more effort re-integrating something that is already there (and covered by a couple of tests) into new code? As opposed to start from what is there. Sorry -- I understand the appeal of writing that module from scratch, but I'm questioning whether it will make the next few steps actually harder. Obviously that's just my own humble opinion, let's see what other people have to say about this!

Regarding forecast it's not removed, just renamed as I'm not sure forecast still describes what this part of the code is doing. But it still serves the same purpose.

Did forecast become optimise_starttime? It looks like a new module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll submit a commit later today that reintroduces WindowedForecast closer to its original form so that we can hopefully get the best of both worlds

Copy link
Contributor Author

@Llannelongue Llannelongue Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out that properly integrating the first and last time windows into the trapezoidal model was a bit harder than planned (didn't manage with the original implementation but this one is very close) but it's now done alongside a test scenario: commit
1aca7ab

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also @tlestang if you think testing the integration method on the sinus function is still needed, it can be put back in as it's only commented out for now, for now there is a test on a dummy carbon intensity forecast

@andreww
Copy link
Collaborator

andreww commented Jun 23, 2023

I'm not sure exactly where we are with this one - does it make sense to discuss at the meeting on Wednesday. One possibility would be to try and chop the changes into smaller sets but I think some wider discussion around the architecture could be useful.

My two cents is that a useful approach to thinking about this is to consider other future uses of (maybe bits of) the code base and think about how they would work. So, for cats, I imagine a user who has some big python package providing a service that runs some regular processing, and it's the processing that they want to schedule (no need to integrate with a scheduler). Or, somebody wants to post-process job scheduler logs.

Some (partial) answers to the questions above:

Why some things that are not errors are being printed on stderr and others on stdout? As a temporary measure, all updates/messages are sent to stdout and only errors/warnings to stderr, but would be great to hear some thoughts on that.

The way the python currently interfaces with at is via a shell backtick expansion - stuff that comes out of stdout from cats goes into an argument to at and (from the point of view of at) everything else is ignored and goes to the terminal. We could sensibly use python's subprocess module to do this instead. This would give us more control but could limit what the user could do.

@tlestang
Copy link
Collaborator

I'm closing this for now as we discussed it won't be merged as-in. But let's keep the branch around for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants