Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Structuring 3 #760

Open
wants to merge 46 commits into
base: master
Choose a base branch
from
Open

[WIP] Structuring 3 #760

wants to merge 46 commits into from

Conversation

ardunn
Copy link
Collaborator

@ardunn ardunn commented Mar 2, 2023

Structuring 3

A rewriting of the structuring class to make BEEP structuring more flexible, efficient, and idiomatic.

main contributions

Pros and Cons vs. previous structuring BEEPDatapath, and older *CycleRun framework

Pros

  • Modular, idiomatic objects representing cycler runs, cycles, and steps.
  • Code is organized, modular, and separated by function.
  • Much easier to work with subsets of data
  • Hugely improved configuration for interpolation. Each set of cycles, cycle, and step can be configured to be interpolated individually, in groups, or all at once. Configuration can be specified easily, powerfully, and in readable fashion.
  • Capacity for parallelization and dask-based memory management, since each cycle can be interpolated independently. Testing on single 4-core laptop shows ~10-50% reduction in compute time.
  • No more nan issue. Everything is actually interpolated.
  • Retains from_file functionality to load many kinds of cycler data
  • Improved clarity of core column headers (e.g., no more step_type and wondering what that actually means)
  • Can use as much memory as specified because all the "big" objects are stored in dask Bags (both structured and unstructured data). To specify an amount of memory just use dask.distributed to declare a local client cluster beforehand and use the code as is - or you can configure it to run on a cluster. I've been able to get the memory usage down to about 1/3 it was with BEEPDatapath.
  • Tests are organized sanely, rather than insanely
  • Is backwards compatible with both legacy ProcessedCyclerRun and BEEPDatapath objects previously structured and saved to disc.
  • Enables eventually being able to programmatically represent protocols with run, cycle, and step granularity.
  • gzipped files saved to disk, including structured data and summaries, are actually (usually) smaller than the original uncompressed raw cycler data. Typically on the order of a 20% reduction in size

Cons

  • Comparatively slow when the number of points in a single cycle or step is low, eg if there are 10000 steps in a cycle and each step has 2 points.
  • Takes longer to obtain views of data because dataframes from steps/cycles are collated together. Typically ~1-2s maximum on laptop
  • Takes longer to instantiate run objects because all Step/Cycle objects have to be instantiated. Typically ~30s on laptop
  • The implementation of Step can cause strange behavior when grouping multiple steps together (i.e., when steps must be grouped together to interpolated based on chg/dchg step labels.)

problems:

  • When interpolating on step label (e.g., charge and discharge) on axes for which there are multiple steps labels discontinuous on the axis, weird stuff can happen. Examples of when this occurs:
    • One charge step, one discharge step: no problem.
    • $k$ charge steps, $k$ discharge steps: generally, no problem
    • a charge step, a discharge step, a charge step: interpolation is wonky
  • Indeterminate steps are interpolated by default
  • Validation currently requires the entire df be loaded into memory at once, but this can in principle be fixed
  • Cannot yet index steps or collections by raw index, e.g. run.raw.cycles[-1] does not work, but run.raw.cycles[1] will give you the cycle with cycle_index=1.
  • Indexing chained selections can be wonky, ie run.raw.cycles["regular"][3] won't give you the 3rd regular cycle, it will give you the regular cycle with cycle index == 3. Can be fixed but need to determine what kind of behavior we actually want. For the time being, there is a method by_raw_index to retrieve various single objects by their raw index
  • Indexing can sometimes result in a single object (cycle or step) in unexpected ways when you might be expecting an iterable.
  • Validation will still fail if cycle_index is not monotonically increasing, but this no longer matters for interpolation.

other notes

  • step_index standard column has been renamed to step_code
  • all data includes step_counter calculated from step_code
  • EIS functionality is moved to a different directory for now, since it is incomplete

Examples: TBA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment