Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Performance improvements for generating candidate models #254

Conversation

bocklund
Copy link
Member

This set of changes improves the performance of parameter selection with two primary changes:

  1. When we build candidate models (renamed build_feature_sets to build_candidate_models) we take all combinations of the product of composition-independent features with interaction features. The implication of this is that some models that have a lot of features, for example heat capacity temperature features with four binary interaction features, can get very expensive to generate candidate models because the current implementation has geometric complexity with respect to the temperature and interaction features (as documented). Here we make an optimization for cases when the general implementation will generate more than complex_algorithm_candidate_limit (default =1000) candidate models, where the simplified version will have the same number of composition-independent features for all interaction features. Instead of geometric complexity $N(1-N^M)/(1-N)$, the simplified version has complexity $NM$, where $N$ and $M$ are the number of composition-independent features and interaction features, respectively.

  2. A profiling-guided optimization in espei.paramselect._build_feature_matrix. The feature matrix is a concrete matrix of reals (rows: observations, columns: feature coefficients). We use a symengine vector (ImmutableDenseMatrix) to fill the feature matrix row-wise, moving an inner loop to fast SymEngine rather than slow Python. Roughly 3x speedup of this function after this change.

…s successive

The goal here is that we may have some models that don't want to use
`make_successive` for feature sets.
modifies ESPEI code and def breaks gibbs energies, using this commit for backup
Factor out TDB analysis from notebook 2
Start moving get_data_quantities to a FittingStep staticmethod
Binary VM (VA) fitting looks like it works now!
The changes are mostly in adding parameters and not really the fitting itself - that was working.

This may be throwaway code (see the comment added), so it's not too complex.
The idea of the modified version is that we also compute the actual site
fractions because individual site fractions are not currently handled by
ESPEI, but can slip in from existing models if not using a reference
state where those contributions cancel (e.g. no _MIX or _FORM refstates
keep the unary extrapolation). to do this, we'll use the config tuple
and create site fractions from the points dict.

tests currently pass locally
fit_formation_energy -> fit_parameters
Pass through all the function indirection.
Add it to AbstractRKMPropertyStep
…tion.utils

from espei.parameter_selection.utils import _get_sample_condition_dicts

to

from espei.error_functions.non_equilibrium_thermochemical_error import  get_sample_condition_dicts
VA is normalized per atom
This is useful for organizing datasets for different runs while having one single source of truth for the data
This algorith has N*M complexity, which is an enormous simplification to
the more complex algorith that converges to N^M complexity as N->inf.
Before this change, even moderate N would cause _build_feature_matrix to
become the dominant time-limiting function in profiling.
@bocklund bocklund changed the title ENH: Performance improvements for generating candidates ENH: Performance improvements for generating candidate models Jan 17, 2024
@bocklund
Copy link
Member Author

Note that the performance issues this resolved are mostly due to a combination of the number of candidate models and the amount of data.

Generating the candidate models has an up front cost, but the result is cached so it's not overly expensive. The main contributor is that with many candidate models and data, most of the time is eventually spent in _build_feature_matrix. Since it's not entirely clear that the existing approach, while more general, actually generates better models, i.e. ones that are actually selected by the mAICc. In my experience, I haven't often seen models generated that have different features selected across different interaction parameter orders for the same interaction.

@bocklund bocklund merged commit 5f6ff36 into PhasesResearchLab:master Jan 17, 2024
7 of 11 checks passed
@bocklund bocklund deleted the performance-improvements-generating-candidates branch January 17, 2024 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant