[ENH] Neat and automated transfer learning with OPTIMADE API for auto-adjusted problem-specific ML model generation on the fly #16

amkrajewski · 2024-03-29T23:23:22Z

As the title says, this new addition to the core pySIPFENN functionalities connects it to OPTIMADE API to enable rapid adjustment of the models to any specific dataset described by an OPTIMADE query (or multiple queries). Most of the functions are neatly hidden behind high-level API and default values should work well for datasets between 100-10,000 datapoints.

You can now simply:

from pysipfenn import Calculator, OPTIMADEAdjuster
c = Calculator(autoLoad=False)
c.loadModels("SIPFENN_Krajewski2022_NN30")
ma = OPTIMADEAdjuster(c, "SIPFENN_Krajewski2022_NN30",  useClearML=True, device='mps') # MPS is for Apple M1 GPU

ma.fetchAndFeturize(
    'elements HAS "Hf" AND elements HAS "Mo" AND NOT elements HAS ANY "O","C","F","Cl","S"',
    parallelWorkers=4)
ma.adjust()

ma.plotStarting() # See the starting performance
ma.plotAdjusted() # See the adjusted performance

or to perform a hyperparameter search, replace the ma.adjust() with:

ma.matrixHyperParameterSearch()
ma.adjust(learningRate=0.0001, optimizer='AdamW', weightDecay=1e-05, epochs=37)

All model usage works as before with the Calculator class. Modifying or exporting it for later is through specific classes in the modelExporters submodule.

…n accident and gets an error

…; meant mostly for tuning to smaller datasets

…ation in`OPTIMADEAdjuster`

… querying

… a parameter with documentation in `OPTIMADEAdjuster`

…e in the top `__init__`

… now collecting names, target data, and featurizing the obtained structures

…ng queries

…adjustment, to display if a datapoint has been shown to the model at adjustment. Displayed when plotting unadjusted and adjusted models.

…ning

….py`

…import

…sed right now but important for future methods including crystALL

amkrajewski · 2024-03-29T23:24:46Z

Notes:

It is feature-complete.
I'm still working on the testing suite.
@rdamaral will add a neat tutorial at a future date.

codecov · 2024-03-30T00:29:20Z

Codecov Report

Attention: Patch coverage is 89.20188% with 46 lines in your changes are missing coverage. Please review.

Project coverage is 93.58%. Comparing base (74a31dd) to head (4c12c21).
Report is 4 commits behind head on main.

Files	Patch %	Lines
pysipfenn/core/modelAdjusters.py	87.15%	46 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #16      +/-   ##
==========================================
- Coverage   94.84%   93.58%   -1.27%     
==========================================
  Files          17       19       +2     
  Lines        1999     2432     +433     
==========================================
+ Hits         1896     2276     +380     
- Misses        103      156      +53

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ENN labeling

…Labels` if no-validation adjustments were ran

…ts all functions except for the optional, non-default ClearML connectivity

amkrajewski · 2024-03-30T05:10:47Z

Hi @jwsiegel2510 and @rdamaral Everything is complete and the tests are passing. It's ready to be reviewed!

amkrajewski · 2024-04-03T16:56:09Z

Hi @jwsiegel2510 and @rdamaral, I was hoping to pull it later today to align with the manuscript posting on arXiv.

rdamaral · 2024-04-03T18:50:55Z

Hi Adam,

I've reviewed the documentation, tested the main functions, and they are working well. I also did not encounter any issues when installing this branch version in a new conda environment (Python 3.10).

Just a couple of comments:

Running the hyperparameter search step takes a long time on a cpu, so I couldn't finish testing. As discussed, please consider changing the default epochs to a lower value.
When testing OPTIMADE providers:

OQMD returned TypeError: can only concatenate str (not "int") to str
JARVIS returned ValidationError: 1 validation error for StructureResource.
Aflow returned Error: Provider ...: ('Connection aborted.', ... ))
PS: Both OQMD and Jarvis were run using the targetPath values mentioned in the documentation. Other than these, MP and Alexandria were also tested and did not raise any error.

I also tried providing the wrong targetPath for a provider and it raises: 'ValueError: not enough values to unpack (expected 4, got 0)'. Do you think targetPath could be fetched automatically from each provider's endpoint when defyining OPTIMADEAdjuster? I'm considering this because even the current default values may eventually break, for instance, if MP decides to change their endpoint (again). Another alternative would be to have targetPath set to () rather than MP's formation energy path.

…mproved description of the parameter to make it clear higher number is desired if GPU is present

…ror message is more clear to the user

…y the base URL to use and point to custom provider or a specific sub-database of a provider

amkrajewski · 2024-04-03T23:49:46Z

Hi @rdamaral ! Thanks for the insightful comments :)

The default number of epochs for fine-tuning was reduced to 20, with documentation discussing this and mentioning that on a GPU (even a laptop one) 100 may be preferred.
The OQMD server is down, and JARVIS seems to have issues filtering. I will ask about that at the developer meeting tomorrow.
I've added a bunch of assertions that should catch unexpected user inputs and display useful messages on what went wrong. The property data paths are provider-specific and cannot be inferred a prior.

amkrajewski · 2024-04-03T23:52:34Z

I also added a new functionality that allows you to override provider and use a custom endpoint. E.g.

ma = pysipfenn.OPTIMADEAdjuster(
    c,
    model="SIPFENN_Krajewski2022_NN30",
    endpointOverride=["https://alexandria.icams.rub.de/pbesol"],
    targetPath=['attributes', '_alexandria_formation_energy_per_atom']
)

ma.fetchAndFeturize(
    'elements HAS "Hf" AND elements HAS "Mo" AND elements HAS "Zr"',
    parallelWorkers=2
)

rdamaral · 2024-04-04T13:42:23Z

Nice. The endpointOverride input is very interesting from the user’s perspective, especially in the event of changes or new additions to OPTIMADE. 👍

jwsiegel2510

Looks good, it seems reasonable to me to assume that each task will take the same amount of time since we are simply training the same network with different hyper parameters.

jwsiegel2510

Looks good to me

jwsiegel2510

Looks good to me, I guess the previous two commits were meant to be one.

amkrajewski added 30 commits March 26, 2024 21:40

(modelAdjusters) set up scaffold for the submodule

55043df

(MA) documentation and initialization of LocalAdjuster

bf83909

(MA) minor initialization additions and 2 plotting helper functions

7408aed

(core) added plotly to the requirements

bd93bde

(MA) finalized the key adjust routine

f1af828

(MA) aded returns to the adjust routine

6825060

(MA) add a message if ClearML is to be used in case someone does it o…

d69151b

…n accident and gets an error

(MA) improved memory handling during adjustments

3a4b7a0

(MA) implemented a hyperparameter search matrix for 27 common options…

2faea84

…; meant mostly for tuning to smaller datasets

(MA) implemented a hyperparameter search matrix for 27 common options…

47e3fbf

…; meant mostly for tuning to smaller datasets

(MA) added optimade to the dev requirements

a3a6c4c

(MA) implemented initialization of OPTIMADEAdjuster and its docstrings

96c3220

(MA) implemented querying OPTIMADE and passing structures to featuriz…

e07e7f6

…ation in`OPTIMADEAdjuster`

(core) added [http_client] to the optimade requirement needed for…

a823eca

… querying

(MA) added entry name collection to the OPTIMADEAdjuster

637f1bb

(MA) upped the default result number limit to 10000 and added it as…

889cad5

… a parameter with documentation in `OPTIMADEAdjuster`

(core) removed info about importing from the top pysipfenn namespac…

2525b2e

…e in the top `__init__`

(MA) finalized the fetchAndFeturize routine of OPTIMADEAdjuster,…

7889c30

… now collecting names, target data, and featurizing the obtained structures

(MA) add a message on possible degeneracy of datasets from overlappi…

71c8557

…ng queries

(MA) added class parameter of validationLabels, assigned at every …

3c5c290

…adjustment, to display if a datapoint has been shown to the model at adjustment. Displayed when plotting unadjusted and adjusted models.

(MA) improved several printouts

a8fd4cd

(MA) added assertions that data to operate on is available before tu…

df8c77c

…ning

(MA) added automatic shuffling of the OPTIMADE-obtained data

5209b22

(core) limit the convenience import scope of the top level `__init__…

2cbcbc1

….py`

(MA) distributed optional dependency imports as needed

4919926

(deps) moved plotly from optional to required dependencies

6e69976

(core) small addition to the top level imports

1f6ec51

(core) reverted the __init__.py scope narrowing for the top level …

b7bb2c4

…import

(MA) aded tracking of the compositions coming in from OPTIMADE; not u…

231bd5f

…sed right now but important for future methods including crystALL

(MA) plotting style improvements

cb02fb8

(MA) added ETA printouts to the hyperparameter search

ff16218

amkrajewski added 9 commits March 29, 2024 22:32

(MA) bugfix at CSV data ingestion which did not expect default pySIPF…

bedfc09

…ENN labeling

(MA) additional assertiona at the plotting step

a079d36

(MA) general minor improvements including populating `self.validation…

e024222

…Labels` if no-validation adjustments were ran

(test) added an example NumPy file with stored descriptor data

8d94133

(test) added an example CSV file with stored descriptor data

8aa6dda

(test) added an example NumPy file with a stored target data array

6316916

(test) added an example CSV file with a stored target data table

87fb74e

(test) added a pretty complete testing suite for model adjusters; tes…

8d635af

…ts all functions except for the optional, non-default ClearML connectivity

(test) improved ModelAdjusters test documentation

f6e92e0

amkrajewski assigned amkrajewski, jwsiegel2510 and rdamaral Mar 30, 2024

amkrajewski added 5 commits April 3, 2024 19:01

(MA) reduced hyperparameter search epoch *default* number to 20 and i…

57eb3e5

…mproved description of the parameter to make it clear higher number is desired if GPU is present

(MA) added a set of assertions to the OPTIMADEAdjuster initialization

bd354c9

(MA) added an assertion more than no data was fetched, so that the er…

785dabb

…ror message is more clear to the user

(MA) implemented endpointOverride parameter to allow user to specif…

ac819da

…y the base URL to use and point to custom provider or a specific sub-database of a provider

(tests) added testing procedure for endpoint overriding

4c12c21

amkrajewski merged commit 0d60ef6 into main Apr 4, 2024
13 of 15 checks passed

jwsiegel2510 reviewed Apr 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Neat and automated transfer learning with OPTIMADE API for auto-adjusted problem-specific ML model generation on the fly #16

[ENH] Neat and automated transfer learning with OPTIMADE API for auto-adjusted problem-specific ML model generation on the fly #16

amkrajewski commented Mar 29, 2024

amkrajewski commented Mar 29, 2024

codecov bot commented Mar 30, 2024 •

edited

amkrajewski commented Mar 30, 2024

amkrajewski commented Apr 3, 2024

rdamaral commented Apr 3, 2024 •

edited

amkrajewski commented Apr 3, 2024 •

edited

amkrajewski commented Apr 3, 2024

rdamaral commented Apr 4, 2024

jwsiegel2510 left a comment

jwsiegel2510 left a comment

jwsiegel2510 left a comment

[ENH] Neat and automated transfer learning with OPTIMADE API for auto-adjusted problem-specific ML model generation on the fly #16

[ENH] Neat and automated transfer learning with OPTIMADE API for auto-adjusted problem-specific ML model generation on the fly #16

Conversation

amkrajewski commented Mar 29, 2024

amkrajewski commented Mar 29, 2024

codecov bot commented Mar 30, 2024 • edited

Codecov Report

amkrajewski commented Mar 30, 2024

amkrajewski commented Apr 3, 2024

rdamaral commented Apr 3, 2024 • edited

amkrajewski commented Apr 3, 2024 • edited

amkrajewski commented Apr 3, 2024

rdamaral commented Apr 4, 2024

jwsiegel2510 left a comment

Choose a reason for hiding this comment

jwsiegel2510 left a comment

Choose a reason for hiding this comment

jwsiegel2510 left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 30, 2024 •

edited

rdamaral commented Apr 3, 2024 •

edited

amkrajewski commented Apr 3, 2024 •

edited