Skip to content

Transformers, new features, transfer learning#424

Merged
mauvais2 merged 48 commits into1.8.0from
feat_scaled_rdkit_mordred
Apr 22, 2026
Merged

Transformers, new features, transfer learning#424
mauvais2 merged 48 commits into1.8.0from
feat_scaled_rdkit_mordred

Conversation

@stewarthe6
Copy link
Copy Markdown
Collaborator

@stewarthe6 stewarthe6 commented Feb 25, 2026

This is a large pull request with 3 new features.

  • Additional functionality and features for transfer learning or using a previously trained AMPL model as a feature encoder.
  • 2 New feature sets that scale rdkit and mordred features.
  • Additional feature that allows you to fit and use transforms on larger/unlabeled datasets.
  • Removed deprecated UMAP feature transformer.

stewarthe6 and others added 30 commits January 21, 2025 10:10
Ipc should not be changed to AvgIpc like this because it would break all rdkit_raw models.
…th RobustScaler and PowerTransformer. Updated documentation in related sections. Added functions to ModelFileReader to read out transformer specific parameters. Changed models that test RobustScaler and PowerTransformer to use RF to speed up the training
… it more generalizeable. Fixed tests. Fixed bug where the imputer_strategy parameter was not used
…ndicator' flag because that changed the number of features and crashed.
…model, if transformers are saved and loaded correctly, and if transform_dataset_key_config is saved correctly
…r want to set that manually. Instead added a check when saving metadata to see if the parameters object has that attribute
…well as infill nan or extremely large values
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 26, 2026

Codecov Report

❌ Patch coverage is 98.30508% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
atomsci/ddm/pipeline/perf_plots.py 93.61% 6 Missing ⚠️
@@            Coverage Diff             @@
##            1.8.0     #424      +/-   ##
==========================================
+ Coverage   49.69%   51.61%   +1.91%     
==========================================
  Files          37       38       +1     
  Lines       11717    11982     +265     
==========================================
+ Hits         5823     6184     +361     
+ Misses       5894     5798      -96     
Flag Coverage Δ
unittests 51.61% <98.30%> (+1.91%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
atomsci/ddm/pipeline/compare_models.py 41.08% <ø> (-0.06%) ⬇️
atomsci/ddm/pipeline/featurization.py 67.59% <100.00%> (+3.15%) ⬆️
atomsci/ddm/pipeline/model_datasets.py 68.35% <100.00%> (+0.26%) ⬆️
atomsci/ddm/pipeline/model_tracker.py 17.45% <ø> (ø)
atomsci/ddm/pipeline/model_wrapper.py 68.85% <100.00%> (+0.39%) ⬆️
atomsci/ddm/pipeline/parameter_parser.py 92.60% <100.00%> (+0.07%) ⬆️
atomsci/ddm/pipeline/transformations.py 70.69% <100.00%> (+12.82%) ⬆️
atomsci/ddm/utils/generate_transformers.py 100.00% <100.00%> (ø)
atomsci/ddm/utils/hyperparam_search_wrapper.py 29.30% <ø> (ø)
atomsci/ddm/utils/model_file_reader.py 70.27% <100.00%> (+4.11%) ⬆️
... and 1 more

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…re_transformers when transformers is None. This does not test the pipeline with no transformers, just that the function returns correctly
- Tests that the heavyatom_col paramter is used correctly and cases when there is no heavyatom_col.
- Tests that the NotImplementedError is raised correctly when there is no feature count or if there is no way to featurize data.
- Tests that the Identity features transforms are returned correctly. And that an error is raised if an unrecognized feature transform is used.
…k and multitask models are trained using the same dataset, the scaled_descriptors copy of the featurized file will only contain response_columns for the single task models, and not all columns for the multitask models. This does not cause an issue during training, but when making predictions, the '_actual' columns won't exist. This causes the function to crash. This patch looks in the original dataset_key csv and finds the response columns and merges them into the scaled_descriptors file.
…t. Without this step, PowerTransformer failes
@mauvais2 mauvais2 merged commit 5eef520 into 1.8.0 Apr 22, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants