Skip to content

Bayer-Group/MotherML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Mother-ML

A ML framework that takes care.

Mother is a machine-learning framework for predicting properties from chemical molecules. The major features are:

  • ๐Ÿ”ฌ SMILES preprocessing
  • ๐Ÿ’พ Generating of feature vectors from molecules
  • ๐Ÿ“ˆ Grouping and cross-validation, based on chemical similarity
  • ๐Ÿ’ป Model Training: Standard catboost models, and feature selection methods
  • ๐Ÿšด Training, cross-validation, and hyperparameter optimization of machine-learning models
  • ๐ŸŒ€ Handling Gene expression data from transcriptomics experiments including different normalisation techniques
  • โœจ Explainability analysis with SHAP (Currently not supported, will be added in a later release)
  • ๐Ÿš™ Generative chemistry (Currently not supported)

Mother provides methods for each of these steps in the form of sklearn transformer objects. By that, all methods are designed to be easily accessible and usable in a modular way. The methods can be combined to ML workflows with sklearn pipelines, column transformers, and feature unions.

All methods can be used as sklearn transformer or estimator. Combination with other methods, or own methods and models (e.g. using mother preprocessing with other model) is therefore straightforward. To be as compatible as possible, every transformer can be constructed using a dictionary containing the required parameters. However, to provide some convenience to the users, a settings class MotherSettings. This class can be used to store all relevant settings for your ML project.

Usage

A basic example can be found in the example regression notebook. Other examples are in the examples folder.

๐Ÿ”ฌ SMILES preprocessing and mol-object generation

SMILES preprocessing is done with the StandardizerTransformer class. The class is used to preprocess SMILES strings to construct a pipeline from SMILES to rdkit mol-objects with:

from sklearn import pipeline as sklearn_pipeline
preprocessor: sklearn_pipeline.Pipeline = sklearn_pipeline.Pipeline(
    [
        (
            "smiles_standardizer",
            StandardizerTransformer(flags=["STANDARDIZE", "DESALT","NEUTRALIZE"]),
        ),
        ("smiles_to_mol", SmilesToMolTransformer()),
        # Add other column transformations here if needed
    ],
    memory=None,
).set_output(transform="pandas")

mol_data: pd.DataFrame = preprocessor.fit_transform(structure_data)

Customize by changing the flags attribute.

๐Ÿ’พ Feature Generation

Mother provides three types of feature generators: MaccsFingerprints, MorganFingerprints, and ChemicalDescriptors:

from sklearn import pipeline as sklearn_pipeline
feature_generator = sklearn_pipeline.FeatureUnion(
    transformer_list=[
        ("maccs", MaccsFingerprints()),
        ("morgan", MorganFingerprints()),
        ("desc", ChemicalDescriptors()),
    ],
).set_output(transform="pandas")

features: pd.DataFrame = feature_generator.fit_transform(mol_data["Molecule"])

The FeatureUnion class is used to combine the feature generators. Each feature generator can be configured.

๐Ÿ“ˆ Grouping and Cross-Validation

For cross-validation, or test-set selection based on chemical similarity, mother provides a transformer-class for generating groups (TanimotoGroupingFromMols):

groups_engine = cv_module.TanimotoGroupingFromMols(similarity_threshold=0.3)

groups: pd.DataFrame = groups_engine.set_output(transform="pandas").fit_transform(mol_data)

These groups can be used, e.g. in the GroupKFold class from the sklearn.model_selection module:

cv = GroupKFold(n_splits=5)

๐Ÿ’ป Model Training

The standard model setup of Mother consists of a feature selection, and a classification- or regression model. Both are based on Catboost. The standard setup for a regression task would be:

import mother.pipeline_utils as mother_takes_care
model_settings = {
        "feature_selection_flags": ["DROP_CORRELATED", "DROP_CONSTANT", "DROP_DUPLICATES", "DROP_UNIMPORTANT"],
        "feature_selection_threshold": 1e-5,
        "correlation_threshold": 0.9,
        "algorithm": "catboost",
        "feature_selection_type": "catboost",
        "type": "regression",
        "target_type": "single_target"
}
pipeline_settings = {
        "remainder": "drop" if len(categorical_features) == 0 else "passthrough",
        "verbose_feature_names_out": False,
}
model = ml.PipelineWithHyperparameterRooting(
    [
        (
            "feature_selector",
            mother_takes_care.get_feature_selection_pipeline(
                settings=model_settings, pipeline_settings=pipeline_settings,
                cv=cv
            ).set_output(transform="pandas"),
        ),
        ("ml_model", ml.CatboostRegressorMother(target_type="single_target", logging_level="Silent")),
    ]
)

Here, we use the extended sklearn pipeline PipelineWithHyperparameterRooting for some additional methods for hyperparameter tuning.

Without feature selection, this is simplified:

model = ml.CatboostRegressorMother(target_type="single_target", logging_level="Silent")

Any other sklearn model, or own model can be used instead of CatboostRegressorMother. An example, on how a custom preprocessing step is added to the model, can be found in the example notebook on custom preprocessing.

Cross-validation

Having used any sklearn pipeline, or sklearn estimator or transformer classes, we can use the sklearn methods for e.g. cross-validation (cross_validate):

cross_validate(model, features, targets, groups=groups, cv=cv, n_jobs=10)

A more convenient method is provided by mother. Using this methods gives you additional output considering CV and groups.

import mother.pipeline_utils as mother_takes_care
mother_takes_care.mother_cv(estimator=model, X=features, y=data["target"],cv=cv)

๐Ÿšด Hyperparameter Optimization

The Mother object MotherTuner uses optuna to optimize hyperparameters:

tuner = opt.MotherTuner(
    scorer="r2",
    n_threads_optuna=10,  # parallel threads for cross-validation evaluation
)

model_tuned = tuner.optimize(
    model,
    features,
    targets,
    cv,
    groups=groups.values,
)

The function model.get_hyperparameter_space returns the hyperparameter space for the model. For the default catboost model, and the PipelineWithHyperparameterRooting class, this is already implemented.

For examples, on how to customize the hyperparameter optimization, or define hyperparameters for your own models, see the example notebook.

๐ŸŒ€ Handling Gene expression data from transcriptomics experiments including different normalisation techniques

The RNA processing pipeline is implemented in the RNA class, which incorporates various preprocessing steps tailored for RNA sequencing data. All RNA code can be found in the rna.py file.

The pipeline includes normalization, feature selection, and discretization, utilizing the power of the scikit-learn framework. The normalization methods available are "Scanpy," "UQ," "CUF," and "CPM.". You can customise the pipeline to your needs, or try different normalisation methods and bin sizes in hyper-parameter tuning. The pipeline can be fitted and re-applied to avoid data-leakage during the normalisation.

Hereโ€™s how to set up and use the RNA processing pipeline:

from mother.ml.rna import RNA
from sklearn.pipeline import Pipeline

rna_pipeline: Pipeline = RNA(
    n_features=None,  # Number of features (=genes) to keep for the prediction. If None this will keep all non-zero importance genes
    n_bins=20,  # Number of bins to use for the discretisation of the target variable.
    normalisation_method="Scanpy",  # Which normalisation to use
)._build_pipeline()

# Fit the pipeline to your RNA sequencing data
transformed_train_data: pd.DataFrame = rna_pipeline.fit_transform(rna_data_train)
transformed_test_data: pd.DataFrame = rna_pipeline.transform(rna_data_test)

A complete walkthrough of the RNA functionality is found in the example notebook.

Install

uv add mother-ml

Optional Features and Extras

To keep the package size small, some dependencies are added as optional extras. These extras provide additional functionality for specific use cases:

Extra Description Key Packages Notes
all All optional features All packages below Installs everything
report Visualization and reporting tools plotly, kaleido For generating plots and reports
rna RNA sequence analysis rnalib RNA-specific preprocessing
torch PyTorch neural network support torch, pytorch-tabular Adds ~3GB to environment size!
tabpfn TabPFN model support tabpfn Prior-fitted networks for tabular data
clustering Chemical compound clustering mol2vec, cluster-my-molecules For molecular clustering analysis

Installation Examples

Using pip:

# Install with report generation support
pip install 'mother[report]'

# Install with PyTorch support (adds ~3GB!)
pip install 'mother[torch]'

# Install multiple extras
pip install 'mother[report,torch,tabpfn]'

Using uv:

# Install with specific extras
uv add mother --extra report --extra torch

Note: There is also a different mother package on PyPI. Be sure to install mother-ml.

Acknowledgements

Thank you to the following contributors:

  • Thomas Wolf
  • Lukas Hebing
  • Kai Sommer

and all the others.

Packages

 
 
 

Contributors

Languages