GitHub - cparrarojas/geographical-chemotypes: Classification of bacterial metabolite data obtained from soil samples into corresponding to Photorhabdus or Xenorhabdus

Geographical chemotypes

This repository contains the accompanying code to the paper Focused natural product elucidation by prioritizing high-throughput metabolomic studies with machine learning by Tobias et al.

We train a gradient boosting model to classify bacterial metabolite data obtained from soil samples into corresponding to Photorhabdus or Xenorhabdus. In conjunction with a recently-developed feature attribution approach, we determine the most relevant features that set these two genera apart.

We find that the soil environment, given by the geographical metadata, does not affect the classification of the two genera, highlighting the essential role played by their respective host nematode in providing an environment for them. We further identify a family of compounds that can, by themselves, predict the bacterial genus in unseen data with over 90% accuracy. This is important because new data will not necessarily contain values for all features in a full model, and a peak-matching procedure needs to be carried out for those compounds that are present in both the original and new datasets. One of the compounds in this family has been isolated and purified, and you can see its chemical structure below:

code contains Jupyter notebooks taking you trough the steps of model construction and the results obtained. In sequential order:
1. data-cleaning.ipynb shows the removal of blanks from the raw data and the separation into intensity and AUC (and zeroed AUC) datasets, plus the tidying up of the metadata.
2. model-features.ipynb shows additional data preprocessing steps, and the construction of the different datasets used for training by removing zero- and near-zero-variance features from the data, as well as clustering highly-correlated features.
3. cross-val.ipynb shows the actual performance of the classifier for the different datasets, and the determination of most relevant predictors.
4. test-pred.ipynb contains the predictions of the single-feature models for the test data, after determining the closest match in metabolite properties.
data contains both the raw data and the files generated by the notebooks. Due to GitHub restrictions on file size, only some of these files can be displayed directly in the repository tree. The full data directory can be downloaded from here.

Clustered metabolite data

An interactive visualisation of the clustered metabolites can be found here. By hovering over a given node, you can see the metabolite retention time and m/z ratio, as shown below.

Change the correlation threshold to visualise the different sets of clusters.

Note: Only clusters with 250 or fewer members are displayed.

Resources

LightGBM: Light Gradient Boosting Machine.
shap: A unified approach to explain the output of any machine learning model.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
code		code
data		data
imgs		imgs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

imgs

imgs

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Geographical chemotypes

Contents

Clustered metabolite data

Resources

About

Releases 1

Packages

Languages

License

cparrarojas/geographical-chemotypes

Folders and files

Latest commit

History

Repository files navigation

Geographical chemotypes

Contents

Clustered metabolite data

Resources

About

Resources

License

Stars

Watchers

Forks

Languages