## Use NPClassScore on a local dataset
In this notebook we give a demonstration of how to run NPClassScore on a local dataset with the python version of NPLinker. To use this notebook on your own data, download the NPLinker repo to your machine and run this notebook.

In [27]:
# import required packages
import os
import sys
import glob
# if running from clone of the git repo - otherwise let it point to the src directory within the nplinker repo
sys.path.append('../../src')

# import the main NPLinker class. normally this all that's required to work
# with NPLinker in a notebook environment
from nplinker.nplinker import NPLinker
from nplinker.nplinker import Spectrum  # to be able to separate molfams and spectrums from each other in results

Here, we are using the Streptomyces/Salinispora dataset as described in the NPClassScore manuscript. Replace the entry for 'root' by the path to your own dataset. See the NPLinker wiki for instructions on how to prepare your own dataset for analysis with NPLinker.

It is also possible to use an accession from the PoDP as input for 'root' ('root': 'MSV000084950'), which will automatically download data for that accession.


Note that this python version of NPLinker cannot run BiG-SCAPE (yet) so either run BiG-SCAPE seperately or first run the docker version of NPLinker that can run all steps automatically, and then return to this notebook. The python version will look for SIRIUS (CANOPUS) on your system to see if it is installed and able to run it. MolNetEnhancer still has to be run on the GNPS platform en results have to be downloaded into the local dataset's directory and stored in a directory called molnetenhancer.

Also note that loading the Streptomyces/Salinispora dataset results in some unknown strains: these are strains present in the molecular network that we could not tie to one of the strains in our version of the data.

In [2]:
# load your local dataset
npl = NPLinker({'dataset': {'root': '/mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crusemann_3ids_AS6-AS3_30-11/'},
               'docker': {'run_canopus': True, 'extra_canopus_parameters': '--maxmz 850 formula zodiac structure canopus'}})
npl.load_data()

15:17:11 [INFO] config.py:121, Loading from local data in directory /mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crusemann_3ids_AS6-AS3_30-11/
15:17:11 [INFO] loader.py:84, Trying to discover correct bigscape directory under /mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crusemann_3ids_AS6-AS3_30-11/bigscape
15:17:11 [INFO] loader.py:87, Found network files directory: /mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crusemann_3ids_AS6-AS3_30-11/bigscape/network_files/2021-12-02_16-48-06_hybrids_glocal
15:17:11 [INFO] loader.py:226, Updating bigscape_dir to discovered location /mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crusemann_3ids_AS6-AS3_30-11/bigscape/network_files/2021-12-02_16-48-06_hybrids_glocal
15:17:11 [INFO] loader.py:647, Loaded global strain IDs (0 total)
15:17:11 [INFO] loader.py:658, Loaded dataset strain IDs (159 total)
15:17:16 [INFO] metabolomics.py:699, 13667 molecules parsed from MGF file
15:17:17 [INFO] metabolomics.py:716, Found older-style GNP

True

In [3]:
# Basic functionality
# ===================
#
# Once you have an NPLinker object with all data loaded, there are a collection of simple
# methods and properties you can use to access objects and metadata. Some examples are 
# given below, see https://nplinker.readthedocs.io/en/latest/ for a complete API description.

# configuration/dataset metadata
# - a copy of the configuration as parsed from the .toml file (dict)
print(npl.config) 
# - the path to the directory where various nplinker data files are located (e.g. the 
#   default configuration file template) (str)
print(npl.data_dir)
# - a dataset ID, derived from the path for local datasets or the paired platform ID
#   for datasets loaded from that source (str)
print(npl.dataset_id)
# - the root directory for the current dataset (str)
print(npl.root_dir)

# objects
# - you can directly access lists of each of the 4 object types:
print('BGCs:', len(npl.bgcs))
print('GCFs:', len(npl.gcfs)) # contains GCF objects
print('Spectra:', len(npl.spectra)) # contains Spectrum objects
print('Molecular Families:', len(npl.molfams)) # contains MolecularFamily objects

{'loglevel': 'INFO', 'logfile': '', 'log_to_stdout': True, 'repro_file': '', 'dataset': {'root': '/mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crusemann_3ids_AS6-AS3_30-11/', 'overrides': {}, 'platform_id': ''}, 'antismash': {'antismash_format': 'default', 'ignore_spaces': False}, 'docker': {'run_bigscape': True, 'extra_bigscape_parameters': '', 'run_canopus': True, 'extra_canopus_parameters': '--maxmz 850 formula zodiac structure canopus'}, 'webapp': {'tables_metcalf_threshold': 2.0}, 'scoring': {'rosetta': {}}}
../../src/nplinker/data

/mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crusemann_3ids_AS6-AS3_30-11/
BGCs: 5869
GCFs: 1581
Spectra: 13667
Molecular Families: 8346


### Run NPClassScore on data
Here, we run NPClassScore scoring on the data. This is probably not a very real example as NPClassScore will e.a. link all polyketide GCFs and MS/MS spectra present in the same strains. Instead it is more useful to use it together with a co-occurrence-based score (Metcalf) or other feature-based scores like Rosetta (see further down).

We define the npcl variable as an instance of the class that implements NPClassScore scoring. Once
you have such an instance, you may change any of the parameters it exposes.
In the case of NPClassScore scoring, the following parameters are currently exposed:
- cutoff (float): the scoring threshold, default 0.25. Links with scores less than this are excluded
- method (str): the chemical class prediction tool that is used, default is mix. Choose from .method_options:
  - mix - use all tools (first canopus then molnetenhancer)
  - main - use main method (canopus),
  - canopus - use canopus, molnetenhancer
  - use molnetenhancer
- filter_missing_scores (bool): filter out spectra without a score due to missing spectrum classes, default is False.

Less important parameters:
- equal_targets (bool): targets are on equal level, default is False. I.e. if input object is GCF, target is spectra and not MFs.
- both_targets (bool): take both targets from the other side, default is False. I.e. if input object is GCF, target both spectra and MF as targets.
- num_results (int): how many scores do you want to show for each link. Default is 1 showing only NPClassScore (the best) score.

In [4]:
# Use NPClassScore alone
npcl = npl.scoring_method('npclassscore')  # provide the name of the scoring method to get an instance of that method.

npcl.cutoff = 0.25
npcl.filter_missing_scores = True

results = npl.get_links(npl.gcfs, npcl, and_mode=True)

# get_links returns an instance of a class called LinkCollection. This provides a wrapper
# around the results of the scoring operation and has various useful properties/methods:
#
# - len(results) or .source_count will tell you how many of the input_objects were found to have links
print('Number of results: {}'.format(len(results)))
# - .sources is a list of those objects
objects_with_links = results.sources
# - .links is a dict with structure {input_object: {linked_object: ObjectLink}} 
objects_and_link_info = results.links
# - .get_all_targets() will return a flat list of *all* the linked objects (for all sources)
all_targets = results.get_all_targets() 
# - .methods is a list of the scoring methods passed to get_links
print(results.methods)

15:22:15 [INFO] methods.py:970, Set up NPClassScore scoring
15:22:15 [INFO] methods.py:972, Please choose one of the methods from ['mix', 'main', 'canopus', 'molnetenhancer']
15:22:15 [INFO] methods.py:978, Currently the method 'mix' is selected
15:22:15 [INFO] methods.py:984, Running NPClassScore...
15:22:15 [INFO] methods.py:998, Using Metcalf scoring to get shared strains
15:22:15 [INFO] methods.py:459, MetcalfScoring.setup (bgcs=5869, gcfs=1581, spectra=13667, molfams=8346, strains=154)
15:22:16 [INFO] methods.py:499, MetcalfScoring.setup completed
15:23:30 [INFO] methods.py:1005, Calculating NPClassScore for 1581 objects to 13667 targets (1784369 pairwise interactions that share at least 1 strain). This might take a while.
15:27:11 [INFO] methods.py:1054, NPClassScore completed in 295.9s
Number of results: 1581
{<nplinker.scoring.methods.NPClassScoring object at 0x7f8fbedba7f0>}


In [5]:
# show the result for one of the objects - in this case a GCF encoding staurosporine
obj = npl.gcfs[534]

result = results.links[obj]
print('Results for object: {}, {} total links, {} methods used\n'.format(obj, len(result), results.method_count))
sorted_links = results.get_sorted_links(npcl, obj)
link_data = sorted_links[0]
print("ObjectLink: ", link_data)
print('--> [{}] {} | {} | shared strains = {}'.format(','.join(method.name for method in link_data.methods),
                                                                 link_data.target,
                                                                 npcl.format_data(link_data[npcl]),
                                                                 len(link_data.shared_strains)))
print("   unfiltered direct result from NPClassScore:", link_data[npcl])

Results for object: GCF(id=534, class=Others, gcf_id=511, strains=54), 644 total links, 1 methods used

ObjectLink:  ObjectLink(source=GCF(id=534, class=Others, gcf_id=511, strains=54), target=Spectrum(id=88, spectrum_id=424, strains=2), #methods=1)
--> [npclassscore] Spectrum(id=88, spectrum_id=424, strains=2) | 0.781 | shared strains = 1
   unfiltered direct result from NPClassScore: [(0.780952380952381, 'as_classes', 'cf_superclass', 'indole', 'Organoheterocyclic compounds')]


### Run NPClassScore and Metcalf scoring
Here, we use NPClassScore in combination with standardised Metcalf scoring. This is the real scenario which we also describe in the manuscript; co-occurrence based scoring (Metcalf) to find candidate links and NPClassScore to remove unlikely candidates from this list.

The and_mode is important here; and_mode=True means that links are only kept when it passes the threshold for both methods.

In [6]:
# Initialise metcalf scoring the same way
mc = npl.scoring_method('metcalf')
mc.cutoff = 2.5
mc.standardised = True

# Now only links are kept that pass the cutoff for both methods
results_both = npl.get_links(npl.gcfs, [mc, npcl], and_mode=True)

print('Number of results for Metcalf and NPClassScore scoring: {}'.format(len(results_both)))
print(results_both.methods)

15:29:01 [INFO] methods.py:984, Running NPClassScore...
15:29:01 [INFO] methods.py:998, Using Metcalf scoring to get shared strains
15:30:22 [INFO] methods.py:1005, Calculating NPClassScore for 1581 objects to 13667 targets (1784369 pairwise interactions that share at least 1 strain). This might take a while.
15:36:38 [INFO] methods.py:1054, NPClassScore completed in 457.5s
Number of results for Metcalf and NPClassScore scoring: 1574
{<nplinker.scoring.methods.MetcalfScoring object at 0x7f8f42c556a0>, <nplinker.scoring.methods.NPClassScoring object at 0x7f8fbedba7f0>}


In [12]:
# use same obj as before to show results
print('Results for object: {}, {} total links, {} methods used'.format(
    obj, len(results_both.links.get(obj)), results_both.method_count))

# sort results based on metcalf scoring
sorted_links_both = results_both.get_sorted_links(mc, obj)
i = 0  # keep track of (spectrum) results
for both_link_data in sorted_links_both:
        if isinstance(both_link_data.target, Spectrum):
            print('{}.  --> [{}] {} | mc:{} npcl:{} | shared strains = {}'.format(
                i,
                ','.join(method.name for method in both_link_data.methods),
                both_link_data.target,
                mc.format_data(both_link_data[mc]),
                npcl.format_data(both_link_data[npcl]),
                len(both_link_data.shared_strains)))
            if both_link_data.target.gnps_annotations:
                comp_name = both_link_data.target.gnps_annotations.get("Compound_Name")
                print('Library match:', comp_name)
            print('Precursor_mz:', link_data.target.precursor_mz)
            print("   unfiltered results:", both_link_data[mc], both_link_data[npcl])
            i+=1

Results for object: GCF(id=534, class=Others, gcf_id=511, strains=54), 21 total links, 2 methods used
0.  --> [metcalf,npclassscore] Spectrum(id=3632, spectrum_id=89513, strains=67) | mc:8.9996 npcl:0.781 | shared strains = 50
Library match: 7-OH-staurosporine
Precursor_mz: 400.39001
   unfiltered results: 8.99963318035332 [(0.780952380952381, 'as_classes', 'cf_superclass', 'indole', 'Organoheterocyclic compounds')]
1.  --> [metcalf,npclassscore] Spectrum(id=4070, spectrum_id=95003, strains=21) | mc:4.7266 npcl:0.702 | shared strains = 17
Precursor_mz: 400.39001
   unfiltered results: 4.726582782023565 [(0.7021276595744681, 'as_classes', 'npc_pathway', 'indole', 'Alkaloids')]
2.  --> [metcalf,npclassscore] Spectrum(id=3544, spectrum_id=87806, strains=27) | mc:4.6625 npcl:0.702 | shared strains = 20
Library match: 4-[5-[[4-[5-[acetyl(hydroxy)amino]pentylamino]-4-oxobutanoyl]-hydroxyamino]pentylamino]-4-oxobutanoic acid
Precursor_mz: 400.39001
   unfiltered results: 4.6624688447848435 [(

### Use only the feature-based scores - Rosetta and NPClassScore
This is a scenario that will likely get more popular once different feature-based scores are added NPLinker, such as substructure-based scoring methods, as they do not depend on the dataset size (more strains means better Metcalf scoring).

We see in this scenario that Rosetta scoring does not find many candidate links.

In [8]:
# Initialise rosetta scoring the same way
ros = npl.scoring_method('rosetta')

# Now only links are kept that pass the cutoff for both methods
results_feat = npl.get_links(npl.gcfs, [ros, npcl], and_mode=True)

print('Number of results for Rosetta and NPClassScore scoring: {}'.format(len(results_feat)))
print(results_feat.methods)

15:36:49 [INFO] methods.py:329, RosettaScoring setup
15:36:49 [INFO] rosetta.py:376, Trying to load cached Rosetta hits data
15:36:49 [INFO] rosetta.py:379, Loaded cached Rosetta hits for dataset  at /mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crusemann_3ids_AS6-AS3_30-11/rosetta/RosettaHits.pckl
15:36:49 [INFO] methods.py:346, RosettaScoring setup completed
15:36:49 [INFO] methods.py:393, RosettaScoring got 1581 GCFs input, converted to 5869 BGCs
15:36:55 [INFO] methods.py:984, Running NPClassScore...
15:36:55 [INFO] methods.py:998, Using Metcalf scoring to get shared strains
15:39:30 [INFO] methods.py:1005, Calculating NPClassScore for 1581 objects to 13667 targets (1784369 pairwise interactions that share at least 1 strain). This might take a while.
15:46:41 [INFO] methods.py:1054, NPClassScore completed in 586.2s
Number of results for Rosetta and NPClassScore scoring: 31
{<nplinker.scoring.methods.RosettaScoring object at 0x7f8f0b74d908>, <nplinker.scoring.methods.NPClassSco

In [9]:
# use same obj as before to show results - apparently no results for staurosporine
result_feat = results_feat.links.get(obj)
print('Results for object: {}, {} total links, {} methods used'.format(
    obj, result_feat if not result_feat else len(result_feat), results_feat.method_count))
if result_feat:
    # sort results based on rosetta scoring
    sorted_links_feat = results_feat.get_sorted_links(ros, obj)

    i = 0  # keep track of (spectrum) results
    for feat_link_data in sorted_links_feat:
            if isinstance(feat_link_data.target, Spectrum):
                print('{}  --> [{}] {} | ros:{} npcl:{} | shared strains = {}'.format(
                    i,
                    ','.join(method.name for method in feat_link_data.methods),
                    feat_link_data.target,
                    ros.format_data(feat_link_data[ros]),
                    npcl.format_data(feat_link_data[npcl]),
                    len(feat_link_data.shared_strains)))
                if feat_link_data.target.gnps_annotations:
                    comp_name = feat_link_data.target.gnps_annotations.get("Compound_Name")
                    print('Library match:', comp_name)
                print("   unfiltered results:", feat_link_data[ros], feat_link_data[npcl])
                i+=1
else:
    print("\nNo result for obj", obj)

Results for object: GCF(id=534, class=Others, gcf_id=511, strains=54), None total links, 2 methods used

No result for obj GCF(id=534, class=Others, gcf_id=511, strains=54)


In [25]:
# get results for an obj that does have links
obj_feat = list(results_feat.links)[0]
print(obj_feat)

GCF(id=1459, class=Others, gcf_id=1754, strains=39)


In [26]:
result_feat = results_feat.links.get(obj_feat)
print('Results for object: {}, {} total links, {} methods used'.format(
    obj_feat, result_feat if not result_feat else len(result_feat), results_feat.method_count))
if result_feat:
    # sort results based on rosetta scoring
    sorted_links_feat = results_feat.get_sorted_links(ros, obj_feat)

    i = 0  # keep track of (spectrum) results
    for feat_link_data in sorted_links_feat:
            if isinstance(feat_link_data.target, Spectrum):
                print('{}  --> [{}] {} | ros:{} npcl:{} | shared strains = {}'.format(
                    i,
                    ','.join(method.name for method in feat_link_data.methods),
                    feat_link_data.target,
                    ros.format_data(feat_link_data[ros]),
                    npcl.format_data(feat_link_data[npcl]),
                    len(feat_link_data.shared_strains)))
                if feat_link_data.target.gnps_annotations:
                    comp_name = feat_link_data.target.gnps_annotations.get("Compound_Name")
                    print('Library match:', comp_name)
                print('Precursor_mz:', link_data.target.precursor_mz)
                print("   unformatted results:", feat_link_data[ros], feat_link_data[npcl])
                i+=1
else:
    print("\nNo result for obj", obj_feat)

Results for object: GCF(id=1459, class=Others, gcf_id=1754, strains=39), 1 total links, 2 methods used
0  --> [rosetta,npclassscore] Spectrum(id=8987, spectrum_id=166494, strains=1) | ros:3 hits npcl:0.412 | shared strains = 1
Precursor_mz: 400.39001
   unformatted results: [RosettaHit: 166494<-->NZ_KB900331.1.region001 via (CCMSLIB00000222303 (0.507), BGC0000054 (4.155)), RosettaHit: 166494<-->NZ_KB896267.1.region001 via (CCMSLIB00000222303 (0.507), BGC0000054 (4.159)), RosettaHit: 166494<-->NZ_KB900270.1.region001 via (CCMSLIB00000222303 (0.507), BGC0000054 (4.088))] [(0.4117647058823529, 'as_classes', 'cf_subclass', 'oligosaccharide', 'Carbohydrates and carbohydrate conjugates')]
