# Summary
Prospect some ways to load CANOPUS output, load MIBiG known class links, and establish class linking (scores) in the NPLinker object. To try this, we use a version of the Crusemann dataset (see Crüsemann et al. (2016) or MolNetEnhancer paper). Many parts of this notebook originate from the demo notebook.

In [1]:
import sys, csv, os
# if running from clone of the git repo
sys.path.append('../prototype')

# import the main NPLinker class. normally this all that's required to work
# with NPLinker in a notebook environment
from nplinker.nplinker import NPLinker

In [3]:
# load local crusemann data ~8000 spectra
npl = NPLinker({'dataset': {'root': '/mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_new_gnps_AS3/'}})
npl.load_data()

14:28:50 [INFO] config.py:121, Loading from local data in directory /mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_new_gnps_AS3/
14:28:57 [INFO] loader.py:80, Trying to discover correct bigscape directory under /mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_new_gnps_AS3/bigscape
14:28:57 [INFO] loader.py:83, Found network files directory: /mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_new_gnps_AS3/bigscape/network_files/2021-07-16_08-32-34_hybrids_glocal
14:28:57 [INFO] loader.py:210, Updating bigscape_dir to discovered location /mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_new_gnps_AS3/bigscape/network_files/2021-07-16_08-32-34_hybrids_glocal
14:28:57 [INFO] loader.py:569, Loaded global strain IDs (0 total)
14:28:57 [INFO] loader.py:580, Loaded dataset strain IDs (145 total)
14:28:59 [INFO] metabolomics.py:642, 8099 molecules parsed from MGF file
14:29:00 [INFO] metabolomics.py:659, Found older-style GNPS dataset, no quantification table


14:29:01 [INFO] loader.py:555, Loading provided annotation files (/mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_new_gnps_AS3/result_specnets_DB)


14:29:05 [INFO] genomics.py:445, Found 1816 MiBIG json files
14:29:56 [INFO] genomics.py:236, Using antiSMASH filename delimiters ['.', '_', '-']
14:35:22 [INFO] genomics.py:352, # MiBIG BGCs = 0, non-MiBIG BGCS = 7721, total bgcs = 7721, GCFs = 1263, strains=1961
14:35:22 [INFO] genomics.py:409, Filtering MiBIG BGCs: removing 0 GCFs and 0 BGCs
14:35:22 [INFO] genomics.py:359, # after filtering, total bgcs = 5905, GCFs = 1263, strains=145, unknown_strains=0
14:35:25 [INFO] loader.py:332, Writing common strain labels to /mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_new_gnps_AS3/common_strains.csv
14:35:28 [INFO] loader.py:345, Strains filtered down to total of 48
14:35:28 [INFO] loader.py:271, No further strain filtering to apply


True

In [4]:
# Basic functionality
# ===================
#
# Once you have an NPLinker object with all data loaded, there are a collection of simple
# methods and properties you can use to access objects and metadata. Some examples are 
# given below, see https://nplinker.readthedocs.io/en/latest/ for a complete API description.

# configuration/dataset metadata
# - a copy of the configuration as parsed from the .toml file (dict)
print(npl.config) 
# - the path to the directory where various nplinker data files are located (e.g. the 
#   default configuration file template) (str)
print(npl.data_dir)
# - a dataset ID, derived from the path for local datasets or the paired platform ID
#   for datasets loaded from that source (str)
print(npl.dataset_id)
# - the root directory for the current dataset (str)
print(npl.root_dir)

# objects
# - you can directly access lists of each of the 4 object types:
print('BGCs:', len(npl.bgcs))
print('GCFs:', len(npl.gcfs)) # contains GCF objects
print('Spectra:', len(npl.spectra)) # contains Spectrum objects
print('Molecular Families:', len(npl.molfams)) # contains MolecularFamily objects

{'loglevel': 'INFO', 'logfile': '', 'log_to_stdout': True, 'repro_file': '', 'dataset': {'root': '/mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_new_gnps_AS3/', 'overrides': {}, 'platform_id': ''}, 'antismash': {'antismash_format': 'default', 'ignore_spaces': False}, 'docker': {'run_bigscape': True, 'extra_bigscape_parameters': ''}, 'webapp': {'tables_metcalf_threshold': 2.0}, 'scoring': {'rosetta': {}}}
../prototype/nplinker/data

/mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_new_gnps_AS3/
BGCs: 5905
GCFs: 1263
Spectra: 8099
Molecular Families: 4611


In [5]:
mc = npl.scoring_method('metcalf')

# Now mc is an instance of the class that implements Metcalf scoring. Once
# you have such an instance, you may change any of the parameters it exposes.
# In the case of Metcalf scoring, the following parameters are currently exposed:
# - cutoff (float): the scoring threshold. Links with scores less than this are excluded
# - standardised (bool): set to True to use standardised scores (default), False for regular
mc.cutoff = 2.5
mc.standardised = True

results = npl.get_links(npl.gcfs, mc, and_mode=True) 

# get_links returns an instance of a class called LinkCollection. This provides a wrapper
# around the results of the scoring operation and has various useful properties/methods:
#
# - len(results) or .source_count will tell you how many of the input_objects were found to have links
print('Number of results: {}'.format(len(results)))
# - .sources is a list of those objects
objects_with_links = results.sources
# - .links is a dict with structure {input_object: {linked_object: ObjectLink}} 
objects_and_link_info = results.links
# - .get_all_targets() will return a flat list of *all* the linked objects (for all sources)
all_targets = results.get_all_targets() 
# - .methods is a list of the scoring methods passed to get_links
methods = results.methods

14:35:28 [INFO] methods.py:436, MetcalfScoring.setup (bgcs=5905, gcfs=1263, spectra=8099, molfams=4611, strains=48)
14:35:39 [INFO] methods.py:475, MetcalfScoring.setup completed
Number of results: 959
