# Summary
Prospect some ways to load CANOPUS output, load MIBiG known class links, and establish class linking (scores) in the NPLinker object. To try this, we use a version of the Crusemann dataset (see Crüsemann et al. (2016) or MolNetEnhancer paper). Many parts of this notebook originate from the demo notebook.

In [1]:
import sys, csv, os
# if running from clone of the git repo
sys.path.append('../prototype')

# import the main NPLinker class. normally this all that's required to work
# with NPLinker in a notebook environment
from nplinker.nplinker import NPLinker

In [20]:
# load local crusemann data
npl = NPLinker({'dataset': {'root': '/mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_full_new_AS3_03-09/'}})
npl.load_data()

18:15:24 [INFO] config.py:121, Loading from local data in directory /mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_full_new_AS3_03-09/
18:15:24 [INFO] loader.py:80, Trying to discover correct bigscape directory under /mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_full_new_AS3_03-09/bigscape
18:15:24 [INFO] loader.py:83, Found network files directory: /mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_full_new_AS3_03-09/bigscape/network_files/2021-07-16_08-32-34_hybrids_glocal
18:15:24 [INFO] loader.py:212, Updating bigscape_dir to discovered location /mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_full_new_AS3_03-09/bigscape/network_files/2021-07-16_08-32-34_hybrids_glocal
18:15:24 [INFO] loader.py:571, Loaded global strain IDs (0 total)
18:15:24 [INFO] loader.py:582, Loaded dataset strain IDs (159 total)
18:15:31 [INFO] metabolomics.py:699, 13667 molecules parsed from MGF file
18:15:32 [INFO] metabolomics.py:716, Found older-style GNPS dataset, no quanti

True

In [17]:
# Basic functionality
# ===================
#
# Once you have an NPLinker object with all data loaded, there are a collection of simple
# methods and properties you can use to access objects and metadata. Some examples are 
# given below, see https://nplinker.readthedocs.io/en/latest/ for a complete API description.

# configuration/dataset metadata
# - a copy of the configuration as parsed from the .toml file (dict)
print(npl.config) 
# - the path to the directory where various nplinker data files are located (e.g. the 
#   default configuration file template) (str)
print(npl.data_dir)
# - a dataset ID, derived from the path for local datasets or the paired platform ID
#   for datasets loaded from that source (str)
print(npl.dataset_id)
# - the root directory for the current dataset (str)
print(npl.root_dir)

# objects
# - you can directly access lists of each of the 4 object types:
print('BGCs:', len(npl.bgcs))
print('GCFs:', len(npl.gcfs)) # contains GCF objects
print('Spectra:', len(npl.spectra)) # contains Spectrum objects
print('Molecular Families:', len(npl.molfams)) # contains MolecularFamily objects

{'loglevel': 'INFO', 'logfile': '', 'log_to_stdout': True, 'repro_file': '', 'dataset': {'root': '/mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_full_new_AS3_03-09/', 'overrides': {}, 'platform_id': ''}, 'antismash': {'antismash_format': 'default', 'ignore_spaces': False}, 'docker': {'run_bigscape': True, 'extra_bigscape_parameters': ''}, 'webapp': {'tables_metcalf_threshold': 2.0}, 'scoring': {'rosetta': {}}}
../prototype/nplinker/data

/mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_full_new_AS3_03-09/
BGCs: 5905
GCFs: 1263
Spectra: 13667
Molecular Families: 8346


In [4]:
mc = npl.scoring_method('metcalf')

# Now mc is an instance of the class that implements Metcalf scoring. Once
# you have such an instance, you may change any of the parameters it exposes.
# In the case of Metcalf scoring, the following parameters are currently exposed:
# - cutoff (float): the scoring threshold. Links with scores less than this are excluded
# - standardised (bool): set to True to use standardised scores (default), False for regular
mc.cutoff = 2.5
mc.standardised = True

results = npl.get_links(npl.gcfs, mc, and_mode=True) 

# get_links returns an instance of a class called LinkCollection. This provides a wrapper
# around the results of the scoring operation and has various useful properties/methods:
#
# - len(results) or .source_count will tell you how many of the input_objects were found to have links
print('Number of results: {}'.format(len(results)))
# - .sources is a list of those objects
objects_with_links = results.sources
# - .links is a dict with structure {input_object: {linked_object: ObjectLink}} 
objects_and_link_info = results.links
# - .get_all_targets() will return a flat list of *all* the linked objects (for all sources)
all_targets = results.get_all_targets() 
# - .methods is a list of the scoring methods passed to get_links
methods = results.methods

14:25:27 [INFO] methods.py:436, MetcalfScoring.setup (bgcs=5905, gcfs=1263, spectra=8099, molfams=4611, strains=48)
14:25:35 [INFO] methods.py:475, MetcalfScoring.setup completed
Number of results: 959


In [5]:
### strange there are only 48 strains... -> ah there are only spectra for 48 strains so that makes sense
b_strains = [bgc.strain.id for bgc in npl.bgcs]
bs_set = set(b_strains)
s_strains = [list(spec.strains) for spec in npl.spectra]
s_strains = [strain.id for s in s_strains for strain in s if s]
ss_set = set(s_strains)
len(bs_set), len(ss_set)

(143, 48)

## Reading canopus output
Expect files to be present in the data folder that are called:
- cluster_index_classifications.txt -> for the spectra (cluster indices in gnps)
- component_index_classifications.txt -> for the molfams (component indices in gnps)

For now just read the files to some dicts

In [6]:
[bgc.product_prediction for bgc in npl.bgcs][:10]

['nrps',
 'nrps.t1pks.otherks',
 'cf_fatty_acid.nrps.t1pks',
 'nrps',
 'nrps',
 'nrps',
 'nrps',
 'nrps',
 'nrps',
 'nrps.t1pks']

In [7]:
npl.root_dir

'/mnt/scratch/louwe015/NPLinker/own/nplinker_shared/crus_new_gnps_AS3/'

In [8]:
c_spec = None
for spec in npl.spectra:
    if spec.spectrum_id == 46386:
        c_spec = spec
c_spec.metadata

{'precursormass': 577.31299,
 'parentintensity': None,
 'charge': 0,
 'mslevel': '2',
 'precursorintensity': '122530.000000',
 'filename': 'specs_ms.pklbin',
 'parentrt': 1022.690002,
 'activation': 'CID',
 'instrument': 'ion trap',
 'title': 'Scan Number: 46386',
 'scans': '46386',
 'parentmass': 577.31299,
 'singlechargeprecursormass': 577.31299,
 'cluster_index': 46386,
 'files': {'CNS654_R5_E.mzXML': 'CNS654_R5_E.mzXML'}}

In [9]:
ci_classes_file = os.path.join(npl.root_dir, 'cluster_index_classifications.txt')
os.path.exists(ci_classes_file)
ci_classes = {} # for now make a dict {ci: [[(class,score)]]}
with open(ci_classes_file) as inf:
    ci_classes_header = inf.readline()
    print(ci_classes_header)
    for line in inf:
        line = line.strip().split("\t")
        classes_list = []
        for lvl in line[3:]:
            lvl_list = []
            for l_class in lvl.split("; "):
                if l_class:
                    l_class = l_class.split(":")
                    c_tup = tuple([l_class[0], float(l_class[1])])
                else:
                    c_tup = None  # default value for class value
                lvl_list.append(c_tup)
            classes_list.append(lvl_list)
        ci_classes[line[1]] = classes_list

print(line)  #example

componentindex	cluster index	formula	kingdom	superclass	class	subclass	level 5	level 6	level 7	level 8	level 9	level 10	level 11	pathway	superclass	class

['3773', '31191', 'C19H41NO13', 'Organic compounds:1.000', 'Organic oxygen compounds:0.999; Organoheterocyclic compounds:0.536; Organic nitrogen compounds:1.000; Hydrocarbon derivatives:1.000; Organopnictogen compounds:0.959', 'Organooxygen compounds:0.953; Oxanes:0.579; Organonitrogen compounds:1.000', 'Carbohydrates and carbohydrate conjugates:0.936; Amines:0.977; Alcohols and polyols:0.955; Ethers:0.782', 'Glycosyl compounds:0.739; Alkanolamines:0.824; Secondary alcohols:0.870; Acetals:0.738; Secondary amines:0.741; Polyols:0.770; Primary alcohols:0.835', 'O-glycosyl compounds:0.657; 1,2-aminoalcohols:0.789; Dialkylamines:0.767', '', '', '', '', '', 'Amino acids and Peptides:0.010', 'Amino acid glycosides:0.302', 'Cyanogenic glycosides:0.438']


In [10]:
ci_classes['46386']

[[('Organic compounds', 1.0)],
 [('Lipids and lipid-like molecules', 0.359),
  ('Organic acids and derivatives', 0.601),
  ('Organic oxygen compounds', 1.0),
  ('Organic nitrogen compounds', 1.0),
  ('Hydrocarbon derivatives', 1.0),
  ('Organopnictogen compounds', 0.976)],
 [('Fatty Acyls', 0.49),
  ('Carboxylic acids and derivatives', 0.358),
  ('Organooxygen compounds', 0.994),
  ('Organonitrogen compounds', 1.0),
  ('Organic oxides', 0.639)],
 [('Fatty amides', 0.597),
  ('Carboxylic acid derivatives', 0.334),
  ('Alcohols and polyols', 0.941),
  ('Amines', 0.829),
  ('Carbonyl compounds', 0.572)],
 [('N-acyl amines', 0.812),
  ('Carboxylic acid amides', 0.521),
  ('Secondary alcohols', 0.73),
  ('Polyols', 0.551),
  ('Primary alcohols', 0.599),
  ('Primary amines', 0.839)],
 [('Secondary carboxylic acid amides', 0.512), ('Monoalkylamines', 0.849)],
 [None],
 [None],
 [None],
 [None],
 [None],
 [('Carbohydrates', 0.419)],
 [('Polyols', 0.856)],
 [('Amino cyclitols', 0.8)]]

In [11]:
# for component indices
compi_classes_file = os.path.join(npl.root_dir, 'component_index_classifications.txt')
os.path.exists(compi_classes_file)
compi_classes = {} # for now make a dict {ci: [[(class,score)]]}
with open(compi_classes_file) as inf:
    compi_classes_header = inf.readline()
    print(compi_classes_header)
    for line in inf:
        line = line.strip().split("\t")
        classes_list = []
        for lvl in line[2:]:
            lvl_list = []
            for l_class in lvl.split("; "):
                if l_class:
                    l_class = l_class.split(":")
                    c_tup = tuple([l_class[0], float(l_class[1])])
                else:
                    c_tup = None  # default value for class value
                lvl_list.append(c_tup)
            classes_list.append(lvl_list)
        compi_classes[line[0]] = classes_list

print(line)  #example

componentindex	size	kingdom	superclass	class	subclass	level 5	level 6	level 7	level 8	level 9	level 10	level 11	pathway	superclass	class

['3773', '2', 'Organic compounds:1.000', 'Organic acids and derivatives:0.500; Organic oxygen compounds:1.000; Alkaloids and derivatives:0.500; Organic nitrogen compounds:1.000; Organoheterocyclic compounds:1.000; Hydrocarbon derivatives:1.000; Organopnictogen compounds:1.000', 'Carboxylic acids and derivatives:0.500; Organooxygen compounds:1.000; Organonitrogen compounds:1.000; Oxanes:0.500; Azacyclic compounds:0.500; Oxacyclic compounds:0.500; Organic oxides:0.500', 'Amino acids, peptides, and analogues:0.500; Carbohydrates and carbohydrate conjugates:0.500; Amines:1.000; Carboxylic acid derivatives:0.500; Alcohols and polyols:1.000; Ethers:0.500; Monocarboxylic acids and derivatives:0.500; Carbonyl compounds:0.500', 'Glycosyl compounds:0.500; Aralkylamines:0.500; Alkanolamines:0.500; Amino acids and derivatives:0.500; Carboxylic acid esters:0.500;

In [12]:
compi_classes['3773']

[[('Organic compounds', 1.0)],
 [('Organic acids and derivatives', 0.5),
  ('Organic oxygen compounds', 1.0),
  ('Alkaloids and derivatives', 0.5),
  ('Organic nitrogen compounds', 1.0),
  ('Organoheterocyclic compounds', 1.0),
  ('Hydrocarbon derivatives', 1.0),
  ('Organopnictogen compounds', 1.0)],
 [('Carboxylic acids and derivatives', 0.5),
  ('Organooxygen compounds', 1.0),
  ('Organonitrogen compounds', 1.0),
  ('Oxanes', 0.5),
  ('Azacyclic compounds', 0.5),
  ('Oxacyclic compounds', 0.5),
  ('Organic oxides', 0.5)],
 [('Amino acids, peptides, and analogues', 0.5),
  ('Carbohydrates and carbohydrate conjugates', 0.5),
  ('Amines', 1.0),
  ('Carboxylic acid derivatives', 0.5),
  ('Alcohols and polyols', 1.0),
  ('Ethers', 0.5),
  ('Monocarboxylic acids and derivatives', 0.5),
  ('Carbonyl compounds', 0.5)],
 [('Glycosyl compounds', 0.5),
  ('Aralkylamines', 0.5),
  ('Alkanolamines', 0.5),
  ('Amino acids and derivatives', 0.5),
  ('Carboxylic acid esters', 0.5),
  ('Secondary al

## Find some known links (staurosporine...) and compare classes
Not completely sure but from other notebook this seems to be one of the staurosporine GCFs:

Results for object: GCF(id=504, class=Others, gcf_id=3327, strains=3), 34 total links, 1 methods used
  --> [metcalf] Spectrum(id=1707, spectrum_id=27268, strains=1) | 3.8730 | shared strains = 1

In [13]:
cur_gcf = npl.gcfs[504]
cur_bgcs = [bgc for bgc in cur_gcf.bgcs if bgc.strain in cur_gcf.strains]
cur_gcf.bigscape_class, cur_bgcs

('Others',
 [BGC(id=2716, name=AREQ01000000_KB892473.1.cluster040, strain=Strain(Salinispora arenicola CNT849) [28 aliases], asid=KB892480.1, region=-1),
  BGC(id=2719, name=ARGY01000000_KB896072.1.cluster017, strain=Strain(Salinispora arenicola CNP193) [7 aliases], asid=KB896077.1, region=-1),
  BGC(id=2738, name=AZWU01000000_KI911490.1.cluster011, strain=Strain(Salinispora arenicola CNT005) [32 aliases], asid=KI911493.1, region=-1)])

In [14]:
cur_spec = npl.spectra[1707]
cur_spec_id = cur_spec.spectrum_id

cur_spec, cur_spec_id, ci_classes[str(cur_spec_id)]

(Spectrum(id=1707, spectrum_id=27268, strains=1),
 27268,
 [[('Organic compounds', 1.0)],
  [('Lipids and lipid-like molecules', 0.411),
   ('Benzenoids', 0.878),
   ('Hydrocarbon derivatives', 1.0),
   ('Organic oxygen compounds', 1.0)],
  [('Prenol lipids', 0.613), ('Organooxygen compounds', 0.998)],
  [None],
  [None],
  [None],
  [None],
  [None],
  [None],
  [None],
  [None],
  [('Terpenoids', 0.884)],
  [('Meroterpenoids', 0.625)]])

In [15]:
cur_bgcs[0].product_prediction

'indole'