# Usage of the "brenda" module for obtaining molecules from the BRENDA database

First create a molecule object that holds all the data. Here we use only molecules that are listed in the **"Natural substrates"** table of BRENDA

In [1]:
from cheminformatics import brenda

# other valid values for typeof are 'product', 'substrates', 'products'
mol_obj = brenda.BrendaNaturalMols(typeof='substrate') 

type(mol_obj)

FileNotFoundError: [Errno 2] No such file or directory: './data/natural_substrates_filtered.json'

You can equally well use molecules form the **"Substrates"** tabel in BRENDA

In [None]:
mol_obj = brenda.BrendaMols(typeof='substrate') 

type(mol_obj)

Get a list of all unique molecules from all EC numbers

In [None]:
molecules = mol_obj.names()

molecules[:10]

Get a list of all EC numbers in the dataset

In [None]:
ec_nums = mol_obj.ec()

ec_nums[:10]

Get a dictionary with EC number keys holding lists of molecule names

In [None]:
mol_data = mol_obj.data_dict()

print(list(mol_data.keys())[:10])

print('\n')

print(mol_data['1.1.3.10'])


Get a data frame holding the EC number and substrate name data

In [None]:
mol_data = mol_obj.data_frame()

mol_data.head()

# Usage of the "cheminfo" module for working with the molecules

First create a molecule object that holds all the data. Conversion from names to SMILES occurs automatically on creation of the object.

The cheminfo module is dependent on the following cheminformatics packages:

cirpy

pubchempy

rdkit

### First let's look at name to SMILE conversion

The SMILES are fetched from saved data where possible, otherwise tries to get them from a server. The retest_none variable can be used to force the script to try to get SMILES for the molecule names that did not work in previous runs.

In [None]:
from cheminformatics import cheminfo

mols = ['(S)-Lactate', 'Glycolate', 'Tryptophane', 'ATP', 'ADP', 'AMP']

chem_obj = cheminfo.NameToSmile(names=mols, retest_none=False)

type(chem_obj)

Get a list of molecule names

In [None]:
molecules = chem_obj.names()

molecules[:10]

Get a list of all the smiles

In [None]:
smiles = chem_obj.smiles()

smiles[:10]

Get a dictionary of all data with metabolite names as keys and smiles as values

In [None]:
smile_data = chem_obj.data_dict()

print(list(smile_data.keys())[:10])

print('\n')

print(smile_data['lactate'])

Get a data frame with the molecule names and smiles data

In [None]:
smile_data = chem_obj.data_frame()

smile_data.head()

**Note**

Each of the four methods names(), smiles(), data_dict(), and data_frame() take the optional argument "exclude_none". By default this is set to False and all molecule names for which no SMILE could be obtained are excluded. If set to True, these will be included in the output.

### Now let's look at getting interesting data out of the SMILES

First create a data object using molecule names and smiles as input.

In [None]:
data_obj = cheminfo.SmileToData(names=chem_obj.names(),
                                smiles=chem_obj.smiles(), 
                                descriptor='morgan3', 
                                metric='tanimoto')

type(data_obj)

Several different alortithms for calculating descriptors are available. A list of the available ones can easily be obtained.

In [None]:
data_obj.valid_descriptors()

Additionally several different metrics for comparing the molecules are available. A list of the available ones can easily be obtained.

In [None]:
data_obj.valid_metrics()

#### Obtaining basic properties for the molecules

Get a list of the molecule names 

In [None]:
names = data_obj.names()

names[:10]

Get a list of the molecule smiles

In [None]:
smiles = data_obj.smiles()

smiles[:10]

Get a list of the molecule objects obtained from the smiles

In [None]:
mols = data_obj.molecules()

print(mols[:10])

for m in mols[:10]:
    display(m)

It is also possible to specify the a molecules name to obtain only that molecule object

In [None]:
m = data_obj.molecules('lactate')
m

Get a list of fingerprint objects for each of the molecules

In [None]:
fingerp = data_obj.fingerprints()

fingerp[0]

It is also possible to get the fingerprints as bitstrings

In [None]:
fingerp = data_obj.fingerprints_str()

fingerp[0]

Or as a bit list if one prefers...

In [None]:
fingerp = data_obj.fingerprints_list()

fingerp[0]

Get a list of molecule properties for the molecules, in this case molecular weight.

In [None]:
props = data_obj.property(property_type='molwt')

props[:10]

You can easiliy obtain a list of which properties that are available

In [None]:
data_obj.valid_properties()

These are a bit cryptic, so one can also get the explanation.

In [None]:
data_obj.explain_properties()

Get a dictionary containing a selection of chemical properties for each molecule

In [None]:
prop_dict = data_obj.data_dict()

print(prop_dict.keys())

prop_dict['lactate']

Get a data frame containing a selection of chemical properties for each molecule

In [None]:
prop_data = data_obj.data_frame()

prop_data.head()

Get a matrix of all pairwise similarities between molecules

In [None]:
data_obj.similarity()

Get a matrix of all pairwise distances between molecules

In [None]:
data_obj.distance()

Get similarity statistics (minimum similarity, maximum similarity, etc.) for each molecule in the whole set. The diagonal (self-similarity) is ignored in these calculations. Molecules with a low max and a low sum might be considered outliers.

In [None]:
sim_data = data_obj.molecule_similarity_stats()

sim_data


Get similarity statistics for the entire set of molecules.

In [None]:
sim_data = data_obj.global_similarity_stats()

sim_data

Get a subset of diverse molecules, chosen from the total set of molecules. It is possible to specify already selected molecules in the "firstpicks" argument.

In [None]:
data_obj.diversity_pick(n=3, firstpicks=['glycolate'])

Draw all molecules. These are aligned based on their maximum common substrucuture, which, in the case shown here, is not highlighted.

In [None]:
data_obj.draw_structures()

Optionally one can also highlight the maximum common substrucuture setting the "highlight_substructure" to True

In [None]:
data_obj.draw_structures(highlight_substructure=True)

One can specifically compare the structure of two molecules.

In [None]:
data_obj.draw_mol_comparison(refmol='lactate', mol='glycolate')

Cluster the molecules and obtain a vector indicating cluster identity for each molecule. In this case Butina clustering is used on already pre-computed distances.

In [None]:
data_obj.cluster_butina()

Other cluster methods...

In [None]:
# here

### Plotting


Use PCA to visualize the relationship between all molecules __based on molecule fingerprints__. 

Optional arguments include:

include_labels: True or False

color_categories: None or a list of numbers the same length as the data indicating which point belong to what group.

No color but with labels

In [None]:
data_obj.pca(include_labels=True)

No labels, but with coloring

In [None]:
data_obj.pca(color_categories=[3, 2, 1, 0, 0, 0])

Use MDS to visualize the relationship between all molecules __based on pre-computed distances__. The options are same as for PCA. First with labels.

In [None]:
data_obj.mds(include_labels=True)

Now with colors instead of labels.

In [None]:
data_obj.mds(color_categories=[3, 2, 1, 0, 0, 0])

Use t-SNE to visualize the relationship between all molecules __based on molecule fingerprints__. The options are same as for PCA. First with labels.

In [None]:
data_obj.tsne(include_labels=True)

Now with colors instead of labels.

In [None]:
data_obj.tsne(color_categories=[3, 2, 1, 0, 0, 0])

## Example usage

So let's say I'm interested in looking at EC 1.1.3.15. I might do something along these lines:

In [None]:
# first get all the substrates
mol_obj = brenda.BrendaMols(typeof='substrate') 
my_ec_data = mol_obj.data_dict()['1.1.3.15']

my_ec_data

In [None]:
# need to get smiles for these substrates
chem_obj = cheminfo.NameToSmile(names=my_ec_data)

# as you can see any "strange" substrates that did not give smiles have been excluded
chem_obj.data_dict()

In [None]:
# now I make the data object to do the fun stuff
data_obj = cheminfo.SmileToData(names=chem_obj.names(),
                                smiles=chem_obj.smiles(), 
                                descriptor='morgan3', 
                                metric='tanimoto')

# cluster them
clu = data_obj.cluster_butina()

# use to cluster categories to color points in an MDS plot
data_obj.mds(include_labels=True, color_categories=clu)

In [None]:
# that's a lot of different types of molecules in there, I wonder how similar they are on average
data_obj.global_similarity_stats()

In [None]:
# I happen to know that EC 1.4.3.2 is also active on a range of substrates. 
# I wonder if that EC class takes substrates that are on average more or less similar

my_ec_data = mol_obj.data_dict()['1.4.3.2']
chem_obj = cheminfo.NameToSmile(names=my_ec_data)
data_obj = cheminfo.SmileToData(names=chem_obj.names(),
                                smiles=chem_obj.smiles(), 
                                descriptor='morgan3', 
                                metric='tanimoto')

# cluster them
clu = data_obj.cluster_butina()

# use to cluster categories to color points in an MDS plot
data_obj.mds(include_labels=True, color_categories=clu)

In [None]:
# seems like the mean similarity is a bit higher here, but on the whole I would say they are comparable
data_obj.global_similarity_stats()

In [None]:
# what is orn? Don't know what that is really..

data_obj.molecules('orn')

# Aha, looks like ornithine! Who would have guessed...