# Simple examples: Using Papyrus scripts

Herein it is assumed that the Papyrus <a href="https://doi.org/10.4121/16896406">bioactivity data</a> hosted on 4TU was dowloaded and placed in a folder called ***PAPYRUS_DATA_FOLDER***.

In [1]:
%%html
<style>
table {align:left;display:block}
</style>

## Reading Papyrus files

Functions can be found under *papyrus_scripts.reader* to facilitate the dataset being read from disk.

In [2]:
from papyrus_scripts.reader import read_papyrus, read_protein_set

### Bioactivity data

Let's first read the bioactivity data.
Make sure **not to decompress** the data, as it is **more than 45 GB!**

We will use the *read_papyrus* function to read the bioactivity data as a pandas dataframe. <br/>
Let us first demonstrate the use of the function on systems with limited RAM (less than 50GB).

We first ensure to read the standardized data without stereochemistry in chunks of ten thousand lines.
Additionally we ensure the *source_path* is that of the ***PAPYRUS_DATA_FOLDER***.

In [3]:
PAPYRUS_DATA_FOLDER = 'C:/Users/Olivier/Downloads/Papyrus/Fixed/gmail'

In [4]:
sample_data = read_papyrus(is3d=False, chunksize=10000, source_path=PAPYRUS_DATA_FOLDER)

The return value is an iterator of Pandas dataframes of maximum ten thousand rows each.<br/>
Let's extract the first chunk as a pandas dataframe and have a look at few rows.

In [5]:
chunk1 = next(sample_data)
chunk1.head()

Unnamed: 0,Activity_ID,Quality,source,CID,SMILES,connectivity,InChIKey,InChI,InChI_AuxInfo,target_id,...,type_other,Activity_class,relation,pchembl_value,pchembl_value_Mean,pchembl_value_StdDev,pchembl_value_SEM,pchembl_value_N,pchembl_value_Median,pchembl_value_MAD
0,AAAAEENPAALFRN_on_P49654_WT,High,ChEMBL29,CHEMBL492934,COc1cc(C(C)C)c(Oc2cnc(NCCS(C)(=O)=O)nc2N)cc1I,AAAAEENPAALFRN,AAAAEENPAALFRN-UHFFFAOYSA-N,InChI=1S/C17H23IN4O4S/c1-10(2)11-7-14(25-3)12(...,"""AuxInfo=1/1/N:7,8,1,19,16,17,4,25,12,6,5,26,9...",P49654_WT,...,0,,=,7.800,7.8,0.0,0.0,1.0,7.8,0.0
1,AAAAEENPAALFRN_on_P56373_WT,Low,ChEMBL29,CHEMBL492934,COc1cc(C(C)C)c(Oc2cnc(NCCS(C)(=O)=O)nc2N)cc1I,AAAAEENPAALFRN,AAAAEENPAALFRN-UHFFFAOYSA-N,InChI=1S/C17H23IN4O4S/c1-10(2)11-7-14(25-3)12(...,"""AuxInfo=1/1/N:7,8,1,19,16,17,4,25,12,6,5,26,9...",P56373_WT,...,0,,=,7.400,7.4,0.0,0.0,1.0,7.4,0.0
2,AAAAEENPAALFRN_on_Q9UBL9_WT,Low,ChEMBL29,CHEMBL492934,COc1cc(C(C)C)c(Oc2cnc(NCCS(C)(=O)=O)nc2N)cc1I,AAAAEENPAALFRN,AAAAEENPAALFRN-UHFFFAOYSA-N,InChI=1S/C17H23IN4O4S/c1-10(2)11-7-14(25-3)12(...,"""AuxInfo=1/1/N:7,8,1,19,16,17,4,25,12,6,5,26,9...",Q9UBL9_WT,...,0,,=,7.400,7.4,0.0,0.0,1.0,7.4,0.0
3,AAAAKTROWFNLEP_on_P49137_WT,Low,ChEMBL29,CHEMBL246893,CC1CNC(=O)c2c1c1c(ccc(C(=O)N(C)C)c1)[nH]2,AAAAKTROWFNLEP,AAAAKTROWFNLEP-UHFFFAOYSA-N,InChI=1S/C15H17N3O2/c1-8-7-16-14(19)13-12(8)10...,"""AuxInfo=1/1/N:1,17,18,12,11,19,3,2,13,9,10,8,...",P49137_WT,...,0,,<,4.699,4.699,0.0,0.0,1.0,4.699,0.0
4,AAAAZQPHATYWOK_on_P00533_WT,High,Sharma2016;Sharma2016;ChEMBL29,4277046;4277046;CHEMBL175513,CCOc1c(NC(=O)C=CCN(C)C)cc2c(Nc3cc(Cl)c(OCc4nc5...,AAAAZQPHATYWOK,AAAAZQPHATYWOK-UHFFFAOYSA-N,InChI=1S/C32H29ClN6O3S/c1-4-41-28-16-25-22(15-...,"""AuxInfo=1/1/N:1,13,14,2,32,31,10,33,30,9,36,3...",P00533_WT,...,0,,=,6.726; 6.063; 6.730,6.506,0.313,0.222,3.0,6.726,0.004


If you are sure your hardware can handle loading all the data, then you can drop *chunksize*.<br/>
Then the return value is a pandas dataframe.

Below, we will show how to use:
  - a pandas dataframe (by calling our methods on *chunk1*)
  - an iterator of pandas dataframes (by calling our methods on *sample_data*)

### Protein target data

But for now let's focus on protein data:<br/>
Information about the protein targets is available from a different file and can be loaded as easily as was demonstrated above.<br/>
This file being very limited in size, chunking is not needed. 

In [6]:
protein_data = read_protein_set(source_path=PAPYRUS_DATA_FOLDER)
protein_data.head()

Unnamed: 0,target_id,HGNC_symbol,UniProtID,Status,Organism,Classification,Length,Sequence
0,P47747_WT,,HRH2_CAVPO,reviewed,Cavia porcellus (Guinea pig),Membrane receptor->Family A G protein-coupled ...,359,MAFNGTVPSFCMDFTVYKVTISVILIILILVTVAGNVVVCLAVGLN...
1,B0FL73_WT,,B0FL73_CAVPO,unreviewed,Cavia porcellus (Guinea pig),Membrane receptor->Family A G protein-coupled ...,467,MGAGVLALGASEPCNLSSTAPLPDGAATAARLLVPASPPASLLPPT...
2,Q8K4Z4_WT,,ADRB2_CAVPO,reviewed,Cavia porcellus (Guinea pig),Membrane receptor->Family A G protein-coupled ...,418,MGHLGNGSDFLLAPNASHAPDHNVTRERDEAWVVGMAIVMSLIVLA...
3,P97266_WT,,OPRM_CAVPO,reviewed,Cavia porcellus (Guinea pig),Membrane receptor->Family A G protein-coupled ...,98,YTKMKTATNIYIFNLALADALATSTLPFQSVNYLMGTWPFGTILCK...
4,P41144_WT,,OPRK_CAVPO,reviewed,Cavia porcellus (Guinea pig),Membrane receptor->Family A G protein-coupled ...,380,MGRRRQGPAQPASELPARNACLLPNGSAWLPGWAEPDGNGSAGPQD...


## Filtering Papyrus

The data contained in the dataset can be filtered very easily using functions under *papyrus_scripts.preprocess*.<br/>
All filtering functions start with ***keep_***.

In [7]:
from papyrus_scripts.preprocess import (keep_quality, keep_source, keep_type,
                                        keep_organism, keep_accession, keep_protein_class
                                       )

**The strength of the Papyrus scripts is that the data can be filtered whether chunked or not.** The only difference:
  - when using chunked data, call *consume_chunks* once all filters are applied to reconstiture a pandas dataframe

### Filtering pandas dataframes

Let's first keep the data with quality 'medium' and above (namely 'high' and 'medium').

In [8]:
filter1 = keep_quality(data=chunk1, min_quality='medium')

<u>Using <a href="https://www.ebi.ac.uk/chembl/visualise/">ChEMBL's protein target tree</a> is encouraged for this part.</u><br/>
<br/>
We will then filter out any protein not belonging to these two classes:
* Ligand-gated ion channels
* SLC superfamily of solute carriers

For this filter, passing protein information is required (the same applies for *keep_organism* and *keep_accession*).

In [9]:
filter2 = keep_protein_class(data=filter1, protein_data=protein_data, classes=[{'l2': 'Ligand-gated ion channels'}, {'l3': 'SLC superfamily of solute carriers'}])

We now keep only K<sub>i</sub> and K<sub>D</sub> data.<br/>
Here we will pass filter1 to the next *keep_* funtion.

In [10]:
filter3 = keep_type(data=filter2, activity_types=['Ki', 'KD'])

We finally keep only human and rat data (protein information is also required here).

In [11]:
filter4 = keep_organism(data=filter3, protein_data=protein_data, organism=['Human', 'Rat'], generic_regex=True)

Let us have a look at the filtered data.

In [12]:
filter4.head()

Unnamed: 0,Activity_ID,Quality,source,CID,SMILES,connectivity,InChIKey,InChI,InChI_AuxInfo,target_id,...,type_other,Activity_class,relation,pchembl_value,pchembl_value_Mean,pchembl_value_StdDev,pchembl_value_SEM,pchembl_value_N,pchembl_value_Median,pchembl_value_MAD
804,AAEKULYONKUBOZ_on_P23975_WT,Medium,ChEMBL29,CHEMBL14144,CN1C2CCC1C(C(=O)Oc1ccccc1)C(c1ccc(Cl)cc1)C2,AAEKULYONKUBOZ,AAEKULYONKUBOZ-UHFFFAOYSA-N,InChI=1S/C21H22ClNO2/c1-23-16-11-12-19(23)20(2...,"""AuxInfo=1/0/N:1,14,13,15,12,16,19,24,20,23,4,...",P23975_WT,...,0,,=,6.62,6.62,0.0,0.0,1.0,6.62,0.0
805,AAEKULYONKUBOZ_on_P23977_WT,High,ChEMBL29,CHEMBL14144,CN1C2CCC1C(C(=O)Oc1ccccc1)C(c1ccc(Cl)cc1)C2,AAEKULYONKUBOZ,AAEKULYONKUBOZ-UHFFFAOYSA-N,InChI=1S/C21H22ClNO2/c1-23-16-11-12-19(23)20(2...,"""AuxInfo=1/0/N:1,14,13,15,12,16,19,24,20,23,4,...",P23977_WT,...,0,,=,8.28,8.28,0.0,0.0,1.0,8.28,0.0
806,AAEKULYONKUBOZ_on_P31645_WT,High,ChEMBL29,CHEMBL14144,CN1C2CCC1C(C(=O)Oc1ccccc1)C(c1ccc(Cl)cc1)C2,AAEKULYONKUBOZ,AAEKULYONKUBOZ-UHFFFAOYSA-N,InChI=1S/C21H22ClNO2/c1-23-16-11-12-19(23)20(2...,"""AuxInfo=1/0/N:1,14,13,15,12,16,19,24,20,23,4,...",P31645_WT,...,0,,=,6.41,6.41,0.0,0.0,1.0,6.41,0.0
808,AAEKULYONKUBOZ_on_Q9WTR4_WT,High,ChEMBL29,CHEMBL14144,CN1C2CCC1C(C(=O)Oc1ccccc1)C(c1ccc(Cl)cc1)C2,AAEKULYONKUBOZ,AAEKULYONKUBOZ-UHFFFAOYSA-N,InChI=1S/C21H22ClNO2/c1-23-16-11-12-19(23)20(2...,"""AuxInfo=1/0/N:1,14,13,15,12,16,19,24,20,23,4,...",Q9WTR4_WT,...,0,,=,6.62,6.62,0.0,0.0,1.0,6.62,0.0
809,AAEKULYONKUBOZ_on_Q01959_WT,Medium,ChEMBL29,CHEMBL14144,CN1C2CCC1C(C(=O)Oc1ccccc1)C(c1ccc(Cl)cc1)C2,AAEKULYONKUBOZ,AAEKULYONKUBOZ-UHFFFAOYSA-N,InChI=1S/C21H22ClNO2/c1-23-16-11-12-19(23)20(2...,"""AuxInfo=1/0/N:1,14,13,15,12,16,19,24,20,23,4,...",Q01959_WT,...,0,,=,8.28,8.28,0.0,0.0,1.0,8.28,0.0


In [13]:
f'Number of activity points: {filter4.shape[0]}'

'Number of activity points: 53'

Remember that this result comes from only the first chunk of the entire dataset.

One can now save this dataframe like any other pandas object.

### Filtering iterators of dataframes

Now that the filtering capacity of the Papyrus scripts have been demonstrated for entire dataframes, we can try with chunked iterators.

Let's first reinstanciate sample data. This time we will use a chunk size of 1,000,000.

In [14]:
sample_data = read_papyrus(is3d=False, chunksize=1000000, source_path=PAPYRUS_DATA_FOLDER)

For this will will go through the same filters as above but iterate over the entire dataset.

In [15]:
filter1_it = keep_quality(data=sample_data, min_quality='medium')
filter2_it = keep_protein_class(data=filter1_it, protein_data=protein_data, classes=[{'l2': 'Ligand-gated ion channels'}, {'l3': 'SLC superfamily of solute carriers'}])
filter3_it = keep_type(data=filter2_it, activity_types=['Ki', 'KD'])
filter4_it = keep_organism(data=filter3_it, protein_data=protein_data, organism=['Human', 'Rat'], generic_regex=True)

The filters do not get applied directly on chunked iterators and one can easily check that *filter4_it* is not a pandas dataframe.

In [16]:
filter4_it

<generator object _chunked_keep_organism at 0x000002BCF5A70780>

To apply the filters on the entire iterator, one needs to call *consume_chunks*.<br/>
This function can be found under *papyrus_scripts.preprocess* just like the *keep_* functions used for filtering.

In [17]:
from papyrus_scripts.preprocess import consume_chunks

In order to follow progress of the filtering process, one needs to pass the total number of chunks the filters will go through.<br/>
$Total = \displaystyle \Bigl \lceil\frac{Size_{dataset}}{chunksize}\Bigl \rceil $<br/>

In version 5.4 of the Papyrus dataset the number of compound-protein activity points depends on whether stereochemistry is used or not **(remember we discourage its usage)**.<br/>


| Stereochemistry | Total size |
| :--- | :---: |
| Without | 59,763,781 |
| With (strongly discouraged) | 61,085,152 |

In this example $Total = \displaystyle \Bigl \lceil \frac{59,763,781}{1,000,000}\Bigl \rceil = 60 $<br/>

In [18]:
filtered_data = consume_chunks(filter4_it, progress=True, total=60)

  0%|          | 0/60 [00:00<?, ?it/s]

Although this may take around 30 minutes to filter the entire dataset, this is the ideal way to work with this dataset from laptops.

In [19]:
f'Number of activity points: {filter4.shape[0]}'

'Number of activity points: 53'

We hope these simple examples demonstrated how the Papyrus data can easily be filtered.
Let's now focus on the modelling

## Modelling the bioactivity data

The Papyrus scripts allow for both quantitative structure-activity relationship (QSAR) and proteochemometrics (PCM) modelling.<br/>
All functions related to modelling can be found under *papyrus_scripts.modelling*.

**Disclaimer:**<br/>
For now, only precomputed molecular descriptors can be used, preventing the use of models outside of Papyrus.<br/>
This major flaw will be soon fixed.

### QSAR models

In [20]:
from papyrus_scripts.modelling import qsar
import xgboost

Let us first restrict the data that we just extracted from Papyrus to the human serotonin receptor (accession P31645).

In [21]:
sample_data = read_papyrus(is3d=False, chunksize=1000000, source_path=PAPYRUS_DATA_FOLDER)
filter1_it = keep_accession(sample_data, 'P31645')
filter2_it = keep_quality(data=filter1_it, min_quality='medium')
filter3_it = keep_type(data=filter2_it, activity_types=['Ki', 'KD'])

In [22]:
SLC6A4_data = consume_chunks(filter3_it, total=60)

  0%|          | 0/60 [00:00<?, ?it/s]

Herein it is assumed that the Papyrus <a href="https://drive.google.com/drive/folders/1Lhw5G6gu_nLzHQoGmnl02uhFsmOgEZ5a?usp=sharing">molecular descriptors</a> were dowloaded and placed in the ***PAPYRUS_DATA_FOLDER*** folder.

We will first create a regression model predicting the average pActivity values of a compound-target pair (i.e. *pchembl_value_Mean*).

In [23]:
reg_model = xgboost.XGBRegressor(verbosity=0)

In [24]:
reg_results, trained_reg_model = qsar(data=SLC6A4_data,
                                  endpoint='pchembl_value_Mean',
                                  quality='low',
                                  source='any',
                                  activity_types='any',
                                  num_points=30,
                                  delta_activity=2,
                                  descriptors='mold2',
                                  descriptor_path=PAPYRUS_DATA_FOLDER,
                                  descriptor_chunksize=50000,
                                  activity_threshold=6.5,
                                  model=reg_model,
                                  folds=5,
                                  stratify=False,
                                  split_by='Year',
                                  split_year=2013,
                                  test_set_size=0.30,
                                  validation_set_size=0.30,
                                  cluster_method=None,
                                  custom_groups=None,
                                  random_state=1234,
                                  verbose=True)

Loading molecular descriptors: 0it [00:00, ?it/s]

  0%|                                                                                             | 0/1 [00:00…

In [25]:
reg_results

Unnamed: 0,Unnamed: 1,number,R2,MSE,RMSE,MSLE,RMSLE,MAE,Explained Variance,Max Error,Mean Poisson Distrib,Mean Gamma Distrib,Pearson r,Spearman r,Kendall tau
P31645_WT,Fold 1,341.0,0.613751,0.611546,0.782014,0.009559,0.097771,0.599266,0.613907,3.21758,0.086353,0.012412,0.783526,0.77833,0.590912
P31645_WT,Fold 2,341.0,0.522181,0.72568,0.851868,0.011601,0.107708,0.643654,0.522279,4.162226,0.104763,0.015442,0.727999,0.732167,0.546937
P31645_WT,Fold 3,341.0,0.555722,0.675527,0.821904,0.010439,0.102172,0.599939,0.557242,2.975757,0.095199,0.013642,0.748005,0.733045,0.555637
P31645_WT,Fold 4,341.0,0.557549,0.668057,0.817347,0.010817,0.104003,0.604281,0.563145,2.99242,0.095203,0.013921,0.751657,0.771598,0.577344
P31645_WT,Fold 5,340.0,0.580592,0.632147,0.795077,0.009619,0.098078,0.594852,0.580627,3.077268,0.088472,0.012591,0.762783,0.759198,0.573053
P31645_WT,Mean,340.8,0.565959,0.662591,0.813642,0.010407,0.101946,0.608398,0.56744,3.28505,0.093998,0.013602,0.754794,0.754868,0.568777
P31645_WT,SD,0.365148,0.027659,0.035847,0.02194,0.000699,0.003413,0.016322,0.027371,0.407933,0.005884,0.000994,0.01665,0.017515,0.014325
P31645_WT,Test set,309.0,-0.137833,1.371653,1.171176,0.020634,0.143644,0.89242,-0.127932,3.542618,0.188637,0.026236,0.247963,0.254628,0.190017


When looking at average R<sup>2</sup>, performance over cross-validation is correct but the model show very little capacity to predict the temporally split test set.

To train a classifier, all that is needed is to change the type of model.

In [26]:
cls_model = xgboost.XGBClassifier(verbosity=0)

In [27]:
cls_results, trained_cls_model = qsar(data=SLC6A4_data,
                                      endpoint='pchembl_value_Mean',
                                      quality='low',
                                      source='any',
                                      activity_types='any',
                                      num_points=30,
                                      delta_activity=2,
                                      descriptors='mold2',
                                      descriptor_path=PAPYRUS_DATA_FOLDER,
                                      descriptor_chunksize=50000,
                                      activity_threshold=6.5,
                                      model=cls_model,
                                      folds=5,
                                      stratify=False,
                                      split_by='Year',
                                      split_year=2013,
                                      test_set_size=0.30,
                                      validation_set_size=0.30,
                                      cluster_method=None,
                                      custom_groups=None,
                                      random_state=1234,
                                      verbose=True)

Loading molecular descriptors: 0it [00:00, ?it/s]

  0%|                                                                                             | 0/1 [00:00…

In [28]:
cls_results

Unnamed: 0,Unnamed: 1,MCC,A:N,ACC,BACC,Sensitivity,Specificity,PPV,NPV,F1,AUC A,AUC N
P31645_WT,Fold 1,0.557173,245:96,0.824047,0.773023,0.65625,0.889796,0.7,0.868526,0.677419,0.144813,0.855187
P31645_WT,Fold 2,0.622859,245:96,0.853372,0.793431,0.65625,0.930612,0.7875,0.873563,0.715909,0.101828,0.898172
P31645_WT,Fold 3,0.600187,244:97,0.841642,0.78997,0.670103,0.909836,0.747126,0.874016,0.706522,0.112261,0.887739
P31645_WT,Fold 4,0.596339,244:97,0.844575,0.76407,0.57732,0.95082,0.823529,0.849817,0.678788,0.100642,0.899358
P31645_WT,Fold 5,0.524965,241:99,0.811765,0.748187,0.59596,0.900415,0.710843,0.844358,0.648352,0.129259,0.870741
P31645_WT,Mean,0.580304,-,0.83508,0.773736,0.631176,0.916296,0.7538,0.862056,0.685398,0.117761,0.882239
P31645_WT,SD,0.031783,-,0.013743,0.015265,0.033945,0.019971,0.042391,0.011404,0.021821,0.015498,0.015498
P31645_WT,Test set,0.235181,235:74,0.7411,0.607591,0.351351,0.86383,0.448276,0.808765,0.393939,0.301035,0.698965


Looking at the active to inactive ratio (i.e. A:N) one can clearly identify the reason of this low prediction performance over the test set.<br/>
Oversampling and/or undersampling techniques could help the model better identify the boundary between actives and inactives in the mmecular descriptor space.<br/>
However the use of such techniques is not the focus here.

### PCM models

In [29]:
from papyrus_scripts.modelling import pcm

Let us see if including Rat data improves the quality of the model.

In [30]:
sample_data = read_papyrus(is3d=False, chunksize=1000000, source_path=PAPYRUS_DATA_FOLDER)
filter1_it = keep_accession(sample_data, ['P31645', 'P31652'])
filter2_it = keep_quality(data=filter1_it, min_quality='medium')
filter3_it = keep_type(data=filter2_it, activity_types=['Ki', 'KD'])

In [31]:
SLC6A4_human_rat = consume_chunks(filter3_it, total=60)

  0%|          | 0/60 [00:00<?, ?it/s]

Herein it is assumed that the Papyrus <a href="https://drive.google.com/drive/folders/1Lhw5G6gu_nLzHQoGmnl02uhFsmOgEZ5a?usp=sharing">protein descriptors</a> were dowloaded and placed in the ***PAPYRUS_DATA_FOLDER*** folder.

In [32]:
pcm_reg_model = xgboost.XGBRegressor(verbosity=0)

In [33]:
pcm_reg_results, pcm_reg_trained_model = pcm(data=SLC6A4_human_rat,
                                             endpoint='pchembl_value_Mean',
                                             quality='low',
                                             source='any',
                                             activity_types='any',
                                             num_points=30,
                                             delta_activity=2,
                                             mol_descriptors='mold2',
                                             mol_descriptor_path=PAPYRUS_DATA_FOLDER,
                                             mol_descriptor_chunksize=50000,
                                             prot_sequences_path=None,
                                             prot_descriptors='unirep',
                                             prot_descriptor_path=PAPYRUS_DATA_FOLDER,
                                             prot_descriptor_chunksize=50000,
                                             activity_threshold=6.5,
                                             model=pcm_reg_model,
                                             folds=5,
                                             stratify=False,
                                             split_by='Year',
                                             split_year=2013,
                                             test_set_size=0.30,
                                             cluster_method=None,
                                             custom_groups=None,
                                             random_state=1234,
                                             verbose=True)

Loading molecular descriptors: 0it [00:00, ?it/s]

Loading protein descriptors: 0it [00:00, ?it/s]

Fitting model:   0%|          | 0/6 [00:00<?, ?it/s]

In [34]:
pcm_reg_results

Unnamed: 0,number,R2,MSE,RMSE,MSLE,RMSLE,MAE,Explained Variance,Max Error,Mean Poisson Distrib,Mean Gamma Distrib,Pearson r,Spearman r,Kendall tau
Fold 1,759.0,0.650572,0.539854,0.734747,0.008344,0.091346,0.555272,0.651308,2.89146,0.075833,0.010901,0.807642,0.817277,0.624606
Fold 2,759.0,0.668604,0.507427,0.712339,0.007716,0.087843,0.528931,0.668621,3.406365,0.07057,0.01005,0.81794,0.824713,0.635008
Fold 3,758.0,0.648581,0.524182,0.724004,0.008413,0.091722,0.532656,0.651415,2.667242,0.074473,0.010802,0.808409,0.809608,0.618169
Fold 4,758.0,0.639182,0.521883,0.722415,0.007907,0.088919,0.545989,0.639193,2.960461,0.072609,0.010308,0.799814,0.802759,0.610677
Fold 5,758.0,0.616353,0.578168,0.760374,0.008796,0.09379,0.571366,0.616505,4.112196,0.080409,0.011443,0.787766,0.796161,0.60675
Mean,758.4,0.644658,0.534303,0.730776,0.008235,0.090724,0.546843,0.645408,3.207545,0.074779,0.010701,0.804314,0.810103,0.619042
SD,0.447214,0.015572,0.022113,0.014986,0.00035,0.001929,0.014114,0.015725,0.467406,0.003037,0.000443,0.009198,0.009251,0.009199
Test set,628.0,0.288017,0.963438,0.981549,0.016309,0.127705,0.774795,0.293285,3.042561,0.141789,0.021233,0.559699,0.558549,0.391705


As with QSAR models, training a classifier is a matter of changing the underlying model to be used.

In [35]:
pcm_cls_model = xgboost.XGBClassifier(verbosity=0)

In [36]:
pcm_cls_results, pcm_cls_trained_model = pcm(data=SLC6A4_human_rat,
                                             endpoint='pchembl_value_Mean',
                                             quality='low',
                                             source='any',
                                             activity_types='any',
                                             num_points=30,
                                             delta_activity=2,
                                             mol_descriptors='mold2',
                                             mol_descriptor_path=PAPYRUS_DATA_FOLDER,
                                             mol_descriptor_chunksize=50000,
                                             prot_sequences_path=None,
                                             prot_descriptors='unirep',
                                             prot_descriptor_path=PAPYRUS_DATA_FOLDER,
                                             prot_descriptor_chunksize=50000,
                                             activity_threshold=6.5,
                                             model=pcm_cls_model,
                                             folds=5,
                                             stratify=False,
                                             split_by='Year',
                                             split_year=2013,
                                             test_set_size=0.30,
                                             cluster_method=None,
                                             custom_groups=None,
                                             random_state=1234,
                                             verbose=True)

Loading molecular descriptors: 0it [00:00, ?it/s]

Loading protein descriptors: 0it [00:00, ?it/s]

Fitting model:   0%|          | 0/6 [00:00<?, ?it/s]

In [37]:
pcm_cls_results

Unnamed: 0,MCC,A:N,ACC,BACC,Sensitivity,Specificity,PPV,NPV,F1,AUC A,AUC N
Fold 1,0.645298,523:236,0.852437,0.808054,0.690678,0.92543,0.806931,0.868941,0.744292,0.084867,0.915133
Fold 2,0.690676,556:203,0.881423,0.836185,0.738916,0.933453,0.802139,0.907343,0.769231,0.069032,0.930964
Fold 3,0.616368,531:227,0.843008,0.798415,0.687225,0.909605,0.764706,0.871841,0.723898,0.09193,0.90807
Fold 4,0.670853,546:212,0.872032,0.817394,0.693396,0.941392,0.821229,0.887737,0.751918,0.082063,0.917937
Fold 5,0.673868,539:219,0.869393,0.82548,0.721461,0.929499,0.806122,0.891459,0.761446,0.078244,0.921756
Mean,0.659413,-,0.863659,0.817106,0.706335,0.927876,0.800225,0.885464,0.750157,0.081227,0.918772
SD,0.023697,-,0.012721,0.012011,0.018566,0.009629,0.017254,0.01277,0.014249,0.006906,0.006904
Test set,0.439667,390:238,0.745223,0.698255,0.504202,0.892308,0.740741,0.746781,0.6,0.276287,0.723713
