# Predict and evaluate DTIs and BR with trained Model

In [1]:
from HoTS.model.hots import *
from HoTS.utils.build_features import *

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Firstly, you need to define encoders for protein sequences and compounds with given encoding types.

In [2]:
# define input feature
prot_vec = "Sequence"
drug_vec = "Morgan"
drug_len = 2048
radius = 2
protein_encoder = ProteinEncoder(prot_vec)
compound_encoder = CompoundEncoder(drug_vec, radius=radius, n_bits=drug_len)

## Load trained model

You need to intialize model

In [3]:
# initialize model
dti_model = HoTS()

Hots model initialization done!


Then you can load model. You can use `dti_model.load_model("output.config.json")`

In [4]:
# load model
dti_model.load_model("./Model/HoTS_config.json")

Given hyperparamters in ./Model/HoTS_config.json are loaded

protein_grid_size   :  10
compound_grid_size  :  None
anchors             :  [9]
hots_dimension      :  128
hots_n_heads        :  4
dropout             :  0.1
drug_layers         :  [512, 128]
protein_strides     :  [5, 10, 15, 20, 25, 30]
filters             :  128
fc_layers           :  [256, 64]
hots_fc_layers      :  [256, 64]
learning_rate       :  0.0001
prot_vec            :  Sequence
drug_vec            :  Morgan
drug_len            :  2048
activation          :  gelu
protein_layers      :  [128, 128, 128, 128]
reg_loss_weight     :  0.1
conf_loss_weight    :  1
negative_loss_weight:  0.1
retina_loss_weight  :  2
decay               :  0.0001
hots_file           :  ./Model/HoTS.h5
dti_file            :  ./Model/DTI.h5
hots_validation_results:  {}
dti_validation_results:  {'MATADOR_DTI': [{'AUC': 0.6055226824457594, 'AUPR': 0.5292486343535088}, {'AUC': 0.5882616344154805, 'AUPR': 0.5303096078019971}, {'AUC': 0.5835119

## Load BR dataset, predict and evaluate Performance

You can use the `parse_HoTS_data` function to load HoTS data as input of the HoTS model.

`parse_HoTS_data` is located in `HoTS.utils.build_features`.

If it contains `binding_region` columns you parsed, then set `binding_region=True`.

In [5]:
hots_data = parse_HoTS_data("./SampleData/HoTS/Validation_HoTS.tsv",
                           compound_encoder=compound_encoder, protein_encoder=protein_encoder, 
                          binding_region=True)

100%|██████████| 232/232 [00:00<00:00, 29246.17it/s]

Parsing HoTS data: ./SampleData/HoTS/Validation_HoTS.tsv
Number of 3D-complexes : 232
Number of proteins : 207





You can predict BRs with `HoTS_prediction`.

In [6]:
hots_prediction_result = dti_model.HoTS_prediction(**hots_data)

It returns a tuple of DTI predictions and BR predictions

In [7]:
hots_prediction_result[1][0][0:10]

[(307, 321, 0.7835828065872192),
 (318, 333, 0.7639481425285339),
 (299, 312, 0.7182453870773315),
 (327, 342, 0.6910973191261292),
 (336, 349, 0.6828732490539551),
 (231, 243, 0.676645040512085),
 (366, 380, 0.6762275099754333),
 (387, 400, 0.6689488887786865),
 (291, 303, 0.6437587738037109),
 (249, 261, 0.6274698376655579)]

Please see under sections for interpretation of this result

You can evaluate BR prediction results with the `HoTS_validation` function,

then AP will be calculated. __(It takes a lot of time for AP calculation, so that you can skip it)__

In [8]:
hots_evalution_result = dti_model.HoTS_validation(**hots_data)

	AP :  0.7196415938563052


evaluation results are summarized as a dictionary object

In [9]:
hots_evalution_result

{'AP': 0.7196415938563052}

## Load DTI dataset, predict and evaluate Performance

You can use the `parse_DTI_data` function to load HoTS data as the input of HoTS model.

`parse_DTI_data` is located in `HoTS.utils.build_features`

In [10]:
dti_data = parse_DTI_data("./SampleData/DTI/Validation/Validation_DTI.csv", 
                   "./SampleData/DTI/Validation/Validation_Compound.csv", 
                   "./SampleData/DTI/Validation/Validation_Protein.csv", 
                   compound_encoder=compound_encoder, protein_encoder=protein_encoder)

Parsing ./SampleData/DTI/Validation/Validation_DTI.csv , ./SampleData/DTI/Validation/Validation_Compound.csv, ./SampleData/DTI/Validation/Validation_Protein.csv with length 2500, type None


  5%|▍         | 24/499 [00:00<00:01, 238.20it/s]

Encoding compound with Morgan type


100%|██████████| 499/499 [00:02<00:00, 237.80it/s]

Encoding compound ends!
	Positive data : 370
	Negative data : 507





You can predict DTI with the `DTI_prediction` function

In [11]:
dti_prediction = dti_model.DTI_prediction(**dti_data)

Also, you can easily evaluate perforamnce with the `DTI_validation` function.

Basically, HoTS will calculate AUC and AUPR.

You can put `threshold` argument, then sensitivity, specificity, precision, accuracy, F1 scores are also calculated.

In [12]:
dti_evalution_result = dti_model.DTI_validation(threshold=0.4, **dti_data)

	Sen :  0.5648648648648649
	Spe :  0.8777120315581854
	Precision :  0.7712177121771218
	Acc :  0.7457240592930444
	F1 :  0.6521060842433697
	Area Under ROC Curve(AUC): 0.826
	Area Under PR Curve(AUPR): 0.790
	Optimal threshold(AUC)   : 0.000 
	Optimal threshold(AUPR)  : 0.359


  EERs = (1-recall)/(1-precision)


Evaluation results are summarized as a dictionary object.

In [13]:
dti_evalution_result

{'Sen': 0.5648648648648649,
 'Spe': 0.8777120315581854,
 'Acc': 0.7457240592930444,
 'Pre': 0.7712177121771218,
 'F1': 0.6521060842433697,
 'AUC': 0.8260408337331414,
 'AUPR': 0.7895424822083337}

## Predict BRs and DTIs without loading data

In [14]:
import urllib
uniprot_url = "https://www.uniprot.org/uniprot/{0}.fasta"
from Bio import Entrez

def get_smiles_from_cid(cid):
    return urllib.request.urlopen("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/%d/property/CanonicalSMILES/txt"%cid).read().decode("utf-8").strip()
 
def get_seq_from_uniprot_acc(uniprot_acc):
    opened = urllib.request.urlopen(uniprot_url.format(uniprot_acc))
    lines = opened.readlines()
    return "".join([line.decode("utf-8").rstrip() for line in lines[1:]])

You can download SMILES with PubChem CID

In [15]:
# SMILES of drug (should be listed)
drugs = [get_smiles_from_cid(5291), get_smiles_from_cid(219024)]

In [16]:
print(drugs)

['CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5', 'CNC(=O)C1=CN(N=C1)C2=NC(=C3C(=N2)N(C=N3)C4C(C(C(O4)CO)O)O)N']


Also, you can download protein sequences with UniProt accession.

In [17]:
# Sequences (should be listed)

targets = [get_seq_from_uniprot_acc("P00519"), get_seq_from_uniprot_acc("P29274")]

In [18]:
print(targets)

['MLEICLKLVGCKSKKGLSSSSSCYLEEALQRPVASDFEPQGLSEAARWNSKENLLAGPSENDPNLFVALYDFVASGDNTLSITKGEKLRVLGYNHNGEWCEAQTKNGQGWVPSNYITPVNSLEKHSWYHGPVSRNAAEYLLSSGINGSFLVRESESSPGQRSISLRYEGRVYHYRINTASDGKLYVSSESRFNTLAELVHHHSTVADGLITTLHYPAPKRNKPTVYGVSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEIKHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFIHRDLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWAFGVLLWEIATYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQESSISDEVEKELGKQGVRGAVSTLLQAPELPTKTRTSRRAAEHRDTTDVPEMPHSKGQGESDPLDHEPAVSPLLPRKERGPPEGGLNEDERLLPKDKKTNLFSALIKKKKKTAPTPPKRSSSFREMDGQPERRGAGEEEGRDISNGALAFTPLDTADPAKSPKPSNGAGVPNGALRESGGSGFRSPHLWKKSSTLTSSRLATGEEEGGGSSSKRFLRSCSASCVPHGAKDTEWRSVTLPRDLQSTGRQFDSSTFGGHKSEKPALPRKRAGENRSDQVTRGTVTPPPRLVKKNEEAADEVFKDIMESSPGSSPPNLTPKPLRRQVTVAPASGLPHKEEAGKGSALGTPAAAEPVTPTSKAGSGAPGGTSKGPAEESRVRRHKHSSESPGRDKGKLSRLKPAPPPPPAASAGKAGGKPSQSPSQEAAGEAVLGAKTKATSLVDAVNSDAAKPSQPGEGLKKPVLPATPKPQSAKPSGTPISPAPVPSTLPSASSALAGDQPS

You need to encode SMILES and sequences with a defined encoder.

In [19]:
drugs_fp = [compound_encoder.encode(drug) for drug in drugs]
targets_encoded = [protein_encoder.encode(target) for target in targets]

For protein sequences with different lengths, you need to pad sequence.

In [20]:
target_encoded = protein_encoder.pad(targets_encoded)

### Prediction of DTIs

you can simply predict DTIs with the `DTI_prediction` function of the `dti_model` object

In [21]:
dti_prediction_results = dti_model.DTI_prediction(drugs_fp, targets_encoded)

The model will result in [N, 1] shape prediction

In [22]:
print(dti_prediction_results)

[[0.92619896]
 [0.8873392 ]]


### Prediction of BRs

You can predict BRs with the `HoTS_prediction` function.

In [23]:
dti_predictions,  br_predictions = dti_model.HoTS_prediction(drugs_fp, targets_encoded)

BR prediction results contain a list of (BR_start, BR_end, confidence score) for each protein grid.

Also, they are sorted by confidence score.

In [24]:
print(br_predictions[0][0:10])

[(367, 385, 0.7754790186882019), (359, 373, 0.7527496218681335), (308, 324, 0.7307592630386353), (249, 264, 0.7113943099975586), (288, 303, 0.7104097604751587), (315, 330, 0.6692482233047485), (281, 294, 0.6628872156143188), (299, 312, 0.6184608936309814), (378, 390, 0.6164246201515198), (350, 364, 0.6097195148468018)]


### Visualization of BR predictions

You can visualize results with the `HoTS_visualization` function, but you need to give a list of names for each visualiztion.

You need to put names DTI pairs.

In [25]:
names = ["ABL1_imatinib", "AA2AR_Regadenoson"]

In [26]:
dti_model.HoTS_visualization(drugs_fp, targets_encoded, targets, protein_names=names, th=0.6)

Prediction with 0.600000
ABL1_imatinib
DTI score :  [0.92619896]
  Sequence : MLEICLKLVGCKSKKGLSSSSSCYLEEALQRPVASDFEPQGLSEAARWNSKENLLAGPSENDPNLFVALYDFVASGDNTLSITKGEKLRVLGYNHNGEWC
Prediction :                                                                                                     
     Score :                                                                                                     
  Sequence : EAQTKNGQGWVPSNYITPVNSLEKHSWYHGPVSRNAAEYLLSSGINGSFLVRESESSPGQRSISLRYEGRVYHYRINTASDGKLYVSSESRFNTLAELVH
Prediction :                                                                                                     
     Score :                                                                                                     
  Sequence : HHSTVADGLITTLHYPAPKRNKPTVYGVSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEIKHPNLVQ
Prediction :                                                  GGQYGEVYEGVWKKYSLTVAVKTL        EFLKEAAVMKEIKHPNLVQ
     Score :           