# Predict and evaluate DTIs and BR with trained Model

In [1]:
from HoTS.model.hots import *
from HoTS.utils.build_features import *

2022-10-29 16:23:10.976659: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1


Firstly, you need to define encoders for protein sequences and compounds with given encoding types.

In [2]:
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"   # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [3]:
# define input feature
prot_vec = "Sequence"
drug_vec = "Morgan"
drug_len = 2048
radius = 2
protein_encoder = ProteinEncoder(prot_vec)
compound_encoder = CompoundEncoder(drug_vec, radius=radius, n_bits=drug_len)

## Load trained model

You need to intialize model

In [4]:
# initialize model
dti_model = HoTS()

Hots model initialization done!


Then you can load model. You can use `dti_model.load_model("output.config.json")`

In [5]:
# load model
dti_model.load_model("./Model/Model_config.json")

Given hyperparamters in ./Model/Model_config.json are loaded

protein_grid_size   :  10
compound_grid_size  :  None
anchors             :  [9]
hots_dimension      :  128
hots_n_heads        :  4
dropout             :  0.1
drug_layers         :  [512, 128]
protein_strides     :  [5, 10, 15, 20, 25, 30]
filters             :  128
fc_layers           :  [256, 64]
hots_fc_layers      :  [256, 64]
learning_rate       :  0.0001
prot_vec            :  Sequence
drug_vec            :  Morgan
drug_len            :  2048
activation          :  gelu
protein_layers      :  [128, 128, 128, 128]
reg_loss_weight     :  0.1
conf_loss_weight    :  1
negative_loss_weight:  0.1
retina_loss_weight  :  2
decay               :  0.0001
hots_file           :  ./Model/HoTS.h5
dti_file            :  ./Model/DTI.h5
hots_validation_results:  {}
dti_validation_results:  {}
n_stack_hots_prediction:  2
protein_encoder_config:  {'feature': 'Sequence'}
compound_encoder_config:  {'radius': 2, 'feature': 'Morgan', 'n_bit

2022-10-29 16:23:13.535617: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-10-29 16:23:13.537012: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-10-29 16:23:13.615713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:24:00.0 name: NVIDIA TITAN RTX computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
2022-10-29 16:23:13.615751: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-10-29 16:23:13.618547: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2022-10-29 16:23:13.618663: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2022-

	HoTS Model is loaded from ./Model/HoTS.h5
	DTI Model is loaded from ./Model/DTI.h5


## Load BR dataset, predict and evaluate Performance

You can use the `parse_HoTS_data` function to load HoTS data as input of the HoTS model.

`parse_HoTS_data` is located in `HoTS.utils.build_features`.

If it contains `binding_region` columns you parsed, then set `binding_region=True`.

In [6]:
hots_data = parse_HoTS_data("./SampleData/HoTS/Validation_HoTS.tsv",
                           compound_encoder=compound_encoder, protein_encoder=protein_encoder, 
                          binding_region=True)

Parsing HoTS data: ./SampleData/HoTS/Validation_HoTS.tsv
Number of 3D-complexes : 232
Number of proteins : 207


100%|██████████| 232/232 [00:00<00:00, 31168.43it/s]


You can predict BRs with `HoTS_prediction`.

In [7]:
hots_prediction_result = dti_model.HoTS_prediction(**hots_data)

2022-10-29 16:23:18.142627: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-10-29 16:23:18.187538: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3800000000 Hz
2022-10-29 16:23:18.774938: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2022-10-29 16:23:19.135566: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7




It returns a tuple of DTI predictions and BR predictions

In [8]:
hots_prediction_result[1][0][0:10]

[(316, 331, 0.7528640627861023),
 (307, 320, 0.718887209892273),
 (378, 393, 0.7044767737388611),
 (358, 372, 0.6984079480171204),
 (297, 311, 0.6941364407539368),
 (267, 281, 0.6887148022651672),
 (334, 349, 0.6546329259872437),
 (207, 221, 0.6542425751686096),
 (189, 201, 0.6519545316696167),
 (366, 378, 0.6451152563095093)]

Please see under sections for interpretation of this result

You can evaluate BR prediction results with the `HoTS_validation` function,

then AP will be calculated. __(It takes a lot of time for AP calculation, so that you can skip it)__

In [9]:
hots_evalution_result = dti_model.HoTS_validation(**hots_data)

	AP :  0.6485652597577035


evaluation results are summarized as a dictionary object

In [10]:
hots_evalution_result

{'AP': 0.6485652597577035}

## Load DTI dataset, predict and evaluate Performance

You can use the `parse_DTI_data` function to load HoTS data as the input of HoTS model.

`parse_DTI_data` is located in `HoTS.utils.build_features`

In [11]:
dti_data = parse_DTI_data("./SampleData/DTI/Validation/Validation_DTI.csv", 
                   "./SampleData/DTI/Validation/Validation_Compound.csv", 
                   "./SampleData/DTI/Validation/Validation_Protein.csv", 
                   compound_encoder=compound_encoder, protein_encoder=protein_encoder)

Parsing ./SampleData/DTI/Validation/Validation_DTI.csv , ./SampleData/DTI/Validation/Validation_Compound.csv, ./SampleData/DTI/Validation/Validation_Protein.csv with length 2500, type None
Encoding compound with Morgan type


100%|██████████| 499/499 [00:00<00:00, 648.56it/s]

Encoding compound ends!
	Positive data : 370
	Negative data : 507





You can predict DTI with the `DTI_prediction` function

In [12]:
dti_prediction = dti_model.DTI_prediction(**dti_data)

Also, you can easily evaluate perforamnce with the `DTI_validation` function.

Basically, HoTS will calculate AUC and AUPR.

You can put `threshold` argument, then sensitivity, specificity, precision, accuracy, F1 scores are also calculated.

In [13]:
dti_evalution_result = dti_model.DTI_validation(threshold=0.2, **dti_data)

	Sen :  0.6351351351351351
	Spe :  0.8224852071005917
	Precision :  0.7230769230769231
	Acc :  0.7434435575826682
	F1 :  0.6762589928057553
	Area Under ROC Curve(AUC): 0.838
	Area Under PR Curve(AUPR): 0.811
	Optimal threshold(AUC)   : 0.000 
	Optimal threshold(AUPR)  : 0.218


  EERs = (1-recall)/(1-precision)


Evaluation results are summarized as a dictionary object.

In [14]:
dti_evalution_result

{'Sen': 0.6351351351351351,
 'Spe': 0.8224852071005917,
 'Acc': 0.7434435575826682,
 'Pre': 0.7230769230769231,
 'F1': 0.6762589928057553,
 'AUC': 0.8381043765659151,
 'AUPR': 0.8106569321431619}

## Predict BRs and DTIs without loading data

In [15]:
import urllib
uniprot_url = "https://www.uniprot.org/uniprot/{0}.fasta"
from Bio import Entrez

def get_smiles_from_cid(cid):
    return urllib.request.urlopen("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/%d/property/CanonicalSMILES/txt"%cid).read().decode("utf-8").strip()
 
def get_seq_from_uniprot_acc(uniprot_acc):
    opened = urllib.request.urlopen(uniprot_url.format(uniprot_acc))
    lines = opened.readlines()
    return "".join([line.decode("utf-8").rstrip() for line in lines[1:]])

You can download SMILES with PubChem CID

In [16]:
# SMILES of drug (should be listed)
drugs = [get_smiles_from_cid(5291), get_smiles_from_cid(219024)]

In [17]:
print(drugs)

['CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5', 'CNC(=O)C1=CN(N=C1)C2=NC(=C3C(=N2)N(C=N3)C4C(C(C(O4)CO)O)O)N']


Also, you can download protein sequences with UniProt accession.

In [18]:
# Sequences (should be listed)

targets = [get_seq_from_uniprot_acc("P00519"), get_seq_from_uniprot_acc("P29274")]

In [19]:
print(targets)

['MLEICLKLVGCKSKKGLSSSSSCYLEEALQRPVASDFEPQGLSEAARWNSKENLLAGPSENDPNLFVALYDFVASGDNTLSITKGEKLRVLGYNHNGEWCEAQTKNGQGWVPSNYITPVNSLEKHSWYHGPVSRNAAEYLLSSGINGSFLVRESESSPGQRSISLRYEGRVYHYRINTASDGKLYVSSESRFNTLAELVHHHSTVADGLITTLHYPAPKRNKPTVYGVSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEIKHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFIHRDLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWAFGVLLWEIATYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQESSISDEVEKELGKQGVRGAVSTLLQAPELPTKTRTSRRAAEHRDTTDVPEMPHSKGQGESDPLDHEPAVSPLLPRKERGPPEGGLNEDERLLPKDKKTNLFSALIKKKKKTAPTPPKRSSSFREMDGQPERRGAGEEEGRDISNGALAFTPLDTADPAKSPKPSNGAGVPNGALRESGGSGFRSPHLWKKSSTLTSSRLATGEEEGGGSSSKRFLRSCSASCVPHGAKDTEWRSVTLPRDLQSTGRQFDSSTFGGHKSEKPALPRKRAGENRSDQVTRGTVTPPPRLVKKNEEAADEVFKDIMESSPGSSPPNLTPKPLRRQVTVAPASGLPHKEEAGKGSALGTPAAAEPVTPTSKAGSGAPGGTSKGPAEESRVRRHKHSSESPGRDKGKLSRLKPAPPPPPAASAGKAGGKPSQSPSQEAAGEAVLGAKTKATSLVDAVNSDAAKPSQPGEGLKKPVLPATPKPQSAKPSGTPISPAPVPSTLPSASSALAGDQPS

You need to encode SMILES and sequences with a defined encoder.

In [20]:
drugs_fp = [compound_encoder.encode(drug) for drug in drugs]
targets_encoded = [protein_encoder.encode(target) for target in targets]

For protein sequences with different lengths, you need to pad sequence.

In [21]:
target_encoded = protein_encoder.pad(targets_encoded)

### Prediction of DTIs

you can simply predict DTIs with the `DTI_prediction` function of the `dti_model` object

In [22]:
dti_prediction_results = dti_model.DTI_prediction(drugs_fp, targets_encoded)

The model will result in [N, 1] shape prediction

In [23]:
print(dti_prediction_results)

[[0.9627259 ]
 [0.97712976]]


### Prediction of BRs

You can predict BRs with the `HoTS_prediction` function.

In [24]:
dti_predictions,  br_predictions = dti_model.HoTS_prediction(drugs_fp, targets_encoded)

BR prediction results contain a list of (BR_start, BR_end, confidence score) for each protein grid.

Also, they are sorted by confidence score.

In [25]:
print(br_predictions[0][0:10])

[(307, 330, 0.7390334010124207), (367, 382, 0.721824049949646), (260, 274, 0.7199005484580994), (358, 373, 0.7056204080581665), (288, 303, 0.7008621692657471), (246, 262, 0.6984542012214661), (298, 312, 0.697698175907135), (376, 389, 0.6810927391052246), (240, 254, 0.6167941093444824), (277, 291, 0.6081487536430359)]


### Visualization of BR predictions

You can visualize results with the `HoTS_visualization` function, but you need to give a list of names for each visualiztion.

You need to put names DTI pairs.

In [26]:
names = ["ABL1_imatinib", "AA2AR_Regadenoson"]

In [27]:
dti_model.HoTS_visualization(drugs_fp, targets_encoded, targets, protein_names=names, th=0.6)

Prediction with 0.600000
ABL1_imatinib
DTI score :  [0.9627259]
  Sequence : MLEICLKLVGCKSKKGLSSSSSCYLEEALQRPVASDFEPQGLSEAARWNSKENLLAGPSENDPNLFVALYDFVASGDNTLSITKGEKLRVLGYNHNGEWC
Prediction :                                                                                                     
     Score :                                                                                                     
  Sequence : EAQTKNGQGWVPSNYITPVNSLEKHSWYHGPVSRNAAEYLLSSGINGSFLVRESESSPGQRSISLRYEGRVYHYRINTASDGKLYVSSESRFNTLAELVH
Prediction :                                                                                                     
     Score :                                                                                                     
  Sequence : HHSTVADGLITTLHYPAPKRNKPTVYGVSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEIKHPNLVQ
Prediction :                                         DITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLK   MEVEEFLKEAAVMKEIKHPNLVQ
     Score :            