# Predict and evaluate DTIs and BR with trained Model

In [1]:
from HoTS.model.hots import *
from HoTS.utils.build_features import *

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
# define input feature
prot_vec = "Sequence"
drug_vec = "Morgan"
drug_len = 2048
radius = 2
protein_encoder = ProteinEncoder(prot_vec)
compound_encoder = CompoundEncoder(drug_vec, radius=radius, n_bits=drug_len)

In [3]:
# initialize model
dti_model = HoTS()

## Load trained model

To load trained model, you can use `dti_model.load_model("output.config.json")`

In [4]:
# load model
dti_model.load_model("./Model/HoTS_config.json")

{'protein_grid_size': 10, 'compound_grid_size': None, 'anchors': [9], 'hots_dimension': 128, 'hots_n_heads': 4, 'dropout': 0.1, 'drug_layers': [512, 128], 'protein_strides': [5, 10, 15, 20, 25, 30], 'filters': 128, 'fc_layers': [256, 64], 'hots_fc_layers': [256, 64], 'learning_rate': 0.0001, 'prot_vec': 'Sequence', 'drug_vec': 'Morgan', 'drug_len': 2048, 'activation': 'gelu', 'protein_layers': [128, 128, 128, 128], 'reg_loss_weight': 0.1, 'conf_loss_weight': 1, 'negative_loss_weight': 0.1, 'retina_loss_weight': 2, 'decay': 0.0001, 'hots_file': './Model/HoTS.h5', 'dti_file': './Model/DTI.h5', 'hots_validation_results': {}, 'dti_validation_results': {'MATADOR_DTI': [{'AUC': 0.6055226824457594, 'AUPR': 0.5292486343535088}, {'AUC': 0.5882616344154805, 'AUPR': 0.5303096078019971}, {'AUC': 0.583511914281145, 'AUPR': 0.519665186632455}, {'AUC': 0.5810011194626579, 'AUPR': 0.5175772095988646}, {'AUC': 0.7315688469534624, 'AUPR': 0.684015600969329}, {'AUC': 0.7398688629457861, 'AUPR': 0.6686561

## Load BR dataset and evaluate Performance

you can use `parse_HoTS_data` function to load HoTS data as input of HoTS model.

`parse_HoTS_data` is located in `HoTS.utils.build_features`

if it contains `binding_region` columns you parsed, then set `binding_region=True`

In [5]:
hots_data = parse_HoTS_data("./SampleData/HoTS/Validation_HoTS.tsv",
                           compound_encoder=compound_encoder, protein_encoder=protein_encoder, 
                          binding_region=True)

100%|██████████| 232/232 [00:00<00:00, 27539.44it/s]

Parsing HoTS data: ./SampleData/HoTS/Validation_HoTS.tsv
Number of 3D-complexes : 232
Number of proteins : 207





then AP will be caulculated

In [6]:
hots_evalution_result = dti_model.HoTS_validation(**hots_data)

	AP :  0.7196415938563052


evaluation results are summarized as dictionary object

In [7]:
hots_evalution_result

{'AP': 0.7196415938563052}

## Load DTI dataset and evaluate Performance

you can use `parse_DTI_data` function to load HoTS data as input of HoTS model.

`parse_DTI_data` is located in `HoTS.utils.build_features`

In [8]:
dti_data = parse_DTI_data("./SampleData/DTI/Validation/Validation_DTI.csv", 
                   "./SampleData/DTI/Validation/Validation_Compound.csv", 
                   "./SampleData/DTI/Validation/Validation_Protein.csv", 
                   compound_encoder=compound_encoder, protein_encoder=protein_encoder)

Parsing ./SampleData/DTI/Validation/Validation_DTI.csv , ./SampleData/DTI/Validation/Validation_Compound.csv, ./SampleData/DTI/Validation/Validation_Protein.csv with length 2500, type None


  4%|▍         | 22/499 [00:00<00:02, 219.79it/s]

Encoding compound with Morgan type


100%|██████████| 499/499 [00:02<00:00, 217.39it/s]

Encoding compound ends!
	Positive data : 370
	Negative data : 507





Basically, AUC and AUPR are calculated.

You can put `threshold` argument, then sensitivity, specificity, precision, accuracy, F1 scores are also calculated.

In [9]:
dti_evalution_result = dti_model.DTI_validation(threshold=0.4, **dti_data)

	Sen :  0.5648648648648649
	Spe :  0.8777120315581854
	Precision :  0.7712177121771218
	Acc :  0.7457240592930444
	F1 :  0.6521060842433697
	Area Under ROC Curve(AUC): 0.826
	Area Under PR Curve(AUPR): 0.790
	Optimal threshold(AUC)   : 0.000 
	Optimal threshold(AUPR)  : 0.359


  EERs = (1-recall)/(1-precision)


evaluation results are summarized as dictionary object

In [10]:
dti_evalution_result

{'Sen': 0.5648648648648649,
 'Spe': 0.8777120315581854,
 'Acc': 0.7457240592930444,
 'Pre': 0.7712177121771218,
 'F1': 0.6521060842433697,
 'AUC': 0.8260408337331414,
 'AUPR': 0.7895424822083337}

## Predict BRs and DTIs without loading data

In [11]:
import urllib
uniprot_url = "https://www.uniprot.org/uniprot/{0}.fasta"
from Bio import Entrez

def get_smiles_from_cid(cid):
    return urllib.request.urlopen("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/%d/property/CanonicalSMILES/txt"%cid).read().decode("utf-8").strip()
 
def get_seq_from_uniprot_acc(uniprot_acc):
    opened = urllib.request.urlopen(uniprot_url.format(uniprot_acc))
    lines = opened.readlines()
    return "".join([line.decode("utf-8").rstrip() for line in lines[1:]])

In [12]:
# SMILES of drug (should be listed)

drugs = [get_smiles_from_cid(219024)]

# Sequences (should be listed)

targets = [get_seq_from_uniprot_acc("P29274")]

Encoding SMILES and Sequence with defined encoder

In [13]:
drugs_fp = [compound_encoder.encode(drug) for drug in drugs]
targets_encoded = [protein_encoder.encode(target) for target in targets]

### Prediction of DTIs

you can simply predict DTIs with `DTI_prediction` function of `dti_model` object

In [14]:
dti_model.DTI_prediction(drugs_fp, targets_encoded)

array([[0.78115684]], dtype=float32)

### Prediction of BRs

you can predict BRs with `HoTS_prediction` but it's hard to interpret

In [15]:
dti_model.HoTS_prediction(drugs_fp, targets_encoded)

(array([[0.78115684]], dtype=float32),
 [[(178, 193, 0.7201496958732605),
   (256, 270, 0.7143986225128174),
   (167, 182, 0.6817784905433655),
   (116, 130, 0.6758933067321777),
   (60, 74, 0.6708559989929199),
   (276, 290, 0.649710476398468),
   (66, 80, 0.6246914267539978),
   (79, 93, 0.6140745878219604),
   (230, 243, 0.6140472292900085),
   (360, 373, 0.6114603877067566),
   (236, 249, 0.6080814003944397),
   (328, 341, 0.6058177351951599),
   (320, 332, 0.593420684337616),
   (86, 99, 0.5894615054130554),
   (288, 302, 0.5893487334251404),
   (137, 150, 0.5890995264053345),
   (248, 261, 0.5860364437103271),
   (109, 121, 0.5758647918701172),
   (268, 281, 0.5649896264076233),
   (349, 360, 0.5600472092628479),
   (27, 41, 0.5552756190299988),
   (8, 21, 0.551354169845581),
   (98, 112, 0.5365844964981079),
   (378, 390, 0.5329036116600037),
   (338, 351, 0.5312085151672363),
   (38, 52, 0.5262905359268188),
   (130, 143, 0.5258466601371765),
   (188, 201, 0.5243402123451233),


### Visualization of BR predictions

You can visualize result with `HoTS_visualization`, but you need to give list of names for each visualiztion

In [16]:
names = ["AA2AR_Regadenoson"]

In [17]:
dti_model.HoTS_visualization(drugs_fp, targets_encoded, targets, protein_names=names, th=0.6)

Prediction with 0.600000
AA2AR_Regadenoson
DTI score :  [0.78115684]
  Sequence : MPIMGSSVYITVELAIAVLAILGNVLVCWAVWLNSNLQNVTNYFVVSLAAADIAVGVLAIPFAITISTGFCAACHGCLFIACFVLVLTQSSIFSLLAIAI
Prediction :                                                             PFAITISTGFCAACHGCLFIACFVLVLTQSSIF       
     Score :                                                             67%   62%          61%                  
  Sequence : DRYIAIRIPLRYNGLVTGTRAKGIIAICWVLSFAIGLTPMLGWNNCGQPKEGKNHSQGCGEGQVACLFEDVVPMNYMVYFNFFACVLVPLLLMLGVYLRI
Prediction :                 TGTRAKGIIAICWV                                     FEDVVPMNYMVYFNFFACVLVPLLLM       
     Score :                 67%                                                68%        72%                   
  Sequence : FLAARRQLKQMESQPLPGERARSTLQKEVHAAKSLAIIVGLFALCWLPLHIINCFTFFCPDCSHAPLWLMYLAIVLSHTNSVVNPFIYAYRIREFRQTFR
Prediction :                               AAKSLAIIVGLFALCWLPL       FFCPDCSHAPLWLM      SHTNSVVNPFIYAY          
     Score :       