# Predict and evaluate DTIs and BR with trained Model

In [1]:
from HoTS.model.hots import *
from HoTS.utils.build_features import *

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
# define input feature
prot_vec = "Sequence"
drug_vec = "Morgan"
drug_len = 2048
radius = 2
protein_encoder = ProteinEncoder(prot_vec)
compound_encoder = CompoundEncoder(drug_vec, radius=radius, n_bits=drug_len)

In [3]:
# initialize model
dti_model = HoTS()

## Load trained model

To load trained model, you can use `dti_model.load_model("output.config.json")`

In [4]:
# load model
dti_model.load_model("./Model/HoTS_config.json")

{'protein_grid_size': 10, 'compound_grid_size': None, 'anchors': [9], 'hots_dimension': 128, 'hots_n_heads': 4, 'dropout': 0.1, 'drug_layers': [512, 128], 'protein_strides': [5, 10, 15, 20, 25, 30], 'filters': 128, 'fc_layers': [256, 64], 'hots_fc_layers': [256, 64], 'learning_rate': 0.0001, 'prot_vec': 'Sequence', 'drug_vec': 'Morgan', 'drug_len': 2048, 'activation': 'gelu', 'protein_layers': [128, 128, 128, 128], 'reg_loss_weight': 0.1, 'conf_loss_weight': 1, 'negative_loss_weight': 0.1, 'retina_loss_weight': 2, 'decay': 0.0001, 'hots_file': './Model/HoTS.h5', 'dti_file': './Model/DTI.h5', 'hots_validation_results': {}, 'dti_validation_results': {'MATADOR_DTI': [{'AUC': 0.6055226824457594, 'AUPR': 0.5292486343535088}, {'AUC': 0.5882616344154805, 'AUPR': 0.5303096078019971}, {'AUC': 0.583511914281145, 'AUPR': 0.519665186632455}, {'AUC': 0.5810011194626579, 'AUPR': 0.5175772095988646}, {'AUC': 0.7315688469534624, 'AUPR': 0.684015600969329}, {'AUC': 0.7398688629457861, 'AUPR': 0.6686561

## Load BR dataset and evaluate Performance

you can use `parse_HoTS_data` function to load HoTS data as input of HoTS model.

`parse_HoTS_data` is located in `HoTS.utils.build_features`

if it contains `binding_region` columns you parsed, then set `binding_region=True`

In [5]:
hots_data = parse_HoTS_data("./SampleData/HoTS/Validation_HoTS.tsv",
                           compound_encoder=compound_encoder, protein_encoder=protein_encoder, 
                          binding_region=True)

100%|██████████| 232/232 [00:00<00:00, 29284.90it/s]

Parsing HoTS data: ./SampleData/HoTS/Validation_HoTS.tsv
Number of 3D-complexes : 232
Number of proteins : 207





then AP will be caulculated

In [6]:
hots_evalution_result = dti_model.HoTS_validation(**hots_data)

	AP :  0.7196415938563052


evaluation results are summarized as dictionary object

In [7]:
hots_evalution_result

{'AP': 0.7196415938563052}

## Load DTI dataset and evaluate Performance

you can use `parse_DTI_data` function to load HoTS data as input of HoTS model.

`parse_DTI_data` is located in `HoTS.utils.build_features`

In [8]:
dti_data = parse_DTI_data("./SampleData/DTI/Validation/Validation_DTI.csv", 
                   "./SampleData/DTI/Validation/Validation_Compound.csv", 
                   "./SampleData/DTI/Validation/Validation_Protein.csv", 
                   compound_encoder=compound_encoder, protein_encoder=protein_encoder)

Parsing ./SampleData/DTI/Validation/Validation_DTI.csv , ./SampleData/DTI/Validation/Validation_Compound.csv, ./SampleData/DTI/Validation/Validation_Protein.csv with length 2500, type None


  4%|▍         | 21/499 [00:00<00:02, 201.63it/s]

Encoding compound with Morgan type


100%|██████████| 499/499 [00:02<00:00, 200.52it/s]

Encoding compound ends!
	Positive data : 370
	Negative data : 507





Basically, AUC and AUPR are calculated.

You can put `threshold` argument, then sensitivity, specificity, precision, accuracy, F1 scores are also calculated.

In [9]:
dti_evalution_result = dti_model.DTI_validation(threshold=0.4, **dti_data)

	Sen :  0.5648648648648649
	Spe :  0.8777120315581854
	Precision :  0.7712177121771218
	Acc :  0.7457240592930444
	F1 :  0.6521060842433697
	Area Under ROC Curve(AUC): 0.826
	Area Under PR Curve(AUPR): 0.790
	Optimal threshold(AUC)   : 0.000 
	Optimal threshold(AUPR)  : 0.359


  EERs = (1-recall)/(1-precision)


evaluation results are summarized as dictionary object

In [10]:
dti_evalution_result

{'Sen': 0.5648648648648649,
 'Spe': 0.8777120315581854,
 'Acc': 0.7457240592930444,
 'Pre': 0.7712177121771218,
 'F1': 0.6521060842433697,
 'AUC': 0.8260408337331414,
 'AUPR': 0.7895424822083337}

## Predict BRs and DTIs without loading data

In [11]:
import urllib
uniprot_url = "https://www.uniprot.org/uniprot/{0}.fasta"
from Bio import Entrez

def get_smiles_from_cid(cid):
    return urllib.request.urlopen("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/%d/property/CanonicalSMILES/txt"%cid).read().decode("utf-8").strip()
 
def get_seq_from_uniprot_acc(uniprot_acc):
    opened = urllib.request.urlopen(uniprot_url.format(uniprot_acc))
    lines = opened.readlines()
    return "".join([line.decode("utf-8").rstrip() for line in lines[1:]])

In [12]:
# SMILES of drug (should be listed)

drugs = [get_smiles_from_cid(5291), get_smiles_from_cid(219024)]

# Sequences (should be listed)

targets = [get_seq_from_uniprot_acc("P00519"), get_seq_from_uniprot_acc("P29274")]

Encoding SMILES and Sequence with defined encoder

In [13]:
drugs_fp = [compound_encoder.encode(drug) for drug in drugs]
targets_encoded = [protein_encoder.encode(target) for target in targets]

For protein sequences with different lengths, you need to pad sequence.

In [14]:
target_encoded = protein_encoder.pad(targets_encoded)

In [15]:
target_encoded.shape

(2, 1130)

### Prediction of DTIs

you can simply predict DTIs with `DTI_prediction` function of `dti_model` object

In [16]:
dti_model.DTI_prediction(drugs_fp, targets_encoded)

array([[0.92619896],
       [0.8873392 ]], dtype=float32)

### Prediction of BRs

you can predict BRs with `HoTS_prediction` but it's hard to interpret

In [17]:
dti_model.HoTS_prediction(drugs_fp, targets_encoded)

(array([[0.92619896],
        [0.8873392 ]], dtype=float32),
 [[(367, 385, 0.7754790186882019),
   (359, 373, 0.7527496218681335),
   (308, 324, 0.7307592630386353),
   (249, 264, 0.7113943099975586),
   (288, 303, 0.7104097604751587),
   (315, 330, 0.6692482233047485),
   (281, 294, 0.6628872156143188),
   (299, 312, 0.6184608936309814),
   (378, 390, 0.6164246201515198),
   (350, 364, 0.6097195148468018),
   (258, 273, 0.6040068864822388),
   (1059, 1071, 0.5806174874305725),
   (241, 253, 0.5787888169288635),
   (328, 341, 0.5430454015731812),
   (388, 401, 0.5384601950645447),
   (339, 351, 0.5373850464820862),
   (1101, 1113, 0.5239614844322205),
   (168, 180, 0.5186799764633179),
   (467, 481, 0.517566978931427),
   (409, 421, 0.507867157459259),
   (398, 410, 0.5074823498725891),
   (741, 753, 0.4995215833187103),
   (510, 522, 0.49692490696907043),
   (159, 173, 0.49686112999916077),
   (1037, 1048, 0.4954761862754822),
   (676, 688, 0.492410808801651),
   (428, 441, 0.49102842

### Visualization of BR predictions

You can visualize result with `HoTS_visualization`, but you need to give list of names for each visualiztion

In [18]:
names = ["ABL1_imatinib", "AA2AR_Regadenoson"]

In [19]:
dti_model.HoTS_visualization(drugs_fp, targets_encoded, targets, protein_names=names, th=0.6)

Prediction with 0.600000
ABL1_imatinib
DTI score :  [0.92619896]
  Sequence : MLEICLKLVGCKSKKGLSSSSSCYLEEALQRPVASDFEPQGLSEAARWNSKENLLAGPSENDPNLFVALYDFVASGDNTLSITKGEKLRVLGYNHNGEWC
Prediction :                                                                                                     
     Score :                                                                                                     
  Sequence : EAQTKNGQGWVPSNYITPVNSLEKHSWYHGPVSRNAAEYLLSSGINGSFLVRESESSPGQRSISLRYEGRVYHYRINTASDGKLYVSSESRFNTLAELVH
Prediction :                                                                                                     
     Score :                                                                                                     
  Sequence : HHSTVADGLITTLHYPAPKRNKPTVYGVSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEIKHPNLVQ
Prediction :                                                  GGQYGEVYEGVWKKYSLTVAVKTL        EFLKEAAVMKEIKHPNLVQ
     Score :           