# Drug repurposing with DeepPurpose

- The input to the model is a drug target pair, where drug uses the simplified molecular-input line-entry system (SMILES) string and target uses the amino acid sequence.

- The output is a score indicating the binding activity of the drug target pair.

Tutorial: https://github.com/kexinhuang12345/DeepPurpose/blob/master/Tutorial_1_DTI_Prediction.ipynb

Use the `train.py` script to pre-train the model (~1h with a GPU)

## Objective

1. Find the amino acid sequence of a target known to be involved in a disease (e.g. Alzheimer `MONDO:0004975`)
2. Run the model to get drugs that could potentially bind with Alzheimer target

# Model Prediction and Repuposing/Screening

In [9]:
import pickle
from DeepPurpose import utils
# from trapi_predict_kit import load

# Load pre-trained model
with open("../models/deeppurpose.pkl", "rb") as f:
    model = pickle.load(f)
# model = load(path="../models/deeppurpose")

drug_encoding, target_encoding = 'MPNN', 'CNN'

In [29]:
X_drug = ['CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N']
X_target = ['MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTAEKRKGGQTVDSGLPLLSVSDPFIPLQVPDAPEKLIEGLKSPDTSLLLPDLLPMTDPFGSTSDAVIEKADVAVESLIPGLEPPVPQRLPSQTESVTSNRTDSLTGEDSLLDCSLLSNPTTDLLEEFAPTAISAPVHKAAEDSNLISGFDVPEGSDKVAEDEFDPIPVLITKNPQGGHSRNSSGSSESSLPNLARSLLLVDQLIDL']
y = [1] # Useless, does not impact the final predicted score.
X_pred = utils.data_process(
    X_drug, X_target, y,
    drug_encoding, target_encoding, 
    split_method='no_split'
)
y_pred = model.predict(X_pred)
print('The predicted score is ' + str(y_pred))

Drug Target Interaction Prediction Mode...
in total: 1 drug-target pairs
encoding drug...
unique drugs: 1
encoding protein...
unique target sequence: 1
splitting dataset...
do not do train/test split on the data for already splitted data
predicting...
The predicted score is [5.3395676612854]


# Next steps

## 1. Which dataset should we use?

From all pre-processed datasets available, which one would produce the best predictions? How can we evaluate a model produce better predictions than the other?

We can allow users to choose themselves between models, but in the context of the Translator project it will not work. **We need to provide 1 main model to query from**.

TODO: try to load all datasets before training the model

## 2. Link targets to diseases?

Optional, they can send us amino acid sequences directly 

For some relevant diseases, like Alzheimer, we need to **get a list of potential targets for a disease** (with their amino acid sequence)

Then we will be able to run prediction between a list of drugs smiles and those potential targets

Try to retry disease to potential targets in a database/API ideally, so that we can automate the system for all diseases