Skip to content

Henrykaa/drug_food_interactions

Repository files navigation

drug_food_interactions

1. Introduction

Possible drug-food interactions (DFIs) could change the intended efficiency of particular therapeutics in medical practice. The increasing number of multiple-drug prescriptions lead to the rise of drug-drug interactions (DDIs) and DFIs. These adverse interactions lead to other implications, e.g., the decline in medicament’s effect, the withdrawals of various medications, harmful impacts on the patients’ health. However, the importance of DFIs is remained underestimated, as the number of studies referring to these topics is constrained. Recently, scientists have applied artificial intelligence-based models to study DFIs. However, there were still some limitations in data mining, data input, and detailed annotations. We proposed a novel predictive model based on machine learning algorithms to predict the DFIs.

2. Code presentation

2.1. Data extraction and labelling the DFIs ground-truths

We extracted 70,477 food compounds from FooDB and 13,580 drugs from DrugBank. Regarding the food database, after deleting duplicates and filtering compounds with pairwise Tanimoto coefficients above 0.75, we kept 4,341 food compounds. We selected drugs with SMILES formulas and annotations on DFIs from DrugBank, leaving 1,133 drugs for further investigation. Based on DrugBank database, we labeled the ground-truths of the DFIs as negative ("0"), positive ("1") and non-significant ("2").

2.2. Feature extraction

We ran our experiments on Windows 10 (version 20H2) (4.60GHz Intel i7-11800H CPU and 64GB RAM). We used PyBioMed package (PyInteraction module) and RDKit (version 1.0.3) for input representation preparation of the chemical compounds. From each pair of DFIs, we extracted 3,780 features of Molecular Operating Environment (MOE) descriptors.

2.3. Selecting the baseline machine learning algorithms

We applied the Lazy Predict package (https://github.com/shankarpandala/lazypredict) in the Python 3.8, which has 30 built-in classification algorithms, to have overview results and chose the optimal algorithms for model building. We created five trial sets by randomly taking 10,000 samples per class (random_state = 100,000). Each trial set contained 30,000 instances of three classes (30,000 rows and 3,781 columns). We afterward conducted the classification using LazyPredict on five sets one-by-one. After five runs of LazyPredict, we selected the four most optimal algorithms, in order of highest accuracy to lowest regardless of their computational time including Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost), Random Forest (RF) and Extra Tree Classifier (ETC). Details of the results will be stated in the Results section. The findings obtained from the LazyPredict package were only to find the initial optimal algorithms in the most appropriate time, they were not used as the final result and did not affect the final output of the study.

2.4. Comparisons between baseline algorithms

We ran the algorithms one by one on the training set (557,483 instances - 3,780 features) with default setting parameters to test the performance of each algorithm. We focused on the accuracy score. We use sklearn.model_selection.train_test_split, with 75% data (418,112 instances) for training and 25% data (139,371 instances) for testing, with the random state of 1,000. Of the four algorithms, XGBoost recorded the highest accuracy score, with 98.36%, nearly 2% higher than the second-ranked algorithm, LightGBM (96.45%). Detailed results are listed in the Results section. We selected XGBoost as the baseline algorithm for model ensemble.

2.5. Features selection

After selecting the baseline algorithm, we performed feature selection using sklearn.feature_selection.VarianceThreshold and sklearn.feature_selection.SelectKBest. VarianceThreshold (VT) removes the features with variance lower than a threshold value. The default value of the threshold is zero; this means, VT will retain all features with non-zero variance. SelectKBest allows selecting the most optimal features (K features), and it can act as a data preprocessing step for the dataset, thereby shortening the computational time. We set the threshold to 0.8; then VT removed all features with variance less than 0.8. From 3,780 features, we retained 2,844 features for the next step using SelectKBest. To increase the accuracy and reliability, we ranged the parameter K from 1 to 2,848 (i.e., selecting in orders from 1 to 2,848 features) and evaluated the performance of the four algorithms. We recorded the results and compared them to find the highest accuracy, corresponding to which algorithm was used and which value of parameter k was. We found that with k equals to 2,844, the XGBoost algorithm achieved the highest accuracy score, at 98.36%. With the above number of features, the runtime reduced from 48 minutes 48 seconds to 39 minutes 53 seconds; the accuracy score remained at 98.36%. Additionally, we took a deeper look into which feature(s) were the most important for prediction and classification. To do this, we performed “SHapley Additive exPlanations” (SHAP) analysis method on the XGBoost-KBest based model. The detailed results from the SHAP importance plots can be revealed in the Results section.

2.6. Drug-food recommendations

For the final stage, we aimed to create a model capable of yielding recommendations or warnings to assist patients and physicians in the daily use of drugs in combination with food to avoid adverse effects. The user(s) input the names of the drug and food compound into the model. Then the algorithm from the model will tell if they can be used together. Based on that result, the patient will consult with a physician to get the best advice. If the DFI was positive, i.e., the labeled “1” DFI, the output message would be “A could be taken with food containing B.” If the DFI was non-significant, the output message would be “A may be taken with food containing B.” Otherwise, the output would be “A should not be taken with food containing B.”

Conclusions

To reduce the number of ADEs due to DFIs and DNIs, we proposed a new classification model based on XGBoost algorithm, VT, and SelectKBest feature selection method. The ability of our model to predict the adverse interactions between drug-food compounds can contribute to drug discoveries and the conformity between prescribed drugs and dietary plans in clinical medicine.

References

  1. Dong, J., et al., PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions. J Cheminform, 2018. 10(1): p. 16.
  2. Group, C.B.D.D., PyBioMed Molecular features. 2015, Computational Biology & Drug Design Group.
  3. Pedregosa, F., et al., Scikit-learn: Machine learning in Python. 2011. 12: p. 2825-2830.
  4. Lundberg, S. and S.-I. Lee, A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874, 2017.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published