Skip to content

Latest commit

 

History

History
33 lines (27 loc) · 2.1 KB

README.md

File metadata and controls

33 lines (27 loc) · 2.1 KB

Drug Discovery: A Machine Learning and Deep Learning Approach using SMILES Strings

This Jupyter Notebook demonstrates the use of Machine Learning and Deep Learning for drug discovery of a specific protein. We establish a pipeline that includes constructing a Random Forest (RF) regression model to predict the pIC50 values for the target’s chemical compounds, employing LSTMs to generate novel Simplified Molecular-Input Line Entry System (SMILES) strings, and utilizing the trained RF model to predict the pIC50 values for the generated strings.

Inspired by Chanin Nantasenamat's Computational Drug Discovery

Methods and Dataset

  • Dataset
    • Bioactivity data from the ChEMBL database for Human Acetylcholinesterase (hAChE)
    • 9,091 data points and 46 associated properties
    • Preprocessing data:
      • remove duplicates and missing values
      • normalized IC50 values with negative log base 10
      • calculate fingerprint descriptor using molecule ID and SMILES strings
  • Methods
    • Training Random Forest model
      • Apply nested cross-validation to find optimal parameters (5-fold outer loop and 3-fold inner loop)

      • Summary of nested cross-validation

        k-fold Best parameters Best scores
        k=1 {'max_depth': 100, 'n_estimators': 1000} 0.58
        k=2 {'max_depth': 90, 'n_estimators': 1000} 0.55
        k=3 {'max_depth': 100, 'n_estimators': 1000} 0.46
        k=4 {'max_depth': 100, 'n_estimators': 1000} 0.52
        k=5 {'max_depth': 100, 'n_estimators': 1200} 0.52
    • Training SMILES generator with 1,725 SMILES strings representing active molecules

Results

  • Mean Absolute Error for Random Forest: 0.75 test_pred_v1

  • Generated SMILES string from the SMILES_generator: 1 valid SMILES string of C=O