Drug Discovery: A Machine Learning and Deep Learning Approach using SMILES Strings

This Jupyter Notebook demonstrates the use of Machine Learning and Deep Learning for drug discovery of a specific protein. We establish a pipeline that includes constructing a Random Forest (RF) regression model to predict the pIC50 values for the target’s chemical compounds, employing LSTMs to generate novel Simplified Molecular-Input Line Entry System (SMILES) strings, and utilizing the trained RF model to predict the pIC50 values for the generated strings.

Inspired by Chanin Nantasenamat's Computational Drug Discovery

Methods and Dataset

Dataset
- Bioactivity data from the ChEMBL database for Human Acetylcholinesterase (hAChE)
- 9,091 data points and 46 associated properties
- Preprocessing data:
  - remove duplicates and missing values
  - normalized IC50 values with negative log base 10
  - calculate fingerprint descriptor using molecule ID and SMILES strings

Methods

Training Random Forest model

Apply nested cross-validation to find optimal parameters (5-fold outer loop and 3-fold inner loop)

Summary of nested cross-validation

k-fold	Best parameters	Best scores
k=1	{'max_depth': 100, 'n_estimators': 1000}	0.58
k=2	{'max_depth': 90, 'n_estimators': 1000}	0.55
k=3	{'max_depth': 100, 'n_estimators': 1000}	0.46
k=4	{'max_depth': 100, 'n_estimators': 1000}	0.52
k=5	{'max_depth': 100, 'n_estimators': 1200}	0.52

Training SMILES generator with 1,725 SMILES strings representing active molecules

Results

Mean Absolute Error for Random Forest: 0.75
Generated SMILES string from the SMILES_generator: 1 valid SMILES string of C=O

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Drug Discovery: A Machine Learning and Deep Learning Approach using SMILES Strings

Methods and Dataset

Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

Drug Discovery: A Machine Learning and Deep Learning Approach using SMILES Strings

Methods and Dataset

Results