<a href="https://colab.research.google.com/github/schwallergroup/ai4chem_course/blob/scikit_learn/notebooks/02%20-%20Supervised%20Learning/training_and_evaluating_ml_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. Software
### Scikit-learn
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities. It is widely used in industry and academia, and a wealth of tutorials and code snippets are available online.
We will learn to use scikit-learn to do machine learning work. You can also browse the scikit-learn [user guide](https://scikit-learn.org/stable/user_guide.html) and [tutorials](https://scikit-learn.org/stable/tutorial/index.html) for additional details.
### Essential Libraries and Tools 
Scikit-learn depends on two other Python packages, NumPy and SciPy. For plotting and interactive development, you should also install matplotlib, IPython, and the Jupyter Notebook.
- **NumPy** is one of the fundamental packages for scientific computing in Python. It contains functionality for multidimensional arrays, high-level mathematical functions such as linear algebra operations and the Fourier transform, and pseudorandom number generators. In scikit-learn, the NumPy array is the fundamental data structure. scikit-learn takes in data in the form of NumPy arrays. Any data you’re using will have to be converted to a NumPy array.
- **SciPy** is a collection of functions for scientific computing in Python. It provides, among other functionality, advanced linear algebra routines, mathematical function optimization, signal processing, special mathematical functions, and statistical distributions. scikit-learn draws from SciPy’s collection of functions for implementing its algorithms.
- **Matplotlib** is the primary scientific plotting library in Python. It provides functions for making publication-quality visualizations such as line charts, histograms, scatter plots, and so on.
- **Pandas** Python library for data wrangling and analysis. It is built around a data structure called the DataFrame that is similar to an Excel spreadsheet. It can ingest from a great variety of file formats and databases, like SQL, Excel files, and comma-separated values (CSV) files.

### XGBoost
XGBoost (eXtreme Gradient Boosting) is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. You can also browse the [XGBoost Documentation](https://xgboost.readthedocs.io/en/stable/) for additional details.

### DeepChem
DeepChem is a high quality open-source toolchain that democratizes the use of deep-learning in chemistry, biology and materials science. It also provides various tools for dataset loader, splitters, molecular featurization, model construction and hyperparameter tuning. You can also browse the [DeepChem Ducumentation](https://deepchem.readthedocs.io/en/latest/) for additional details.

We will first install the required libraries. We also need `RDKit` library to process and analyze molecules, like calculating molecular descriptors.

In [None]:
!pip install numpy scipy matplotlib scikit-learn pandas rdkit xgboost deepchem

# 1. Introduction to Machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that "learn" – that is, methods that leverage data to improve performance on some set of tasks. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so.

<div align="center">
<img src="https://s3.ap-southeast-1.amazonaws.com/files-scs-prod/public%2Fimages%2F1605842918803-AI+vs+ML+vs+DL.png" width="500"/>
</div>

Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the "signal" or "feedback" available to the learning system:
- **Supervised learning**: The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs.
- **Unsupervised learning**: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).
- **Reinforcement learning**: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle or playing a game against an opponent). As it navigates its problem space, the program is provided feedback that's analogous to rewards, which it tries to maximize.

<div align="center">
<img src="https://starship-knowledge.com/wp-content/uploads/2021/01/unsupervised_supervised_reinforcement.jpeg" width="500"/>
</div>

# 2. Supervised learning
Two major types of supervised machine learning problems:
- **Classification** task is to predict a class label, which is a choice from a predefined list of possibilities. For example, to determine whether the photo is a dog, a cat or a rabbit.
- **Regression** task is to predict a continuous number, or a floating-point number in programming terms (or real number in mathematical terms), like predicting a person’s annual income from their education, their age, and where they live.

<div align="center">
<img src="https://cdn-images-1.medium.com/max/1600/1*xs6Jr4iAPvoqszF9JgDWOA.png" width="500"/>
</div>

## Common algorithms
- k-Nearest Neighbors (k-NN)
- Linear Models
- Support Vector Machines
- Decision Trees
- Ensembles of Decision Trees
  - Random forests
  - Gradient boosting machines

We can use `scikit-learn` to create ML models of different algorithms.

In [None]:
# k-NN classifier
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3) # instantiate the model and set the number of neighbors to consider to 3

# k-NN regressor
from sklearn.neighbors import KNeighborsRegressor
knn_reg = KNeighborsRegressor(n_neighbors=3) # instantiate the model and set the number of neighbors to consider to 3

# linear regressor
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

# decision tree classifier & regressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
tree_clf = DecisionTreeClassifier()
tree_reg = DecisionTreeRegressor()

# random forest classifier & regressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
ranf_clf = RandomForestClassifier(n_estimators=10)  # using 10 trees
ranf_reg = RandomForestRegressor(n_estimators=10)  # using 10 trees

# XGBoost classifier & regressor
from xgboost import XGBClassifier, XGBRegressor
bst_clf = XGBClassifier(n_estimators=10)  # using 10 trees
bst_reg = XGBRegressor(n_estimators=10)  # using 10 trees

## Model evaluation and data splitting
### Why do we need to split dataset?
We want models learn from data to predict on new data (like the data without label). But whether we should trust their predictions? Thus, we need some methods to evaluate models before using them. Unfortunately, we cannot use the data we used to build the model to evaluate it. This is because our model can always simply remember the whole training set, and will therefore always predict the correct label for any point in the training set. This “remembering” does not indicate to us whether our model will **generalize** well (in other words, whether it will also perform well on new data). To assess the model’s performance, we show it new data (data that it hasn’t seen before) for which we have labels. This is usually done by splitting the labeled data we have collected (here, our 150 flower measurements) into two parts. One part of the data is used to build our machine learning model, and is called the training data or **training set**. The rest of the data will be used to assess how well the model works; this is called the test data, **test set**, or hold-out set. In addition, we will need **valid set** to provide an unbiased evaluation of a model fitted on the training dataset while tuning model hyperparameters. If you have more time, you can read this [article](https://towardsdatascience.com/how-to-split-data-into-three-sets-train-validation-and-test-and-why-e50d22d3e54c) for more details.
### Evaluation metrics
The metrics used to evaluate the ML models are very important. The choice of metrics to use influences how model performance is measured and compared. The metrics influence both how you weight the importance of different characteristics in the results and your ultimate choice of algorithm. The main evaluation metrics for regression and classification tasks are illustrated below. If you have more time, you can read this [article](https://blog.knoldus.com/model-evaluation-metrics-for-machine-learning-algorithms/) for more details.

<div align="center">
<img src="https://www.oreilly.com/api/v2/epubs/9781492073048/files/assets/mlbf_0407.png" width="500"/>
</div>


## Common steps
1. Prepare data & split data
2. Choose the model
3. Train the model
4. Evaluate the model
5. Use the model

## Regression example
Below is a simple example to show basic steps of regression tasks. **Our goal** is to build a ML model that can learn from chemical structures (as encoded in SMILES strings) to predict **water solubility**. We will use ESOL dataset from [MoleculeNet](https://doi.org/10.1039/C7SC02664A) to train the models. This dataset contains structures and water solubility data for 1128 compounds.

Load dataset & show data

In [None]:
import pandas as pd

# load dataset from a CSV file
esol_df = pd.read_csv('../data/esol.csv')
esol_df


The original dataset contains 2 columns, where the `smiles` column represents the SMILES strings of the solute molecules. The column `log solubility (mol/L)` represents the solubility of molecules in water, which is the predicted target of our task.

In [None]:
smiles = esol_df['smiles'].values
y = esol_df['log solubility (mol/L)'].values

We need to convert the SMILES strings of molecules into numerical values that can be used as input to the ML models. We can calculate molecular descirptors from SMILES strings by some software like `RDKit`, `DeepChem` and [Mordred](https://github.com/mordred-descriptor/mordred). Here we use DeepChem [Featurizers](https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html) to compute molecular descriptors.

In [None]:
# Here, we use molecular descriptors from RDKit, like molecular weight, number of valence electrons, maximum and minimum partial charge, etc.
from deepchem.feat import RDKitDescriptors
featurizer = RDKitDescriptors()
features = featurizer.featurize(smiles)
print("Number of molecular descriptors:", features.shape[1])

Data preprocessing

In [None]:
import numpy as np

# Min-Max Normalization of features
fea_max = features.max(axis=0)
fea_min = features.min(axis=0)
fea_norm = (features - fea_min) / (fea_max - fea_min)

# Check if normalized features contain invalid values
contain_nan = (True in np.isnan(fea_norm))
if contain_nan:
    print('Our normalized features contain invalid values, please delete them before model training!')
    fea_norm = fea_norm[:, ~np.isnan(fea_norm).any(axis=0)]
    print('Dropping of columns containing invalid values has been completed.')
else:
    print('Our normalized features do not contain invalid values.')
print("Shape of molecular descriptors after data preprocessing:", fea_norm.shape[1])

Dataset split

In [None]:
from sklearn.model_selection import train_test_split
X = fea_norm
# training data size : test data size = 0.8 : 0.2
# fixed seed using the random_state parameter, so it always has the same split.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=0)

Create models

In [None]:
# random forest regressor, and the default criterion is mean squared error (MSE)
from sklearn.ensemble import RandomForestRegressor
ranf_reg = RandomForestRegressor(n_estimators=50, random_state=0)  # using 50 trees and seed=0

# XGBoost regressor
from xgboost import XGBRegressor
bst_reg = XGBRegressor(n_estimators=50, random_state=0)  # using 50 trees and seed=0

Train and evaluate the models
- Mean Squared Error: $MSE$ = $\frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2$
- Root Mean Squared Error: $RMSE$ = $\sqrt{MSE}$ = $\sqrt{\frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2}$

We choose `RMSE` as the evaluation metric for this task.

In [None]:
# for random forests
ranf_reg.fit(X_train, y_train)  # train the model
ranf_train_mse = ranf_reg.score(X_train, y_train)
ranf_test_mse = ranf_reg.score(X_test, y_test)
ranf_train_rmse = ranf_train_mse ** 0.5
ranf_test_rmse = ranf_test_mse ** 0.5
print('Random forests performance:')
print('RMSE on train set: {:.3f}, and test set: {:.3f}.\n'.format(ranf_train_rmse, ranf_test_rmse))

# for XGBoost
bst_reg.fit(X_train, y_train)  # train the model
y_pred_train = bst_reg.predict(X_train)
y_pred_test = bst_reg.predict(X_test)
from sklearn.metrics import mean_squared_error
bst_train_mse = mean_squared_error(y_pred_train, y_train)
bst_test_mse = mean_squared_error(y_pred_test, y_test)
bst_train_rmse = bst_train_mse ** 0.5
bst_test_rmse = bst_test_mse ** 0.5
print('XGBoost performance:')
print('RMSE on train set: {:.3f}, and test set: {:.3f}.'.format(bst_train_rmse, bst_test_rmse))

The results show that the RMSE value of XGBoost on the test set is smaller, indicating that the XGBoost is more accurate than random forests on this task.

# Introduction to traditional ML.

## Supervised learning

Training a model to take inputs X and return output y.

As you have seen in class, for this type of learning, we have two variants:

- Classification
- Regression

Linear regression is one example of suppervised learning for regression.

# Week 2 tutorial - AI 4 Chemistry

## Index:

- Classification
.
.
.

### TODO

# Regression

### TODO: Improve this introduction based on the ESOL paper, why is solub. prediction important?

One problem in both academic and industrial chemistry is predicting solubility. For instance we might know that some molecule has good potential as a ligand for some relevant reaction, however when you synthesize it, you realize it's not soluble under your already optimized reaction conditions! 😥

It would be extremely useful to know the solubility of my molecule, **before I even try to synthesize it**!

---

In this task we will try to solve this using supervised learning. In particular, we will train a regression model using the very convenient [scikit-learn](https://www.kaggle.com/competitions/MerckActivity/data) Python library, to predict solubility based on some molecular descriptors.

In [None]:
# TODO: Let's start with loading the data, visualizing some molecules and their solubility
# Let's also see some stats. e.g. size of dataset, distribution of solubility, etc.

# TODO: Generate features
# TODO do a train/test split

In [None]:
# TODO: Let's give an introduction to scikit learn by doing a simple linear regression and see the results.
# TODO Introduce sckit-learn, and use a RF model for this.
import tempfile
from sklearn.manifold import Isomap
from sklearn.neighbors import KNeighborsTransformer
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_regression
cache_path = tempfile.gettempdir()  # we use a temporary folder here
X, _ = make_regression(n_samples=50, n_features=25, random_state=0)
estimator = make_pipeline(
    KNeighborsTransformer(mode='distance'),
    Isomap(n_components=3, metric='precomputed'),
    memory=cache_path)
X_embedded = estimator.fit_transform(X)
X_embedded.shape

In [None]:
# TODO: Let's train another model (this can be an excercise)

# EXERCISE: Implement random forest and and XGBoost models using scikit learn

# As we see, there are many possible models we can use for this task. But which one is better?

In addition, each model has a set of hyperparameters that we need to tune ourselves. How do we select them?

This is an important part of machine learning! What we want to know is: What is the best combination of model + model hyperparameters for our task? 
As you've seen in the course, common strategies to evaluate and compare model's performance include:

- Splitting dataset in train/validation/test.
- Doing cross-validation for hyperparameter tuning.

In [None]:
# TODO: Split data in train/valid/test

# Retrain the models on the train set, and compare them using the validation set.

# What model is the best?

In [None]:
# TODO: Let's do cross-validation

# Optimize the hyperparameters for XGBoost, and again compare performance

In [None]:
# TODO: Finally, compare all models on the test set.
# Explain that test set should never be seen by models.
# This is all completely new data so we know how it would work in real life.

---

# Classification

We now turn our attention towards the other type of supervised learning: classification.

Many questions in chemistry can be framed as a classification task: 

- Will this molecule act as a nucleophile or electrophile in my reaction?
- What is the smell of this substance? (fruity, citrus, sweet, ...)

<div>
<img src="img/is_this_a_meme.png" width="500"/>
</div>


--- 

### TODO: Let's get a dataset for classification in molecules. Let's say it's prediction of toxicity.

What we want to know is, is the molecule shown toxic or not?

##### TODO insert image of molec

Let's see if a model can tell what molecules are toxic!

This would be very useful for instance in drug discovery, where we want to know if a molecule has potential as a drug, **even before we synthesize it**.

## To do this, we will use [mordred](http://mordred-descriptor.github.io/documentation/master/descriptors.html) to generate some molecular descriptors, and will again train some models using scikit-learn using these features.


In [None]:
# TODO load dataset
# TODO small visualization of dataset, let's see some molecules and the property we want to predict.

In [None]:
# TODO train a Random Forest classification model

In [None]:
# TODO evaluate model using metrics for this (AUC-ROC, accuracy, etc)

In [None]:
# TODO explore feature importance
# Do these features make sense?
# Find out in moldred documentation what the features are and think why this is important for the model.

In [None]:
# TODO train another model (maybe XGBoost) and explain using SHAP

# Tasks for today: 

- scikit learn
- classification
- regression: [ESOL dataset](https://www.kaggle.com/competitions/MerckActivity/data)
- XGBoost: [Merck kaggle challenge](https://www.kaggle.com/competitions/MerckActivity/data)
- SHAP values + feature importance
- Evaluation of ML models
- Cross-validation: hyperparameter tuning
- Train/valid/test split