<a href="https://colab.research.google.com/github/schwallergroup/ai4chem_course/blob/scikit_learn/notebooks/02%20-%20Supervised%20Learning/training_and_evaluating_ml_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Scikit-learn
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities. It is widely used in industry and academia, and a wealth of tutorials and code snippets are available online.
We will learn to use scikit-learn to do machine learning work. You can also browse the scikit-learn [user guide](https://scikit-learn.org/stable/user_guide.html) and [tutorials](https://scikit-learn.org/stable/tutorial/index.html) for additional details.
### Essential Libraries and Tools 
Scikit-learn depends on two other Python packages, NumPy and SciPy. For plotting and interactive development, you should also install matplotlib, IPython, and the Jupyter Notebook.
- **NumPy** is one of the fundamental packages for scientific computing in Python. It contains functionality for multidimensional arrays, high-level mathematical functions such as linear algebra operations and the Fourier transform, and pseudorandom number generators. In scikit-learn, the NumPy array is the fundamental data structure. scikit-learn takes in data in the form of NumPy arrays. Any data you’re using will have to be con‐ verted to a NumPy array.
- **SciPy** is a collection of functions for scientific computing in Python. It provides, among other functionality, advanced linear algebra routines, mathematical function optimization, signal processing, special mathematical functions, and statistical distributions. scikit-learn draws from SciPy’s collection of functions for implementing its algorithms.
- **Matplotlib** is the primary scientific plotting library in Python. It provides functions for making publication-quality visualizations such as line charts, histograms, scatter plots, and so on.
- **Pandas** Python library for data wrangling and analysis. It is built around a data structure called the DataFrame that is similar to an Excel spreadsheet. It can ingest from a great variety of file formats and databases, like SQL, Excel files, and comma-separated values (CSV) files.

We will first install the required libraries. We also need `RDKit` library to process and analyze molecules, like calculating molecular descriptors.

In [None]:
!pip install numpy scipy matplotlib scikit-learn pandas rdkit

# Introduction to traditional ML

## Supervised learning

# Introduction to traditional ML.

## Supervised learning

Training a model to take inputs X and return output y.

As you have seen in class, for this type of learning, we have two variants:

- Classification
- Regression

Linear regression is one example of suppervised learning for regression.

# Week 2 tutorial - AI 4 Chemistry

## Index:

- Classification
.
.
.

### TODO

# Regression

### TODO: Improve this introduction based on the ESOL paper, why is solub. prediction important?

One problem in both academic and industrial chemistry is predicting solubility. For instance we might know that some molecule has good potential as a ligand for some relevant reaction, however when you synthesize it, you realize it's not soluble under your already optimized reaction conditions! 😥

It would be extremely useful to know the solubility of my molecule, **before I even try to synthesize it**!

---

In this task we will try to solve this using supervised learning. In particular, we will train a regression model using the very convenient [scikit-learn](https://www.kaggle.com/competitions/MerckActivity/data) Python library, to predict solubility based on some molecular descriptors.

In [13]:
# TODO: Let's start with loading the data, visualizing some molecules and their solubility
# Let's also see some stats. e.g. size of dataset, distribution of solubility, etc.

# TODO: Generate features
# TODO do a train/test split

In [3]:
# TODO: Let's give an introduction to scikit learn by doing a simple linear regression and see the results.
# TODO Introduce sckit-learn, and use a RF model for this.
import tempfile
from sklearn.manifold import Isomap
from sklearn.neighbors import KNeighborsTransformer
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_regression
cache_path = tempfile.gettempdir()  # we use a temporary folder here
X, _ = make_regression(n_samples=50, n_features=25, random_state=0)
estimator = make_pipeline(
    KNeighborsTransformer(mode='distance'),
    Isomap(n_components=3, metric='precomputed'),
    memory=cache_path)
X_embedded = estimator.fit_transform(X)
X_embedded.shape

In [16]:
# TODO: Let's train another model (this can be an excercise)

# EXERCISE: Implement random forest and and XGBoost models using scikit learn

# As we see, there are many possible models we can use for this task. But which one is better?

In addition, each model has a set of hyperparameters that we need to tune ourselves. How do we select them?

This is an important part of machine learning! What we want to know is: What is the best combination of model + model hyperparameters for our task? 
As you've seen in the course, common strategies to evaluate and compare model's performance include:

- Splitting dataset in train/validation/test.
- Doing cross-validation for hyperparameter tuning.

In [7]:
# TODO: Split data in train/valid/test

# Retrain the models on the train set, and compare them using the validation set.

# What model is the best?

In [14]:
# TODO: Let's do cross-validation

# Optimize the hyperparameters for XGBoost, and again compare performance

In [15]:
# TODO: Finally, compare all models on the test set.
# Explain that test set should never be seen by models.
# This is all completely new data so we know how it would work in real life.

---

# Classification

We now turn our attention towards the other type of supervised learning: classification.

Many questions in chemistry can be framed as a classification task: 

- Will this molecule act as a nucleophile or electrophile in my reaction?
- What is the smell of this substance? (fruity, citrus, sweet, ...)

<div>
<img src="img/is_this_a_meme.png" width="500"/>
</div>


--- 

### TODO: Let's get a dataset for classification in molecules. Let's say it's prediction of toxicity.

What we want to know is, is the molecule shown toxic or not?

##### TODO insert image of molec

Let's see if a model can tell what molecules are toxic!

This would be very useful for instance in drug discovery, where we want to know if a molecule has potential as a drug, **even before we synthesize it**.

## To do this, we will use [mordred](http://mordred-descriptor.github.io/documentation/master/descriptors.html) to generate some molecular descriptors, and will again train some models using scikit-learn using these features.


In [2]:
# TODO load dataset
# TODO small visualization of dataset, let's see some molecules and the property we want to predict.

In [None]:
# TODO train a Random Forest classification model

In [None]:
# TODO evaluate model using metrics for this (AUC-ROC, accuracy, etc)

In [None]:
# TODO explore feature importance
# Do these features make sense?
# Find out in moldred documentation what the features are and think why this is important for the model.

In [11]:
# TODO train another model (maybe XGBoost) and explain using SHAP

# Tasks for today: 

- scikit learn
- classification
- regression: [ESOL dataset](https://www.kaggle.com/competitions/MerckActivity/data)
- XGBoost: [Merck kaggle challenge](https://www.kaggle.com/competitions/MerckActivity/data)
- SHAP values + feature importance
- Evaluation of ML models
- Cross-validation: hyperparameter tuning
- Train/valid/test split