In [1]:
# apply Jupyter notebook style
from IPython.core.display import HTML

from custom.styles import style_string

HTML(style_string)

# Fitting the ESOL Dataset

<div class="overview admonition"> 
<p class="admonition-title">Overview</p>

Questions:

* How can I use SciKit-Learn to fit the ESOL dataset?

Objectives:

* Fit solubility data with the ESOL model using SciKitLearn.

</div>

In this notebook, we will put together many of the skills we have already learned to creat the model originally fit in the 2004 ESOL paper.

In the original ESOL paper, solubility is calculated as a linear relationship with multiple molecular descriptors:

1. **cLogP** - "calculated LogP". [LogP is the log of the partition coefficient of a solute between octanol and water, at near infinite dilution. LogP is widely used in drug discovery and development as an indicator of potential utility of a solute as a drug.](https://www.sciencedirect.com/topics/chemistry/logp#:~:text=LogP%20is%20the%20log%20of,a%20solute%20as)
2. **Molecular Weight** - The sum of the atomic weights of all the atoms in a molecule.
3. **Rotable Bonds** - Bonds that are not part of a ring and can be rotated freely, indicating molecular flexibility.
4. **Aromatic Proportion** - The proportion of aromatic atoms to the total number of heavy (non-hydrogen) atoms in the molecule.

In this notebook, we will go through the data science pipeline in order to create a model of our data using SciKit-Learn.

## Step 1 - Data Preparation

The first step in fitting our data is preparing our data for fitting.
Sometimes this could involve data cleaning. 
Data cleaning is a process to ensure that things like missing data is taken care of.
For this data set, we don't need to do any data cleaning, however, we do need to add some additional data in order
to replicate the original ESOL model fit.

In [None]:
# First we import the liraries we want

import pandas as pd

from rdkit import Chem
from rdkit.Chem import Descriptors

In [None]:
# next we use pandas to read the file
df = pd.read_csv("data/delaney-processed.csv")

We would like to add some molecular descriptors to our data frame, so we will use RDKit to load the SMILES 
as RDKit molecule objects.

In [None]:
df["mol"] = df["smiles"].apply(Chem.MolFromSmiles)
df.head(2) # view the first two rows

<div class="exercise admonition">
<p class="admonition-title">Adding Descriptors</p>

Use RDKit to add the descripors cLogP to the dataframe.

You can find this descriptor as `Descriptors.MolLogP`

</div>

In [None]:
df["cLogP"] = df["mol"].apply(Descriptors.MolLogP)

Getting the aromatic proportion is more complicated.
We can define a SMARTS string that corresponds to an aromatic atom as `[a]`.

In [None]:
# use Chem.MolFromSmarts to make an aromatic pattern.
aromatic_pattern = 

The following cell adds a column with aromatic proportion data for each molecule to the dataframe.
This line of code uses a syntax called lambda functions that has not been presented before.

You can take some time to try to understand this cell, or you can move on to the model fitting.

In [None]:
df["aromatic proportion"] = df["mol"].apply(lambda x: len(x.GetSubstructMatches(aromatic_pattern)) / Descriptors.HeavyAtomCount(x))

## Step 2 - Data Inspection

Use the following cells and the pandas/seaborn libraries to investigate what descriptors are correlated.

## Step 3 - Data Fitting

### Step 3.1 - Train Test Split

Use the cells below to perform a train test split using SciKitLearn

### Step 3.2 - Create and Fit a Linear Model to the Training Data
In a data science pipeline, this step would also involve model selection. However, for our purposes, we will fit a multi-linear model.
You can use the features mentioned in the first cell, or you may decide on others you think are important from your inspection.

## Step 4 - Model Evaluation

Use your testing data to evaluate your model. Calculate the $R^2$ score for the training data and visualize the predicted vs. actual values.

## Step 5 - Iteration
Based on the evaluation results, iterate as needed. You might need to adjust features, consider model complexity, or revisit data preparation.