## Building a Machine Learning Model to Predict cLogD
In this notebook we will use the descriptors we calculated using the **calc_descriptors.py** script build a model to predict cLogD.  First we'll import the necessary libraries. 

In [1]:
import numpy as np
import pandas as pd
from lightgbm import LGBMRegressor, Booster
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm

Enable Pandas progress_apply

In [2]:
tqdm.pandas()

### 1. Read the Descriptors
Read the data generated by **calc_descriptors.py**. 

In [3]:
%%time
df = pd.read_pickle("logd_descriptors.pkl")

CPU times: user 4.86 s, sys: 3.44 s, total: 8.3 s
Wall time: 9.08 s


Let's see how much data we read. 

In [4]:
df.shape

(2084724, 4)

As a sanity check, let's look at the first few rows.

In [5]:
df.head()

Unnamed: 0,smiles,name,logd,desc
0,Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccccc1Cl,1,2.69,"[2.151684657491086, -2.0923071597340894, 2.216..."
1,Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(C#N)cc1,2,1.82,"[2.1309007703070395, -2.0857480375944135, 2.17..."
2,Cc1cc(-n2ncc(=O)[nH]c2=O)cc(C)c1C(O)c1ccc(Cl)cc1,3,2.64,"[2.1708426473880804, -2.1841553872256485, 2.29..."
3,Cc1ccc(C(=O)c2ccc(-n3ncc(=O)[nH]c3=O)cc2)cc1,4,1.97,"[2.093212484023919, -2.0505383418976266, 2.126..."
4,Cc1cc(-n2ncc(=O)[nH]c2=O)ccc1C(=O)c1ccc(Cl)cc1,5,2.57,"[2.129504518042437, -2.0858337096441177, 2.182..."


Drop any data with nulls. 

In [6]:
df.dropna(inplace=True)

Let's see how much data remains.  In practice, we'd try to figure out why some molecules didn't generate descriptors.  In this case, we still have more that 2 million records, so we're fine. 

In [7]:
df.shape

(2031926, 4)

Convert the descriptors from lists to numpy arrays. In retrospect, I may have been able to do this when I generated, the descriptors.  Oh well, not a big deal. 

### 2. Build the ML Model
Split the data into training and test sets. 

In [8]:
%%time
train, test = train_test_split(df)

CPU times: user 969 ms, sys: 13.9 ms, total: 983 ms
Wall time: 983 ms


Use [np.stack](https://numpy.org/doc/stable/reference/generated/numpy.stack.html) to convert the descriptors to an appropriate format for ML model building. 

In [9]:
%%time
X_train = np.stack(train.desc.values)
y_train = train.logd.values
X_test = np.stack(test.desc.values)
y_test = test.logd.values

CPU times: user 5.76 s, sys: 6.53 s, total: 12.3 s
Wall time: 15.2 s


Build an ML model.  Wow, LightGBM is fast! 

In [10]:
%%time
lgbm = LGBMRegressor()
lgbm.fit(X_train,y_train)

CPU times: user 5min 3s, sys: 7.02 s, total: 5min 10s
Wall time: 23 s


LGBMRegressor()

Predict on the test set.

In [76]:
%%time
pred = lgbm.predict(X_test)

CPU times: user 7.96 s, sys: 1.24 s, total: 9.19 s
Wall time: 686 ms


### 3. Test the ML Model
Calculate $R^2$

In [77]:
r2_score(y_test,pred)

0.8955569033682811

Calculate RMSE

In [78]:
mean_squared_error(y_test,pred, squared=False)

0.8168554120634113

### 4. Save the ML Model
Save the model to disk.

In [79]:
lgbm.booster_.save_model("model.txt")

<lightgbm.basic.Booster at 0x44bc7d940>

Read the model from disk.

In [80]:
mdl = Booster(model_file='model.txt')

Predict with the saved model and calculate $R^2$

In [90]:
pred = mdl.predict(X_test)

In [91]:
mean_squared_error(y_test,pred, squared=False)

0.8090787677074638

Save the model.

In [83]:
lgbm.booster_.save_model("logd.mdl")

<lightgbm.basic.Booster at 0x44bc7d940>