In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.metrics import mean_absolute_error, r2_score, make_scorer

# Overview: Building Models from Data

## Download dataset

In [None]:
os.system("wget https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv")

ESOL (delaney) is a standard regression dataset containing structures and water solubility data for 1128 compounds. The dataset is widely used to validate machine learning models on estimating solubility directly from molecular structures (as encoded in SMILES strings).

Reference:

Delaney, John S. "ESOL: estimating aqueous solubility directly from molecular structure." Journal of chemical information and computer sciences 44.3 (2004): 1000-1005.

In [None]:
with open('delaney-processed.csv') as f:
  data = f.readlines()

print("".join(data[:10]))

## Working with DataFrames

Rather than parsing the data manual, we can use a dedicated function to load the .csv file into a pandas DataFrame

In [None]:
df = pd.read_csv("delaney-processed.csv")
df

## Constructing the feature/label arrays

In order to build a data-driven model, we must select the task we'd like to predict (labels, `y`) and the information about each example to give to the model (features, `X`)

In [None]:
features = df[df.keys()[2:8]]
features

In [None]:
label = df[df.keys()[8]]
label

In [None]:
X = features.values
y = label.values
print(X.shape, y.shape)

## Data visualization

Let's take a look at the distribution of solubility values we'd like to predict. This is a good first step to take when building a model, it allows you to visually identify some data anomalies or outliers.

In [None]:
plt.hist(y, 50)
plt.xlabel("Solubility")
plt.ylabel("Counts")
plt.show()
plt.close()

## Data splitting

In order to test the performance of our model on unseen data, we split the dataset into two parts: training set (80%) and test set (20%).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

## Model: Linear regression

For this example, we choose the simplest data-driven model: linear regression. In this model, we determine a set of variables $w$ such that the quantity $y = X w$ is minimized.

$$ \min_w y - Xw $$

We minimize the square of the cost function above using the training set. This amounts to performing a linear least-squares minimization to determine $w$.

In [None]:
reg = LinearRegression().fit(X_train, y_train)

## Model evaluation: Coefficient of determination ($R^2$ score)

Now that we've trained out model, we'd like to evaluate the performance of the model on the training set as well as the test set. There are many metrics available to evaluate the performance of your model. In this example, we choose the coefficient of determination or R2 score. The R2 score is calculated as:

$$ R^2 = 1 - \frac{\sum_i (y_{true} - y_{pred})^2}{\sum_i (y_{true} - \bar{y})^2} $$

In [None]:
train_pred = reg.predict(X_train)
test_pred = reg.predict(X_test)

train_r2 = r2_score(y_train, train_pred)
test_r2 = r2_score(y_test, test_pred)

## Model evaluation: Plotting

Let's construct a parity plot to display our model predictions and their true values. A parity plot is constructed by scattering the experimental (true) values along the x-axis and the predicted values on the y-axis. We can construct a parity plot for the train and test sets.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 7.5))
ax[0].scatter(y_train, train_pred)
ax[0].plot(np.linspace(-10, 2), np.linspace(-10, 2), 'r')
ax[0].set_xlim((-10, 2))
ax[0].set_ylim((-10, 2))
ax[0].set_aspect('equal', adjustable='box')
ax[0].set_xlabel("Log Solubility")
ax[0].set_ylabel("Predicted Log Solubility")
ax[0].set_title("Train R2: {}".format(r2_score(y_train, train_pred)))
ax[1].scatter(y_test, test_pred)
ax[1].plot(np.linspace(-10, 2), np.linspace(-10, 2), 'r')
ax[1].set_xlim((-10, 2))
ax[1].set_ylim((-10, 2))
ax[1].set_aspect('equal', adjustable='box')
ax[1].set_xlabel("Log Solubility")
ax[1].set_ylabel("Predicted Log Solubility")
ax[1].set_title("Test R2: {}".format(r2_score(y_test, test_pred)))
plt.show()
plt.close()

## Interpreting our model

Linear regression models have a much simpler functional form compared to our machine learning models we will use later on in this course. This simple functional form, however, allows us to easily interpret the model by looking at the coefficients $w$. In this case, each parameter $w_i$ tells us how important each feature $i$ is (magnitude) and how the model uses each feature (sign).

In [None]:
reg.coef_

In [None]:
list(zip(df.keys()[2:8], reg.coef_))

## Linear regression the hard (?) way

We've already learned how to solve an over-determined linear system of equations using linear least-squares via the `np.linalg.lstsq` function. Let's compare our results with the `sklearn` model we just constructed.

In [None]:
x, _, _, _ = np.linalg.lstsq(X_train, y_train, rcond=-1)
list(zip(df.keys()[2:8], x))

In [None]:
train_pred = X_train @ x
test_pred = X_test @ x
fig, ax = plt.subplots(1, 2, figsize=(15, 7.5))
ax[0].scatter(y_train, train_pred)
ax[0].plot(np.linspace(-10, 2), np.linspace(-10, 2), 'r')
ax[0].set_xlim((-10, 2))
ax[0].set_ylim((-10, 2))
ax[0].set_aspect('equal', adjustable='box')
ax[0].set_xlabel("Log Solubility")
ax[0].set_ylabel("Predicted Log Solubility")
ax[0].set_title("Train R2: {}".format(r2_score(y_train, train_pred)))
ax[1].scatter(y_test, test_pred)
ax[1].plot(np.linspace(-10, 2), np.linspace(-10, 2), 'r')
ax[1].set_xlim((-10, 2))
ax[1].set_ylim((-10, 2))
ax[1].set_aspect('equal', adjustable='box')
ax[1].set_xlabel("Log Solubility")
ax[1].set_ylabel("Predicted Log Solubility")
ax[1].set_title("Test R2: {}".format(r2_score(y_test, test_pred)))

## Model evaluation: Cross-validation

Cross-validation is another technique for evaluating the performance of a model. The dataset is split into $k$ equal size partitions or folds. The model is then trained on $k-1$ folds and tested on the $k$-th fold. This procedure is repeated $k$ times. Model performance and evaluation can vary from run to run based on differences in how our original dataset is randomly partitioned into training and test sets. By evaluating the model multiple times with different training and test sets, we can estimate the error in model evaluation that can arise from dataset splitting differences.

In [None]:
scoring = {'r2_score': make_scorer(r2_score)}
cv = cross_validate(LinearRegression(), X, y, scoring=scoring, return_train_score=True)
cv

In [None]:
for score in ['train_r2_score', 'test_r2_score']:
  scores = cv[score]
  mean_scores = np.round(np.mean(scores), 4)
  stderr_scores = np.round(stats.sem(scores), 4)
  print("{}: {} +/- {}".format(score, mean_scores, stderr_scores))