In [None]:
import os
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Lasso, Ridge

# Linear Models

## Dataset: BACE-1

Beta-Secretase 1 (BACE) is a transmembrane aspartic-acid protease human protein encoded by the BACE1 gene. BACE is essential for the generation of beta-amyloid peptide in neural tissue, a component of amyloid plaques widely believed to be critical in the development of Alzheimer's, rendering BACE an attractive therapeutic target. 

In [None]:
os.system("wget https://raw.githubusercontent.com/deepchem/deepchem/master/datasets/desc_canvas_aug30.csv")

This dataset contains a set of molecular structures `mol`, half-maximal inhibitory concentration `pIC50`, and 590 molecular topological features. These features can be calculated using common chemistry Python packages like `openbabel` or `rdkit`.

In [None]:
df = pd.read_csv("desc_canvas_aug30.csv")
df

This dataset was previous used in a drug design competition sponsored by Novartis. Here, we use the original train/test dataset splitting of the contest.

In [None]:
train_df = df[df['Model'] == "Train"]
test_df = df[df['Model'] == "Test"]

label = 'pIC50'
y_train = train_df[label].values
y_test = test_df[label].values

features = list(train_df.keys()[5:-1])
features = [f for f in features if not np.isnan(np.sum(train_df[f].values))]
X_train = train_df[features].values
X_test = test_df[features].values

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

## Linear Regression

In linear regression, we fit a model of the form:

$$ y(\textbf{x}, \textbf{w}) = w_0 + \sum_{j=1}^{M-1} w_j \phi_j(\textbf{x}) $$

where $\phi_j(\textbf{x})$ are known as basis functions and M is the total number of parameters in this model.

The parameters $w_0$ allows for any fixed offset in the data and is sometimes called the bias paramaters. We can introduce a dummy basis function $\phi_0(\textbf{x}) = 1$ so that

$$ y(\textbf{x}, \textbf{w}) = \sum_{j=0}^{M-1} w_j \phi_j(\textbf{x}) = \textbf{w}^T \boldsymbol{\phi}(\textbf{x}) $$

The objective function

$$ E_D(\textbf{w}) = \frac{1}{2} \sum_{n=1}^N [t_n - \textbf{w}^T \boldsymbol{\phi}(\textbf{x}_n)]^2$$

is minimized via an Ordinary Least Squares optimization over N labeled examples.

In [None]:
reg = LinearRegression()
reg.fit(X_train, y_train)
print(reg.score(X_train, y_train))
print(reg.score(X_test, y_test))

## Lasso (L1-regularized) Regression

In Lasso regression, we modify the objective function to include an L1-norm penalty on the weights of the model. This procedure is designed to reduce overfitting by constraining parameters from taking extreme values. This process is known as regularization, or adding information or constraints to prevent overfitting. This works by limiting model complexity by enforcing sparsity within the parameter set. Regularization is an important concept in all machine learning-based models.

The new objective function becomes

$$ E_D(\textbf{w}) = \frac{1}{2} \sum_{n=1}^N [t_n - \textbf{w}^T \boldsymbol{\phi}(\textbf{x}_n)]^2 + \frac{\alpha}{2} \sum_{j=0}^{M-1} |w_j|$$

In [None]:
reg = Lasso(alpha=0.1)
reg.fit(X_train, y_train)
print(reg.score(X_train, y_train))
print(reg.score(X_test, y_test))

## Exercise: Tune alpha

Try adjusting alpha to maximize the score on the test set.

## Ridge (L2-regularized) Regression

In Lasso regression, we modify the objective function to include an L2-norm penalty on the weights of the model. The L2-regularization typically results in a less-sparse set of weights compared with L1-regularization.

The new objective function becomes

$$ E_D(\textbf{w}) = \frac{1}{2} \sum_{n=1}^N [t_n - \textbf{w}^T \boldsymbol{\phi}(\textbf{x}_n)]^2 + \frac{\alpha}{2} \sum_{j=0}^{M-1} w_j^2$$

In [None]:
reg = Ridge(alpha=0.1)
reg.fit(X_train, y_train)
print(reg.score(X_train, y_train))
print(reg.score(X_test, y_test))

## Exercise: Tune alpha

Try adjusting alpha to maximize the score on the test set.