In [None]:
import os
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score

# Linear Models for Classification

## Dataset: BACE-1

Beta-Secretase 1 (BACE) is a transmembrane aspartic-acid protease human protein encoded by the BACE1 gene. BACE is essential for the generation of beta-amyloid peptide in neural tissue, a component of amyloid plaques widely believed to be critical in the development of Alzheimer's, rendering BACE an attractive therapeutic target. 

In [None]:
os.system("wget https://raw.githubusercontent.com/deepchem/deepchem/master/datasets/desc_canvas_aug30.csv")

This dataset contains a set of molecular structures `mol`, half-maximal inhibitory concentration `pIC50`, and 590 molecular topological features. These features can be calculated using common chemistry Python packages like `openbabel` or `rdkit`.

In [None]:
df = pd.read_csv("desc_canvas_aug30.csv")
df

This dataset was previous used in a drug design competition sponsored by Novartis. Here, we use the original train/test dataset splitting of the contest.

In [None]:
train_df = df[df['Model'] == "Train"]
test_df = df[df['Model'] == "Test"]

label = 'Class'
y_train = train_df[label].values
y_test = test_df[label].values

features = list(train_df.keys()[5:-1])
features = [f for f in features if not np.isnan(np.sum(train_df[f].values))]
X_train = train_df[features].values
X_test = test_df[features].values

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

## Linear Models for Classification

The goal in classification is to take an input vector $\textbf{x}$ and to assign it to one of K discrete classes $\mathcal{C}_k$ where k = 1,..,K. In the most common scenario, the classes are considered to be disjoint, such that each example is assigned to a single class. The input space is then divided into decision regions whose boundaries are class decision boundaries. 

Today, we will consider linear models for classification, by which we mean that the decision boundaries are linear functions of the input vector $\textbf{x}$. The resulting decision surfaces are defined by (D-1)-dimensional hyperplances within a D-dimensional input space.  

<center><img src="https://i.imgur.com/hNAusMw.png" height=400>

Example of a linearly seperable decision problem (Rep. Bishop Fig 4.4a) </center>

## Logistic Regression

In logistic regression, we fit a model of the form:

$$ y(\textbf{x}, \textbf{w}) = \sum_{j=0}^{M-1} w_j \phi_j(\textbf{x}) = \textbf{w}^T \boldsymbol{\phi}(\textbf{x}) $$

We assume a linear relationship between predictor variables $\textbf{x}$ and the log-odds that the example belongs to some class $\mathcal{C}_k$:

$$ log \frac{p}{1-p} = \textbf{w}^T \boldsymbol{\phi}(\textbf{x}) $$

<center><img src="https://i.imgur.com/t7rhx63.png" height=400>

Example logistic regression (Rep. Bishop Fig 4.1)
</center>

The weights $\textbf{w}$ are obtained by maximizing the log-likelihood of the model given the data.

In [None]:
# C is the inverse regularization strength i.e. 1/alpha
reg = LogisticRegression(penalty='l1', C=10.0, solver='liblinear')
reg.fit(X_train, y_train)
print(reg.score(X_train, y_train))
print(reg.score(X_test, y_test))

## Exercise: Fit a logistic regression model using the larger "test" dataset and evaluate model performance on the smaller "train" dataset. What is the effect of tuning "C"? 

## Perceptron

We consider a non-linear modification of the logistic regression model, the perceptron. The functional form of the perceptron model is:

$$ y(\textbf{x}, \textbf{w}) = \mathcal{f}(\textbf{w}^T \mathbb{\phi}(\textbf{x})) $$

where f is a step function of the form

$$ f(a) = \begin{cases} 
      +1 & a\geq 0 \\
      -1 & a \lt 0 
   \end{cases}
$$

The perceptron error function is given by:

$$ E(\textbf{w}) = - \sum_{n \in M} \textbf{w}^T \mathbb{\phi}(\textbf{x}_n) t_n $$

This error function is typically optimized through stochastic gradient descent.

The perceptron famously is unable to solve the XOR problem, a simple two-variable single-class classification problem. However, it was shown that stacking/nesting perceptrons into what is refered to as a multi-layer perceptron (or neural network) can allieviate this issue.

<center><img src="https://i.imgur.com/LZNuV4D.png" height=400>

Rep. Bishop Fig. 4.8 </center>

In [None]:
reg = Perceptron(penalty='l2', alpha=0.11)
reg.fit(X_test, y_test)
print(reg.score(X_train, y_train))
print(reg.score(X_test, y_test))

## Exercise: The XOR problem

Attempt to train a perceptron using the XOR dataset below. What is the predicted accuracy score? What examples are being classified incorrectly?

In [None]:
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])
reg = Perceptron()
reg.fit(X, y)
print(reg.score(X, y))

## Classification Metrics



In [None]:
reg = LogisticRegression(penalty='l1', C=10.0, solver='liblinear')
reg.fit(X_train, y_train)

Accuracy Score: ratio of (true positives + true negatives) to (true positives + true negatives + false positives + false negatives)

In [None]:
print(accuracy_score(reg.predict(X_train), y_train))
print(accuracy_score(reg.predict(X_test), y_test))

Receiver Operating Characteristic (ROC) Curve: A plot showing the true positive rate versus the false positive rate as the discriminative threshold is varied

In [None]:
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y_test, reg.predict_proba(X_test)[:, 1])
plt.plot(fpr, tpr)

Area Under the Receiver Operating Characteristic Curve (ROC-AUC)

In [None]:
print(roc_auc_score(y_train, reg.predict_proba(X_train)[:, 1]))
print(roc_auc_score(y_test, reg.predict_proba(X_test)[:, 1]))