# Binomial Classification
## Logistic Regression From Scratch
This blog uses logistic regression on breast cancer data to predict if a patient has cancer based on 9 different features. I have added to this blog's code a from scratch train_test_spilt function, training and prediction functions, as well as the k-fold cross validation function.

### Get Data

In [1]:
%pylab inline
# Pass wget command off to shell to download the data and name it 
!wget -O dataset.csv https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
!head -3 dataset.csv

Populating the interactive namespace from numpy and matplotlib


--2020-09-18 19:52:06--  https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19889 (19K) [application/x-httpd-php]
Saving to: 'dataset.csv'

     0K .......... .........                                  100%  326K=0.06s



1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2


2020-09-18 19:52:06 (326 KB/s) - 'dataset.csv' saved [19889/19889]



In [2]:
import pandas as pd
df = pd.read_csv('dataset.csv', names=[
  "id number",
  "Clump Thickness",
  "Uniformity of Cell Size",
  "Uniformity of Cell Shape",
  "Marginal Adhesion",
  "Single Epithelial Cell Size",
  "Bare Nuclei",
  "Bland Chromatin",
  "Normal Nucleoli",
  "Mitoses",
  "Class"
])

df.head()

Unnamed: 0,id number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


### Clean X

In [3]:
df = df.replace('?',np.NaN)
df.isna().sum()

id number                       0
Clump Thickness                 0
Uniformity of Cell Size         0
Uniformity of Cell Shape        0
Marginal Adhesion               0
Single Epithelial Cell Size     0
Bare Nuclei                    16
Bland Chromatin                 0
Normal Nucleoli                 0
Mitoses                         0
Class                           0
dtype: int64

In [4]:
# Select features
X = df[["Clump Thickness",
  "Uniformity of Cell Size",
  "Uniformity of Cell Shape",
  "Marginal Adhesion",
  "Single Epithelial Cell Size",
  "Bare Nuclei",
  "Bland Chromatin",
  "Normal Nucleoli",
  "Mitoses"
]].values.astype(np.float32)
X.shape

(699, 9)

In [5]:
idx = np.where(np.isnan(X))
X[idx] = np.take(np.nanmedian(X, axis = 0), idx[1])

y = df['Class'].values
y.shape

(699,)

### Clean Y
We change the labels that were originally (2 for negative , 4 for positive) --> (0,1)

In [6]:
if y[0] == 2:
  y = np.array(y == 4, dtype=np.float32)
y.shape, y[:10]

((699,), array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0.], dtype=float32))

### Bias Factor
To make this a proper linear model after $y=mx + b$ with $b$ as the bias factor. We just add a column of 1's to the front of the data.

In [7]:
X = np.hstack((np.ones((len(X), 1)), X))
X[:10]

array([[ 1.,  5.,  1.,  1.,  1.,  2.,  1.,  3.,  1.,  1.],
       [ 1.,  5.,  4.,  4.,  5.,  7., 10.,  3.,  2.,  1.],
       [ 1.,  3.,  1.,  1.,  1.,  2.,  2.,  3.,  1.,  1.],
       [ 1.,  6.,  8.,  8.,  1.,  3.,  4.,  3.,  7.,  1.],
       [ 1.,  4.,  1.,  1.,  3.,  2.,  1.,  3.,  1.,  1.],
       [ 1.,  8., 10., 10.,  8.,  7., 10.,  9.,  7.,  1.],
       [ 1.,  1.,  1.,  1.,  1.,  2., 10.,  3.,  1.,  1.],
       [ 1.,  2.,  1.,  2.,  1.,  2.,  1.,  3.,  1.,  1.],
       [ 1.,  2.,  1.,  1.,  1.,  2.,  1.,  1.,  1.,  5.],
       [ 1.,  4.,  2.,  1.,  1.,  2.,  1.,  2.,  1.,  1.]])

In [8]:
m, n = X.shape
K = 2 # 2 classes
K, m, n

(2, 699, 10)

### Model
Following from maximum likelihood estimation, we are using the idea of negative log likelihood to design an objective function.

In [9]:
# Weights
theta = np.zeros(n)

# Sigmoid (Logistic)
def g(z):
  """ sigmoid """
  return 1 / (1 + np.exp(-z))

# Model
def h(X, theta):
  return g(X @ theta)

# Objective or cost function
def J(preds, y):
  return 1/m * (-y @ np.log(preds) - (1 - y) @ np.log(1 - preds))

# Gradient
def compute_gradient(theta, X, y):
  preds = h(X, theta)
  gradient = 1/m * X.T @ (preds - y)
  return gradient

### Split Data Up
To test LR properly on new data, I implemented a train_test_split function from scratch (versus using sklearn function).

In [10]:
def my_train_test_split(X, y, train_perc):
    train_end_ind = int(X.shape[0] * train_perc)  
    train_x = X[0:train_end_ind]
    train_y = y[0:train_end_ind]
    test_x = X[train_end_ind:-1]
    test_y = y[train_end_ind:-1]

    return (train_x, train_y, test_x, test_y)


train_x, train_y, test_x, test_y = my_train_test_split(X, y, 0.8)
print("Train (x,y) shape:", train_x.shape, train_y.shape, "\nTest (x,y) shape:", test_x.shape, test_y.shape)

Train (x,y) shape: (559, 10) (559,) 
Test (x,y) shape: (139, 10) (139,)


### Training Loop
I implement train and test functions to use in my cross validation function later and for ease of use.

In [11]:
def train_LR_from_scratch(X, y, iters, alpha):
    theta = np.zeros(10)
    
    hist = {'loss': [], 'acc': []} # Performance history
    
    for i in range(iters):
        
        gradient = compute_gradient(theta, X, y)
        theta -= alpha * gradient # Update weights based on gradient of cost function

        # loss
        preds = h(X, theta)
        loss = J(preds, y) 
        hist['loss'].append(loss) # Measure and store predicted loss

        # acc
        c = 0
        for j in range(len(y)):
            if (h(X[j], theta) > .5) == y[j]:
              c += 1
        acc = c / len(y)
        hist['acc'].append(acc) # Compute and store accuracy

    return (loss,acc, theta, hist)

def test_LR_from_scratch(X, y, theta):
    # loss
    preds = h(X, theta)
    loss = J(preds, y) 
 
    # acc
    c = 0
    for j in range(len(y)):
        if (h(X[j], theta) > .5) == y[j]:
          c += 1
    acc = c / len(y)

    return (loss,acc)

iters = 1000
alpha = 0.1

train_loss_from_scratch, train_acc_from_scratch, theta, hist = train_LR_from_scratch(train_x, train_y, iters, alpha)
test_loss_from_scratch, test_acc_from_scratch = test_LR_from_scratch(test_x, test_y, theta)

print(f'Training (loss, acc): {(train_loss_from_scratch,train_acc_from_scratch)}')
print(f'Testing (loss, acc): {(test_loss_from_scratch, test_acc_from_scratch)}')

Training (loss, acc): (0.11071898232627445, 0.9588550983899821)
Testing (loss, acc): (0.015275531393884854, 1.0)


## Logistic Regression sklearn

In [12]:
# No fancy features we're used in the sklearn version so that we are comparing apples to apples
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import train_test_split

# Init model
LR = LogisticRegression(random_state=0)
# 10-fold cv using sklearn
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=1, random_state=1)
scores = cross_val_score(LR, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
kfold_sklearn_acc = mean(scores)
# Performance
print('LogisticRegression Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

LogisticRegression Mean Accuracy: 0.964 (0.018)


# Cross Validation Comparisons
We are unable to compare the from scratch and sklearn logistic regression algorithms using the packaged pair t-test with 5x2 cross validation since it takes two sklearn models as parameters for comparison. So, we have settled to just comparing the 10-fold cross validation of each.

For the sake of learning, I have included how to use the pair t-test with 5x2 cross validation by comparing LR and LDA to show how to do that.

## K-Fold Cross Validation Comparison

In [13]:
# k_Fold Cross Validation
def cv_split(X, y, start_ind_test, end_ind_test, end_cap):
    # k-1 train segments
    train_x = np.append(X[0:start_ind_test], X[end_ind_test:end_cap], axis=0)
    train_y = np.append(y[0:start_ind_test], y[end_ind_test:end_cap], axis=0)
    # 1 test segment
    test_x = X[start_ind_test:end_ind_test]
    test_y = y[start_ind_test:end_ind_test]

    
    return (train_x, train_y, test_x, test_y)

def kfold_cv_from_scratch(train_func, test_func, X, y, iters, alpha, folds):
    hist = {'loss': [], 'acc': []} # Performance history
    fold_size = (X.shape[0]) // folds
    for i in range(folds):
        start_ind_test = i * fold_size
        end_ind_test = min(len(X) - 1, start_ind_test + fold_size)
        end_cap = fold_size * folds

        train_x, train_y, test_x, test_y = cv_split(X, y, start_ind_test, end_ind_test, end_cap)
        
        _, _, theta, _  = train_func(train_x, train_y, iters, alpha)
        loss, acc = test_func(test_x, test_y, theta)
        
        hist['loss'].append(loss)
        hist['acc'].append(acc)
    
    avg_loss, avg_acc = mean(hist['loss']), mean(hist['acc'])
    
    return (avg_loss, avg_acc)

## Binomial Classification: Logistic Regession From Scratch vs sklearn
We can see that both models perform extremely similarly in terms of accuracy with LR sklean beating the from scratch model by half of a percent.

In [14]:
folds = 10
_, kfold_from_scratch_acc = kfold_cv_from_scratch(train_LR_from_scratch, test_LR_from_scratch, X, y, iters, alpha, folds)

In [15]:
dif = kfold_from_scratch_acc - kfold_sklearn_acc
print(f'Mean Accuracy (From Scratch, SKLearn): {kfold_from_scratch_acc, kfold_sklearn_acc}, Difference: {dif}')

Mean Accuracy (From Scratch, SKLearn): (0.9594202898550724, 0.9642443064182193), Difference: -0.0048240165631469045


## Paired T-Test 5x2 Cross Validation Example

In [16]:
from mlxtend.evaluate import paired_ttest_5x2cv

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
LDA = LinearDiscriminantAnalysis()

t,p = paired_ttest_5x2cv(estimator1=LR, estimator2=LDA, X=X, y=y, scoring='accuracy', random_seed=1)
# summarize
print('P-value: %.3f, t-Statistic: %.3f' % (p, t))
# interpret the result
if p <= 0.05:
    print('Difference between mean performance is probably real')
else:
    print('Algorithms probably have the same performance')

P-value: 0.074, t-Statistic: 2.257
Algorithms probably have the same performance
