Binary logistic regression.
The heart disease data set is described at:
https://archive.ics.uci.edu/ml/datasets/Heart+Disease
The course webpage has a file heart.csv that contains a more compact version of this data set with
303 data points, each of which has a 13-dimensional attribute vector x (first 13 columns) and a binary
label y (final column). We’ll work with this smaller data set.

(a) Randomly partition the data into 200 training points and 103 test points. Fit a logistic regression
model to the training data and display the coefficients of the model. If you had to choose the
three features that were most influential in the model, what would they be?

In [133]:
# Randomly partition data into 200 training points and 103 test points 
from pandas import read_csv
import numpy as np 
from sklearn.linear_model import LogisticRegression 
 
d = read_csv('heart.csv')
 
data = d.values
np.random.shuffle(data)
features = data[:, 0:-1]
labels = data[:,-1]

train_features = features[0:200,:]
train_labels = labels[0:200]

test_features = features[200:,:]
test_labels = labels[200:]


In [134]:
# Fit a logistic regression model to training data and display the coefficients of the model
model = LogisticRegression(solver='liblinear')
clf = model.fit(train_features, train_labels)
print(clf.coef_)
# Find the three most influential features -> sort the coefficients by magnitude and choose 
# the features corresponding to the three largest values
sorted_coeff_array_indices = np.argsort(np.abs(clf.coef_))
print(sorted_coeff_array_indices)

[[ 0.02854899 -1.33926467  0.6408165  -0.01446384 -0.00155748 -0.52903129
   0.52888458  0.02246758 -1.03397455 -0.33038919  0.57155042 -1.06607593
  -0.85016179]]
[[ 4  3  7  0  9  6  5 10  2 12  8 11  1]]


The three features that were most influential in the model were sex (1), ca (11), and exang (8).

(b) What is the test error of your model?

In [135]:
clf.score(test_features, test_labels)

0.8058252427184466

(c) Estimate the error by using 5-fold cross-validation on the training set. How does this compare to
the test error?

In [136]:
def cross_validation(k, features, labels):
    
    n = len(labels)
    score = 0
    
    for i in range(k):

        # Partition train and validation set
        validation_features = features[0:int(n/k)]
        validation_labels = labels[0:int(n/k)]
        train_features = features[int(n/k):]
        train_labels = labels[int(n/k):]

        # Train + test model
        model = LogisticRegression(solver='liblinear')
        clf = model.fit(train_features, train_labels)
        score = score + clf.score(validation_features, validation_labels)

        features = np.concatenate((train_features, validation_features))
        labels = np.concatenate((train_labels, validation_labels))

    return score/k
    
cross_validation(5, train_features, train_labels)

0.85

The error estimated by using 5-fold cross-validation on the training set is very similar to the test error. 