## 1.4 An example of logistic regression with sklearn on real data
In this case we have many more features.

### Loading the breast cancer dataset

In [7]:
#read the data
from sklearn.datasets import load_breast_cancer
import numpy as np

# Load dataset
dataset = load_breast_cancer()
#print(dataset)

### feature names, and the label names

In [8]:
# investigate the dataset

X_names = dataset['feature_names']
X = dataset['data']
y_names = dataset['target_names']
y = dataset['target']
print('Features (ie, predictors) in the data:',X_names,'\n\n')
print('Class labels (ie, outcome values):',y_names)

Features (ie, predictors) in the data: ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension'] 


Class labels (ie, outcome values): ['malignant' 'benign']


In [9]:
print(f"The data has a dimension of {X.shape[0]} rows and {X.shape[1]} columns.")
print(f"This means that there are {X.shape[0]} samples and {X.shape[1]} features in the dataset.")

The data has a dimension of 569 rows and 30 columns.
This means that there are 569 samples and 30 features in the dataset.


In [10]:
print('What this looks like:')
print(X)
print('\nAnd what the label data looks like:')
print(y)

What this looks like:
[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]

And what the label data looks like:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0

### Splitting the data into training data and test data

In [11]:
# Split our data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    random_state=42)

In [12]:
print('the shape of training data, and test data---- shape(sample number, dimensionality)')
print(X_train.shape)
print(X_test.shape)

the shape of training data, and test data---- shape(sample number, dimensionality)
(381, 30)
(188, 30)


In [13]:
print("scaling the data using standarisation, i.e., zero mean, unit variance")
#https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(X_train)
# Apply transform to both the training set and the test set.
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

scaling the data using standarisation, i.e., zero mean, unit variance


### training the a logistic regresson (LR) model using the default hyper-parameters
see sklearn API of LR: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [14]:
from sklearn.linear_model import LogisticRegression
print('just use the default hyper-parameters')

clf = LogisticRegression().fit(X_train, y_train)

# the default penalty is L2, while the default C=1, the default one is the same as the following
#_penalty = 'l2'
#_C = 1
#clf = LogisticRegression(penalty=_penalty, C=_C).fit(X_train, y_train)

print('finished training')

just use the default hyper-parameters
finished training


### testing using the trained LR, which is based on the default hyper-parameters

In [15]:
from sklearn.metrics import accuracy_score

y_pred = clf.predict(X_test)
print('the predicted labels are:')
print(y_pred)
print('the ground truth labels are:')
print(y_test)


print('logistic regression classifier accuracy:',accuracy_score(y_test, y_pred))

the predicted labels are:
[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
 1 1 1 0 1 1 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
 1 0 0 0 0 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0
 0 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0
 0 0 1]
the ground truth labels are:
[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
 1 1 1 0 1 1 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0
 0 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0
 0 1 1]
logistic regression classifier accuracy: 0.9787234042553191


Note: because we have 30 features, we can't visualize the hyperplane in a straight forward way. 