# Introduction

This notebook continues the discussion of binary classification. In particular, In particular, we will look at applying logistic regression to the scikit-learn breast cancer toy data set. <br>
People looking at this notebook would greatly benefit from first brushing up on linear regression.

# Some Theory for Logistic Regression

## Why use scary logistic regression over the powerful and loved linear regression?

This is a reasonable question and merits some discussion before we dive into the mathematics of logistic regression. From a practical perspective, linear regression produces contnous output rather than probabilistic output, requiring extra effort to transform the answer into a particular class. While the output of linear regression is continuous, logistic regression operates in probability space, which is highly desirable when we want to classify a given sample.

# Building a Logistic Regression Model for Binary Classification

## Exploring the Toy Cancer Data

The medical sciences are becoming increasingly interested in data science and machine learning to build predictive models and find novel insights. We will first perform some preliminary exploration of the scikit-learn breast cancer toy dataset. This dataset is perfect for didactic purposes because the data comes the biopsies have only two classes: malignant or benign. <br>
According to the documentation, the data "features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass" (https://scikit-learn.org/dev/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset). <br>
We will first load the data and targets into pandas DataFrames and take a brief look at the data.

In [4]:
import pandas as pd
from sklearn.datasets import load_breast_cancer

#unpack the data into separate data and target numpy arrays
#setting return_X_y argument to True puts the data and the targets 
#into different numpy arrays
data_array, target_array = load_breast_cancer(return_X_y = True)

In [5]:
#covariate names
column_headers = [
    "radius (mean)",
    "texture (mean)",
    "perimeter (mean)",
    "area (mean)",
    "smoothness (mean)",
    "compactness (mean)",
    "concavity (mean)",
    "concave points (mean)",
    "symmetry (mean)",
    "fractal dimension (mean)",
    "radius (standard error)",
    "texture (standard error)",
    "perimeter (standard error)",
    "area (standard error)",
    "smoothness (standard error)",
    "compactness (standard error)",
    "concavity (standard error)",
    "concave points (standard error)",
    "symmetry (standard error)",
    "fractal dimension (standard error)",
    "radius (worst)",
    "texture (worst)",
    "perimeter (worst)",
    "area (worst)",
    "smoothness (worst)",
    "compactness (worst)",
    "concavity (worst)",
    "concave points (worst)",
    "symmetry (worst)",
    "fractal dimension (worst)"
]

data = pd.DataFrame(data_array, columns = column_headers)
targets = pd.DataFrame(target_array, columns = ["cancer class"])
#include the cancer class in the data dataframe for filtering ppurposes
data = pd.concat([data, targets], axis = 1, sort = False)

In [6]:
#print some metadata
print("Shape of data:\t\t\t", data_array.shape)
print("Shape of target:\t\t", target_array.shape)
print("Number of benign cases:\t\t", data[data['cancer class'] == 1]['cancer class'].count())
print("Number of malignant cases:\t", data[data['cancer class'] == 0]['cancer class'].count())

Shape of data:			 (569, 30)
Shape of target:		 (569,)
Number of benign cases:		 357
Number of malignant cases:	 212


In [7]:
#set option so that we see can see all columns
pd.set_option('display.max_columns', None)
data.head()

Unnamed: 0,radius (mean),texture (mean),perimeter (mean),area (mean),smoothness (mean),compactness (mean),concavity (mean),concave points (mean),symmetry (mean),fractal dimension (mean),radius (standard error),texture (standard error),perimeter (standard error),area (standard error),smoothness (standard error),compactness (standard error),concavity (standard error),concave points (standard error),symmetry (standard error),fractal dimension (standard error),radius (worst),texture (worst),perimeter (worst),area (worst),smoothness (worst),compactness (worst),concavity (worst),concave points (worst),symmetry (worst),fractal dimension (worst),cancer class
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [8]:
#0 is malignant
#1 is benign
targets.head()

Unnamed: 0,cancer class
0,0
1,0
2,0
3,0
4,0


For convenience and sanity, we will call a particular measurement, such as radius or compactness, an "attribute", and call a derived measurement, such as the mean compactness or worst perimeter, a "covariate". <br>
The above data has 10 different attributes and 30 covariates. Each particular attribute for a biopsy image has three related covariates: the mean value for the image, the "worst" value for the image and the standard error for the image. <br>
We will continue our initial data exploration by looking at summary statistics for the data overall and class-specific data.

In [9]:
data.describe()

Unnamed: 0,radius (mean),texture (mean),perimeter (mean),area (mean),smoothness (mean),compactness (mean),concavity (mean),concave points (mean),symmetry (mean),fractal dimension (mean),radius (standard error),texture (standard error),perimeter (standard error),area (standard error),smoothness (standard error),compactness (standard error),concavity (standard error),concave points (standard error),symmetry (standard error),fractal dimension (standard error),radius (worst),texture (worst),perimeter (worst),area (worst),smoothness (worst),compactness (worst),concavity (worst),concave points (worst),symmetry (worst),fractal dimension (worst),cancer class
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,0.405172,1.216853,2.866059,40.337079,0.007041,0.025478,0.031894,0.011796,0.020542,0.003795,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,0.277313,0.551648,2.021855,45.491006,0.003003,0.017908,0.030186,0.00617,0.008266,0.002646,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,0.1115,0.3602,0.757,6.802,0.001713,0.002252,0.0,0.0,0.007882,0.000895,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,0.2324,0.8339,1.606,17.85,0.005169,0.01308,0.01509,0.007638,0.01516,0.002248,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,0.3242,1.108,2.287,24.53,0.00638,0.02045,0.02589,0.01093,0.01873,0.003187,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,0.4789,1.474,3.357,45.19,0.008146,0.03245,0.04205,0.01471,0.02348,0.004558,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,2.873,4.885,21.98,542.2,0.03113,0.1354,0.396,0.05279,0.07895,0.02984,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


In [10]:
data[data['cancer class'] == 0].describe()

Unnamed: 0,radius (mean),texture (mean),perimeter (mean),area (mean),smoothness (mean),compactness (mean),concavity (mean),concave points (mean),symmetry (mean),fractal dimension (mean),radius (standard error),texture (standard error),perimeter (standard error),area (standard error),smoothness (standard error),compactness (standard error),concavity (standard error),concave points (standard error),symmetry (standard error),fractal dimension (standard error),radius (worst),texture (worst),perimeter (worst),area (worst),smoothness (worst),compactness (worst),concavity (worst),concave points (worst),symmetry (worst),fractal dimension (worst),cancer class
count,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0
mean,17.46283,21.604906,115.365377,978.376415,0.102898,0.145188,0.160775,0.08799,0.192909,0.06268,0.609083,1.210915,4.323929,72.672406,0.00678,0.032281,0.041824,0.01506,0.020472,0.004062,21.134811,29.318208,141.37033,1422.286321,0.144845,0.374824,0.450606,0.182237,0.323468,0.09153,0.0
std,3.203971,3.77947,21.854653,367.937978,0.012608,0.053987,0.075019,0.034374,0.027638,0.007573,0.345039,0.483178,2.568546,61.355268,0.00289,0.018387,0.021603,0.005517,0.010065,0.002041,4.283569,5.434804,29.457055,597.967743,0.02187,0.170372,0.181507,0.046308,0.074685,0.021553,0.0
min,10.95,10.38,71.9,361.6,0.07371,0.04605,0.02398,0.02031,0.1308,0.04996,0.1938,0.3621,1.334,13.99,0.002667,0.008422,0.01101,0.005174,0.007882,0.001087,12.84,16.67,85.1,508.1,0.08822,0.05131,0.02398,0.02899,0.1565,0.05504,0.0
25%,15.075,19.3275,98.745,705.3,0.09401,0.1096,0.109525,0.06462,0.17405,0.056598,0.390375,0.892825,2.7155,35.7625,0.005085,0.019662,0.026998,0.011415,0.014615,0.002688,17.73,25.7825,119.325,970.3,0.130475,0.244475,0.326425,0.15275,0.2765,0.076302,0.0
50%,17.325,21.46,114.2,932.0,0.1022,0.13235,0.15135,0.08628,0.1899,0.061575,0.5472,1.1025,3.6795,58.455,0.006209,0.02859,0.037125,0.014205,0.0177,0.003739,20.59,28.945,138.0,1303.0,0.14345,0.35635,0.4049,0.182,0.3103,0.0876,0.0
75%,19.59,23.765,129.925,1203.75,0.110925,0.1724,0.20305,0.103175,0.20985,0.067075,0.7573,1.42925,5.20625,94.0,0.007971,0.03891,0.050443,0.017497,0.022132,0.004892,23.8075,32.69,159.8,1712.75,0.155975,0.44785,0.556175,0.210675,0.359225,0.102625,0.0
max,28.11,39.28,188.5,2501.0,0.1447,0.3454,0.4268,0.2012,0.304,0.09744,2.873,3.568,21.98,542.2,0.03113,0.1354,0.1438,0.0409,0.07895,0.01284,36.04,49.54,251.2,4254.0,0.2226,1.058,1.17,0.291,0.6638,0.2075,0.0


In [11]:
data[data['cancer class'] == 1].describe()

Unnamed: 0,radius (mean),texture (mean),perimeter (mean),area (mean),smoothness (mean),compactness (mean),concavity (mean),concave points (mean),symmetry (mean),fractal dimension (mean),radius (standard error),texture (standard error),perimeter (standard error),area (standard error),smoothness (standard error),compactness (standard error),concavity (standard error),concave points (standard error),symmetry (standard error),fractal dimension (standard error),radius (worst),texture (worst),perimeter (worst),area (worst),smoothness (worst),compactness (worst),concavity (worst),concave points (worst),symmetry (worst),fractal dimension (worst),cancer class
count,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0,357.0
mean,12.146524,17.914762,78.075406,462.790196,0.092478,0.080085,0.046058,0.025717,0.174186,0.062867,0.284082,1.22038,2.000321,21.135148,0.007196,0.021438,0.025997,0.009858,0.020584,0.003636,13.379801,23.51507,87.005938,558.89944,0.124959,0.182673,0.166238,0.074444,0.270246,0.079442,1.0
std,1.780512,3.995125,11.807438,134.287118,0.013446,0.03375,0.043442,0.015909,0.024807,0.006747,0.11257,0.58918,0.771169,8.843472,0.003061,0.016352,0.032918,0.005709,0.006999,0.002938,1.981368,5.493955,13.527091,163.601424,0.020013,0.09218,0.140368,0.035797,0.041745,0.013804,0.0
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.05185,0.1115,0.3602,0.757,6.802,0.001713,0.002252,0.0,0.0,0.009539,0.000895,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1566,0.05521,1.0
25%,11.08,15.15,70.87,378.2,0.08306,0.05562,0.02031,0.01502,0.158,0.05853,0.2073,0.7959,1.445,15.26,0.005212,0.01132,0.01099,0.006433,0.0156,0.002074,12.08,19.58,78.27,447.1,0.1104,0.112,0.07708,0.05104,0.2406,0.07009,1.0
50%,12.2,17.39,78.18,458.4,0.09076,0.07529,0.03709,0.02344,0.1714,0.06154,0.2575,1.108,1.851,19.63,0.00653,0.01631,0.0184,0.009061,0.01909,0.002808,13.35,22.82,86.92,547.4,0.1254,0.1698,0.1412,0.07431,0.2687,0.07712,1.0
75%,13.37,19.76,86.1,551.1,0.1007,0.09755,0.05999,0.03251,0.189,0.06576,0.3416,1.492,2.388,25.03,0.008534,0.02589,0.03056,0.01187,0.02406,0.004174,14.8,26.51,96.59,670.0,0.1376,0.2302,0.2216,0.09749,0.2983,0.08541,1.0
max,17.85,33.81,114.6,992.1,0.1634,0.2239,0.4108,0.08534,0.2743,0.09575,0.8811,4.885,5.118,77.11,0.02177,0.1064,0.396,0.05279,0.06146,0.02984,19.82,41.78,127.1,1210.0,0.2006,0.5849,1.252,0.175,0.4228,0.1486,1.0


In [12]:
#are summary statistics greater for the malignant class
data[data['cancer class'] == 0].describe() > data[data['cancer class'] == 1].describe()

Unnamed: 0,radius (mean),texture (mean),perimeter (mean),area (mean),smoothness (mean),compactness (mean),concavity (mean),concave points (mean),symmetry (mean),fractal dimension (mean),radius (standard error),texture (standard error),perimeter (standard error),area (standard error),smoothness (standard error),compactness (standard error),concavity (standard error),concave points (standard error),symmetry (standard error),fractal dimension (standard error),radius (worst),texture (worst),perimeter (worst),area (worst),smoothness (worst),compactness (worst),concavity (worst),concave points (worst),symmetry (worst),fractal dimension (worst),cancer class
count,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
mean,True,True,True,True,True,True,True,True,True,False,True,False,True,True,False,True,True,True,False,True,True,True,True,True,True,True,True,True,True,True,False
std,True,False,True,True,False,True,True,True,True,True,True,False,True,True,False,True,False,False,True,False,True,False,True,True,True,True,True,True,True,True,False
min,True,True,True,True,True,True,True,True,True,False,True,True,True,True,True,True,True,True,False,True,True,True,True,True,True,True,True,True,False,False,False
25%,True,True,True,True,True,True,True,True,True,False,True,True,True,True,False,True,True,True,False,True,True,True,True,True,True,True,True,True,True,True,False
50%,True,True,True,True,True,True,True,True,True,True,True,False,True,True,False,True,True,True,False,True,True,True,True,True,True,True,True,True,True,True,False
75%,True,True,True,True,True,True,True,True,True,True,True,False,True,True,False,True,True,True,False,True,True,True,True,True,True,True,True,True,True,True,False
max,True,True,True,True,False,True,True,True,True,True,True,False,True,True,True,True,False,False,True,False,True,True,True,True,True,True,False,True,True,True,False


We will now examine the ranges of the covariates

In [13]:
malignant_range = data[data['cancer class'] == 0].max() - data[data['cancer class'] == 0].min()
malignant_range

radius (mean)                           17.160000
texture (mean)                          28.900000
perimeter (mean)                       116.600000
area (mean)                           2139.400000
smoothness (mean)                        0.070990
compactness (mean)                       0.299350
concavity (mean)                         0.402820
concave points (mean)                    0.180890
symmetry (mean)                          0.173200
fractal dimension (mean)                 0.047480
radius (standard error)                  2.679200
texture (standard error)                 3.205900
perimeter (standard error)              20.646000
area (standard error)                  528.210000
smoothness (standard error)              0.028463
compactness (standard error)             0.126978
concavity (standard error)               0.132790
concave points (standard error)          0.035726
symmetry (standard error)                0.071068
fractal dimension (standard error)       0.011753


In [14]:
benign_range = data[data['cancer class'] == 1].max() - data[data['cancer class'] == 1].min()
benign_range

radius (mean)                           10.869000
texture (mean)                          24.100000
perimeter (mean)                        70.810000
area (mean)                            848.600000
smoothness (mean)                        0.110770
compactness (mean)                       0.204520
concavity (mean)                         0.410800
concave points (mean)                    0.085340
symmetry (mean)                          0.168300
fractal dimension (mean)                 0.043900
radius (standard error)                  0.769600
texture (standard error)                 4.524800
perimeter (standard error)               4.361000
area (standard error)                   70.308000
smoothness (standard error)              0.020057
compactness (standard error)             0.104148
concavity (standard error)               0.396000
concave points (standard error)          0.052790
symmetry (standard error)                0.051921
fractal dimension (standard error)       0.028945


In [15]:
malignant_range > benign_range

radius (mean)                          True
texture (mean)                         True
perimeter (mean)                       True
area (mean)                            True
smoothness (mean)                     False
compactness (mean)                     True
concavity (mean)                      False
concave points (mean)                  True
symmetry (mean)                        True
fractal dimension (mean)               True
radius (standard error)                True
texture (standard error)              False
perimeter (standard error)             True
area (standard error)                  True
smoothness (standard error)            True
compactness (standard error)           True
concavity (standard error)            False
concave points (standard error)       False
symmetry (standard error)              True
fractal dimension (standard error)    False
radius (worst)                         True
texture (worst)                        True
perimeter (worst)               

The first thing we may notice is the mean for all mean covariates except mean fractal dimension is larger for malignant samples than those for benign samples. This agrees with the model of cancer as abnormal cells that do not respect cellular signaling that limits normal somatic cell growth. Moreover, the range for malignant cells for mean radius, mean texture, mean perimeter, mean area, and mean compactness are all greater, which corresponds with the expected unregulated growth of cancer cells. <br>
Perhaps unsurprisingly, the mean for all worst covariates was greater for malignant compared to cancer cells. Furthermore, the range for all worst covariates except concavity is greater for maligcant cancer cells. This helps build some intuition that the above metrics may provide some discriminatory power for classifying images as cancer. <br>
We will focus in the next section on creating our model and measuring its effectivenesss with the metrics we outlined in pervious sections.