<h1>Logistic Regression With The Breast Cancer Dataset</h1>

<h3>Introducing the Breast Cancer Dataset</h3>

<p>The breast cancer dataset is built right into scikit-learn and each datapoint each point in the dataset has measurements from an image of a breast mass and whether or not it’s cancerous. The goal will be to use these measurements to predict if the mass is cancerous.</p>

<p>First we import the dataset from scikit-learn</p>

In [1]:
from sklearn.datasets import load_breast_cancer
cancer_data = load_breast_cancer()

<p>The we view the available keys.</p>

In [2]:
print(cancer_data.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])


<p>For example, the key 'DESCR' gives a detailed description of the dataset.</p>

In [3]:
print(cancer_data['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 569

:Number of Attributes: 30 numeric, predictive attributes and the class

:Attribute Information:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry
    - fractal dimension ("coastline approximation" - 1)

    The mean, standard error, and "worst" or largest (mean of the three
    worst/largest values) of these features were computed for each image,
    resulting in 30 features.  For instance, field 0 is Mean Radius, field
    10 is Radius SE, field 20 is Worst Radius.

    - 

<p>From the description above, we can see that out dataset:</p>
<ul>
    <li>Includes 30 numeric features/attributes</li>
    <li>Has 569 datapoints/rows/instances</li>
    <li>Has to classes Malignant (cancerous) or Benign (not cancerous)</li>
    <li>For each of the datapoints we have measurements of the breast mass (radius, texture, perimeter, etc.)</li>
    <li>For each of the 10 measurements, multiple values were computed, so we have the mean, standard error and the worst value. This results in 10 * 3 or 30 total features.</li>
</ul>

<strong>Feature Engineering is the process of figuring out what additional features to calculate.</strong>
<p>In the breast cancer dataset, there are several features that are calculated based on other columns. </p>

<h3>Loading the Data into Pandas</h3>

<p>We use the shape to see that it is an array with 569 rows and 30 columns.</p>

In [4]:
print(cancer_data['data'].shape)

(569, 30)


<p>lets start by pulling the feature data that is stored with the 'data' key, and in order to put this in a Pandas DataFrame and make it more human readable, we want the column names which are stored with the 'feature_names' key.</p>

In [5]:
from IPython.display import display
import pandas as pd

df = pd.DataFrame(cancer_data['data'], columns=cancer_data['feature_names'])
display(df)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


<p>Now we need to put the target data in our DataFrame, which can be found with the 'target' key. We can see that the target is a 1-dimensional numpy array of 1’s and 0’s.</p>

In [6]:
print(cancer_data['target'])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 0 1 0 1 1 0 

<p>If we look at the shape of the array, we see that it’s a 1-dimensional array with 569 values (which was how many datapoints we had).</p>

In [7]:
print(cancer_data['target'].shape)

(569,)


<p>In order to interpret these 1’s and 0’s, we need to know whether 1 or 0 is benign or malignant. This is given by the target_names</p>

In [8]:
print(cancer_data['target_names'])

['malignant' 'benign']


<p>This gives the array ['malignant' 'benign'] which tells us that 0 means malignant and 1 means benign. Let’s add this data to the Pandas DataFrame.</p>

<p>It’s important to double check that you are interpreting boolean columns correctly. In our case a target of <strong>0 means malignant (cancerous)</strong> and <strong>1 means benign (not cancerous)</strong>.</p>

In [9]:
df['target'] = cancer_data['target']
display(df.head())

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


<h3>Build a Logistic Regression Model</h3>

<p>We start by building our feature matrix X and target array y</p>

In [10]:
X = df[cancer_data.feature_names].values
y = df['target'].values

<p>Now we create a Logistic Regression object and use the fit method to build the model.</p>

In [11]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<p>When we run this code we get a Convergence Warning. This means that the model needs more time to find the optimal solution. One option is to increase the number of iterations as the following</p>

In [12]:
model = LogisticRegression(max_iter=3000)
model.fit(X, y)

<p>Or you can also switch to a different solver, which is what we will do. The solver is the algorithm that the model uses to find the equation of the line.</p>

In [13]:
model = LogisticRegression(solver='liblinear')
model.fit(X, y)

<p>Let’s see what the model predicts for the first datapoint in our dataset. Recall that the predict method takes a 2-dimensional array so we must put the datapoint in a list.</p>

In [14]:
print(model.predict([X[0]]))

[0]


<p>The model predicts that the first datapoint is malignant = 0.</p>

<p>To see how well the model performs over the whole dataset, we use the score method to see the accuracy of the model.</p>

In [15]:
model.score(X, y)

0.9595782073813708

<strong>We see that the model gets 96% of the datapoints correct</strong>