# Easily Build a Neural Net for Breast Cancer detection
http://www.laurencemoroney.com/easily-build-a-neural-net-for-breast-cancer-detection/  
In this case, we’ll work with structured data, represented as a CSV file. This has been generated from close inspection of the images of cells that were taken in a biopsy.
  
It’s also possible to work with the images directly, but we chose this approach — so you can modify this code for a problem you care about — because while you may not work on cancer detection, you likely have some structured data of your own, and hopefully the techniques we use here will work for you, too!

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np

  from ._conv import register_converters as _register_converters


In [2]:
tf.__version__

'1.4.0'

In [3]:
# check whether GPU is fine :)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

## load my data

In [4]:
my_data = pd.read_csv('data/wdbc.csv', delimiter=',')
print my_data.shape

(569, 33)


In [5]:
my_data.columns.values

array(['id', 'diagnosis', 'diagnosis_numeric', 'radius', 'texture',
       'perimeter', 'area', 'smoothness', 'compactness', 'concavity',
       'concave_points', 'symmetry', 'fractal_dimension', 'radius_se',
       'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave_points_se',
       'symmetry_se', 'fractal_dimension_se', 'radius_worse',
       'texture_worst', 'perimeter_worst', 'area_worst',
       'smoothness_worst', 'compactness_worst', 'concavity_worst',
       'concave_points_worst', 'symmetry_worst',
       'fractal_dimension_worst'], dtype=object)

In [6]:
my_data.dtypes

id                           int64
diagnosis                   object
diagnosis_numeric            int64
radius                     float64
texture                    float64
perimeter                  float64
area                       float64
smoothness                 float64
compactness                float64
concavity                  float64
concave_points             float64
symmetry                   float64
fractal_dimension          float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave_points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worse               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst   

In [7]:
my_data.head()

Unnamed: 0,id,diagnosis,diagnosis_numeric,radius,texture,perimeter,area,smoothness,compactness,concavity,...,radius_worse,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,8510426,B,0,13.54,14.36,87.46,566.3,0.09779,0.08129,0.06664,...,15.11,19.26,99.7,711.2,0.144,0.1773,0.239,0.1288,0.2977,0.07259
1,8510653,B,0,13.08,15.71,85.63,520.0,0.1075,0.127,0.04568,...,14.5,20.49,96.09,630.5,0.1312,0.2776,0.189,0.07283,0.3184,0.08183
2,8510824,B,0,9.504,12.44,60.34,273.9,0.1024,0.06492,0.02956,...,10.23,15.66,65.13,314.9,0.1324,0.1148,0.08867,0.06227,0.245,0.07773
3,854941,B,0,13.03,18.42,82.61,523.8,0.08983,0.03766,0.02562,...,13.3,22.81,84.46,545.9,0.09701,0.04619,0.04833,0.05013,0.1987,0.06169
4,85713702,B,0,8.196,16.84,51.71,201.9,0.086,0.05943,0.01588,...,8.964,21.96,57.26,242.2,0.1297,0.1357,0.0688,0.02564,0.3105,0.07409


## model configuration

In [8]:
# The data needs to be split into a training set and a test set
# To use 80/20, set the training size to .8
training_set_size_portion = .8

# Set to True to shuffle the data before you split into training and test sets
do_shuffle = True

# Keep track of the accuracy score
accuracy_score = 0

# The DNN has hidden units, set the spec for them here
hidden_units_spec = [10,20,10]
n_classes_spec = 2

# Define the temp directory for keeping the model and checkpoints
tmp_dir_spec = "tmp/model"

# The number of training steps
steps_spec = 2000

# The number of epochs
epochs_spec = 15

# Here's a set of our features. If you look at the CSV, 
# you'll see these are the names of the columns. 
# In this case, we'll just use all of them:
#features = ['radius','texture']
features = my_data.columns.values[3:]

# Here's the label that we want to predict -- it's also a column in the CSV
labels = ['diagnosis_numeric']

In [9]:
features

array(['radius', 'texture', 'perimeter', 'area', 'smoothness',
       'compactness', 'concavity', 'concave_points', 'symmetry',
       'fractal_dimension', 'radius_se', 'texture_se', 'perimeter_se',
       'area_se', 'smoothness_se', 'compactness_se', 'concavity_se',
       'concave_points_se', 'symmetry_se', 'fractal_dimension_se',
       'radius_worse', 'texture_worst', 'perimeter_worst', 'area_worst',
       'smoothness_worst', 'compactness_worst', 'concavity_worst',
       'concave_points_worst', 'symmetry_worst',
       'fractal_dimension_worst'], dtype=object)

## Create Training and Test sets based on our specified columns

Your data might have some form of ordering on it already, and this could impact the learning/testing.   
For example, if it’s breast cancer, sorted by size, and the items at the beginning are more likely to be benign, and the ones at the end are more likely to be malignant, then you’ll be training on benign data, and testing on malignant, which isn’t representative.   
It’s always a good idea to shuffle your data before you split it into training and test sets:

In [10]:
# The pandas DataFrame allows you to shuffle with the reindex method
# Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html#pandas.DataFrame.reindex
# If the doShuffle property is true, we will shuffle with this
# You really SHOULD shuffle to make sure that trends in data don't affect your learning
# but I make it optional here so you can choose

if do_shuffle:
    randomized_data = my_data.reindex(np.random.permutation(my_data.index))
else:
    randomized_data = my_data

In [11]:
randomized_data.head()

Unnamed: 0,id,diagnosis,diagnosis_numeric,radius,texture,perimeter,area,smoothness,compactness,concavity,...,radius_worse,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
242,906290,B,0,11.16,21.41,70.95,380.3,0.1018,0.05978,0.008955,...,12.36,28.92,79.26,458.0,0.1282,0.1108,0.03582,0.04306,0.2976,0.07123
552,91504,M,1,13.82,24.49,92.33,595.9,0.1162,0.1681,0.1357,...,16.01,32.94,106.0,788.0,0.1794,0.3966,0.3381,0.1521,0.3651,0.1183
502,892189,M,1,11.76,18.14,75.0,431.1,0.09968,0.05914,0.02685,...,13.36,23.39,85.1,553.6,0.1137,0.07974,0.0612,0.0716,0.1978,0.06915
347,924342,B,0,9.333,21.94,59.01,264.0,0.0924,0.05605,0.03996,...,9.845,25.05,62.86,295.8,0.1103,0.08298,0.07993,0.02564,0.2435,0.07393
430,866083,M,1,13.61,24.69,87.76,572.6,0.09258,0.07862,0.05285,...,16.89,35.64,113.2,848.7,0.1471,0.2884,0.3796,0.1329,0.347,0.079


In [12]:
total_records = len(randomized_data)
training_set_size = int(total_records * training_set_size_portion)
test_set_size = total_records - training_set_size

In [13]:
# Build the training features and labels
training_features = randomized_data.head(training_set_size)[features].copy()
training_labels = randomized_data.head(training_set_size)[labels].copy()
print(training_features.head())
print(training_labels.head())

     radius  texture  perimeter   area  smoothness  compactness  concavity  \
242  11.160    21.41      70.95  380.3     0.10180      0.05978   0.008955   
552  13.820    24.49      92.33  595.9     0.11620      0.16810   0.135700   
502  11.760    18.14      75.00  431.1     0.09968      0.05914   0.026850   
347   9.333    21.94      59.01  264.0     0.09240      0.05605   0.039960   
430  13.610    24.69      87.76  572.6     0.09258      0.07862   0.052850   

     concave_points  symmetry  fractal_dimension           ...             \
242         0.01076    0.1615            0.06144           ...              
552         0.06759    0.2275            0.07237           ...              
502         0.03515    0.1619            0.06287           ...              
347         0.01282    0.1692            0.06576           ...              
430         0.03085    0.1761            0.06130           ...              

     radius_worse  texture_worst  perimeter_worst  area_worst  \
242

In [14]:
# Build the testing features and labels
testing_features = randomized_data.tail(test_set_size)[features].copy()
testing_labels = randomized_data.tail(test_set_size)[labels].copy()

## Create TensorFlow Feature Columns
The Neural Network classifier expects the feature columns to be specified as tf.feature_column types.   
As our columns are numbers,  we set them to numeric_column types.

In [15]:
feature_columns = [tf.feature_column.numeric_column(key) for key in features]

In [16]:
feature_columns

[_NumericColumn(key='radius', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='texture', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='perimeter', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='area', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='smoothness', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='compactness', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='concavity', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='concave_points', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='symmetry', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='fractal_dimension', shape=(1,), default_value=

## Define the Neural Network used to classify the data
Given that we have all our data, we can now create our neural network object that we’ll train on the data. This takes the feature columns that you just created as well as parameters defining the number of hidden units in the neural network, as well as the number of classes. As it trains the network, it saves temporary files and checkpoints as well as the finished model out to the specified model directory.  

The hidden units are a direct specification of what the network looks like — so, for example our default here is [10, 20, 10], which means there’ll be a layer of 10 neurons, with each connected to 20 neurons in the next layer, each of which is connected to 10 neurons in the third layer.   

The classes are the number of classes we are classifying to. In this case we’re doing breast cancer classification, with 2 classes, so we will train on 2 classes.

In [17]:
classifier = tf.estimator.DNNClassifier(feature_columns=feature_columns, hidden_units=hidden_units_spec, 
                                        n_classes=n_classes_spec, model_dir=tmp_dir_spec)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7efcc7771a10>, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_model_dir': 'tmp/model', '_save_summary_steps': 100}


## Train the network
The next step is to train the classifier using the data. to do this you build an input function that specifies the features (aka ‘x’) and the labels (aka ‘y’). This is done by specifiying it as a pandas_input_fn:

In [18]:
# Define the training input function
train_input_fn = tf.estimator.inputs.pandas_input_fn(x=training_features, y=training_labels, num_epochs=epochs_spec, shuffle=True)

And now you can train the neural network by giving it the input function, and the number of steps you want to use to train it.   
Experiment with different step numbers to get different results. In the case of the breast cancer data, with 2000 steps, I usually get 90%+ accuracy against the test set.

In [20]:
%%time
# Train the model using the classifer.
classifier.train(input_fn=train_input_fn, steps=steps_spec)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into tmp/model/model.ckpt.
INFO:tensorflow:loss = 513.96954, step = 1
INFO:tensorflow:Saving checkpoints for 54 into tmp/model/model.ckpt.
INFO:tensorflow:Loss for final step: 14.5733595.
CPU times: user 1.92 s, sys: 72.5 ms, total: 1.99 s
Wall time: 1.49 s


<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x7efcc46c96d0>

## Test the network
Similar to training a model, we test the model by specifying an input function in exactly the same way, except of course we pass in the testing features and labels:

In [21]:
# Define the test input function
test_input_fn = tf.estimator.inputs.pandas_input_fn(x=testing_features, y=testing_labels, num_epochs=epochs_spec, shuffle=False)

Now, we can ask the classifier to tell evaluate the test input function, and tell us its accuracy. It goes through the test set, and compares its classifications to the actual values, and uses this to calculate how often it was right, giving us an accuracy score:

In [22]:
# Evaluate accuracy.
accuracy_score = classifier.evaluate(input_fn=test_input_fn)["accuracy"]
print("Accuracy = {}".format(accuracy_score))

INFO:tensorflow:Starting evaluation at 2018-03-10-04:21:02
INFO:tensorflow:Restoring parameters from tmp/model/model.ckpt-54
INFO:tensorflow:Finished evaluation at 2018-03-10-04:21:02
INFO:tensorflow:Saving dict for global step 54: accuracy = 0.877193, accuracy_baseline = 0.5526316, auc = 0.9486461, auc_precision_recall = 0.9511447, average_loss = 0.3304201, global_step = 54, label/mean = 0.4473684, loss = 40.358456, prediction/mean = 0.36682603
Accuracy = 0.877192974091


## use the network
Now that you have a trained and tested network, you likely want to see how it would react to predict different data sets that it hasn’t already seen. Let’s take a look at how to do that, and read the results here.  

First of all, your prediction set should match your feature columns. So, in this example we only trained against 2 feature columns, and they were both numeric. So, if I want to classify something, I have to pass data in the same shape into the network. So, for example, here I can create a prediction set of two cells, one with a radius of 14 and a texture of 25, the other with a radius of 13 and a texture of 26.  

In [None]:
# Create a prediction set -- this is a list of input features that you want to classify
prediction_set = pd.DataFrame({'radius':[14, 13], 'texture':[25, 26]})

predict_input_fn = tf.estimator.inputs.pandas_input_fn(x=prediction_set, num_epochs=1, shuffle=False)

# Get a list of the predictions
predictions = list(classifier.predict(input_fn=predict_input_fn))

predicted_classes = [p["classes"] for p in predictions] 
results=np.concatenate(predicted_classes) 
print(results)

## Summary
You’ve done a lot in a very short time — not only have you trained a neural network to classify breast cancer data from the Wisconsin database, as well as writing code that can be easily adapted to provide classification for any CSV file (within reason). Take it for a spin, and let me know your experience in the comments below!