# Breast Cancer Predicting Linear Classifier

Our goal will be to train a single layer neural network classifier. The model will have 9 inputs and 2 outputs, and will be fed features related to cell attributes. The model will then output a probability distribution over the events of the cell being benign or the cell being malignant.

## collecting our data

We will use a dataset from the UCI Machine Learning Repository. It can be found <a href="https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29">here.</a>

First, our imports.

In [1]:
import tensorflow as tf
import numpy as np
import csv

Couldn't import dot_parser, loading of dot files will not be possible.


After downloading the csv data, we will to parse it and convert it into a numpy matrix. We can delete the first feature of the dataset, which is the sample code. This feature is meaningless to us, we will also convert our code specifying malignancy from 2 and 4 to 0 and 1. This will make it easier to use the cross entropy function to compare our probability distributions and train the model. Also, we will delete any rows with missing features.

In [2]:
def training_data():
    with open('breast_cancer_data.csv') as csv_file:
        csv_reader = csv.reader(csv_file)
        bc_training_data = []
        bc_training_labels = []
        for row in csv_reader:
            try:
                integer_mapped = map(int, row[:-1])
                bc_training_data.append(integer_mapped[:-1])
                if(integer_mapped[-1] == 0):
                    bc_training_labels.append(np.array([0,1]))
                else:
                    bc_training_labels.append(np.array([1,0]))
            except ValueError:
                pass
    return np.array(bc_training_data), np.array(bc_training_labels)

data_x, data_y = training_data()
print 'First feature vector of dataset: {}'.format(data_x[0])
print 'First class label of dataset: {} (not cancerous)'.format(data_y[0])

print 'Sixth feature vector of dataset: {}'.format(data_x[5])
print 'First class label of dataset: {} (cancerous)'.format(data_y[5])

print 'total length of dataset: {}'.format(len(data_x))

First feature vector of dataset: [5 1 1 1 2 1 3 1 1]
First class label of dataset: [0 1] (not cancerous)
Sixth feature vector of dataset: [ 8 10 10  8  7 10  9  7  1]
First class label of dataset: [1 0] (cancerous)
total length of dataset: 121


We will then separate the data into a training and testing set. 

In [3]:
train_x, test_x = data_x[:90], data_x[90:]
train_y, test_y = data_y[:90], data_y[90:]
print 'length of training set: {}'.format(len(train_x))
print 'length of testing set: {}'.format(len(test_x))

length of training set: 90
length of testing set: 31


## setting up our tensorflow model

We will instantiate our two placeholders needed; the placeholder for the matrix of feature vectors and the placeholder for the matrix of class labels. We set the vertical size of the matrices to None because we want to be able to feed an arbitrary amount of feature vectors into the classifier for training.

In [4]:
x = tf.placeholder(dtype = tf.float32, shape = [None,9])
y = tf.placeholder(dtype = tf.float32, shape = [None,2])

We will set up  our tensorflow variables to be initialized. We will only need one weight matrix since this is a linear classifier. Also, a bias vector will be declared. A bias is important to allow our separating plane move about as well. 

We then define the computation to get our matrix of outputs corresponding to each input. We do this by multiplying the input and weight matrix, adding the bias, then applying the softmax function. We apply the softmax function to each row in the output matrix to normalize the values and get probability distributions.

We will then define our cross function as the mean of the cross entropy values between the predicted labels and the correct labels. Also, a gradient descent optimizer will be declared and used for training.

To benchmark our model, we will also compute a vector of True/False values corresponding to correct class predictions, then convert it to a vector of 1's and 0's the compute our score as a percentage. 

In [5]:
weights = tf.Variable(tf.random_normal([9, 2]))
bias = tf.Variable(tf.zeros([1,2]))

y_predictions = tf.nn.softmax(tf.matmul(x, weights) + bias)

y_predicted_classes = tf.argmax(y_predictions, dimension=1)
y_true_class = tf.argmax(y, dimension=1)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y_predictions, labels=y))

optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5).minimize(cost)
accuracy = tf.reduce_mean(tf.cast(tf.equal(y_predicted_classes, y_true_class), tf.float32))

Instructions for updating:
Use the `axis` argument instead


## training and testing the model

We will defined a training function taking an amount of epochs as input. We will train over the dataset in full batches. A test function to check performance will also be defined.

In [10]:
def train(epochs):
    feed_dict={x:train_x, y:train_y}
    for i in xrange(epochs):
        session.run(optimizer, feed_dict = feed_dict)
        
def test():
    acc = session.run(accuracy, feed_dict={x:test_x, y:test_y})
    print "Accuracy on test-set: {0:.1%}".format(acc)


Let's test our model first without any training, train it for 9000 epochs, then test it again.

In [17]:
session = tf.Session()
session.run(tf.global_variables_initializer())
test()
train(9000)
test()

Accuracy on test-set: 48.4%
Accuracy on test-set: 93.5%


After trying a couple times, we are able to achieve an accuracy at 93.5%. Though our model is fairly accurate, there are definitely avenues for improvement we can take later, perhaps in another notebook.