# UCI MACHINE LEARNING REPOSITORY: Breast Cancer Wisconsin (Diagnostic) Data Set

21 April 2018

Creator: CB-90

Data taken from: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29 (there are actually several data sets on the site, so pay attention which one you're working on! I used the file wdbc.data.txt.)

The code (in Octave) in this notebook is merely for illustrative purposes. You can find the full code in the folder.

## 1. Description of data set

This data set contains data from breast biopsies and the cancer diagnosis. We want to predict if the Diagnosis is malignant or benign based on the biopsy results.

About the data set: We have 569 examples, 30 features, and 2 classes (we have 32 columns because the first two columns are ID and Class).

Class distribution: 357 benign, 212 malignant.

NOTE IF YOU USE OCTAVE: Apparently, Octave can't read text files with strings, and the class is given as M (malignant) or B (benign), so you have to replace those with digits. I used 1 for M, 0 for B.

The features in this data set are interesting. There are only ten "real" features, but each feature has three measurements. As the UCI description puts it:

"Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter^2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry 

j) fractal dimension ("coastline approximation" - 1)

The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.  For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius."

## 2. Getting started

After loading the data, we first rearrange it in a random order:

In [None]:
l = length(data);
rndIDX = randperm(l); 
data = data(rndIDX, :);

Now we get different training, validation and test sets each time we run the algorithm.

Then, we want to get an idea about the data, so we visualize the distribution of examples over the different features:

In [None]:
range = linspace(1, l, l);
for i = 1:num_features,
  scatter(range, X(:,i), 50, y)
  title ("Feature distribution for malignant and benign tumors");
  xlabel("No. examples");
  ylabel("Values");
  pause;
end;

Let's take a look at some graphs (I individualized the titles):

<img src="Images/RadiusMean.jpg" width="400" height="400" align="left"/>