# Project 2: Classification of Breast Cancer Data

*Projects are an important part of the learning and assessment process in this course. Plan on committing considerable resources to each Project.*

## Review of the Breast Cancer Dataset

> Breast cancer is cancer that forms in the cells of the breasts. After skin cancer, breast cancer is the most common cancer diagnosed in women in the United States. Breast cancer can occur in both men and women, but it's far more common in women. -- [MayoClinic.org](https://www.mayoclinic.org/diseases-conditions/breast-cancer/symptoms-causes/syc-20352470)

> A tumor can be **benign** (not dangerous to health) or **malignant** (has the potential to be dangerous). Benign tumors are not considered cancerous: their cells are close to normal in appearance, they grow slowly, and they do not invade nearby tissues or spread to other parts of the body. Malignant tumors are cancerous. Left unchecked, malignant cells eventually can spread beyond the original tumor to other parts of the body. -- [BreastCancer.org](https://www.breastcancer.org/symptoms/understand_bc/what_is_bc)

## Overview of the Data

This data was originally obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. This data has become a standard machine learning example and can be found both at the [University of California-Irvine Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) and the[ Kaggle Data sets](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data). Both of these sites can provide additional details on the data.

The dataset contains over 500 samples of malignant and benign tumor cells.

- The first two columns in the dataset store the unique ID numbers of the samples and the corresponding diagnosis (M=malignant, B=benign), respectively.
- The columns 3-32 contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.

## Classification Algorithms

You will try out the following traditional classification algorithms on this data set
    - K-Nearest Neighbors or from sklearn.neighbors import KNeighborsClassifier
    - Support Vector Machines (SVM) or from sklearn.svm import SVC
    - Decision Trees or from sklearn.tree import DecisionTreeClassifier
    
*Note, in the next unit we will revisit these adding dimensional reduction methods like PCA or LDA *

## Setup








**Setting up Python tools: **

We'll use three libraries for this tutorial: [pandas](http://pandas.pydata.org/), [matplotlib](http://matplotlib.org/), and [seaborn](http://stanford.edu/~mwaskom/software/seaborn/).




In [0]:
# First, we'll import pandas and numpy, two data processing libraries
import pandas as pd
import numpy as np

# We'll also import seaborn and matplot, twp Python graphing libraries
import seaborn as sns
import matplotlib.pyplot as plt
#sns.set(style="white", color_codes=True)

# We will turn off some warns in this notebook to make it easier to read for new students
import warnings
warnings.filterwarnings('ignore')

## Read in the Wisconsin Breast Cancer data
The Breast Cancer data is read in from a file stored on the internet
<p>
It is stored in a Pandas DataFrame which is similar to an internal spreadsheet in that the data is stored in rows and columns.

In [0]:
# Read in the data file from stored in a raw file in GitHub
url = 'https://raw.githubusercontent.com/CIS3115-Machine-Learning-Scholastica/CIS3115ML-Units3and4/master/breast-cancer-wisconsin-data.csv'

cancer = pd.read_csv(url)
# Set the Id column as the index since it is unique for each pati
cancer.set_index('id', inplace=True)

In [0]:
# Display the first 5 rows at the start, or head, of the dataframe
cancer.head(5)

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


## Prep the data

- Remove the last column which is blank
- Set up the input cancer features, X, and the output diagnosis, y
- Scale the data to put large features like area_mean on the same footing as small features like smoothness_mean
- Split the data into train and testing sets

In [0]:
# Drop the last collumn which is Unnamed full of blank values
cancer.drop(['Unnamed: 32'], axis=1, inplace= True)

In [0]:
# List out all the column names. We might need this later
collumnNames = list(cancer)[1:]
print (collumnNames)

['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']


In [0]:
# === Select all the data for input ===
X = cancer.iloc[:, 1:31] 

# === Select only certain important fields ===
# feature_columns = ['radius_mean', 'texture_mean', 'perimeter_mean','area_mean', 'perimeter_se', 'concavity_se']
# X = cancer[feature_columns].values 

# The output is the diagnosis where M is Malignant and B is Benign
y = cancer['diagnosis'].values

In [0]:
from sklearn.model_selection import train_test_split
# Split the data into 80% for training and 20% for testing out the models
X_train, X_test, y_train, y_test = train_test_split(X, y.ravel(), test_size=0.2)

# print the training data to verify it looks correct
X_train

Unnamed: 0_level_0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
859196,9.173,13.86,59.20,260.9,0.07721,0.08751,0.059880,0.021800,0.2341,0.06963,...,10.010,19.23,65.59,310.1,0.09836,0.16780,0.139700,0.05087,0.3282,0.08490
854941,13.030,18.42,82.61,523.8,0.08983,0.03766,0.025620,0.029230,0.1467,0.05863,...,13.300,22.81,84.46,545.9,0.09701,0.04619,0.048330,0.05013,0.1987,0.06169
857155,12.050,14.63,78.04,449.3,0.10310,0.09092,0.065920,0.027490,0.1675,0.06043,...,13.760,20.70,89.88,582.6,0.14940,0.21560,0.305000,0.06548,0.2747,0.08301
903507,15.490,19.97,102.40,744.7,0.11600,0.15620,0.189100,0.091130,0.1929,0.06744,...,21.200,29.41,142.10,1359.0,0.16810,0.39130,0.555300,0.21210,0.3187,0.10190
905189,16.140,14.86,104.30,800.0,0.09495,0.08501,0.055000,0.045280,0.1735,0.05875,...,17.710,19.58,115.90,947.9,0.12060,0.17220,0.231000,0.11290,0.2778,0.07012
91813701,13.460,18.75,87.44,551.1,0.10750,0.11380,0.042010,0.031520,0.1723,0.06317,...,15.350,25.16,101.90,719.8,0.16240,0.31240,0.265400,0.14270,0.3518,0.08665
9011495,12.210,18.02,78.31,458.4,0.09231,0.07175,0.043920,0.020270,0.1695,0.05916,...,14.290,24.04,93.85,624.6,0.13680,0.21700,0.241300,0.08829,0.3218,0.07470
90401601,13.510,18.89,88.10,558.1,0.10590,0.11470,0.085800,0.053810,0.1806,0.06079,...,14.800,27.20,97.33,675.2,0.14280,0.25700,0.343800,0.14530,0.2666,0.07686
866458,15.100,16.39,99.58,674.5,0.11500,0.18070,0.113800,0.085340,0.2001,0.06467,...,16.110,18.33,105.90,762.6,0.13860,0.28830,0.196000,0.14230,0.2590,0.07779
866203,19.000,18.91,123.40,1138.0,0.08217,0.08028,0.092710,0.056270,0.1946,0.05044,...,22.320,25.73,148.20,1538.0,0.10210,0.22640,0.320700,0.12180,0.2841,0.06541


## Task 1: Classifying using K-Nearest Neighbors

The code below runs the KNN algorithm initially with 5 neighbors. Try this code out with different numbers of neighbors and try to find an optimal value. The closer to 100% or 1.0 the better for this score.

---

Describe the values you tried and the optimal one you found here.

Write up a paragraph analyzing your results. What were the highest and lowest scores you got? Was there a wide range in scores? What are your thoughts on why you got the range of scores you did?

---

Put your analysis here...


In [0]:
from sklearn.neighbors import KNeighborsClassifier
# Set up the K-Nearest neighbor model using the k nearest neighbors. Change the value of n_neighbors
knn_model = KNeighborsClassifier(n_neighbors = 5 )
# Train the model on the iris data
knn_model.fit(X_train, y_train)
score = knn_model.score(X_test, y_test)
print ("The score for this model is ", score)

The score for this model is  0.9385964912280702


## Task 2: Scaling the data

In the cancer data some features have large measurements, like area_mean which can get as high as 2,500. Other features are very small like smoothness_mean where 0.07 is a common value.

This wide range confuses many classification algorithms and they give too much credence to the larger values. So, we generally scale each feature converting the values into the range of 0.0 to 1.0.


---

The following cell scales all the features in X using a standard scaling method built into the sklearn library. 

The two cells after that redo the train-test split using the scaled data and run the KNN algorithm.

Repeat your analysis from above and see if scaling the data improves your results. Describe your findings below...

---

Put your analysis here...

In [0]:
# Scale the data to put large features like area_mean on the same footing as small features like smoothness_mean
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [0]:
from sklearn.model_selection import train_test_split
# Split the data into 80% for training and 20% for testing out the models
X_train, X_test, y_train, y_test = train_test_split(X, y.ravel(), test_size=0.2)

# The scaling method converts the data from a Panda dataframe into a Numpy array, so it will look different below but should work the same
X_train

array([[ 1.59124868,  0.12341635,  1.59533627, ...,  2.07224515,
        -0.24550745,  2.53550544],
       [ 1.51172471,  0.00939006,  1.42233743, ...,  1.298734  ,
         0.77369434,  0.30778958],
       [ 1.54012613,  0.91229211,  1.52119391, ...,  1.0337912 ,
        -0.52538349, -0.43921564],
       ...,
       [-0.94499809,  0.62606285, -0.95474903, ..., -0.605352  ,
         0.10393316, -0.40596615],
       [-0.47353452, -1.50320357, -0.54119942, ..., -1.33699001,
        -1.00424655, -0.75730243],
       [ 1.83834103,  2.33645719,  1.98252415, ...,  2.28998549,
         1.91908301,  2.21963528]])

In [0]:
from sklearn.neighbors import KNeighborsClassifier
# Set up the K-Nearest neighbor model using the k nearest neighbors. Change the value of n_neighbors
knn_model = KNeighborsClassifier(n_neighbors = 5 )
# Train the model on the iris data
knn_model.fit(X_train, y_train)
score = knn_model.score(X_test, y_test)
print ("The score for this model is ", score)

The score for this model is  0.956140350877193


## Task 3: Experiment with SVM and Decision Trees

The cells below implement SVM and Decision Tree classification. Try both of these out--you can use different kernels and values of the C penalty parameter in SVM.

Keep track of your results to include in the writeup below

## Support Vector Machine (SVM)
### kernels
Support Vector Machines have different ways of defining the lines or hyperplanes separating the data into classes. These are called kernels in our software. You will try out two:
- linear - uses only straight lines
- rbf - [Radial Basis Function](http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex8/ex8.html)

### C penalty parameter
Besides the kernel, there are a number of other parameters you can set on the SVM algorithm. We will look only at one called "C" which is the penalty the algorithm pays for misclassifying a point. As C gets above 1, the algorithm tries not to misclassify any points. For a good overview, see the [second answer in the StackOverflow question](https://stats.stackexchange.com/questions/31066/what-is-the-influence-of-c-in-svms-with-linear-kernel).

In [0]:
from sklearn.svm import SVC

# Set up SVM model with a given kernel and c parameter
svm_model = SVC(C=1.0, kernel='linear')          # linear SVM
#svm_model = SVC(C=10.0, kernel='rbf')           # non-linear SVM

# Train the model on the iris data
svm_model.fit(X_train, y_train)
score = svm_model.score(X_test, y_test)
print ("The score for this model is ", score)

## Decision Trees 

In [0]:
from sklearn.tree import DecisionTreeClassifier
DT_model = DecisionTreeClassifier()

# Train the model on the iris data
DT_model.fit(X_train, y_train)
score = DT_model.score(X_test, y_test)
print ("The score for this model is ", score)

In [0]:
# Display the data size after all the errors have been removed.
print ("Final data size is: ", cancer.shape)
# While we won't share complete solutions to the projects in the Blackboard discussions
# you are allowed to share hints and information like the final number of good records you had.

Final data size is:  (573, 31)


## Writeup 1: Analyzing the three classification algorithms

In two or three paragraphs, analyze the three classification algorithms you used (KNN, SVM, and DT). How did they perform relative to each other on the breast cancer data? Which do you think is the best one to use on this data set? Why do you think some did not perform as well as others.

---

Put your writeup here...

## Writeup 2: Using all the data vs selecting features to use

Redo some of your analysis above using only selected features rather than all the features. Look at your analysis of the cancer data for Project 1 and try to determine 4-6 of the most important features.

Modify the code in the Prep the Data section at the start of the notebook and uncomment these lines:


```
# === Select only certain important fields ===
# feature_columns = ['radius_mean', 'texture_mean', 'perimeter_mean','area_mean', 'perimeter_se', 'concavity_se']
# X = cancer[feature_columns].values 
```

You can replace the feature names, like 'radius_mean',  with the features you want to try. 

Re-run your most promising analysis and see if it improves using only the select features.

Write two paragraphs describe what features you selected and how this affected the score.


---

Put your writeup here...



## Writeup 3: Bias in Data

This breast cancer data was collected in the early 1990s in at the University of Wisconsin. You have not been given much information about how the data was collected. 

For example, you don't know the ages of the patients and university studies have a history of data biased toward college-aged students.  We also don't know the country of origin or the ethnicity of the patients. 

Assume you were creating a machine learning model to diagnose breast cancer for hospitals in South Korea or Peru. Why would the characteristics of the patient population be important in this case? Why is bias in the training data important when creating machine learning models?


---

Put your writeup here...

## Wrapping Up

Remember to share this sheet with your instructor and submit a link to it in Blackboard.