# Model Development Challenge: Project

For this project, use what you have learned in Units 1-4 to execute a data science workflow on the breast cancer dataset (sklearn).  For this project, you must:

- Load the breast cancer dataset from sklearn
- Examine your data using pandas
- Perform statistical analysis for the numeric features in the dataset
- Perform data visualization comparing each pair of features to look for a correlation and state the type of trend/correlation that you see if any and why
- Train 3 different classifiers or regressors from sklearn given the target variable and explain why you chose the type of model
- Compare the 3 models that you trained and tested
- Explain the conclusions that you drew from your comparisons

In [6]:
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

import numpy
import pandas
import toyplot

## Load Dataset

In [2]:
data = load_breast_cancer()

## Examine Dataset

In [3]:
X = pandas.DataFrame(data.data, columns=data.feature_names)
Y = pandas.DataFrame(data.target)
X

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [4]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

## Statistical Analysis

In [5]:
mean = X.mean()
print("The mean for each column is:")
print(mean)
print()
median = X.median()
print("The median for each column is:")
print(median)
print()
var = X.var()
print("The variance for each column is:")
print(var)
print()
std = X.std()
print("The standard deviation for each column is:")
print(std)
print()

The mean for each column is:
mean radius                 14.127292
mean texture                19.289649
mean perimeter              91.969033
mean area                  654.889104
mean smoothness              0.096360
mean compactness             0.104341
mean concavity               0.088799
mean concave points          0.048919
mean symmetry                0.181162
mean fractal dimension       0.062798
radius error                 0.405172
texture error                1.216853
perimeter error              2.866059
area error                  40.337079
smoothness error             0.007041
compactness error            0.025478
concavity error              0.031894
concave points error         0.011796
symmetry error               0.020542
fractal dimension error      0.003795
worst radius                16.269190
worst texture               25.677223
worst perimeter            107.261213
worst area                 880.583128
worst smoothness             0.132369
worst compactness    

## Data Visualization
This code may freeze the system.  Students should run just a subset to show comprehension or run in batches.  For example in the code below, "X.columns" should be replaced with "X.columns[a:b]", where a is the start index (i.e. 0) and b is the end index (i.e. 5).

In [None]:
for c0 in X.columns:
    for c1 in X.columns:
        if c0 is not c1:
            canvas = toyplot.Canvas(1000,1000)
            axes = canvas.cartesian(label=c0+" vs. "+c1, xlabel=c0, ylabel=c1)
            mark = axes.scatterplot(X[c0], X[c1])

## Train Classifiers and Compare

Students should choose classifiers over regressors because the target variable is not continuous.  Students can discuss the best and worst classifier that they chose, the differences in accuracy, etc.

In [12]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=.3)

clf0 = DecisionTreeClassifier()
clf0.fit(x_train, y_train)
preds0 = clf0.predict(x_test)
acc0 = clf0.score(x_test, y_test)

print(acc0)

clf1 = RandomForestClassifier()
clf1.fit(x_train, y_train)
preds1 = clf1.predict(x_test)
acc1 = clf1.score(x_test, y_test)

print(acc1)

clf2 = SVC()
clf2.fit(x_train, y_train)
preds2 = clf2.predict(x_test)
acc2 = clf2.score(x_test, y_test)

print(acc2)

0.9122807017543859
0.9590643274853801
0.8888888888888888


  # This is added back by InteractiveShellApp.init_path()
  y = column_or_1d(y, warn=True)


## Draw Conclusions

Students can make observations about the different features that were compared and discuss which were and were not correlated.  Can also discuss which features may have more predictive power.  Can also discuss which classifiers were best and whether or not the features were good predictors for the target variable.