## Training a classification tree with sklearn

We'll start by working with the [Wisconsin Breast Cancer Dataset from the UCI machine learning repository](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic) and predict whether a tumor is malignant or benign based on two features: the mean radius of the tumor (`radius_mean`) and its mean number of concave points (`concave points_mean`).

We'll start by creating a split into 80% train and 20% test. The feature matrices will be assigned to `X_train` and `X_test`, while the arrays of labels are assigned to `y_train` and `y_test` where class 0 corresponds to a benign tumor and class 1 corresponds to a malignant tumor. To obtain reproducible results, we also defined a variable called SEED which is set to 1.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

df_wbc = pd.read_csv('./data/wbc.csv')
print(df_wbc.shape)
print(df_wbc.columns)
df_wbc.head()

(569, 32)
Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [2]:
X = df_wbc.loc[:, ['radius_mean', 'concave points_mean']]
y = df_wbc['diagnosis'].str.replace('M', '1')
y = y.str.replace('B', '0')
y = y.astype('int')

In [3]:
SEED = 1
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=SEED)

print(X_train.shape)
X_train.head()

(455, 2)


Unnamed: 0,radius_mean,concave points_mean
195,12.91,0.02377
560,14.05,0.04304
544,13.87,0.02369
495,14.87,0.04951
527,12.34,0.02647


In [4]:
# instantiate a DecisionTreeClassifier with a maximum depth of 6
dt = DecisionTreeClassifier(max_depth=6, random_state=SEED)

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict test set labels
y_pred = dt.predict(X_test)
print(y_pred[0:5])

[0 0 0 1 0]


## Evaluate the classification tree

Now that you've fit the tree model, it's time to evaluate its performance on the test set. We'll do so using the accuracy metric which corresponds to the fraction of correct predictions made on the test set.

In [7]:
from sklearn.metrics import accuracy_score

# predict test set labels
y_pred = dt.predict(X_test)

# compute test set accuracy  
acc = accuracy_score(y_test, y_pred)
print(f"Test set accuracy: {acc:.2f}")

Test set accuracy: 0.89
