# Cancer Tumor Detection using Decision Tree Algorithm

Consider The Wisconsin Breast Cancer Database. 

This dataset consists of 10 continuous attributes and 1 target class attribute. 

Class attribute shows the observation result, whether the patient is suffering from the benign tumor or malignant tumor. 

Benign tumors do not spread to other parts while the malignant tumor is cancerous. 

Breast Cancer Data Set Attribute Information:
1. Sample code number: id number
2. Clump Thickness: 1 – 10
3. Uniformity of Cell Size: 1 – 10
4. Uniformity of Cell Shape: 1 – 10
5. Marginal Adhesion: 1 – 10
6. Single Epithelial Cell Size: 1 – 10
7. Bare Nuclei: 1 – 10
8. Bland Chromatin: 1 – 10
9. Normal Nucleoli: 1 – 10
10. Mitoses: 1 – 10
11. Class: (2 for benign, 4 for malignant)

# Problem Statement:

# Model the Decision Tree classifier using the Breast Cancer data for predicting whether a patient is suffering from the benign tumor or malignant tumor.

Decision Tree Model for Cancerous tumor detection:

To diagnose Breast Cancer, the doctor uses his experience by analyzing details provided by

1. Patient’s Past Medical History
2. Reports of all the tests performed.

The modeled Decision Tree classifier will compare the new patient’s test reports, observation metrics with the records of patients (training data) that correctly classified as benign or malignant.

Objective: Fill in the blank1 and blank2

# Import necessary libraries

In [2]:
# To enable plotting graphs in Jupyter notebook. 
# This should always be the first step before importing the other libraries
%matplotlib inline

In [3]:
import numpy as np
import pandas as pd
from scipy.stats import zscore
from sklearn.preprocessing import Imputer
from sklearn.metrics import accuracy_score
import seaborn as sns

# Data Preprocessing

In [4]:
# Read the Wisconsin Breast Cancer Dataset
bc_df = pd.read_csv("wisc_bc_data.csv")

In [5]:
# Check the shape of the data
bc_df.shape

(569, 32)

In [6]:
# Check the data types of the data
bc_df.dtypes

id                     int64
diagnosis             object
radius_mean          float64
texture_mean         float64
perimeter_mean       float64
area_mean            float64
smoothness_mean      float64
compactness_mean     float64
concavity_mean       float64
points_mean          float64
symmetry_mean        float64
dimension_mean       float64
radius_se            float64
texture_se           float64
perimeter_se         float64
area_se              float64
smoothness_se        float64
compactness_se       float64
concavity_se         float64
points_se            float64
symmetry_se          float64
dimension_se         float64
radius_worst         float64
texture_worst        float64
perimeter_worst      float64
area_worst           float64
smoothness_worst     float64
compactness_worst    float64
concavity_worst      float64
points_worst         float64
symmetry_worst       float64
dimension_worst      float64
dtype: object

In [7]:
# Convert the 'diagnosis' column as categorical
bc_df['diagnosis'] = bc_df.diagnosis.astype('category')
# Verify again if indeed this changed it to the categorical datatype
bc_df.dtypes

id                      int64
diagnosis            category
radius_mean           float64
texture_mean          float64
perimeter_mean        float64
area_mean             float64
smoothness_mean       float64
compactness_mean      float64
concavity_mean        float64
points_mean           float64
symmetry_mean         float64
dimension_mean        float64
radius_se             float64
texture_se            float64
perimeter_se          float64
area_se               float64
smoothness_se         float64
compactness_se        float64
concavity_se          float64
points_se             float64
symmetry_se           float64
dimension_se          float64
radius_worst          float64
texture_worst         float64
perimeter_worst       float64
area_worst            float64
smoothness_worst      float64
compactness_worst     float64
concavity_worst       float64
points_worst          float64
symmetry_worst        float64
dimension_worst       float64
dtype: object

In [8]:
bc_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,569.0,30371830.0,125020600.0,8670.0,869218.0,906024.0,8813129.0,911320500.0
radius_mean,569.0,14.12729,3.524049,6.981,11.7,13.37,15.78,28.11
texture_mean,569.0,19.28965,4.301036,9.71,16.17,18.84,21.8,39.28
perimeter_mean,569.0,91.96903,24.29898,43.79,75.17,86.24,104.1,188.5
area_mean,569.0,654.8891,351.9141,143.5,420.3,551.1,782.7,2501.0
smoothness_mean,569.0,0.09636028,0.01406413,0.05263,0.08637,0.09587,0.1053,0.1634
compactness_mean,569.0,0.104341,0.05281276,0.01938,0.06492,0.09263,0.1304,0.3454
concavity_mean,569.0,0.08879932,0.07971981,0.0,0.02956,0.06154,0.1307,0.4268
points_mean,569.0,0.04891915,0.03880284,0.0,0.02031,0.0335,0.074,0.2012
symmetry_mean,569.0,0.1811619,0.02741428,0.106,0.1619,0.1792,0.1957,0.304


In [9]:
bc_df.groupby(["diagnosis"]).count()
# Class distribution among Benign (Healthy) and Malignant (Not-Healthy) is almost 2:1. 
#The model will better predict B and M

Unnamed: 0_level_0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,points_worst,symmetry_worst,dimension_worst
diagnosis,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
B,357,357,357,357,357,357,357,357,357,357,...,357,357,357,357,357,357,357,357,357,357
M,212,212,212,212,212,212,212,212,212,212,...,212,212,212,212,212,212,212,212,212,212


In [11]:
bc_df.head(10)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,points_worst,symmetry_worst,dimension_worst
0,87139402,B,12.32,12.39,78.85,464.1,0.1028,0.06981,0.03987,0.037,...,13.5,15.64,86.97,549.1,0.1385,0.1266,0.1242,0.09391,0.2827,0.06771
1,8910251,B,10.6,18.95,69.28,346.4,0.09688,0.1147,0.06387,0.02642,...,11.88,22.94,78.28,424.8,0.1213,0.2515,0.1916,0.07926,0.294,0.07587
2,905520,B,11.04,16.83,70.92,373.2,0.1077,0.07804,0.03046,0.0248,...,12.41,26.44,79.93,471.4,0.1369,0.1482,0.1067,0.07431,0.2998,0.07881
3,868871,B,11.28,13.39,73.0,384.8,0.1164,0.1136,0.04635,0.04796,...,11.92,15.77,76.53,434.0,0.1367,0.1822,0.08669,0.08611,0.2102,0.06784
4,9012568,B,15.19,13.21,97.65,711.8,0.07963,0.06934,0.03393,0.02657,...,16.2,15.73,104.5,819.1,0.1126,0.1737,0.1362,0.08178,0.2487,0.06766
5,906539,B,11.57,19.04,74.2,409.7,0.08546,0.07722,0.05485,0.01428,...,13.07,26.98,86.43,520.5,0.1249,0.1937,0.256,0.06664,0.3035,0.08284
6,925291,B,11.51,23.93,74.52,403.5,0.09261,0.1021,0.1112,0.04105,...,12.48,37.16,82.28,474.2,0.1298,0.2517,0.363,0.09653,0.2112,0.08732
7,87880,M,13.81,23.75,91.56,597.8,0.1323,0.1768,0.1558,0.09176,...,19.2,41.85,128.5,1153.0,0.2226,0.5209,0.4646,0.2013,0.4432,0.1086
8,862989,B,10.49,19.29,67.41,336.1,0.09989,0.08578,0.02995,0.01201,...,11.54,23.31,74.22,402.8,0.1219,0.1486,0.07987,0.03203,0.2826,0.07552
9,89827,B,11.06,14.96,71.49,373.9,0.1033,0.09097,0.05397,0.03341,...,11.92,19.9,79.76,440.0,0.1418,0.221,0.2299,0.1075,0.3301,0.0908


#The first column is id column which is patient id and it has nothing to do with the model attriibutes. So drop it.

In [12]:
bc_df = bc_df.drop(labels = "id", axis = 1)

In [13]:
# Create a separate dataframe consisting only of the features i.e independent attributes
bc_feature_df = bc_df.drop(labels= "diagnosis" , axis = 1)

# Scaling the Data

#Convert the features into z scores as we do not know what units / scales were used and store them in new dataframe
#We do this using the Z score method

In [14]:
bc_feature_df_z = bc_feature_df.apply(zscore)  

In [15]:
# Capture the class values from the 'diagnosis' column into a pandas series akin to array 

bc_labels = bc_df["diagnosis"]

In [16]:
# store the normalized features data into np array 

X = np.array(bc_feature_df_z)

In [17]:
# store the bc_labels data into a separate np array

Y = np.array(bc_labels)

# Split into Train and Test

In [18]:
# Break the data into training and test set

X_Train = X[ :400, :]
X_Test = X[401: , :]
Y_Train = Y[:400, ]
Y_Test = Y[401:, ]

In [19]:
X_Train.shape, Y_Train.shape, X_Test.shape, Y_Test.shape

((400, 30), (400,), (168, 30), (168,))

# Building the Decision Tree Classifier Model

In [20]:
from sklearn.tree import DecisionTreeClassifier

DTClassifier = DecisionTreeClassifier (max_depth=1, min_samples_split = 3, random_state = 0)
# Your code here
# DTClassifier.fit (# Your code here, # Your code here)
#Hint: 
DTClassifier.fit (X_Train,Y_Train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=3,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

Using the above modeil predict on the test dataset

In [21]:
predicted_labels = DTClassifier.predict(X_Test)

We now need to see how well our Decision Tree Classifier is actually working. We do this by calculating the accuracy score i.e. how many test cases were correctly predicted as a ratio of total number of test cases

In [22]:
score = accuracy_score(Y_Test, predicted_labels)
print(score)

0.875


# To improve performance  
# Iteration 2

Let us changing the max_depth from 1 to 3 and let's see what happens

In [41]:
# Call Decision Tree Classifier algorithm again and predict
DTClassifier_new = DecisionTreeClassifier (max_depth=3, random_state = 0)
DTClassifier_new.fit(X_Train, Y_Train)
predicted_labels = DTClassifier_new.predict(X_Test)
accuracy_score_new = accuracy_score(Y_Test, predicted_labels)
print(accuracy_score_new)

0.9166666666666666


In [42]:
# What happens when we change the max_depth to 5?


### Your code here####

In [34]:
# What happens when we change the max_depth to 3 but the criterion is entropy?

DTClassifier_new = DecisionTreeClassifier (max_depth=3, criterion = "entropy",min_samples_leaf= 5,random_state = 0)
DTClassifier_new.fit(X_Train, Y_Train)
predicted_labels = DTClassifier_new.predict(X_Test)
accuracy_score_new = accuracy_score(Y_Test, predicted_labels)
print(accuracy_score_new)


0.9166666666666666


# Let's look at what happens when min_samples_leaf =5 with gini

In [33]:
DTClassifier_new = DecisionTreeClassifier (max_depth=3, criterion = "gini",min_samples_leaf= 5,random_state = 0)
DTClassifier_new.fit(X_Train, Y_Train)
predicted_labels = DTClassifier_new.predict(X_Test)
accuracy_score_new = accuracy_score(Y_Test, predicted_labels)
print(accuracy_score_new)

0.8988095238095238


In [None]:
# What happens when we change the max_depth to 3 and min_samples_leaf =2?


### Your code here####