# Breast Cancer Detection with Decision Trees

Hi Guys 😀

In this notebook, I'll cover the following topics,
- What are decision trees?
- Some advantages of decision trees
- Some disadvantages of decision trees
- Data preprocessing
- Building the model
- Model evaluation
- Hyperparameter tuning with grid search <br/>

You can follow us on [Tirendaz Academy](https://youtube.com/c/tirendazacademy) YouTube channel 👍

Happy learning 🐱‍🏍 

# What are decision trees?

Decision trees are a non-parametric supervised learning. This technique is widely used for classification and regression tasks.

# Some advantages of decision trees
- Decision trees are simple to understand and interpret. 
- You can easily visualize trees.
- Decision trees require little data preprocessing. 
- You can deal with both numerical and categorical data using this technique.

# Some disadvantages of decision trees

- Decision tree learners can create over-complex trees that do not generalize the data well. 

To overcome this problem, you can use some methods such as setting the maximum depth of the tree, setting the minimum number of samples required at a leaf node, and pruning.

- Decision trees can be unstable.

To avoid this problem, you can use decision trees within an ensemble.

# Loading the dataset

In [1]:
import pandas as pd
df = pd.read_csv("Breast Cancer Wisconsin.csv")
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [2]:
df.shape

(569, 33)

# Data preprocessing

The diagnosis is our target variable. Let's assign this column to y variable.

In [3]:
y = df.loc[:,"diagnosis"].values

When creating the feature variable. I'm going to remove both target column and unnecessary columns.

In [4]:
X = df.drop(["diagnosis","id","Unnamed: 32"], axis = 1).values

Pay attention that our target variable has two categories, M and B. Let's encode the target variable with label encoder. 

In [5]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

Let's split the dataset into training and test set. The model is built with training set and is evaluated with test set.

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y,stratify=y,random_state=0)

# Building the model

Let's train our model with training set.

In [7]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

DecisionTreeClassifier(random_state=42)

# Model evaluation

Let's evaluate our model with test set.

In [8]:
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)

To see the performance of the model, I'm going to use the accuracy score function.

In [9]:
from sklearn.metrics import accuracy_score
tree_train = accuracy_score(y_train,y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print(f"Decision tree train/test accuracies:{tree_train:.3f}/{tree_test:.3f}")

Decision tree train/test accuracies:1.000/0.951


There is overfitting problem in the model since the decision trees model learned the training set so well. To overcome the overfitting problem, we control the complexity of a tree. First, let's specify the max_depth parameter which controls the maximum number of levels.

# Building the model with max_depth parameter

In [10]:
dt = DecisionTreeClassifier(max_depth=2, random_state=42)
dt.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=2, random_state=42)

Let's take a look at the performance of the model on the training and test set.

In [11]:
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)
tree_train = accuracy_score(y_train,y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print(f"Decision tree train/test accuracies:{tree_train:.3f}/{tree_test:.3f}")

Decision tree train/test accuracies:0.951/0.923


By making it less complex, we improved the ability of our model to generalize. But, this tree has another problem. To make it better, we need to tune the model using different parameters.

# Hyperparameter tuning with grid search

To find out the best parameter, I'm going to use grid search technique.

In [12]:
from sklearn.model_selection import GridSearchCV
dt = DecisionTreeClassifier(random_state=42)
parameters = {"max_depth":[1,2,3,4,5,7,10],
              "min_samples_leaf":[1,3,6,10,20]}
clf = GridSearchCV(dt,parameters, n_jobs=1)
clf.fit(X_train,y_train)
print(clf.best_params_)

{'max_depth': 3, 'min_samples_leaf': 1}


When we execute this cell, you can see the best parameters. These are 3 for max_depth and 1 for min_samples_leaf. Now, I'm going to predict this model trained with these parameters. 
<br/>
*Pay attention that* we don't need to train our model again. Because after the best parameters are found, the model is trained again. 
<br/>
Let's predict the values of the training and the test values.

In [13]:
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
tree_train = accuracy_score(y_train,y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print(f"Decision tree train/test accuracies:{tree_train:.3f}/{tree_test:.3f}")

Decision tree train/test accuracies:0.974/0.958


Notice that the score of our model on the training set is close to the score on the test set. In addition, both accuracy scores are close to 1. The performance of the model is now better. 

Thanks for reading.

Pleas don't forget to follow us on [YouTube](http://youtube.com/tirendazacademy) | [Medium](http://tirendazacademy.medium.com) | [Twitter](http://twitter.com/tirendazacademy) | [GitHub](http://github.com/tirendazacademy) | [Linkedin](https://www.linkedin.com/in/tirendaz-academy) | [Kaggle](https://www.kaggle.com/tirendazacademy) 😎