# **Decision Trees**

The Wisconsin Breast Cancer Dataset(WBCD) can be found here(https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data)

This dataset describes the characteristics of the cell nuclei of various patients with and without breast cancer. The task is to classify a decision tree to predict if a patient has a benign or a malignant tumour based on these features.

Attribute Information:
```
#  Attribute                     Domain
   -- -----------------------------------------
   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
  10. Mitoses                       1 - 10
  11. Class:                        (2 for benign, 4 for malignant)
```

In [1]:
import pandas as pd
headers = ["ID","CT","UCSize","UCShape","MA","SECSize","BN","BC","NN","Mitoses","Diagnosis"]
data = pd.read_csv('breast-cancer-wisconsin.data', na_values='?',    
         header=None, index_col=['ID'], names = headers) 
data = data.reset_index(drop=True)
data = data.fillna(0)
data.describe()

Unnamed: 0,CT,UCSize,UCShape,MA,SECSize,BN,BC,NN,Mitoses,Diagnosis
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,4.41774,3.134478,3.207439,2.806867,3.216023,3.463519,3.437768,2.866953,1.589413,2.689557
std,2.815741,3.051459,2.971913,2.855379,2.2143,3.640708,2.438364,3.053634,1.715078,0.951273
min,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,2.0
25%,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,6.0,5.0,5.0,4.0,4.0,5.0,5.0,4.0,1.0,4.0
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


1. a) Implement a decision tree (you can use decision tree implementation from existing libraries).

In [2]:
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
X = data[["CT","UCSize","UCShape","MA","SECSize","BN","BC","NN","Mitoses"] ]
y = data["Diagnosis" ]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

In [3]:
len(X_train)

489

1. b) Train a decision tree object of the above class on the WBC dataset using misclassification rate, entropy and Gini as the splitting metrics.

In [15]:
#Create decision tree object
d_gini = tree.DecisionTreeClassifier(criterion = 'gini')
#train the classifier
d_gini  = d_gini.fit(X_train, y_train)

In [16]:
#Create decision tree object
d_entropy = tree.DecisionTreeClassifier(criterion = 'entropy')
#train the classifier
d_entropy  = d_entropy.fit(X_train, y_train)

# #Create decision tree object
# d_gini = tree.DecisionTreeClassifier(criterion = 'gini')
# #train the classifier
# d_gini  = d_gini.fit(X_train, y_train)
# #predict the results
# y_pred = d_gini.predict(X_test)
# # decision tree accuracy, % correct
# print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

1. c) Report the accuracies in each of the above splitting metrics and give the best result. 

In [17]:
# determine best accuracy from the two above

#predict the results
y_pred_gini = d_gini.predict(X_test)
# decision tree accuracy, % correct
gini_accuracy = metrics.accuracy_score(y_test, y_pred_gini)
print("Gini Accuracy:", gini_accuracy )

#predict the results
y_pred_entropy = d_entropy.predict(X_test)
# decision tree accuracy, % correct
entropy_accuracy = metrics.accuracy_score(y_test, y_pred_entropy)
print("Entropy Accuracy:", entropy_accuracy)

if entropy_accuracy > gini_accuracy:
    print("Best accuracy from method Entropy", entropy_accuracy)
else:
    print("Best accuracy from method Gini", gini_accuracy)

Gini Accuracy: 0.9428571428571428
Entropy Accuracy: 0.9476190476190476
Best accuracy from method Entropy 0.9476190476190476


1. d) Experiment with different approaches to decide when to terminate the tree (number of layers, purity measure, etc). Report and give explanations for all approaches. 

In [23]:
best_accuracy = 0
bestj = 0 
bestk = 0
bestmethod = ""
for j in range(3,10):
    for k in range(2,10):
        #Create decision tree object
        d_gini = tree.DecisionTreeClassifier(criterion = 'gini', max_depth=j, min_samples_split= k)
        #train the classifier
        d_gini  = d_gini.fit(X_train, y_train)
        #predict the results
        y_pred_gini = d_gini.predict(X_test)
        # decision tree accuracy, % correct
        gini_accuracy = metrics.accuracy_score(y_test, y_pred_gini)
        # print("Gini Accuracy:", j, k, gini_accuracy)
        if gini_accuracy > best_accuracy:
            bestj = j
            bestk = k
            bestmethod = "gini"
            best_accuracy = gini_accuracy


for j in range(3,10):
    for k in range(2,10):
        #Create decision tree object
        d_entropy = tree.DecisionTreeClassifier(criterion = 'entropy', max_depth=j, min_samples_split= k)
        #train the classifier
        d_entropy  = d_entropy.fit(X_train, y_train)

        #predict the results
        y_pred_entropy = d_entropy.predict(X_test)
        # decision tree accuracy, % correct
        entropy_accuracy = metrics.accuracy_score(y_test, y_pred_entropy)
        # print("Entropy Accuracy:", j, k, entropy_accuracy)

        if entropy_accuracy > best_accuracy:
            bestj = j
            bestk = k
            bestmethod = "entropy"
            best_accuracy = entropy_accuracy
            
print("Best accuracy from method ", best_accuracy, "for best parameters of", bestj, bestk, bestmethod)

Best accuracy from method Entropy 0.9571428571428572 for best parameters of 6 2 entropy


Best accuracy is when criterion='entropy', max_depth=6, min_samples_split=2 i.e. 95.71%

2. What is boosting, bagging and  stacking?
Which class does random forests belong to and why?

Answer: </br> All 3 are ensemble methods with decision tree as the basic unit. </br> Bagging builds models using bootstrap sampling. Bootstrap sampling means drawing random samples from training set with replacement. After training "m" decision trees, apply majority rule. This would result in less complex decision boundary and classifier has less variance. </br> </br> Boosting builds models by iteratively fitting baselearners to model error. It incrementally learns from mistakes. </br> Stacking creates a hierarchy of models using the outputs from previous layers. </br> </br> Random forest resembles bagging the most, as in random forest also we use bootstrap sampling. But, in addition to building trees based on multiple samples of training data, it also constrains the features that can be used to build the trees, forcing trees to be different.

3. Implement random forest algorithm using different decision trees . 

In [4]:
# testing maximum count/mode

# import numpy as np
# from scipy import stats

# a = np.array([[10, 2], [3, 4], [5, 6], [10, 4]])
# xmax, ymax = a.max(axis=0)
# print(xmax, ymax)

# m = stats.mode(a)
# print(m[0])

10 6
[[10  4]]


In [4]:
# Create a random subsample from the dataset with replacement
import numpy as np
def subsample(X_train, y_train, ratio):
	rand_num = np.random.randint(3000)
	X_sample = X_train.sample(n = int(len(X_train) * ratio), random_state = np.random.RandomState(rand_num))
	y_sample = y_train[X_sample.index]
	return X_sample, y_sample

In [12]:
# Random Forest Algorithm
from scipy import stats

def random_forest(X_train, X_test, y_train, y_test, criterion, max_depth, min_samples_split, sample_size, n_trees):
	trees = list()
	predictions = []
	for _ in range(n_trees):
		X_sample, y_sample = subsample(X_train, y_train, sample_size)
		dt = tree.DecisionTreeClassifier(criterion = criterion, max_depth=max_depth, min_samples_split= min_samples_split)
		dt  = dt.fit(X_sample, y_sample)
		predictions.append(list(dt.predict(X_test)))
		trees.append(dt)

	m = stats.mode(predictions)
	# print(np.squeeze(m[0]))
	accuracy = metrics.accuracy_score(y_test, np.squeeze(m[0]))
	print("Accuracy:", accuracy, "max_depth:", max_depth, "criterion:", criterion,
	"min_samples_split:", min_samples_split, "n_trees:", n_trees)

	return(accuracy)

In [13]:
best_accuracy = 0
besti = 0
bestj = 0 
bestk = 0
best_criteria = ""

for i in range(10, 150, 5):
    for j in range(3, 10):
        for k in range(2, 10):
            for criterion in {'entropy', 'gini'}:
                accuracy = random_forest(X_train, X_test, y_train, y_test, 'entropy', j, k, 0.5, i)
                if accuracy>best_accuracy:
                    best_accuracy = accuracy
                    besti = i
                    bestj = j
                    bestk = k
                    best_criteria = criterion

print(best_accuracy, besti, bestj, bestk, best_criteria)

Accuracy: 0.9571428571428572 max_depth: 3 criterion: entropy min_samples_split: 2 n_trees: 10
Accuracy: 0.9428571428571428 max_depth: 3 criterion: entropy min_samples_split: 2 n_trees: 10
Accuracy: 0.9619047619047619 max_depth: 3 criterion: entropy min_samples_split: 3 n_trees: 10
Accuracy: 0.9619047619047619 max_depth: 3 criterion: entropy min_samples_split: 3 n_trees: 10
Accuracy: 0.9714285714285714 max_depth: 3 criterion: entropy min_samples_split: 4 n_trees: 10
Accuracy: 0.9619047619047619 max_depth: 3 criterion: entropy min_samples_split: 4 n_trees: 10
Accuracy: 0.9571428571428572 max_depth: 3 criterion: entropy min_samples_split: 5 n_trees: 10
Accuracy: 0.9523809523809523 max_depth: 3 criterion: entropy min_samples_split: 5 n_trees: 10
Accuracy: 0.9476190476190476 max_depth: 3 criterion: entropy min_samples_split: 6 n_trees: 10
Accuracy: 0.9523809523809523 max_depth: 3 criterion: entropy min_samples_split: 6 n_trees: 10
Accuracy: 0.9666666666666667 max_depth: 3 criterion: entropy

Best accuracy is when criterion='gini', n_trees=15, max_depth=8, min_samples_split=4 i.e. 98.09%

In [None]:
#Done using sklearn

# from sklearn import ensemble
# best_accuracy = 0
# besti = 0
# bestj = 0 
# bestk = 0
# bestmethod = ""
# for i in range(10,150, 5):
#     for j in range(3,10):
#         for k in range(2,10):
#             #Create decision tree object
#             d_gini = ensemble.RandomForestClassifier(n_estimators= i, criterion = 'gini', max_depth=j, min_samples_split= k)
#             #train the classifier
#             _gini  = d_gini.fit(X_train, y_train)
#             #predict the results
#             y_pred_gini = d_gini.predict(X_test)
#             # decision tree accuracy, % correct
#             gini_accuracy = metrics.accuracy_score(y_test, y_pred_gini)
#             print("Gini Accuracy:", i, j, k, gini_accuracy)
#             if gini_accuracy > best_accuracy:
#                 print("Gini Best Accuracy:", i, j, k, gini_accuracy)
#                 besti= i
#                 bestj = j
#                 bestk = k
#                 bestmethod = "gini"
#                 best_accuracy = gini_accuracy

# for i in range(10,150, 5):
#     for j in range(3,10):
#         for k in range(2,10):
#             #Create decision tree object
#             d_entropy = ensemble.RandomForestClassifier(n_estimators= i, criterion = 'entropy', max_depth=j, min_samples_split= k)
#             #train the classifier
#             d_entropy  = d_entropy.fit(X_train, y_train)

#             #predict the results
#             y_pred_entropy = d_entropy.predict(X_test)
#             # decision tree accuracy, % correct
#             entropy_accuracy = metrics.accuracy_score(y_test, y_pred_entropy)
#             print("Entropy Accuracy:",i, j, k, entropy_accuracy)

#             if entropy_accuracy > best_accuracy:
#                 print("Entropy Best Accuracy:",i, j, k, entropy_accuracy)
#                 besti= i
#                 bestj = j
#                 bestk = k
#                 bestmethod = "entropy"
#                 best_accuracy = entropy_accuracy
            
# print("Best accuracy from method Entropy", best_accuracy, "for best parameters of", besti, bestj, bestk, bestmethod)


4. Report the accuracies obtained after using the Random forest algorithm and compare it with the best accuracies obtained with the decision trees. 

Best accuracy from random forest is 98.09% and from decision tree is 95.71%.

For decision tree, Best accuracy is when criterion='entropy', max_depth=6, min_samples_split=2 i.e. 95.71%
For random forest, Best accuracy is when criterion='gini', n_trees=15, max_depth=8, min_samples_split=4 i.e. 98.09%

5. Submit your solution as a separate pdf in the final zip file of your submission


Compute a decision tree with the goal to predict the food review based on its smell, taste and portion size.

(a) Compute the entropy of each rule in the first stage.

(b) Show the final decision tree. Clearly draw it.

Submit a handwritten response. Clearly show all the steps.

