# **Decision Trees**

The Wisconsin Breast Cancer Dataset(WBCD) can be found here(https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data)

This dataset describes the characteristics of the cell nuclei of various patients with and without breast cancer. The task is to classify a decision tree to predict if a patient has a benign or a malignant tumour based on these features.

Attribute Information:
```
#  Attribute                     Domain
   -- -----------------------------------------
   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
  10. Mitoses                       1 - 10
  11. Class:                        (2 for benign, 4 for malignant)
```

In [1]:
import pandas as pd
headers = ["ID","CT","UCSize","UCShape","MA","SECSize","BN","BC","NN","Mitoses","Diagnosis"]
data = pd.read_csv('breast-cancer-wisconsin.data', na_values='?',    
         header=None, index_col=['ID'], names = headers) 
data = data.reset_index(drop=True)
data = data.fillna(0)
data.describe()

Unnamed: 0,CT,UCSize,UCShape,MA,SECSize,BN,BC,NN,Mitoses,Diagnosis
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,4.41774,3.134478,3.207439,2.806867,3.216023,3.463519,3.437768,2.866953,1.589413,2.689557
std,2.815741,3.051459,2.971913,2.855379,2.2143,3.640708,2.438364,3.053634,1.715078,0.951273
min,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,2.0
25%,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,6.0,5.0,5.0,4.0,4.0,5.0,5.0,4.0,1.0,4.0
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


1. a) Implement a decision tree (you can use decision tree implementation from existing libraries).

In [2]:
data.head()

Unnamed: 0,CT,UCSize,UCShape,MA,SECSize,BN,BC,NN,Mitoses,Diagnosis
0,5,1,1,1,2,1.0,3,1,1,2
1,5,4,4,5,7,10.0,3,2,1,2
2,3,1,1,1,2,2.0,3,1,1,2
3,6,8,8,1,3,4.0,3,7,1,2
4,4,1,1,3,2,1.0,3,1,1,2


In [3]:
X = data.iloc[:,:-1].to_numpy()
y = data['Diagnosis'].to_numpy()

In [5]:
from sklearn import tree
from sklearn.model_selection import train_test_split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1) 

In [7]:
dt_gini = tree.DecisionTreeClassifier()
dt_gini = dt_gini.fit(X_train, y_train)

In [8]:
y_hat_gini = dt_gini.predict(X_test)

In [9]:
dt_gini.score(X_test, y_test)

0.9428571428571428

1. b) Train a decision tree object of the above class on the WBC dataset using misclassification rate, entropy and Gini as the splitting metrics.

In [10]:
dt_entropy = tree.DecisionTreeClassifier(criterion='entropy')
dt_entropy = dt_entropy.fit(X_train, y_train)

In [11]:
dt_entropy.score(X_test, y_test)

0.9428571428571428

1. c) Report the accuracies in each of the above splitting metrics and give the best result. 

1. d) Experiment with different approaches to decide when to terminate the tree (number of layers, purity measure, etc). Report and give explanations for all approaches. 

In [92]:
from sklearn.model_selection import GridSearchCV

In [101]:
params_grid = {'criterion':['gini','entropy'],
              'max_depth':[4,6,8],
              'min_samples_split':[8,4,2],
              'min_samples_leaf':[4,3,2,1],
              'max_leaf_nodes':[2,None],
              'min_impurity_decrease':[0,.01,.1],
              'ccp_alpha':[0,.1,.5]}

In [102]:
dt = tree.DecisionTreeClassifier()
clf = GridSearchCV(dt, params_grid)
clf.fit(X_train,y_train)

GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'ccp_alpha': [0, 0.1, 0.5],
                         'criterion': ['gini', 'entropy'],
                         'max_depth': [4, 6, 8], 'max_leaf_nodes': [2, None],
                         'min_impurity_decrease': [0, 0.01, 0.1],
                         'min_samples_leaf': [4, 3, 2, 1],
                         'min_samples_split': [8, 4, 2]})

In [106]:
pd.DataFrame(clf.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ccp_alpha,param_criterion,param_max_depth,param_max_leaf_nodes,param_min_impurity_decrease,param_min_samples_leaf,param_min_samples_split,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.011177,0.011885,0.000101,0.000202,0,gini,4,2,0,4,8,"{'ccp_alpha': 0, 'criterion': 'gini', 'max_dep...",0.880952,0.928571,0.928571,0.912698,0.920,0.914159,0.017632,289
1,0.001990,0.003129,0.000000,0.000000,0,gini,4,2,0,4,4,"{'ccp_alpha': 0, 'criterion': 'gini', 'max_dep...",0.880952,0.928571,0.928571,0.912698,0.920,0.914159,0.017632,289
2,0.000199,0.000399,0.000000,0.000000,0,gini,4,2,0,4,2,"{'ccp_alpha': 0, 'criterion': 'gini', 'max_dep...",0.880952,0.928571,0.928571,0.912698,0.920,0.914159,0.017632,289
3,0.003399,0.006799,0.000199,0.000398,0,gini,4,2,0,3,8,"{'ccp_alpha': 0, 'criterion': 'gini', 'max_dep...",0.880952,0.928571,0.928571,0.912698,0.920,0.914159,0.017632,289
4,0.004662,0.005524,0.000736,0.000632,0,gini,4,2,0,3,4,"{'ccp_alpha': 0, 'criterion': 'gini', 'max_dep...",0.880952,0.928571,0.928571,0.912698,0.920,0.914159,0.017632,289
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1291,0.000000,0.000000,0.003138,0.006276,0.5,entropy,8,,0.1,2,4,"{'ccp_alpha': 0.5, 'criterion': 'entropy', 'ma...",0.857143,0.912698,0.928571,0.912698,0.912,0.904622,0.024547,649
1292,0.003124,0.006249,0.000000,0.000000,0.5,entropy,8,,0.1,2,2,"{'ccp_alpha': 0.5, 'criterion': 'entropy', 'ma...",0.857143,0.912698,0.928571,0.912698,0.912,0.904622,0.024547,649
1293,0.000000,0.000000,0.003124,0.006248,0.5,entropy,8,,0.1,1,8,"{'ccp_alpha': 0.5, 'criterion': 'entropy', 'ma...",0.857143,0.912698,0.928571,0.912698,0.912,0.904622,0.024547,649
1294,0.003125,0.006249,0.000000,0.000000,0.5,entropy,8,,0.1,1,4,"{'ccp_alpha': 0.5, 'criterion': 'entropy', 'ma...",0.857143,0.912698,0.928571,0.912698,0.912,0.904622,0.024547,649


2. What is boosting, bagging and  stacking?
Which class does random forests belong to and why?

Answer:

#### Boosting
[Boosting](https://www.geeksforgeeks.org/boosting-in-machine-learning-boosting-and-adaboost/)

#### Bagging
http://en.wikipedia.org/wiki/Bootstrap_aggregating

*Random Forests belong to Bagging*:

#### Stacking
https://www.geeksforgeeks.org/stacking-in-machine-learning/

3. Implement random forest algorithm using different decision trees . 

In [14]:
from sklearn.ensemble import RandomForestClassifier
import numpy as np

In [13]:
rf_gini = RandomForestClassifier(random_state=4)
rf_gini = rf_gini.fit(X_train,y_train)
rf_gini.score(X_test,y_test)

0.9571428571428572

### RF using DT

In [88]:
# Bootstrap Data
def bootstrap_data(X,y):
    '''Create subsample of data'''
    sample_idx = np.random.randint(0,X.shape[0],size=int(X.shape[0]*0.66))
    
    X_bootstrap,y_bootstrap = X[sample_idx], y[sample_idx]
    
    return X_bootstrap, y_bootstrap

# # Select Random Features
# def random_select_feat(feature):
#     '''Select Random set of features. Duplicate is possible'''
#     pass

# Train Mulitple DTs based on Random Data and Random Features
def Build_RandomForest(X,y, num_trees=100):
    '''
    no. of DTs
    
    Returns
    -------
    List of sklearn DTs - a random forest
    '''
    random_forest = []
    for i in range(num_trees):
        X_bootstrap, y_bootstrap = bootstrap_data(X,y)
        tree_in_forest = tree.DecisionTreeClassifier(random_state=i, max_features='auto')
        random_forest.append(tree_in_forest.fit(X_bootstrap, y_bootstrap))
    
    return random_forest
        
# Testing
def RandomForest_predict(random_forest, X_test):
    test_res = []
    # run X_test on each DT
    for dt in random_forest:
        y_hat = dt.predict(X_test)
        # Store predicitons and return
        test_res.append(y_hat)
    test_res = np.array(test_res)
    y_hat_aggr = aggregate(test_res.T)
    
    return y_hat_aggr
    
def aggregate(test_result):
    y_hat_aggr = []
    # Perform Aggregation like Voting
    for test in test_result:
        y_hat, counts = np.unique(test,return_counts=True)
        y_hat_aggr.append(y_hat[counts.argmax()])
    
    # Report Result
    return y_hat_aggr

def test_RandomForest(y_test, y_hat):
    accuracy = (y_test == y_hat).sum()/len(y_test)
    
    return accuracy

In [89]:
rf = Build_RandomForest(X_train,y_train)

In [90]:
y_hat = RandomForest_predict(rf, X_test)

In [91]:
test_RandomForest(y_test,y_hat)

0.9571428571428572

4. Report the accuracies obtained after using the Random forest algorithm and compare it with the best accuracies obtained with the decision trees. 

5. Submit your solution as a separate pdf in the final zip file of your submission


Compute a decision tree with the goal to predict the food review based on its smell, taste and portion size.

(a) Compute the entropy of each rule in the first stage.

(b) Show the final decision tree. Clearly draw it.

Submit a handwritten response. Clearly show all the steps.

