## Decision Tree with Adaboost Tutorial: UCI Lymphography Dataset

In this tutorial, the decision tree as demonstrated in the previous assignment will be enhanced by a boosting algorithm -- in this case, Adaboost. 

The following will be demonstrated and can be used as a supplement with Assignment 7.
 * Loading a dataset from its URL
 * Splitting a dataset in a training and testing set
 * Creating a decision tree object
 * Creating an Adaboost classifier object
 * Tuning a decision tree model with Adaboost with Sklearn's grid search option

### Importing Libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
import matplotlib.pyplot as plt

In the previous tutorials, the dataset may have been downloaded onto your local machine or it was part of the dataset package in Sklearn. Here, using the read_csv function from Pandas, we may retrieve a dataset this way.

In [None]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/lymphography/lymphography.data',
                           sep= ',',header=None)

We show the first five entries of this dataset.

### Visualizing Data

In [None]:
df.head()

In [None]:
print("Number of Samples:: ", df.shape[0])
print("Number of features: ", df.shape[1]-1)

### Creating X and y matrices

From the dataframe created, the data matrix X and target matrix y may be created as follows. From the description of the dataset available online, the first column is the ground truth (target) while the rest of the data are attributes.

In [None]:
X = df.values[:, 1:]
y = df.values[:,0]

In [None]:
X, X.shape

In [None]:
y, y.shape

### Creating Training and Testing Data

To create a training and testing set, we can use Sklearn's built-in function. The test_size parameter allows us to vary the percentage of data to be allocated as training data and testing data. In this case, we are creating a 70:30 ratio.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

### Building Decision Tree

Similiar to the previous assignment, the decision tree can be built using DecisionTreeClassifier. After building it with the desired parameters, we will fit it on the training data.

In [None]:
# Create a decision tree object
tree_gini = DecisionTreeClassifier()

# Fit on training data
tree_gini.fit(X_train, y_train)

### Building Adaboost Classifier

An Adaboost classifier can be built with another built-in function AdaBoostClassifier. Note that for the base estimator of the classifier, we are passing in the decision tree we have previously created.

In [None]:
# Create adaboost classifer object
abc = AdaBoostClassifier(base_estimator=tree_gini)

# Train Adaboost Classifer
model = abc.fit(X_train, y_train)

#Predict the response with test dataset
y_pred = model.predict(X_test)

In [None]:
print("Accuracy:",accuracy_score(y_test, y_pred))

### Tuning Decision Tree with GridSearchCV

We can to tune decision tree either can hand or by Sklearn's grid search option to yield a higher accuracy. In grid search, we provide lists of which contain ranges of each parameter in the decision tree. In return, the combination of parameters that will yield the highest accuracy will be returned.

In [None]:
from sklearn.grid_search import GridSearchCV

In [None]:
tree_param = {'criterion':['gini','entropy'],
             'max_depth':[4,5,6,7,8,9,10,11,12,15,20],
             'min_samples_split':[20,30,40,50,60,70,80,90,100], 
             'min_samples_leaf':[5,6,7,8,9,10], 
             'max_features':list(range(1,X_train.shape[1])), 
             'max_features': ['auto', 'sqrt', 'log2'],
             'presort':[True, False]
            }
grid = GridSearchCV(DecisionTreeClassifier(), tree_param, cv=5)
grid.fit(X_train, y_train)


In [None]:
print(grid.best_score_)

print(grid.best_params_)

print(grid.best_estimator_)

Now knowing the parameters that will yield the highest accuracy provided in tree_param, we can edit our decision tree and rerun it.

In [None]:
tree = grid.best_estimator_

tree.fit(X_train, y_train)

# Create adaboost classifer object
abc = AdaBoostClassifier(n_estimators=100,
                         learning_rate=1,
                         random_state=8,
                         base_estimator=tree)
# Train Adaboost Classifer
model = abc.fit(X_train, y_train)

# Predict the response for test dataset
y_pred = model.predict(X_test)
print("Accuracy:",accuracy_score(y_test, y_pred))