# A Walk Through Ensemble Models
*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. Please check the pdf file for more details.*

In this exercise you will:

- get to know a useful package **pandas** for data analysis/preprocessing
- implement **decision tree** and apply it to a Titanic dataset
- implement a whole bunch of **ensemble methods**, including **random forest, and adaboost**, and apply them to a Titanic dataset

Please note that **YOU CANNOT USE ANY MACHINE LEARNING PACKAGE SUCH AS SKLEARN** for any homework, unless you are asked to.

In [1]:
# some basic imports
from scipy import io
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import re

%matplotlib inline

%load_ext autoreload
%autoreload 2

## Let's first do some data preprocessing

Here we use [pandas](https://pandas.pydata.org/) to do data preprocessing. Pandas is a very popular and handy package for data science or machine learning. You can also refer to this official guide for pandas: [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)

In [2]:
# read titanic train and test data
train = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')

print("train shape: {} test shape: {}".format(train.shape, test.shape))
# Showing overview of the train dataset
train.head(3)

train shape: (1047, 11) test shape: (262, 11)


Unnamed: 0,Pclass,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,1,"Hays, Miss. Margaret Bechstein",female,24.0,0,0,11767,83.1583,C54,C
1,3,0,"Holm, Mr. John Fredrik Alexander",male,43.0,0,0,C 7075,6.45,,S
2,3,0,"Hansen, Mr. Claus Peter",male,41.0,2,0,350026,14.1083,,S


## deal with missing values and transform to discrete variables

In [3]:
# copied from: https://www.kaggle.com/dmilla/introduction-to-decision-trees-titanic-dataset
full_data = [train, test]

# Feature that tells whether a passenger had a cabin on the Titanic
train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
test['Has_Cabin'] = test["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

# Create new feature FamilySize as a combination of SibSp and Parch
for dataset in full_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
# Create new feature IsAlone from FamilySize
for dataset in full_data:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
# Remove all NULLS in the Embarked column
for dataset in full_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')
# Remove all NULLS in the Fare column
for dataset in full_data:
    dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())

# Remove all NULLS in the Age column
for dataset in full_data:
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
    # Next line has been improved to avoid warning
    dataset.loc[np.isnan(dataset['Age']), 'Age'] = age_null_random_list
    dataset['Age'] = dataset['Age'].astype(int)

# Define function to extract titles from passenger names
def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)
# Group all non-common titles into one single grouping "Rare"
for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

for dataset in full_data:
    # Mapping Sex
    dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
    
    # Mapping titles
    title_mapping = {"Mr": 1, "Master": 2, "Mrs": 3, "Miss": 4, "Rare": 5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

    # Mapping Embarked
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
    
    # Mapping Fare
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)
    
    # Mapping Age
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age'] = 4

In [4]:
drop_elements = ['Name', 'Ticket', 'Cabin', 'SibSp']
train = train.drop(drop_elements, axis = 1)
test  = test.drop(drop_elements, axis = 1)

In [5]:
train.head()

Unnamed: 0,Pclass,Survived,Sex,Age,Parch,Fare,Embarked,Has_Cabin,FamilySize,IsAlone,Title
0,1,1,0,1,0,3,1,1,1,1,4
1,3,0,1,2,0,0,0,0,1,1,1
2,3,0,1,2,0,1,0,0,3,0,1
3,3,0,1,1,0,0,2,0,1,1,1
4,2,0,1,2,0,1,0,0,1,1,1


In [6]:
test.head()

Unnamed: 0,Pclass,Survived,Sex,Age,Parch,Fare,Embarked,Has_Cabin,FamilySize,IsAlone,Title
0,3,0,1,1,0,0,2,0,1,1,1
1,3,0,1,1,2,2,0,0,4,0,2
2,3,0,1,0,2,3,0,0,8,0,2
3,2,0,1,2,0,1,0,0,1,1,1
4,3,0,1,1,0,1,0,0,1,1,1


One of the good thing of pd.DataFrame is that you can keep the column names along with the data, which can be beneficial for many case.

Another good thing is that pd.DataFrame can be converted to np.array implicitely.

Also, pd provides a lot of useful data manipulating methods for your convenience, though we may not use them in this homework.

In [7]:
X = train.drop(['Survived'], axis=1)
y = train["Survived"]
X_test = test.drop(['Survived'], axis=1)
y_test = test["Survived"]
print("train: {}, test: {}".format(X.shape, X_test.shape))

train: (1047, 10), test: (262, 10)


In [8]:
X.head()

Unnamed: 0,Pclass,Sex,Age,Parch,Fare,Embarked,Has_Cabin,FamilySize,IsAlone,Title
0,1,0,1,0,3,1,1,1,1,4
1,3,1,2,0,0,0,0,1,1,1
2,3,1,2,0,1,0,0,3,0,1
3,3,1,1,0,0,2,0,1,1,1
4,2,1,2,0,1,0,0,1,1,1


In [9]:
def accuracy(y_gt, y_pred):
    return np.sum(y_gt == y_pred) / y_gt.shape[0]

In [10]:
print("Survived: {:.4f}, Not Survivied: {:.4f}".format(y.sum() / len(y), 1 - y.sum() / len(y)))

Survived: 0.3878, Not Survivied: 0.6122


## Decision Tree
Now it's your turn to do some real coding. Please implement the decision tree model in **decision_tree.py**. The PDF file provides some hints for this part.

In [11]:
from decision_tree import DecisionTree
def get_h(dt):
    h = 0
    if type(dt) == dict:
        for t in list(dt.values())[0].values():
            h = max(h, get_h(t))
    return h + 1
criterion = ['entropy', 'infogain_ratio', 'gini']
for i in range(2, 10):
    print("when max_depth = ", i)
    for crit in criterion:
        print("    using ", crit)
        dt = DecisionTree(criterion=crit, max_depth=i, min_samples_leaf=1, sample_feature=False)
        dt.fit(X, y)
        y_train_pred = dt.predict(X)
        y_test_pred = dt.predict(X_test)
        print("        Accuracy on train set: {}".format(accuracy(y, y_train_pred)))
        print("        Accuracy on test set: {}".format(accuracy(y_test, y_test_pred)))

# Plot the decision tree to get an intuition about how it makes decision
# plt.figure(figsize=(20, 10))
# dt.show()

when max_depth =  2
    using  entropy
        Accuracy on train set: 0.8013371537726839
        Accuracy on test set: 0.8129770992366412
    using  infogain_ratio
        Accuracy on train set: 0.7822349570200573
        Accuracy on test set: 0.7709923664122137
    using  gini
        Accuracy on train set: 0.8013371537726839
        Accuracy on test set: 0.8129770992366412
when max_depth =  3
    using  entropy
        Accuracy on train set: 0.8233046800382043
        Accuracy on test set: 0.8053435114503816
    using  infogain_ratio
        Accuracy on train set: 0.8127984718242598
        Accuracy on test set: 0.7900763358778626
    using  gini
        Accuracy on train set: 0.8233046800382043
        Accuracy on test set: 0.8091603053435115
when max_depth =  4
    using  entropy
        Accuracy on train set: 0.8424068767908309
        Accuracy on test set: 0.7977099236641222
    using  infogain_ratio
        Accuracy on train set: 0.828080229226361
        Accuracy on test set: 0

In [12]:
# TODO: Train the best DecisionTree(best val accuracy) that you can. You should choose some 
# hyper-parameters such as critertion, max_depth, and min_samples_in_leaf 
# according to the cross-validation result.
# To reduce difficulty, you can use KFold here.
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=2020)

best_method = ""
best_acc, best_min_leaf, best_depth = 0, 0, 0
criterion = ['entropy', 'gini', 'infogain_ratio']
for i in range(2, 11): # min_sample_leaf
    print("When min_sample_leaf = ", i)
    for j in range(4, 11): # max_depth
        print("    when max_depth = ", j)
        for crit in criterion:
            print("        use ", crit, end=':')
            dt = DecisionTree(criterion=crit, max_depth=j, min_samples_leaf=i, sample_feature=False)
            avg_accurancy = 0
            for train_indice, valid_indice in kf.split(X, y):
                X_train_fold, y_train_fold = X.loc[train_indice], y.loc[train_indice]
                X_val_fold, y_val_fold = X.loc[valid_indice], y.loc[valid_indice]
                dt.fit(X_train_fold, y_train_fold)
                y_train_pred = dt.predict(X_train_fold)
                y_valid_pred = dt.predict(X_val_fold)
                avg_accurancy += accuracy(y_train_fold, y_train_pred) * 0.25 + accuracy(y_val_fold, y_valid_pred) * 0.75
            avg_accurancy /= 5
            print("accuracy = ", avg_accurancy)
            if avg_accurancy > best_acc:
                best_acc = avg_accurancy
                best_min_leaf = i
                best_depth = j
                best_method = crit

# begin answer
print("The best decision tree is:")
print("best accurancy = ", best_acc)
print("best min leaf = ", best_min_leaf)
print("best depth = ", best_depth)
print("best criterion = ", best_method)
# end answer

When min_sample_leaf =  2
    when max_depth =  4
        use  entropy:accuracy =  0.7948289503518827
        use  gini:accuracy =  0.7863376260839499
        use  infogain_ratio:accuracy =  0.7985905134292232
    when max_depth =  5
        use  entropy:accuracy =  0.7959612352312044
        use  gini:accuracy =  0.7881295818952175
        use  infogain_ratio:accuracy =  0.7975157863322545
    when max_depth =  6
        use  entropy:accuracy =  0.7913089673498278
        use  gini:accuracy =  0.786404233076316
        use  infogain_ratio:accuracy =  0.7941568044182806
    when max_depth =  7
        use  entropy:accuracy =  0.7932783056804394
        use  gini:accuracy =  0.7888030557348118
        use  infogain_ratio:accuracy =  0.7942867271326979
    when max_depth =  8
        use  entropy:accuracy =  0.7890982525969823
        use  gini:accuracy =  0.7840741865812269
        use  infogain_ratio:accuracy =  0.7902776224945186
    when max_depth =  9
        use  entropy:accuracy =

        use  infogain_ratio:accuracy =  0.8021175699201842
    when max_depth =  6
        use  entropy:accuracy =  0.7960760250855109
        use  gini:accuracy =  0.7890898124360852
        use  infogain_ratio:accuracy =  0.7973192728315686
    when max_depth =  7
        use  entropy:accuracy =  0.7961391798770188
        use  gini:accuracy =  0.7892058691141114
        use  infogain_ratio:accuracy =  0.7913520013211858
    when max_depth =  8
        use  entropy:accuracy =  0.7982888722898691
        use  gini:accuracy =  0.7907572612053123
        use  infogain_ratio:accuracy =  0.7915909499472312
    when max_depth =  9
        use  entropy:accuracy =  0.7961391798770188
        use  gini:accuracy =  0.7886109864274587
        use  infogain_ratio:accuracy =  0.7923581375480351
    when max_depth =  10
        use  entropy:accuracy =  0.7962022633831372
        use  gini:accuracy =  0.7907572612053123
        use  infogain_ratio:accuracy =  0.7923649728180282
When min_sample_leaf

In [13]:
# report the accuracy on test set
# begin answer
dt = DecisionTree(criterion="entropy", max_depth=best_depth, min_samples_leaf=best_min_leaf, sample_feature=False)
# end answer
dt.fit(X, y)
print("Accuracy on train set: {}".format(accuracy(y, dt.predict(X))))
print("Accuracy on test set: {}".format(accuracy(y_test, dt.predict(X_test))))

Accuracy on train set: 0.8347659980897804
Accuracy on test set: 0.7900763358778626


## Random Forest
Please implement the random forest model in **random_forest.py**. The PDF file provides some hints for this part.

In [14]:
from random_forest import RandomForest

base_learner = DecisionTree(criterion='entropy', max_depth=2, min_samples_leaf=1, sample_feature=True)
rf = RandomForest(base_learner=base_learner, n_estimator=10, seed=2020)
rf.fit(X, y)

y_train_pred = rf.predict(X)

print("Accuracy on train set: {}".format(accuracy(y, y_train_pred)))

Accuracy on train set: 0.7984718242597899


In [15]:
# TODO: Train the best RandomForest that you can. You should choose some 
# hyper-parameters such as max_depth, and min_samples_in_leaf 
# according to the cross-validation result.
# begin answer
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=2020)

best_method = ""
best_acc, best_min_leaf, best_depth = 0, 0, 0
criterion = ['entropy', 'gini', 'infogain_ratio']
for i in range(2, 11): # min_sample_leaf
    print("When min_sample_leaf = ", i)
    for j in range(4, 11): # max_depth
        print("    when max_depth = ", j)
        for crit in criterion:
            print("        use ", crit, end=':')
            base_leaner = DecisionTree(criterion=crit, max_depth=j, min_samples_leaf=i, sample_feature=False)
            rf = RandomForest(base_learner=base_learner, n_estimator=10, seed=2020)
            avg_accurancy = 0
            for train_indice, valid_indice in kf.split(X, y):
                X_train_fold, y_train_fold = X.loc[train_indice], y.loc[train_indice]
                X_val_fold, y_val_fold = X.loc[valid_indice], y.loc[valid_indice]
                rf.fit(X_train_fold, y_train_fold)
                y_train_pred = rf.predict(X_train_fold)
                y_valid_pred = rf.predict(X_val_fold)
                avg_accurancy += accuracy(y_train_fold, y_train_pred) * 0.25 + accuracy(y_val_fold, y_valid_pred) * 0.75
            avg_accurancy /= 5
            print("accuracy = ", avg_accurancy)
            if avg_accurancy > best_acc:
                best_acc = avg_accurancy
                best_min_leaf = i
                best_depth = j
                best_method = crit

# begin answer
print("The best decision tree is:")
print("best accurancy = ", best_acc)
print("best min leaf = ", best_min_leaf)
print("best depth = ", best_depth)
print("best criterion = ", best_method)
# end answer

When min_sample_leaf =  2
    when max_depth =  4
        use  entropy:accuracy =  0.7961611868803177
        use  gini:accuracy =  0.7961611868803177
        use  infogain_ratio:accuracy =  0.7968788902296
    when max_depth =  5
        use  entropy:accuracy =  0.7954469011660319
        use  gini:accuracy =  0.7961611868803177
        use  infogain_ratio:accuracy =  0.7961611868803177
    when max_depth =  6
        use  entropy:accuracy =  0.7961611868803177
        use  gini:accuracy =  0.7954469011660319
        use  infogain_ratio:accuracy =  0.7954469011660319
    when max_depth =  7
        use  entropy:accuracy =  0.7961611868803177
        use  gini:accuracy =  0.7954469011660319
        use  infogain_ratio:accuracy =  0.7961611868803177
    when max_depth =  8
        use  entropy:accuracy =  0.7954469011660319
        use  gini:accuracy =  0.7961611868803177
        use  infogain_ratio:accuracy =  0.7961611868803177
    when max_depth =  9
        use  entropy:accuracy =  

        use  infogain_ratio:accuracy =  0.7961611868803177
    when max_depth =  6
        use  entropy:accuracy =  0.7961646045153142
        use  gini:accuracy =  0.7961646045153142
        use  infogain_ratio:accuracy =  0.7954469011660319
    when max_depth =  7
        use  entropy:accuracy =  0.7961611868803177
        use  gini:accuracy =  0.7961611868803177
        use  infogain_ratio:accuracy =  0.7961611868803177
    when max_depth =  8
        use  entropy:accuracy =  0.7961611868803177
        use  gini:accuracy =  0.7961611868803177
        use  infogain_ratio:accuracy =  0.7961611868803177
    when max_depth =  9
        use  entropy:accuracy =  0.7968788902296
        use  gini:accuracy =  0.7961611868803177
        use  infogain_ratio:accuracy =  0.7961611868803177
    when max_depth =  10
        use  entropy:accuracy =  0.7961611868803177
        use  gini:accuracy =  0.7954469011660319
        use  infogain_ratio:accuracy =  0.7961611868803177
When min_sample_leaf = 

In [16]:
# report the accuracy on test set
# begin answer
dt = DecisionTree(criterion=best_method, max_depth=best_depth, min_samples_leaf=best_min_leaf, sample_feature=False)
rf = RandomForest(base_learner=base_learner, n_estimator=10, seed=2020)
# end answer
rf.fit(X, y)
print("Accuracy on train set: {}".format(accuracy(y, rf.predict(X))))
print("Accuracy on test set: {}".format(accuracy(y_test, rf.predict(X_test))))

Accuracy on train set: 0.7984718242597899
Accuracy on test set: 0.7862595419847328


## Adaboost
Please implement the adaboost model in **adaboost.py**. The PDF file provides some hints for this part.

In [19]:
from adaboost import Adaboost

base_learner = DecisionTree(criterion='entropy', max_depth=1, min_samples_leaf=1, sample_feature=False)
ada = Adaboost(base_learner=base_learner, n_estimator=50, seed=2020)
ada.fit(X, y)

y_train_pred = ada.predict(X)

print("Accuracy on train set: {}".format(accuracy(y, y_train_pred)))

Accuracy on train set: 0.8099331423113658


In [22]:
# TODO: Train the best Adaboost that you can. You should choose some 
# hyper-parameters such as max_depth, and min_samples_in_leaf 
# according to the cross-validation result.
# begin answer
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=2020)

best_method = ""
best_acc, best_min_leaf, best_depth = 0, 0, 0
criterion = ['entropy', 'gini', 'infogain_ratio']
for i in range(2, 11): # min_sample_leaf
    print("When min_sample_leaf = ", i)
    for j in range(4, 11): # max_depth
        print("    when max_depth = ", j)
        for crit in criterion:
            print("        use ", crit)
            base_leaner = DecisionTree(criterion=crit, max_depth=j, min_samples_leaf=i, sample_feature=False)
            ada = Adaboost(base_learner=base_learner, n_estimator=50, seed=2020)
            avg_accurancy, avg_train, avg_test = 0, 0, 0
            for train_indice, valid_indice in kf.split(X, y):
                X_train_fold, y_train_fold = X.loc[train_indice], y.loc[train_indice]
                X_val_fold, y_val_fold = X.loc[valid_indice], y.loc[valid_indice]
                ada.fit(X_train_fold, y_train_fold)
                y_train_pred = ada.predict(X_train_fold)
                y_valid_pred = ada.predict(X_val_fold)
                train_acc, test_acc = accuracy(y_train_fold, y_train_pred), accuracy(y_val_fold, y_valid_pred)
                avg_train += train_acc
                avg_test += test_acc
                avg_accurancy += train_acc * 0.2 + test_acc * 0.8
            avg_accurancy /= 5
            avg_train /= 5
            avg_test /= 5
            print("            train accurancy = ", avg_train)
            print("            test accurancy = ", avg_test)
            print("            total accurancy = ", avg_accurancy)
            if avg_accurancy > best_acc:
                best_acc = avg_accurancy
                best_min_leaf = i
                best_depth = j
                best_method = crit

# begin answer
print("The best decision tree using adaboost is:")
print("best accurancy = ", best_acc)
print("best min leaf = ", best_min_leaf)
print("best depth = ", best_depth)
print("best criterion = ", best_method)
# end answer

When min_sample_leaf =  2
    when max_depth =  4
        use  entropy
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  gini
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  infogain_ratio
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
    when max_depth =  5
        use  entropy
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  gini
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  infogain_ratio
            train accurancy =  0.80969509813146

            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  gini
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  infogain_ratio
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
    when max_depth =  6
        use  entropy
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  gini
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  infogain_ratio
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total a

            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  infogain_ratio
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
    when max_depth =  7
        use  entropy
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  gini
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  infogain_ratio
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
    when max_depth =  8
        use  entropy
            train accurancy =  0.8096950981314673
            test accurancy =  0.804205969

            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
    when max_depth =  8
        use  entropy
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  gini
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  infogain_ratio
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
    when max_depth =  9
        use  entropy
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  gini
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
  

            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  gini
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  infogain_ratio
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
    when max_depth =  10
        use  entropy
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  gini
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total accurancy =  0.8053037952015953
        use  infogain_ratio
            train accurancy =  0.8096950981314673
            test accurancy =  0.8042059694691274
            total 

In [23]:
# report the accuracy on test set
# begin answer
base = DecisionTree(criterion=best_method, max_depth=best_depth, min_samples_leaf=best_min_leaf, sample_feature=False)
ada = Adaboost(base_learner=base, n_estimator=50, seed=2020)
# end answer
ada.fit(X, y)
print("Accuracy on train set: {}".format(accuracy(y, ada.predict(X))))
print("Accuracy on test set: {}".format(accuracy(y_test, ada.predict(X_test))))

Accuracy on train set: 0.8739255014326648
Accuracy on test set: 0.7862595419847328
