# Decision Tree for the human dataset
## 1. Data Preprocessing
we're going to train a decicison tree model for the dataset and test how good it works.
We will use the scikit-learn library for its powerful funtionality. First import the packages and read the prepocessed data.

In [157]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder

# Load the data from the CSV file
data_train = pd.read_csv('data_train.csv')
data_test = pd.read_csv('data_test.csv')


Note that the sklearn.tree library cannot deal with attributes in strings, so we first have to convert all the string attributes to integers using LabelEncoder(). 'Education' doesn't need to be converted because it was already converted to interger in the original data as a column called 'educational_num'. Each string value of 'education' corrsponds to a unique interger in 'educational_num'. Repeast this process in both training dataset and testing dataset.   

In [158]:
# Convert a categorical variable to numerical values
encoder = LabelEncoder()
data_train['workclass'] = encoder.fit_transform(data_train['workclass'])
data_train['marital-status'] = encoder.fit_transform(data_train['marital-status'])
data_train['occupation'] = encoder.fit_transform(data_train['occupation'])
data_train['relationship'] = encoder.fit_transform(data_train['relationship'])
data_train['race'] = encoder.fit_transform(data_train['race'])
data_train['gender'] = encoder.fit_transform(data_train['gender'])
data_train['native-country'] = encoder.fit_transform(data_train['native-country'])
data_test['workclass'] = encoder.fit_transform(data_test['workclass'])
data_test['marital-status'] = encoder.fit_transform(data_test['marital-status'])
data_test['occupation'] = encoder.fit_transform(data_test['occupation'])
data_test['relationship'] = encoder.fit_transform(data_test['relationship'])
data_test['race'] = encoder.fit_transform(data_test['race'])
data_test['gender'] = encoder.fit_transform(data_test['gender'])
data_test['native-country'] = encoder.fit_transform(data_test['native-country'])

We are using 13 colums as the X, so create X_train and X_test using all these columns. Set y_train and y_test using the 'income' column. As Decision trees are not sensitive to the scale of the features, standard scaling is skipped.

In [159]:
# Set x_train and X_test to contain all the input columns of the DataFrame

feature_names = ['age', 'workclass', 'fnlwgt', 'educational_num', 'marital-status', 'occupation', 'relationship',
                 'race', 'gender', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']

x_train = data_train[feature_names]
x_test = data_test[feature_names]

# Set y_train and y_test to contain the target variable of the DataFrame
y_train = data_train['income']
y_test = data_test['income']

print(x_train)
print(y_train)

       age  workclass  fnlwgt  educational_num  marital-status  occupation  \
0       39          6   77516               13               4           0   
1       50          5   83311               13               2           3   
2       38          3  215646                9               0           5   
3       53          3  234721                7               2           5   
4       28          3  338409               13               2           9   
...    ...        ...     ...              ...             ...         ...   
32556   27          3  257302               12               2          12   
32557   40          3  154374                9               2           6   
32558   58          3  151910                9               6           0   
32559   22          3  201490                9               4           0   
32560   52          4  287927                9               2           3   

       relationship  race  gender  capital-gain  capital-loss  

## 2. get a prelimenery result using default settings of sklearn.tree
feed the data into DecisionTreeClassifier() and get the result. The accuracy is 0.81.

In [160]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Initialize a decision tree classifier
clf = DecisionTreeClassifier()

# Train the classifier on the training data
clf.fit(x_train, y_train)

# Make predictions on the testing data
y_pred = clf.predict(x_test)

#Summarize Result
#precision,recall,f1-score,support, accuracy, macro avg, weighted avg
print(classification_report(y_test,y_pred))
#ROC score
print(roc_auc_score(y_test, y_pred))
#confusion matrix
print(confusion_matrix(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.88      0.87      0.87     12435
           1       0.59      0.61      0.60      3846

    accuracy                           0.81     16281
   macro avg       0.73      0.74      0.74     16281
weighted avg       0.81      0.81      0.81     16281

0.7386096939655632
[[10797  1638]
 [ 1504  2342]]


## 3. Hyperparameter Tuning
### 3.1 Preparation for tuning
Above was just the first attempt without validation. The tree we obtain here is too complex. In order to avoid overfitting on training data, validation is necessary. We will start with the most basic hyperparameter, the map depth of the tree. Loop over values of max_depth to find the one that makes the highest cross-validation score. There are several metrics for evaluating a decision tree model, such as accuracy, precision, recall and AUC. They can be useful depending on the specific problem. Since the data is quite imbalanced, we believe the AUC (Area Under the ROC Curve) should be the most useful one but we will list all the other metrics when comparing the hyperparameters.

In [161]:
import numpy as np
from sklearn.tree import export_graphviz
from graphviz import Source
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import cross_validate

# Define range of values for maximum depth of tree
max_depth_values = range(1, len(feature_names))

# Create arrays to store the evaluation metrics for each max_depth value
accuracy_scores = []
precision_scores = []
recall_scores = []
auc_scores = []

for max_depth in max_depth_values:

    # Create a new DecisionTreeClassifier with the current max_depth value
    clf = DecisionTreeClassifier(max_depth=max_depth)

    # Use cross_validate to calculate the evaluation metrics with 5-fold cross-validation
    cv_results = cross_validate(clf, x_train, y_train, cv=5, scoring=['accuracy', 'precision', 'recall', 'roc_auc'])

    # Store the mean evaluation metrics for the current max_depth value
    accuracy_scores.append(np.mean(cv_results['test_accuracy']))
    precision_scores.append(np.mean(cv_results['test_precision']))
    recall_scores.append(np.mean(cv_results['test_recall']))
    auc_scores.append(np.mean(cv_results['test_roc_auc']))

# Find the index of the max AUC score
best_index = np.argmax(auc_scores)

# Find the best max_depth value based on the index
best_max_depth = max_depth_values[best_index]

# Print the best max_depth value and the corresponding evaluation metrics
print("Best max_depth:", best_max_depth)
print("Accuracy:", accuracy_scores[best_index])
print("Precision:", precision_scores[best_index])
print("Recall:", recall_scores[best_index])
print("AUC:", auc_scores[best_index])

# Covert the array to a Dataframe
df_cv_results = pd.DataFrame(cv_results, columns=['fit_time', 'score_time', 'test_accuracy','test_precision','test_recall','test_roc_auc'])
print(df_cv_results.to_markdown())

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Best max_depth: 8
Accuracy: 0.854887842911795
Precision: 0.7814208412563405
Recall: 0.5518451405418765
AUC: 0.8983571492400083
|    |   fit_time |   score_time |   test_accuracy |   test_precision |   test_recall |   test_roc_auc |
|---:|-----------:|-------------:|----------------:|-----------------:|--------------:|---------------:|
|  0 |  0.104717  |    0.0139616 |        0.847075 |         0.703047 |      0.63225  |       0.87409  |
|  1 |  0.0917549 |    0.0139959 |        0.847359 |         0.703546 |      0.632653 |       0.870982 |
|  2 |  0.0987043 |    0.0149584 |        0.851505 |         0.707099 |      0.654337 |       0.867855 |
|  3 |  0.0937514 |    0.0149603 |        0.85734  |         0.739326 |      0.629464 |       0.881872 |
|  4 |  0.0977356 |    0.0139642 |        0.849355 |         0.729118 |      0.595663 |       0.873141 |


So the maximum depth 8 is found to be the best maximum of depth.

### 3.3 Further validation: grid search of hyperparameters 
Now that we have made our first step into validation, the question becomes, is there a need to validate more hyperparameters to further improve the tree? It's very obvious that the more parameters we put into validation, the higher validation score we will get, but this doesn't necessarily mean the model becomes better because we may end up overfitting to the validation set. Thankfully the fold validation method can somehow cope with this problem. However since the validation set is chosen randomly, the result is not always reproducable. In this case, we start with a broad range of parameters and repeat severl times to narrow down the range we are interest in and then continue on finer ranges. We start with the range below. We first ran the cross-validation 10 times before moving on.

In [162]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameter space for the grid search
param_grid = {
    'max_depth': [6, 8, 10, 12],
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [2, 4, 8, 10],
    'max_features': [4, 6, 8],
    'criterion': ['gini', 'entropy'],
    'class_weight': [None, 'balanced'],
    'splitter': ['best', 'random']
}

# Create a decision tree classifier object
clf = DecisionTreeClassifier()

for i in range(1,11):

    # Perform grid search with cross-validation
    grid_search = GridSearchCV(clf, param_grid=param_grid, cv=5, scoring='roc_auc')
    grid_search.fit(x_train, y_train)

    # Print the best hyperparameters and their score
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    print("Best hyperparameters: ", best_params)
    print("Best score: ", best_score)

    # print the result into a file each time
    with open('output.txt', 'a') as f:
        print("Best hyperparameters: ", best_params, file=f)
        print("Best score: ", best_score, file=f)


Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9023159967690738
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 15, 'splitter': 'best'}
Best score:  0.9020759235497507
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 10, 'max_features': 6, 'min_samples_leaf': 10, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9018326918331608
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 15, 'splitter': 'best'}
Best score:  0.9026765383844827
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': 6, 'min_samples_leaf': 10, 'min_samples_split': 15, 'splitter': 'best'}
Best score:  0.90245

Let's look at the results of the cross validation:
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9018088535869502
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9023159967690738
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 15, 'splitter': 'best'}
Best score:  0.9020759235497507
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 10, 'max_features': 6, 'min_samples_leaf': 10, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9018326918331608
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 15, 'splitter': 'best'}
Best score:  0.9026765383844827
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': 6, 'min_samples_leaf': 10, 'min_samples_split': 15, 'splitter': 'best'}
Best score:  0.90245374058493
Best hyperparameters:  {'class_weight': None, 'criterion': 'entropy', 'max_depth': 10, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 5, 'splitter': 'best'}
Best score:  0.9030600977787238
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 15, 'splitter': 'best'}
Best score:  0.9015087997913884
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9022699344914443
Best hyperparameters:  {'class_weight': None, 'criterion': 'entropy', 'max_depth': 8, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9028009257856118
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 10, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 5, 'splitter': 'best'}
Best score:  0.9029742293622496

Among the hyperparameters, 'class_weight' and 'splitter' return consistent results, so their values can be fixed. 'max_depth', 'max_features' and 'min_samples_leaf' always switch between two closest values, so their serchaing ranges can be narrowed down. However, min_samples_split doesn't show any pattern, so we keep the original inverval for it. Now we continue with the sencond stage of puning.

In [164]:
# Define hyperparameter space for the grid search
param_grid = {
    'max_depth': [7, 8, 9, 10, 11],
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [7, 8, 9, 10, 11],
    'max_features': [5, 6, 7, 8],
    'criterion': ['gini', 'entropy'],
    'class_weight': [None],
    'splitter': ['best']
}

# Create a decision tree classifier object
clf = DecisionTreeClassifier()

for i in range(1,11):

    # Perform grid search with cross-validation
    grid_search = GridSearchCV(clf, param_grid=param_grid, cv=5, scoring='roc_auc')
    grid_search.fit(x_train, y_train)

    # Print the best hyperparameters and their score
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    print("Best hyperparameters: ", best_params)
    print("Best score: ", best_score)

    # print the result into a file each time
    with open('output2.txt', 'a') as f:
        print("Best hyperparameters: ", best_params, file=f)
        print("Best score: ", best_score, file=f)


Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9030214567701511
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 5, 'splitter': 'best'}
Best score:  0.9030683462814768
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 10, 'max_features': 8, 'min_samples_leaf': 11, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9034754487608545
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 7, 'min_samples_leaf': 9, 'min_samples_split': 10, 'splitter': 'best'}
Best score:  0.9031170104918192
Best hyperparameters:  {'class_weight': None, 'criterion': 'entropy', 'max_depth': 9, 'max_features': 7, 'min_samples_leaf': 7, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9029

Now we have a look at the second attempt of puning: Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9030214567701511
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 5, 'splitter': 'best'}
Best score:  0.9030683462814768
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 10, 'max_features': 8, 'min_samples_leaf': 11, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9034754487608545
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 7, 'min_samples_leaf': 9, 'min_samples_split': 10, 'splitter': 'best'}
Best score:  0.9031170104918192
Best hyperparameters:  {'class_weight': None, 'criterion': 'entropy', 'max_depth': 9, 'max_features': 7, 'min_samples_leaf': 7, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.902907821853205
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9040258717842488
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 11, 'min_samples_split': 10, 'splitter': 'best'}
Best score:  0.9033427207923024
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 7, 'min_samples_leaf': 10, 'min_samples_split': 15, 'splitter': 'best'}
Best score:  0.903702840528228
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 7, 'min_samples_leaf': 7, 'min_samples_split': 5, 'splitter': 'best'}
Best score:  0.9029502519913434
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 11, 'min_samples_split': 5, 'splitter': 'best'}
Best score:  0.903802099232124


'criterion' returns as 'gini' 9 out of 10 times, so we will fix it as 'gini'. 'max_depth' returns 9 or 10, so this will be the next range for it. 'max_features' returns 7 or 8, so the range can also be narrowed down. 'min_samples_leaf' and 'min_samples_split' don't show much improvement, so in the next stage, they will become the main focus.

In [165]:
# Define hyperparameter space for the grid search
param_grid = {
    'max_depth': [9, 10],
    'min_samples_split': [2, 4, 6, 8, 10, 12, 14, 16],
    'min_samples_leaf': [7, 8, 9, 10, 11],
    'max_features': [7, 8],
    'criterion': ['gini'],
    'class_weight': [None],
    'splitter': ['best']
}

# Create a decision tree classifier object
clf = DecisionTreeClassifier()

for i in range(1,11):

    # Perform grid search with cross-validation
    grid_search = GridSearchCV(clf, param_grid=param_grid, cv=5, scoring='roc_auc')
    grid_search.fit(x_train, y_train)

    # Print the best hyperparameters and their score
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    print("Best hyperparameters: ", best_params)
    print("Best score: ", best_score)

    # print the result into a file each time
    with open('output3.txt', 'a') as f:
        print("Best hyperparameters: ", best_params, file=f)
        print("Best score: ", best_score, file=f)

Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9037045168895492
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 9, 'min_samples_split': 10, 'splitter': 'best'}
Best score:  0.9038137829852172
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 16, 'splitter': 'best'}
Best score:  0.903599677396058
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9038987503057883
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 7, 'min_samples_leaf': 10, 'min_samples_split': 4, 'splitter': 'best'}
Best score:  0.90397947

In the third stage, we got this result:
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9037045168895492
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 9, 'min_samples_split': 10, 'splitter': 'best'}
Best score:  0.9038137829852172
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 16, 'splitter': 'best'}
Best score:  0.903599677396058
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9038987503057883
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 7, 'min_samples_leaf': 10, 'min_samples_split': 4, 'splitter': 'best'}
Best score:  0.9039794703028206
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 11, 'min_samples_split': 16, 'splitter': 'best'}
Best score:  0.9030998559778391
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 4, 'splitter': 'best'}
Best score:  0.9033415242064644
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 11, 'min_samples_split': 14, 'splitter': 'best'}
Best score:  0.9029601854222502
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9036369179988935
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 11, 'min_samples_split': 8, 'splitter': 'best'}
Best score:  0.9043456265969627

Now 'max_depth' is determind at 9 and 'max_features' at 8. 'min_samples_split' still varies a lot. In the next step, we will increase the number of folds to see whether it helps.

In [166]:
# Define hyperparameter space for the grid search
param_grid = {
    'max_depth': [9],
    'min_samples_split': [2, 4, 6, 8, 10, 12, 14, 16],
    'min_samples_leaf': [7, 8, 9, 10, 11],
    'max_features': [8],
    'criterion': ['gini'],
    'class_weight': [None],
    'splitter': ['best']
}

# Create a decision tree classifier object
clf = DecisionTreeClassifier()

for i in range(1,11):

    # Perform grid search with cross-validation
    grid_search = GridSearchCV(clf, param_grid=param_grid, cv=10, scoring='roc_auc')
    grid_search.fit(x_train, y_train)

    # Print the best hyperparameters and their score
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    print("Best hyperparameters: ", best_params)
    print("Best score: ", best_score)

    # print the result into a file each time
    with open('output4.txt', 'a') as f:
        print("Best hyperparameters: ", best_params, file=f)
        print("Best score: ", best_score, file=f)

Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 12, 'splitter': 'best'}
Best score:  0.9041449909728595
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 11, 'min_samples_split': 4, 'splitter': 'best'}
Best score:  0.9038166181826603
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 11, 'min_samples_split': 8, 'splitter': 'best'}
Best score:  0.9037586841597687
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 9, 'min_samples_split': 6, 'splitter': 'best'}
Best score:  0.9039576452353714
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 7, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.904056312

the result looks like this:Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 12, 'splitter': 'best'}
Best score:  0.9041449909728595
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 11, 'min_samples_split': 4, 'splitter': 'best'}
Best score:  0.9038166181826603
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 11, 'min_samples_split': 8, 'splitter': 'best'}
Best score:  0.9037586841597687
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 9, 'min_samples_split': 6, 'splitter': 'best'}
Best score:  0.9039576452353714
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 7, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9040563125410946
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 9, 'min_samples_split': 8, 'splitter': 'best'}
Best score:  0.9044134180305449
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 11, 'min_samples_split': 8, 'splitter': 'best'}
Best score:  0.9046020616791333
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 7, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9049439115861855
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 10, 'splitter': 'best'}
Best score:  0.9043810854454328
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 6, 'splitter': 'best'}
Best score:  0.9047370751625274

The range of min_samples_split seems to narrow down a little to smaller numbers so we continue.

In [167]:
# Define hyperparameter space for the grid search
param_grid = {
    'max_depth': [9],
    'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'min_samples_leaf': [7, 8, 9, 10, 11],
    'max_features': [8],
    'criterion': ['gini'],
    'class_weight': [None],
    'splitter': ['best']
}

# Create a decision tree classifier object
clf = DecisionTreeClassifier()

for i in range(1,21):

    # Perform grid search with cross-validation
    grid_search = GridSearchCV(clf, param_grid=param_grid, cv=10, scoring='roc_auc')
    grid_search.fit(x_train, y_train)

    # Print the best hyperparameters and their score
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    print("Best hyperparameters: ", best_params)
    print("Best score: ", best_score)

    # print the result into a file each time
    with open('output5.txt', 'a') as f:
        print("Best hyperparameters: ", best_params, file=f)
        print("Best score: ", best_score, file=f)

Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 5, 'splitter': 'best'}
Best score:  0.9050972547937919
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 10, 'splitter': 'best'}
Best score:  0.9040036813454588
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 9, 'min_samples_split': 9, 'splitter': 'best'}
Best score:  0.9043603160414188
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 11, 'min_samples_split': 4, 'splitter': 'best'}
Best score:  0.9037581529272398
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 9, 'min_samples_split': 5, 'splitter': 'best'}
Best score:  0.904153037

Based on the result, further narrow down the ranges and increase the fold number.

In [168]:
# Define hyperparameter space for the grid search
param_grid = {
    'max_depth': [9],
    'min_samples_split': [4, 5, 6, 7, 8, 9],
    'min_samples_leaf': [8, 9, 10],
    'max_features': [8],
    'criterion': ['gini'],
    'class_weight': [None],
    'splitter': ['best']
}

# Create a decision tree classifier object
clf = DecisionTreeClassifier()

for i in range(1,11):

    # Perform grid search with cross-validation
    grid_search = GridSearchCV(clf, param_grid=param_grid, cv=20, scoring='roc_auc')
    grid_search.fit(x_train, y_train)

    # Print the best hyperparameters and their score
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    print("Best hyperparameters: ", best_params)
    print("Best score: ", best_score)

    # print the result into a file each time
    with open('output6.txt', 'a') as f:
        print("Best hyperparameters: ", best_params, file=f)
        print("Best score: ", best_score, file=f)

Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 9, 'min_samples_split': 8, 'splitter': 'best'}
Best score:  0.9042196046938257
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 9, 'splitter': 'best'}
Best score:  0.9046673402911278
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 9, 'min_samples_split': 6, 'splitter': 'best'}
Best score:  0.9049676922110279
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 10, 'min_samples_split': 5, 'splitter': 'best'}
Best score:  0.9045800591908494
Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 9, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 6, 'splitter': 'best'}
Best score:  0.9042498109

Since the best score is not sensitive with the hyperparameters anymore, we choose the modes for 'min_samples_leaf'(10) and 'min_samples_split'(8) as the final values.

## 4 Use the test set to get the final result
use the hyperparameters obtained to run the test again.

In [177]:
# Test the tree using the test set
x_train = data_train_val[feature_names]
x_test = data_test[feature_names]

# Set y_train and y_test to contain the target variable of the DataFrame
y_train = data_train_val['income']
y_test = data_test['income']

# Initialize a decision tree classifier
clf = DecisionTreeClassifier(class_weight=None, criterion= 'gini', max_depth=9, max_features= 8, min_samples_leaf=10, min_samples_split= 8, splitter= 'best')

# Train the classifier on the training data
clf.fit(x_train, y_train)

# Make predictions on the testing data
y_pred = clf.predict(x_test)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("AUC:", auc)



# Visualize the decision tree
export_graphviz(clf, out_file="tree.dot", feature_names=feature_names, filled=True,
                rounded=True, special_characters=True)

with open("tree.dot") as f:
    dot_graph = f.read()

graph = Source(dot_graph)
graph.format = "png"
graph.render("decision_tree", view=True)



Accuracy: 0.8565812910754868
Precision: 0.7762340036563071
Recall: 0.5520020800832033
AUC: 0.7513930786423254


'decision_tree.png'