# Decision Tree for the human dataset
## 1. Data Preprocessing
we're going to train a decicison tree model for the dataset and test how good it works.
We will use the scikit-learn library for its powerful funtionality. First import the packages and read the prepocessed data.

In [150]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder

# Load the data from the CSV file
data_train = pd.read_csv('data_train.csv')
data_test = pd.read_csv('data_test.csv')


Note that the sklearn.tree library cannot deal with attributes in strings, so we first have to convert all the string attributes to integers using LabelEncoder(). 'Education' doesn't need to be converted because it was already converted to interger in the original data as a column called 'educational_num'. Each string value of 'education' corrsponds to a unique interger in 'educational_num'. Repeast this process in both training dataset and testing dataset.   

In [151]:
# Convert a categorical variable to numerical values
encoder = LabelEncoder()
data_train['workclass'] = encoder.fit_transform(data_train['workclass'])
data_train['marital-status'] = encoder.fit_transform(data_train['marital-status'])
data_train['occupation'] = encoder.fit_transform(data_train['occupation'])
data_train['relationship'] = encoder.fit_transform(data_train['relationship'])
data_train['race'] = encoder.fit_transform(data_train['race'])
data_train['gender'] = encoder.fit_transform(data_train['gender'])
data_train['native-country'] = encoder.fit_transform(data_train['native-country'])
data_test['workclass'] = encoder.fit_transform(data_test['workclass'])
data_test['marital-status'] = encoder.fit_transform(data_test['marital-status'])
data_test['occupation'] = encoder.fit_transform(data_test['occupation'])
data_test['relationship'] = encoder.fit_transform(data_test['relationship'])
data_test['race'] = encoder.fit_transform(data_test['race'])
data_test['gender'] = encoder.fit_transform(data_test['gender'])
data_test['native-country'] = encoder.fit_transform(data_test['native-country'])

We are using 13 colums as the X, so create X_train and X_test using all these columns. Set y_train and y_test using the 'income' column. As Decision trees are not sensitive to the scale of the features, standard scaling is skipped.

In [152]:
# Set x_train and X_test to contain all the input columns of the DataFrame

feature_names = ['age', 'workclass', 'fnlwgt', 'educational_num', 'marital-status', 'occupation', 'relationship',
                 'race', 'gender', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']

x_train = data_train[feature_names]
x_test = data_test[feature_names]

# Set y_train and y_test to contain the target variable of the DataFrame
y_train = data_train['income']
y_test = data_test['income']

print(x_train)
print(y_train)

       age  workclass  fnlwgt  educational_num  marital-status  occupation  \
0       39          6   77516               13               4           0   
1       50          5   83311               13               2           3   
2       38          3  215646                9               0           5   
3       53          3  234721                7               2           5   
4       28          3  338409               13               2           9   
...    ...        ...     ...              ...             ...         ...   
32556   27          3  257302               12               2          12   
32557   40          3  154374                9               2           6   
32558   58          3  151910                9               6           0   
32559   22          3  201490                9               4           0   
32560   52          4  287927                9               2           3   

       relationship  race  gender  capital-gain  capital-loss  

## 2. get a prelimenery result using default settings of sklearn.tree
feed the data into DecisionTreeClassifier() and get the result. The accuracy is 0.81.

In [153]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Initialize a decision tree classifier
clf = DecisionTreeClassifier()

# Train the classifier on the training data
clf.fit(x_train, y_train)

# Make predictions on the testing data
y_pred = clf.predict(x_test)

#Summarize Result
#precision,recall,f1-score,support, accuracy, macro avg, weighted avg
print(classification_report(y_test,y_pred))
#ROC score
print(roc_auc_score(y_test, y_pred))
#confusion matrix
print(confusion_matrix(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.88      0.87      0.87     12435
           1       0.59      0.61      0.60      3846

    accuracy                           0.81     16281
   macro avg       0.73      0.74      0.74     16281
weighted avg       0.81      0.81      0.81     16281

0.7389997095661872
[[10797  1638]
 [ 1501  2345]]


## 3. Hyperparameter Tuning
### 3.1 Preparation for tuning
Above was just the first attempt without validation. The tree we obtain here is too complex. In order to avoid overfitting on training data, validation is necessary. We will start with the most basic hyperparameter, the map depth of the tree. Loop over values of max_depth to find the one that makes the highest cross-validation score. There are several metrics for evaluating a decision tree model, such as accuracy, precision, recall and AUC. They can be useful depending on the specific problem. Since the data is quite imbalanced, we believe the AUC (Area Under the ROC Curve) should be the most useful one but we will list all the other metrics when comparing the hyperparameters.

In [154]:
import numpy as np
from sklearn.tree import export_graphviz
from graphviz import Source
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import cross_validate

# Define range of values for maximum depth of tree
max_depth_values = range(1, len(feature_names))

# Create arrays to store the evaluation metrics for each max_depth value
accuracy_scores = []
precision_scores = []
recall_scores = []
auc_scores = []

for max_depth in max_depth_values:

    # Create a new DecisionTreeClassifier with the current max_depth value
    clf = DecisionTreeClassifier(max_depth=max_depth)

    # Use cross_validate to calculate the evaluation metrics with 5-fold cross-validation
    cv_results = cross_validate(clf, x_train, y_train, cv=5, scoring=['accuracy', 'precision', 'recall', 'roc_auc'])

    # Store the mean evaluation metrics for the current max_depth value
    accuracy_scores.append(np.mean(cv_results['test_accuracy']))
    precision_scores.append(np.mean(cv_results['test_precision']))
    recall_scores.append(np.mean(cv_results['test_recall']))
    auc_scores.append(np.mean(cv_results['test_roc_auc']))

# Find the index of the max AUC score
best_index = np.argmax(auc_scores)

# Find the best max_depth value based on the index
best_max_depth = max_depth_values[best_index]

# Print the best max_depth value and the corresponding evaluation metrics
print("Best max_depth:", best_max_depth)
print("Accuracy:", accuracy_scores[best_index])
print("Precision:", precision_scores[best_index])
print("Recall:", recall_scores[best_index])
print("AUC:", auc_scores[best_index])

# Covert the array to a Dataframe
df_cv_results = pd.DataFrame(cv_results, columns=['fit_time', 'score_time', 'test_accuracy','test_precision','test_recall','test_roc_auc'])
print(df_cv_results.to_markdown())

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Best max_depth: 8
Accuracy: 0.8549492632576465
Precision: 0.7814015794170229
Recall: 0.5522277123086328
AUC: 0.8987066080877936
|    |   fit_time |   score_time |   test_accuracy |   test_precision |   test_recall |   test_roc_auc |
|---:|-----------:|-------------:|----------------:|-----------------:|--------------:|---------------:|
|  0 |  0.0997117 |    0.0159898 |        0.84815  |         0.707439 |      0.630338 |       0.87332  |
|  1 |  0.102692  |    0.0219429 |        0.84613  |         0.699859 |      0.632015 |       0.871872 |
|  2 |  0.105717  |    0.0149922 |        0.851198 |         0.70698  |      0.652423 |       0.866322 |
|  3 |  0.0967505 |    0.0179362 |        0.856572 |         0.737988 |      0.626913 |       0.880026 |
|  4 |  0.099755  |    0.0149343 |        0.849816 |         0.729751 |      0.597577 |       0.874404 |


So the maximum depth 8 is found to be the best maximum of depth.

### 3.3 Further validation: grid search of hyperparameters 
Now that we have made our first step into validation, the question becomes, is there a need to validate more hyperparameters to further improve the tree? It's very obvious that the more parameters we put into validation, the higher validation score we will get, but this doesn't necessarily mean the model becomes better because we may end up overfitting to the validation set. Thankfully the fold validation method can somehow cope with this problem. However since the validation set is chosen randomly, the result is not always reproducable. In this case, we start with a broad range of parameters and repeat severl times to narrow down the range we are interest in and then continue on finer ranges. We start with the range below.

In [155]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameter space for the grid search
param_grid = {
    'max_depth': [6, 8, 10, 12],
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [2, 4, 8, 10],
    'max_features': [4, 6, 8],
    'criterion': ['gini', 'entropy'],
    'class_weight': [None, 'balanced'],
    'splitter': ['best', 'random']
}

# Create a decision tree classifier object
clf = DecisionTreeClassifier()

# Perform grid search with cross-validation
grid_search = GridSearchCV(clf, param_grid=param_grid, cv=5, scoring='roc_auc')
grid_search.fit(x_train, y_train)

# Print the best hyperparameters and their score
best_params = grid_search.best_params_
best_score = grid_search.best_score_
print("Best hyperparameters: ", best_params)
print("Best score: ", best_score)

# print the result into a file each time
with open('output.txt', 'a') as f:
    print("Best hyperparameters: ", best_params, file=f)
    print("Best score: ", best_score, file=f)


Best hyperparameters:  {'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': 8, 'min_samples_leaf': 8, 'min_samples_split': 2, 'splitter': 'best'}
Best score:  0.9018088535869502


use the hyperparameters obtained to run the test again.

In [156]:
# Test the tree using the test set
x_train = data_train_val[feature_names]
x_test = data_test[feature_names]

# Set y_train and y_test to contain the target variable of the DataFrame
y_train = data_train_val['income']
y_test = data_test['income']

# Initialize a decision tree classifier
clf = DecisionTreeClassifier(max_depth=grid_search.best_params_['max_depth'], min_samples_leaf=grid_search.best_params_['min_samples_leaf'])

# Train the classifier on the training data
clf.fit(x_train, y_train)

# Make predictions on the testing data
y_pred = clf.predict(x_test)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Visualize the decision tree
export_graphviz(clf, out_file="tree.dot", feature_names=feature_names, filled=True,
                rounded=True, special_characters=True)

with open("tree.dot") as f:
    dot_graph = f.read()

graph = Source(dot_graph)
graph.format = "png"
graph.render("decision_tree", view=True)



Accuracy: 0.8552300227258768


'decision_tree.png'