# Decision Tree for the human dataset
we're going to train a decicison tree model for the dataset and test how good it works.
We will use the scikit-learn library for its powerful funtionality. First import the packages and read the prepocessed data.

In [70]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

# Load the data from the CSV file
data_train = pd.read_csv('data_train.csv')
data_test = pd.read_csv('data_test.csv')


Note that the sklearn.tree library cannot deal with attributes in strings, so we first have to convert all the string attributes to integers using LabelEncoder(). 'Education' doesn't need to be converted because it was already converted to interger in the original data as a column called 'educational_num'. Each string value of 'education' corrsponds to a unique interger in 'educational_num'. Repeast this process in both training dataset and testing dataset.   

In [71]:
# Convert a categorical variable to numerical values
encoder = LabelEncoder()
data_train['workclass'] = encoder.fit_transform(data_train['workclass'])
data_train['marital-status'] = encoder.fit_transform(data_train['marital-status'])
data_train['occupation'] = encoder.fit_transform(data_train['occupation'])
data_train['relationship'] = encoder.fit_transform(data_train['relationship'])
data_train['race'] = encoder.fit_transform(data_train['race'])
data_train['gender'] = encoder.fit_transform(data_train['gender'])
data_train['native-country'] = encoder.fit_transform(data_train['native-country'])
data_test['workclass'] = encoder.fit_transform(data_test['workclass'])
data_test['marital-status'] = encoder.fit_transform(data_test['marital-status'])
data_test['occupation'] = encoder.fit_transform(data_test['occupation'])
data_test['relationship'] = encoder.fit_transform(data_test['relationship'])
data_test['race'] = encoder.fit_transform(data_test['race'])
data_test['gender'] = encoder.fit_transform(data_test['gender'])
data_test['native-country'] = encoder.fit_transform(data_test['native-country'])

We are using 13 colums as the X, so create X_train and X_test using all these columns. Set y_train and y_test using the 'income' column

In [72]:
# Set x_train and X_test to contain all the input columns of the DataFrame

feature_names = ['age', 'workclass', 'fnlwgt', 'educational_num', 'marital-status', 'occupation', 'relationship',
                 'race', 'gender', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']

x_train = data_train[feature_names]
x_test = data_test[feature_names]

# Set y_train and y_test to contain the target variable of the DataFrame
y_train = data_train['income']
y_test = data_test['income']

feed the data into DecisionTreeClassifier() and get the result

In [73]:
# Initialize a decision tree classifier
clf = DecisionTreeClassifier()

# Train the classifier on the training data
clf.fit(x_train, y_train)

# Make predictions on the testing data
y_pred = clf.predict(x_test)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.808549843375714


This was just the first attempt without validation. The tree we obtain here is too complex. In order to avoid overfitting on training data, validation is necessary. We will start with the most basic hyperparameter, the map depth of the tree. To do so, we first import more packages for the validation.

In [74]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.tree import export_graphviz
from graphviz import Source

When doing validation, we have to split the original training set into the actual training set and a new validation set, while keeping the test set unseen still. So we rename the data directly read from the csv file to data_train_val for clarification.

In [75]:
data_train_val = pd.read_csv('data_train.csv')
data_train_val['workclass'] = encoder.fit_transform(data_train_val['workclass'])
data_train_val['marital-status'] = encoder.fit_transform(data_train_val['marital-status'])
data_train_val['occupation'] = encoder.fit_transform(data_train_val['occupation'])
data_train_val['relationship'] = encoder.fit_transform(data_train_val['relationship'])
data_train_val['race'] = encoder.fit_transform(data_train_val['race'])
data_train_val['gender'] = encoder.fit_transform(data_train_val['gender'])
data_train_val['native-country'] = encoder.fit_transform(data_train_val['native-country'])

split the training data and validation data using 5-fold. Print out the best maximum depth

In [76]:
# Split training data into real training and validation sets
X_train, X_val, y_train, y_val = train_test_split(data_train_val[feature_names], data_train_val['income'], test_size=0.2, random_state=42)

# Define range of values for maximum depth of tree
max_depth_values = range(1, len(feature_names))



Loop over values of max_depth to find the one that makes the highest cross-validation score.

In [77]:
# Loop over values of max_depth and calculate cross-validation score for each
cv_scores = []
for max_depth in max_depth_values:
    clf = DecisionTreeClassifier(max_depth=max_depth)
    scores = cross_val_score(clf, X_train, y_train, cv=5)
    cv_scores.append(scores.mean())

# Choose value of max_depth that gives best cross-validation score
best_max_depth = max_depth_values[np.argmax(cv_scores)]

print(best_max_depth)

# Train final model using best value of max_depth
final_clf = DecisionTreeClassifier(max_depth=best_max_depth)
final_clf.fit(X_train, y_train)

# Evaluate performance on validation set
val_score = final_clf.score(X_val, y_val)

7


with the best_max_depth obtained, make another fitting and check the predition 

In [78]:
# Test the tree using the test set
x_train = data_train_val[feature_names]
x_test = data_test[feature_names]

# Set y_train and y_test to contain the target variable of the DataFrame
y_train = data_train_val['income']
y_test = data_test['income']

# Initialize a decision tree classifier
clf = DecisionTreeClassifier(max_depth=best_max_depth)

# Train the classifier on the training data
clf.fit(x_train, y_train)

# Make predictions on the testing data
y_pred = clf.predict(x_test)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.8568269762299613


visualize the tree, the tree should look like this '![Alt text](decision_tree1.png)'

In [79]:
# Visualize the decision tree
export_graphviz(clf, out_file="tree.dot", feature_names=feature_names, filled=True,
                rounded=True, special_characters=True)

with open("tree.dot") as f:
    dot_graph = f.read()

graph = Source(dot_graph)
graph.format = "png"
graph.render("decision_tree", view=True)

'decision_tree.png'

Now we have significantly shrinked the tree the final accuracy on the test set has improved, which indicates the cross-validation has worked out. Now the question is, is there a need to validate more hyperparameters to further improve the tree? There are several more parameters to adjust, so we start with the pair maximum depth and min_samples_leaf

In [80]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameters to search over
param_grid = {
    'max_depth': [6, 7, 8, 9],
    'min_samples_leaf': [1, 2, 3, 4]
}

# Create a decision tree classifier object
tree = DecisionTreeClassifier()

# Create a GridSearchCV object and fit it to the data
grid_search = GridSearchCV(tree, param_grid, cv=5)
grid_search.fit(data_train_val[feature_names], data_train_val['income'])

# Print the best combination of hyperparameters
print("Best parameters:", grid_search.best_params_)

# Print the best mean cross-validation score
print("Best cross-validation score:", grid_search.best_score_)

Best parameters: {'max_depth': 8, 'min_samples_leaf': 1}
Best cross-validation score: 0.8550413961342105


use the hyperparameters obtained to run the test again.

In [82]:
# Test the tree using the test set
x_train = data_train_val[feature_names]
x_test = data_test[feature_names]

# Set y_train and y_test to contain the target variable of the DataFrame
y_train = data_train_val['income']
y_test = data_test['income']

# Initialize a decision tree classifier
clf = DecisionTreeClassifier(max_depth=grid_search.best_params_['max_depth'], min_samples_leaf=grid_search.best_params_['min_samples_leaf'])

# Train the classifier on the training data
clf.fit(x_train, y_train)

# Make predictions on the testing data
y_pred = clf.predict(x_test)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Visualize the decision tree
export_graphviz(clf, out_file="tree.dot", feature_names=feature_names, filled=True,
                rounded=True, special_characters=True)

with open("tree.dot") as f:
    dot_graph = f.read()

graph = Source(dot_graph)
graph.format = "png"
graph.render("decision_tree", view=True)



Accuracy: 0.8544929672624532


'decision_tree.png'