# Decision Tree Classification

We're going to use a couple of different input files that have been discretized to create some decision trees. I want to test it on the following sets of num_bins:

22224
24444
55554

We'll go into more in the future, with some Random Forest shenanigans, but for now we'll just use these. Let's load up some data.

In [1]:
# some useful mysklearn package import statements and reloads
import importlib

import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
import mysklearn.myevaluation as myevaluation

import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MyDecisionTreeClassifier

import copy
import random

header1, data1 = myutils.load_from_file("input_data/NCAA_Statistics_22224.csv")
header2, data2 = myutils.load_from_file("input_data/NCAA_Statistics_24444.csv")
header3, data3 = myutils.load_from_file("input_data/NCAA_Statistics_55554.csv")

# Now, we can move to create some decision trees. Let's first create trees over the whole dataset, then
# test upon our stratisfied k-fold splitting method.

# PART 1: Whole datasets
class_col1 = myutils.get_column(data1, header1, "Win Percentage")
class_col2 = myutils.get_column(data2, header2, "Win Percentage")
class_col3 = myutils.get_column(data3, header3, "Win Percentage")

data1 = myutils.drop_column(data1, header1, "Win Percentage")
data2 = myutils.drop_column(data2, header2, "Win Percentage")
data3 = myutils.drop_column(data3, header3, "Win Percentage")

atts1 = header1[:-1]
atts2 = header2[:-1]
atts3 = header3[:-1]

# Get some classifiers in here...
my_dt1 = MyDecisionTreeClassifier()
my_dt2 = MyDecisionTreeClassifier()
my_dt3 = MyDecisionTreeClassifier()

# Fitting... they look good at first glance.
my_dt1.fit(data1, class_col1)
my_dt2.fit(data2, class_col2)
my_dt3.fit(data3, class_col3)

# Visualizing...
my_dt1.visualize_tree("tree_vis/22224_tree.dot", "tree_vis/22224_tree.pdf", atts1)
my_dt2.visualize_tree("tree_vis/24444_tree.dot", "tree_vis/24444_tree.pdf", atts2)
my_dt3.visualize_tree("tree_vis/55554_tree.dot", "tree_vis/55554_tree.pdf", atts3)

Looking at the graphs generated below [ADD THESE], It's easy to see the problem. Scoring Margin is such a strong indicator that it dominates the graph. Let's try taking the same data above but removing Scoring margin as an attribute.

In [1]:
# some useful mysklearn package import statements and reloads
import importlib

import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
import mysklearn.myevaluation as myevaluation

import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MyDecisionTreeClassifier

import copy
import random

header1, data1 = myutils.load_from_file("input_data/NCAA_Statistics_22224.csv")
header2, data2 = myutils.load_from_file("input_data/NCAA_Statistics_24444.csv")
header3, data3 = myutils.load_from_file("input_data/NCAA_Statistics_55554.csv")

# Now, we can move to create some decision trees. Let's first create trees over the whole dataset, then
# test upon our stratisfied k-fold splitting method.

# PART 1: Whole datasets
class_col1 = myutils.get_column(data1, header1, "Win Percentage")
class_col2 = myutils.get_column(data2, header2, "Win Percentage")
class_col3 = myutils.get_column(data3, header3, "Win Percentage")

data1 = myutils.drop_column(data1, header1, "Win Percentage")
data2 = myutils.drop_column(data2, header2, "Win Percentage")
data3 = myutils.drop_column(data3, header3, "Win Percentage")

data1 = myutils.drop_column(data1, header1, "Scoring Margin")
data2 = myutils.drop_column(data2, header2, "Scoring Margin")
data3 = myutils.drop_column(data3, header3, "Scoring Margin")

atts1 = header1[1:-1]
atts2 = header2[1:-1]
atts3 = header3[1:-1]

# Get some classifiers in here...
my_dt1 = MyDecisionTreeClassifier()
my_dt2 = MyDecisionTreeClassifier()
my_dt3 = MyDecisionTreeClassifier()

# Fitting... they look good at first glance.
my_dt1.fit(data1, class_col1)
my_dt2.fit(data2, class_col2)
my_dt3.fit(data3, class_col3)

# Visualizing...
my_dt1.visualize_tree("tree_vis/_2224_tree.dot", "tree_vis/_2224_tree.pdf", atts1)
my_dt2.visualize_tree("tree_vis/_4444_tree.dot", "tree_vis/_4444_tree.pdf", atts2)
my_dt3.visualize_tree("tree_vis/_5554_tree.dot", "tree_vis/_5554_tree.pdf", atts3)

Let's do some testing on these new classifiers, with a bit less data to test with, using our stratisfied k-fold. In addition, I'll be using a manually-pruned tree my_dtp.tree to express the pruned version of tree 2 (with no Scoring Margin attribute).