# Normalization and Discretization

Let's use the data we've now parsed and normalize the data, while also coming up with some modular formulas for discretization. First things first, we need to load the dataset into a table and a header:

In [1]:
# some useful mysklearn package import statements and reloads
import importlib

import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
import mysklearn.myevaluation as myevaluation

import copy
import random

header, data = myutils.load_from_file("input_data/NCAA_Statistics_Parsed.csv")
data = myutils.convert_to_numeric(data)

Now that we have this, we can move to start doing some funky things with it. I'm going to scale all of these attributes against their domain in the table with the exception of win percentage (the classifier), which will be scaled from 0 to 100 (representing all possible winning %'s. Preliminarily, we will also drop the Team name column.

In [2]:
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

header, data = myutils.load_from_file("input_data/NCAA_Statistics_Parsed.csv")
data = myutils.convert_to_numeric(data)

data = myutils.drop_column(data, header, 0)
header = header[1:]

# Now that we have this, we can scale the data appropriately from 0 to 1 for each attribute. Because we want a
# completely winning or losing score to be the min and max, I'm going to grab this column separately.
winp_col = myutils.get_column(data, header, "Win Percentage")
data = myutils.drop_column(data, header, "Win Percentage")
cut_header = header[:-1]

# Okay, now we can move to scaling. Let's start with the X attributes, which will be dead simple:
X_mins, X_maxs = myutils.scale(data)

# Now, let's scale our classification.
myutils.scale_1d(winp_col, 0, 100)

# Finally, we stitch these back together
for i in range(len(data)):
    data[i].append(winp_col[i])

# Now let's save it to a new csv file to confirm our work's efficacy
myutils.save_to_file(header, data, "input_data/NCAA_Statistics_Normalized.csv")

AttributeError: module 'mysklearn.myutils' has no attribute 'scale_1d'

And with that, we've completed the normalization process. Now let's define some modular splitting functions in myutils to discretize the data into useful buckets.

Before moving forward, I want to mention my strategy here. I've made a function that modularizes the number of bins; Nothing special, for certain, but it does allow us to operate with multiple files here. I'm going to use the num_bins variable to, in part, name our output files, so we can operate well here. Of course this is unnecessary in our classification method, as we can simply call the function, but for testing and display purposes, that's the plan.

In [None]:
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

header, data = myutils.load_from_file("input_data/NCAA_Statistics_Normalized.csv")
data = myutils.convert_to_numeric(data)

# 1) Using the "22222" method. Generally too unspecific to help.
num_bins = [2, 2, 2, 2, 2]
data = myutils.discretize(data, num_bins)
num_bins = myutils.convert_to_lexical(num_bins, table_dimensions=1)
myutils.save_to_file(header, data, "input_data/NCAA_Statistics_%s.csv" % "".join(num_bins))

header, data = myutils.load_from_file("input_data/NCAA_Statistics_Normalized.csv")
data = myutils.convert_to_numeric(data)

# 2) Using the same, but overriding the winning percentage cutoffs.
num_bins = [2, 2, 2, 2, 2]
data = myutils.discretize(data, num_bins, cutoffs=[None, None, None, None, [0, 0.5, 1.0]])
num_bins = myutils.convert_to_lexical(num_bins, table_dimensions=1)
myutils.save_to_file(header, data, "input_data/NCAA_Statistics_%s_alt.csv" % "".join(num_bins))

header, data = myutils.load_from_file("input_data/NCAA_Statistics_Normalized.csv")
data = myutils.convert_to_numeric(data)

# 3) Using more discretization labels, with 4 classifications.
# NOTE: I don't want to rely on Scoring Margin too much here; I think it's an easy-win button here.
num_bins = [2, 4, 4, 4, 4]
data = myutils.discretize(data,num_bins)
num_bins = myutils.convert_to_lexical(num_bins, table_dimensions=1)
myutils.save_to_file(header, data, "input_data/NCAA_Statistics_%s.csv" % "".join(num_bins))

By this point, we have an awesome discretization function. Let's make it even more modular so we can generate a whole ton of them!

In [None]:
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

def print_csv(num_bins):
    header, data = myutils.load_from_file("input_data/NCAA_Statistics_Normalized.csv")
    data = myutils.convert_to_numeric(data)

    data = myutils.discretize(data, num_bins, cutoffs=[None, None, None, None, [0.0, 0.35, 0.50, 0.65, 1.0]])
    num_bins = myutils.convert_to_lexical(num_bins, table_dimensions=1)
    myutils.save_to_file(header, data, "input_data/NCAA_Statistics_%s.csv" % "".join(num_bins))
    
# Now for a big boy nested loop...
for a in range(2, 6):
    for b in range(2, 6):
        for c in range(2, 6):
            for d in range(2, 6):
                print_csv([a, b, c, d, 4])