This week we finish our two-part series on Machine Learning which is often the goal when working so vigorously to clean and organize our data (from previous weeks).  Machine Learning is simply defined as the process of formulating a computational model that can predict the likely result of a new data instance from looking at prior data instances.  This model can be very simple such as a decision tree that used logical rules or very complex such as a Deep Convolution Neural Network.   Machine Learning methods can be broken down into those methods that require domain knowledge and those that do not.  Focusing on methods that require domain knowledge for this week, we will utilize rows of data that have a label or class to then predict the label or class of new unseen data rows (called a classifier).  This week we will start by first loading, cleaning and then labeling and partitioning the data.  Labeling is the process of obtaining or adding a ordinal value to your dataset.  In the examples this week, we derive the label from a column.

In [6]:
#[Load and Process the Dataset: NHANES-10000]
import numpy as np
import gzip
with gzip.GzipFile('NHANES.csv.gz','rb') as f:
    #gp from raw data into filtered and clean data--------------------------------------------------
    raw = [row.decode('utf-8').replace('\n','').replace('\r','').split(',') for row in f.readlines()]
    header,data,D = raw[0],raw[1:],[]
    for row in data: #only work with data rows that have all the data: no missing values
        try: D += [[(1 if row[8]=='CollegeGrad' else 0),abs(int(row[10][-5:])),float(row[12]),int(row[13]),row[6]]]
        except: pass
    #process data to be able to do supervised machine learning techniques-------  
    x = sorted(list(set([d[4] for d in D])))[::-1]
    x_idx = {x[i]:i for i in range(len(x))}
    for i in range(len(D)): D[i][4] = x_idx[D[i][4]]
    X0 = np.array([d[:4] for d in D[:2*len(D)//3]],dtype=np.float32)
    Y0 = np.array([d[4] for d in D[:2*len(D)//3]],dtype=np.uint8)
    X1 = np.array([d[:4] for d in D[2*len(D)//3:]],dtype=np.float32)
    Y1 = np.array([d[4] for d in D[2*len(D)//3:]],dtype=np.uint8)
    

Is there disparity among our Racial background with regards to Education, Income, Poverty and Housing? Lets use the NHANES data from 2018 along with machine learning methods and explore this complex question…  First lets re-frame the question so that is is more correct: Can we use Education level (College Grad or Not), Household Income per year, Poverty Index, and Housing (number of rooms in your housing area) as predictors for race categories of: White, Other, Mexican, Hispanic, Black?

I. Supervised Machine Learning: Decision Trees
When we are given a set of data rows and a label to associate with it, we can build predictive systems that can guess the label of a new unseen data row.  To start we will look at one of the first types of classifiers to gain wide use called Decision Trees (Dts) which are methods that result in a tree data structure, where a data row is processed from the root and one decision is encountered at each tree branch.  The final leaves in the tree are the labels: https://scikit-learn.org/stable/modules/tree.html

In [11]:
#[Example 1: Decision Tree]
import numpy as np
from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X0,Y0)                #fit training data X0 using labels Y0
Y   = clf.predict(X1)               #predict the labels of the unseen data X1
print(np.sum(Y==Y1)/len(Y1))        #show the accuracy of the classification

0.6607785410533202


In this example we first fit the classifier to the labeled data and then use that model as a predictor for new unseen data contained in the Y1 array.  Next we score how many of the labels were predicted correctly against the total number of labels which yields around 65% accuracy.  Is this good? In this example, we are using good estimates for accuracy since the test has been performed on a large amount of data rows >3000 and we achieve 65% total accuracy for 4 labels.  This is good because if you randomly pick the label you would get around 25% accuracy.  The difference between the random chance of picking the class and the measured is the proof of a functional system!

II. Supervised Machine Learning: Naive Bayes
Naive Bayes (NB) is a popular machine learning method for classification among a small number of choices and is one of many probabilistic classification methods.  It gets its name from utilization of Bayes Rule in probability theory which makes assumptions about the data distribution causing this to work well in applications such as document classification and spam filtering but not work well in many general contexts: https://scikit-learn.org/stable/modules/naive_bayes.html

In [9]:
#[Example 2: Naive Bayes]
import numpy as np
from sklearn import naive_bayes

clf = naive_bayes.GaussianNB()
clf = clf.fit(X0,Y0)                #fit training data X0 using labels Y0
Y   = clf.predict(X1)               #predict the labels of the unseen data X1
print(np.sum(Y==Y1)/len(Y1))        #show the accuracy of the classification

0.5904481517827936


Here we see that NB actually doesn’t work as well as the DT in example 1 by a few percentage points. Although a Gaussian distribution is common in many contexts such as biological factors such as height and weight (once normalized by gender), it does not hold for the columns that are presented in this data set.  You will find that this method works very well once the distribution assumptions hold!

III. Supervised Machine Learning: Support Vector Machines
Support Vector Machines (SVMs) used in classification utilize feature vectors to optimally seperate the rows of data in multi-dimension space.  An easy way to think of them (and another name for them) is that they are hyper-plane separators.  The families that the separators common from are called the kernel: one being a basic line (for separating data with 2 columns) or a plane (for separating data with three columns) and more generally a hyper-plane (for data with 3 or more columns).  In addition to basics lines and plane using a linear kernel, many other kernel types exists such as cubic, which result in separators that can curve in space: https://scikit-learn.org/stable/modules/svm.html

In [13]:
#[Example 3: Support Vector Machines]
import numpy as np
from sklearn import svm

clf = svm.SVC()
clf = clf.fit(X0,Y0)                #fit training data X0 using labels Y0
Y   = clf.predict(X1)               #predict the labels of the unseen data X1
print(np.sum(Y==Y1)/len(Y1))        #show the accuracy of the classification

0.6398429833169774


Here we see that SVMs work very well out of the box and outperform the NB examples but score under the DT.  This is mainly because we do not have distribution assumptions and we have a good deal more training examples than data dimensions (3000 rows of data versus the 4 columns in each row is a good ration for the SVM!).  If we had more rows of data the SVM would probably outperform the DT.

IV. Supervised Machine Learning: Neural Networks
Neural Networks (NNs) are a large family of ML methods that are among both the oldest and newest of the methods presented here.  Classic/Old methods include the Multi-layer Perception (MLP), while newer methods include Deep Convolution Neural Networks (DCNN).  At the core of the method, NNs hook up nodes in a graph that simulate (in some ways) how our brain propagates information from connected the connected brain tissue.  Once this flow is restricted, it can be used very effectively to build classification.  In general these NNs may need much more data than the other methods to be able to learn effectively: https://scikit-learn.org/stable/modules/neural_networks_supervised.html

In [14]:
#[Example 4: MLP]
import numpy as np
from sklearn import neural_network

clf = neural_network.MLPClassifier(solver='adam',alpha=1e-5,
                                   hidden_layer_sizes=(5, 2))
clf = clf.fit(X0,Y0)                #fit training data X0 using labels Y0
Y   = clf.predict(X1)               #predict the labels of the unseen data X1
print(np.sum(Y==Y1)/len(Y1))        #show the accuracy of the classification

0.6398429833169774


Using the Adam solving algorithm and with some alpha decay we achieve a good result (same as SVM) but can we do better? This question is very important and gets to the heart of all classification tasks in that to what degree can a label be predicted by its values?  This answer has another name in information science (and applications to data compression) the separability of the data.  This is to say that every data has a limit that is bounded between the random chance and 100% accuracy.  In our case 63% is much better than 20% but is far from being perfect!

If you enjoyed the last topic, you will find that there is a world of methods to explore with NNs and a special package called keras/tensorflow provides the means to design and implement specialized NNs that can make use of GPUs to accelerate the processing (you may have noticed that all the methods except for the MLP executed in a blink of an eye): 
https://keras.io/api/ 
https://www.tensorflow.org/tutorials/quickstart/beginner