$\newcommand{\xv}{\mathbf{x}}
\newcommand{\Xv}{\mathbf{X}}
\newcommand{\yv}{\mathbf{y}}
\newcommand{\zv}{\mathbf{z}}
\newcommand{\av}{\mathbf{a}}
\newcommand{\Wv}{\mathbf{W}}
\newcommand{\wv}{\mathbf{w}}
\newcommand{\tv}{\mathbf{t}}
\newcommand{\Tv}{\mathbf{T}}
\newcommand{\muv}{\boldsymbol{\mu}}
\newcommand{\sigmav}{\boldsymbol{\sigma}}
\newcommand{\phiv}{\boldsymbol{\phi}}
\newcommand{\Phiv}{\boldsymbol{\Phi}}
\newcommand{\Sigmav}{\boldsymbol{\Sigma}}
\newcommand{\Lambdav}{\boldsymbol{\Lambda}}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

# Assignment 4: Classification with LDA and Logistic Regression

Brenton Grundman 

829460164

## Overview

Compare LDA and linear and nonlinear logistic regression applied to two data sets.

## Code

The `parameters` argument for `trainNN` is a list of the hidden layers structure and the number of SCG iterations, as in the previous assignment. The value of the `parameters` argument for `trainLDA` is not used.

Use the `trainValidateTestKFoldsClassification` function in `mlutils.py` to apply the functions. 

The `NeuralNetworkClassifier` class in the above `neuralnetworks.py` file allows you to specify 0 hidden units.  This creates a neural network with just the output layer designed to do classification.  In other words, specify 0 hidden units to apply linear logistic regression.

In [159]:
import numpy as np
import mlutils as ml
import neuralnetworks as nn
import qdalda as ql

In [160]:
def trainLDA(X,T,parameters = 0):
    lda = ql.LDA()
    lda.train(X,T)
    return lda
def evaluateLDA(model,X,T):
    predicted,prob,d = model.use(X)
    percentCorrect = 100*(np.sum(T==predicted)/float(T.shape[0]))
    return percentCorrect
def trainNN(X,T,parameters):
    nnet = nn.NeuralNetworkClassifier(X.shape[1],parameters[0],len(np.unique(T)))
    return nnet.train(X, T, nIterations=parameters[1])
def evaluateNN(model,X,T):
    predict = model.use(X)
    percentCorrect = 100*(np.sum(T == predict)/float(T.shape[0]))
    return percentCorrect

Here is an example, using our automobile MPG data.  This time, instead of predicting the actual MPG values, we quantize the MPG values into 5 intervals, and classify each sample as being in one of these 5 intervals.

## Data

In [161]:
!wget http://mlr.cs.umass.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
!wget http://mlr.cs.umass.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names

SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = c:/progra~1/wget/etc/wgetrc
--2017-04-05 18:44:52--  http://mlr.cs.umass.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
Resolving mlr.cs.umass.edu... 128.119.246.96
Connecting to mlr.cs.umass.edu|128.119.246.96|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30286 (30K) [text/plain]
Saving to: `auto-mpg.data.3'

     0K .......... .......... .........                       100%  336K=0.09s

2017-04-05 18:44:52 (336 KB/s) - `auto-mpg.data.3' saved [30286/30286]

SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = c:/progra~1/wget/etc/wgetrc
--2017-04-05 18:44:53--  http://mlr.cs.umass.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names
Resolving mlr.cs.umass.edu... 128.119.246.96
Connecting to mlr.cs.umass.edu|128.119.246.96|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1660 (1.6K) [text/plain]
Saving to: `auto-mpg.names.3'

     0K .                           

In [162]:
def makeMPGData(filename='auto-mpg.data'):
    def missingIsNan(s):
        return np.nan if s == b'?' else float(s)
    data = np.loadtxt(filename, usecols=range(8), converters={3: missingIsNan})
    print("Read",data.shape[0],"rows and",data.shape[1],"columns from",filename)
    goodRowsMask = np.isnan(data).sum(axis=1) == 0
    data = data[goodRowsMask,:]
    print("After removing rows containing question marks, data has",data.shape[0],"rows and",data.shape[1],"columns.")
    X = data[:,1:]
    T = data[:,0:1]
    Xnames =  ['cylinders','displacement','horsepower','weight','acceleration','year','origin']
    Tname = 'mpg'
    return X,T,Xnames,Tname

In [163]:
def makeMPGClasses(T):
    bounds = np.arange(5,45,10)
    Tclasses = -np.ones(T.shape).astype(np.int)
    for i,mpg in enumerate(T):
        for k in range(len(bounds)-1):
            if bounds[k] < mpg <= bounds[k+1]:
                Tclasses[i] = bounds[k+1]
        if Tclasses[i] == -1:
            Tclasses[i] = 50  # max mpg is 46.6
    return Tclasses

In [164]:
X,T,Xnames,Tname = makeMPGData('auto-mpg.data')
Tclasses = makeMPGClasses(T)
classes,counts = np.unique(Tclasses,return_counts=True)
print('classes',classes)
print('counts',counts)

Read 398 rows and 8 columns from auto-mpg.data
After removing rows containing question marks, data has 392 rows and 8 columns.
classes [15 25 35 50]
counts [ 69 167 123  33]


In [165]:
def printResults(label,results):
    print('{:4s} {:>20s}{:>8s}{:>8s}{:>8s}'.format('Algo','Parameters','TrnAcc','ValAcc','TesAcc'))
    print('-------------------------------------------------')
    for row in results:
        # 20 is expected maximum number of characters in printed parameter value list
        print('{:>4s} {:>20s} {:7.2f} {:7.2f} {:7.2f}'.format(label,str(row[0]),*row[1:]))

In [166]:
resultsLDA = ml.trainValidateTestKFoldsClassification( trainLDA,evaluateLDA, X,Tclasses, [None],
                                                       nFolds=6, shuffle=False,verbose=False)
printResults('LDA:',resultsLDA)


Algo           Parameters  TrnAcc  ValAcc  TesAcc
-------------------------------------------------
LDA:                 None   79.21   70.53   56.04
LDA:                 None   75.58   66.36   80.23
LDA:                 None   78.46   67.19   82.72
LDA:                 None   77.72   68.07   82.89
LDA:                 None   79.75   69.57   69.01
LDA:                 None   81.23   72.96   51.52


In [167]:
resultsNN = ml.trainValidateTestKFoldsClassification( trainNN,evaluateNN, X,Tclasses, 
                                                     [ [ [0], 10], [[10], 100] ],
                                                     nFolds=6, shuffle=False,verbose=False)
printResults('NN:',resultsNN)


Algo           Parameters  TrnAcc  ValAcc  TesAcc
-------------------------------------------------
 NN:            [[0], 10]   83.42   74.48   51.65
 NN:          [[10], 100]   92.47   71.47   80.23
 NN:          [[10], 100]   92.82   68.05   71.60
 NN:          [[10], 100]   93.42   73.30   72.37
 NN:            [[0], 10]   81.50   72.81   67.61
 NN:            [[0], 10]   82.96   78.98   48.48


In [168]:
lda = ql.LDA()
lda.train(X,Tclasses)
predictedClasses,_,_ = lda.use(X)
ml.confusionMatrix(Tclasses,predictedClasses,classes); # <- semi-colon prevents printing of returned result

      15   25   35   50
    ------------------------
15 | 91.3  8.7  0    0     (69 / 69)
25 | 11.4 68.9 18.6  1.2   (167 / 167)
35 |  0    8.9 68.3 22.8   (123 / 123)
50 |  0    0   18.2 81.8   (33 / 33)


## Results

Obviously, trainNN and trainLDA train the neural network and the linear discriminant with data to be used for testing.  Unlike before, where we would train data on the whole and make comparisons against each other and seeing how close we could get, we are now looking at data and hoping to classify it into set categories.  For example, where we would once be trying to guess someone's next move in chess based on their previous playing style, now we are looking at their playing style and hoping to accurately guess what tier player they are.

The evaluation functions are the interpretation of the trained functions, and tell us how accurately we guess.  For the most part, our data is in gaussian curves; generally, data will be separated enough that we can chunk out portions of the graph into different tiers.  Of course, data is rarely so cut-and-dry, and the edge cases are where discrepancies usually arise.  This is much like belts in Jiu Jitsu - while blue belts are clearly better than white belts, a white belt could be on the border of becoming a blue belt, or a blue belt could have just come from being a white belt.  For this reason, a white belt may be consistently be beating a blue belt, and the program could mistake their rankings.  The higher the percentage returned by the evaluations, the less commonly the program is dividing people into their respective ranks incorrectly.

In the above (stolen and reused) data, we compare LDA training and NN training for classifying MPG.  In this case, the NN training seems to be more accurate than LDA, which only means that a linear division of the MPG is not an accurate way of dividing up the cars.  If it were, the LDA would be able to more appropriately classify cars by their MPG.

However, this is only the case for the evaluation.  There's a clear flip in effectiveness once moving on to testing data.  Suddenly, the LDA appears to be a more effective classifier here.  It could be that the NN was overtrained (and overconfident), especially with its highly accurate training.

When it comes down to it, mpg is hardly a good way of classifying anything on its own.  Ideally, there would be distinct classes, determined after training, that serve some classification purpose, such as identifying a type of animal (or at a much more computer intensive level, a specific person) from an image.  This would likely work well with LDA bu

In [169]:
#%run -i A4grader.py