$\newcommand{\xv}{\mathbf{x}}
\newcommand{\Xv}{\mathbf{X}}
\newcommand{\yv}{\mathbf{y}}
\newcommand{\zv}{\mathbf{z}}
\newcommand{\av}{\mathbf{a}}
\newcommand{\Wv}{\mathbf{W}}
\newcommand{\wv}{\mathbf{w}}
\newcommand{\tv}{\mathbf{t}}
\newcommand{\Tv}{\mathbf{T}}
\newcommand{\muv}{\boldsymbol{\mu}}
\newcommand{\sigmav}{\boldsymbol{\sigma}}
\newcommand{\phiv}{\boldsymbol{\phi}}
\newcommand{\Phiv}{\boldsymbol{\Phi}}
\newcommand{\Sigmav}{\boldsymbol{\Sigma}}
\newcommand{\Lambdav}{\boldsymbol{\Lambda}}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

# Assignment 4: Classification with LDA, QDA, and Logistic Regression

*Xuehao(David) Hu*

## Overview

Compare LDA, QDA, and linear and nonlinear logistic regression applied to two data sets.  

## Required Code

Download [nn2.tar](http://www.cs.colostate.edu/~anderson/cs480/notebooks/nn2.tar) and extract its contents, which are

* `neuralnetworks.py`
* `scaledconjugategradient.py`
* `mlutils.py`
* `qdalda.py`

as discussed in lecture. 

Write the following functions that train and evaluate LDA, QDA, and neural network logistic regression models.

* `model = trainLDA(X,T,parameters)`
* `percentCorrect = evaluateLDA(model,X,T)`
* `model = trainQDA(X,T,parameters)`
* `percentCorrect = evaluateQDA(model,X,T)`
* `model = trainNN(X,T,parameters)`
* `percentCorrect = evaluateNN(model,X,T)`
The `parameters` argument for `trainNN` is a list of the hidden layers structure and the number of SCG iterations, as in the previous assignment. The value of the `parameters` argument for `trainLDA` and `trainQDA` are not used.

Use the `trainValidateTestKFoldsClassification` function in `mlutils.py` to apply the above functions. 

In [29]:
import numpy as np
import mlutils as ml
import neuralnetworks as nn 

In [30]:
def trainLDA(X,T,parameters):
    
    pass

In [31]:
def evaluateLDA(model,X,T):
    pass

In [32]:
def trainQDA(X,T,parameters):
    pass

In [33]:
def evaluateQDA(model,X,T):
    pass

In [34]:
def evaluateQDA(model,X,T):
    pass

In [35]:
def trainNN(X,T,parameters):
    pass

In [36]:
def evaluateNN(model,X,T):
    pass

In [37]:
#   To run this notebook, I import my solution
# from A4mysolution import * 
#   You should include all function definitions that you need in this notebook.  They all do not 
#   have to be in the same code cell.

Here is an example, using our automobile MPG data.  This time, instead of predicting the actual MPG values, we quantize the MPG values into 5 intervals, and classify each sample as being in one of these 5 intervals.

In [38]:
def makeMPGData(filename='auto-mpg.data'):
    def missingIsNan(s):
        return np.nan if s == b'?' else float(s)
    data = np.loadtxt(filename, usecols=range(8), converters={3: missingIsNan})
    print("Read",data.shape[0],"rows and",data.shape[1],"columns from",filename)
    goodRowsMask = np.isnan(data).sum(axis=1) == 0
    data = data[goodRowsMask,:]
    print("After removing rows containing question marks, data has",data.shape[0],"rows and",data.shape[1],"columns.")
    X = data[:,1:]
    T = data[:,0:1]
    Xnames =  ['cylinders','displacement','horsepower','weight','acceleration','year','origin']
    Tname = 'mpg'
    return X,T,Xnames,Tname

In [39]:
def makeMPGClasses(T):
    bounds = np.arange(5,45,10)
    Tclasses = -np.ones(T.shape).astype(np.int)
    for i,mpg in enumerate(T):
        for k in range(len(bounds)-1):
            if bounds[k] < mpg <= bounds[k+1]:
                Tclasses[i] = bounds[k+1]
        if Tclasses[i] == -1:
            Tclasses[i] = 50  # max mpg is 46.6
    return Tclasses

In [40]:
X,T,Xnames,Tname = makeMPGData('auto-mpg.data')
Tclasses = makeMPGClasses(T)
classes,counts = np.unique(Tclasses,return_counts=True)
print('classes',classes)
print('counts',counts)

Read 398 rows and 8 columns from auto-mpg.data
After removing rows containing question marks, data has 392 rows and 8 columns.
classes [15 25 35 50]
counts [ 69 167 123  33]


In [41]:
def printResults(label,results):
    print('{:4s} {:>20s}{:>8s}{:>8s}{:>8s}'.format('Algo','Parameters','TrnAcc','ValAcc','TesAcc'))
    print('-------------------------------------------------')
    for row in results:
        # 20 is expected maximvl um number of characters in printed parameter value list
        print('{:>4s} {:>20s} {:7.2f} {:7.2f} {:7.2f}'.format(label,str(row[0]),*row[1:]))

In [42]:
resultsLDA = ml.trainValidateTestKFoldsClassification( trainLDA,evaluateLDA, X,Tclasses, [None],
                                                       nFolds=6, shuffle=False,verbose=False)
printResults('LDA:',resultsLDA)

TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'

In [None]:
resultsQDA = ml.trainValidateTestKFoldsClassification( trainQDA,evaluateQDA, X,Tclasses, [None],
                                                        nFolds=6, shuffle=False,verbose=False)
printResults('QDA:',resultsQDA)

In [None]:
resultsNN = ml.trainValidateTestKFoldsClassification( trainNN,evaluateNN, X,Tclasses, 
                                                     [ [ [0], 10], [[10], 100] ],
                                                     nFolds=6, shuffle=False,verbose=False)
printResults('NN:',resultsNN)

In [None]:
lda = ql.LDA()
lda.train(X,Tclasses)
predictedClasses,_,_ = lda.use(X)
ml.confusionMatrix(Tclasses,predictedClasses,classes); # <- semi-colon prevents printing of returned result

## Data

Pick at least two classification data sets and apply LDA, QDA, Linear Logistic Regression and Nonlinear Logistic Regression to them.

## Results

In this section, we will be looking for

* clear explanations of each function;
* experiments with two different data sets with descriptions of the data;
* and discussion of each result, including
  * accuracies as percent correctly classified,
  * best parameter values,
  * some analysis of each classification algorithm and how it is classifying the data by examining the $\mu$ values for LDA and QDA, and the first layer's weight values for the neural networks;
* and discuss which algorithm works best for each data set.

## Grading

Your notebook will be run and graded automatically. Download [A4grader.tar](http://www.cs.colostate.edu/~anderson/cs480/notebooks/A4grader.tar)  and extract A4grader.py from it. Run the code in the following cell to demonstrate an example grading session. You should see a perfect score of 80/100 if your functions are defined correctly. 

The remaining 20% will be based on your writing.  Be sure to explain each function, and details of the results summarized in the above section. 

## Check-in

Do not include this section in your notebook.

Name your notebook ```Lastname A4.ipynb```.  So, for me it would be ```Anderson A4.ipynb```.  Submit the file using the ```Assignment 4``` link on [Canvas](https://colostate.instructure.com/courses/28803).

Grading will be based on 

  * correct behavior of the required functions,
  * readability of the notebook,
  * effort in making interesting observations, and in formatting your notebook,
  * testing your code on two different classification data sets of your choice.

In [None]:
%run -i A4grader.py