# Assignment 6: Multi-class classification

For this assignment, there are two basic tasks and an open ended additional task for more challenges:
1.  Add the "macro" averaged precision and recall to the **multiPerformance** method from the multicassv2 exercise (copy all of the relevant code to this module).
2.  Add one more class - letters - to your multi-class classifier.   So your classifier will have a total of 11 classes.   The letters data sample is here:
            *  Shorter (1000 samples): https://raw.githubusercontent.com/big-data-analytics-physics/data/master/emnist/emnist_letters_shuffled_1k.csv
            *  Longer (7000 samples): https://raw.githubusercontent.com/big-data-analytics-physics/data/master/emnist/emnist_letters_shuffled_7k.csv
      Note that each of these files has a random sample of 26 upper and lower case english letters.
      
3.  Extra stuff: if you have time and are looking to expand your abilities, try your hand at data augmentation of the MNIST dataset.   The idea is to examine how to increase (or augment) your data sample, by resampling your existing data sample.  Here are some ideas:
             * shift randomly, by 1-2 pixels, the images up/down/left/right
             * random rotations by a few degrees.
             
      First verify that your modifications are working (by displaying the images), then make a "test" set using the augmented images only, and compare how it performs using a training set drawn from the original (non-augmented) data set

In [0]:
import pandas as pd
letters = pd.read_csv("https://raw.githubusercontent.com/big-data-analytics-physics/data/master/emnist/emnist_letters_shuffled_1k.csv", header=None)
letters['digit'] = 10

short = "short_"

# Read in all of the other digits
dfCombined = pd.DataFrame()
for digit in range(10):
    print("Processing digit ",digit)
    fname = 'https://raw.githubusercontent.com/big-data-analytics-physics/data/master/ch3/digit_' + short + str(digit) + '.csv'
    df = pd.read_csv(fname,header=None)
    df['digit'] = digit
    dfCombined = pd.concat([dfCombined, df])

print("Length of sample:     ",len(dfCombined))

dfCombined = pd.concat([dfCombined, letters])
print("The length of the combined: ", len(dfCombined))

Processing digit  0
Processing digit  1
Processing digit  2
Processing digit  3
Processing digit  4
Processing digit  5
Processing digit  6
Processing digit  7
Processing digit  8
Processing digit  9
Length of sample:      10000
The length of the combined:  11000


The multiclass performance method.

In [0]:
from sklearn.utils import shuffle
dfCombinedShuffle = shuffle(dfCombined,random_state=42)    # by setting the random state we will get reproducible results
#We divide the images and the labels and later use KFold to access them.
X = dfCombinedShuffle.as_matrix(columns=dfCombinedShuffle.columns[:784])
y = dfCombinedShuffle['digit'].values

Since accuracy's definition here is the same as recall, I'll simply just add macro precision here.

In [0]:
# Used to implement the multi-dimensional counter we need in the performance class
from collections import defaultdict
def autovivify(levels=1, final=dict):
    return (defaultdict(final) if levels < 2 else
            defaultdict(lambda: autovivify(levels-1, final)))

# Determine the performance
def multiPerformance(y,y_pred,y_score,debug=False):
#
# Make our matrix
  confusionMatrix = autovivify(2,int) #This matrix is still 2D, but now it's 10x10
  classes = set()
  totalTrue = defaultdict(int)
  totalPred = defaultdict(int)
  for i in range(len(y_pred)):
    trueClass = y[i] # What the label/digit it really is.
    classes.add(trueClass)
    predClass = y_pred[i]
    totalTrue[trueClass] += 1 #totalTrue is an array of 10 slots, and they keep track of 
    # the time a digit label shows up.
    totalPred[predClass] += 1 # keeps track of the times a certain prediction is made.
    confusionMatrix[trueClass][predClass] += 1

  if debug:
    for trueClass in classes:
      print("True: ",trueClass,end="")
      for predClass in classes:
        print("\t",confusionMatrix[trueClass][predClass],end="")
      print()
    print()
#
#
# Overall accuracy - sum the diagonals and divide by total
  accMicro = 0.0
  accMacro = 0.0
  precMacro = 0.0
  for c1 in classes: #We now will iterate over all the 10 classes.
    accMicro += confusionMatrix[c1][c1]
    accMacro += confusionMatrix[c1][c1]/totalTrue[c1] #The probability of accurately catching a digit. 
    #Precision is how confident you are sure of your predictions. A digit really what you predict/ all the predictions of this digit you made
    prediction = 0
    for c2 in classes:
      prediction += confusionMatrix[c2][c1] # We add all the times we predict c1, only [c1][c1] is right though.
    precMacro += confusionMatrix[c1][c1] / prediction
    #Recall is how sensitive your estimator is to the digit you should catch.
    
  accMicro /= len(y)
  accMacro /=  len(classes) #accMacro averaged.
  precMacro /= len(classes)
  results = {"confusionMatrix":confusionMatrix,"accuracyMicro":accMicro,"accuracyMacro":accMacro, "precisionMacro":precMacro}
  return results

In [0]:
def runFitter(estimator,X_train,y_train,X_test,y_test,debug=False):
#
# Now fit to our training set
  estimator.fit(X_train,y_train)
#
# Now predict the classes and get the score for our traing set
  y_train_pred = estimator.predict(X_train)
  y_train_score = estimator.decision_function(X_train)   # NOTE: some estimators have a predict_prob method instead od descision_function
#
# Now predict the classes and get the score for our test set
  y_test_pred = estimator.predict(X_test)
  y_test_score = estimator.decision_function(X_test)

#
# Now get the performaance
  results_test = multiPerformance(y_test,y_test_pred,y_test_score,debug=False)
  results_train = multiPerformance(y_train,y_train_pred,y_train_score,debug=False)
#
# Decide what you want to return: for now, just precision, recall, and auc for both test and train
  results = {
      'cf_test':results_test['confusionMatrix'],
      'cf_train':results_train['confusionMatrix'],
      'accuracyMicro_test':results_test['accuracyMicro'],
      'accuracyMacro_test':results_test['accuracyMacro'],
      'accuracyMicro_train':results_train['accuracyMicro'],
      'accuracyMacro_train':results_train['accuracyMacro'],
      'precisionMacro_train':results_train['precisionMacro'],
      'precisionMacro_test':results_test['precisionMacro']
}

  return results
  

In [0]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
kfolds = 5

#skf = StratifiedKFold(n_splits=kfolds)
skf = KFold(n_splits=kfolds)

In [0]:
#
# Get our estimator and predict
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier

estimator = LinearSVC(random_state=42,dual=False,max_iter=500,tol=0.01)    # use dual=False when  n_samples > n_features which is what we have
#estimator = SGDClassifier(random_state=42,max_iter=500,tol=0.01)    # use dual=False when  n_samples > n_features which is what we have
#
# Cresate some vars to keep track of everything
avg_accuracyMicro_test = 0.0
avg_accuracyMicro_train = 0.0
avg_accuracyMacro_test = 0.0
avg_accuracyMacro_train = 0.0
avg_precisionMacro_train = 0.0
avg_precisionMacro_test = 0.0
numSplits = 0.0
#
# Also keep track of the 
#
# Now loop
lastCF_train = None
lastCF_test = None
for train_index, test_index in skf.split(X, y):
  print("Training")
  X_train = X[train_index]
  y_train = y[train_index]
  X_test = X[test_index]
  y_test = y[test_index]
  
#
# Now fit to our training set
  results = runFitter(estimator,X_train,y_train,X_test,y_test)
#
# 
  avg_accuracyMicro_test += results['accuracyMicro_test']
  avg_accuracyMicro_train += results['accuracyMicro_train']
  avg_accuracyMacro_test += results['accuracyMacro_test']
  avg_accuracyMacro_train += results['accuracyMacro_train']
  avg_precisionMacro_train += results['precisionMacro_train']
  avg_precisionMacro_test += results['precisionMacro_test']
  lastCF_train = results['cf_train']
  lastCF_test = results['cf_test']
  numSplits += 1.0
  print("   Split ",numSplits,"; accuracyMicro test/train",results['accuracyMicro_test'],results['accuracyMicro_train'],"; accuracyMacro test/train",results['accuracyMacro_test'],results['accuracyMacro_train'])
#
avg_accuracyMicro_test /= numSplits
avg_accuracyMicro_train /= numSplits
avg_accuracyMacro_test /= numSplits
avg_accuracyMacro_train /= numSplits
avg_precisionMacro_train /= numSplits
avg_precisionMacro_test /= numSplits

print("average accuracyMicro test:  ",avg_accuracyMicro_test)
print("average accuracyMicro train: ",avg_accuracyMicro_train)
print("average accuracyMacro test:  ",avg_accuracyMacro_test)
print("average accuracyMacro train: ",avg_accuracyMacro_train)
print("avg_precisionMacro train: ", avg_precisionMacro_train)
print("avg_precisionMacro test: ", avg_precisionMacro_test)

#This is just making the confusion matrix easier to print on screen
print("Test confusion matrix")
for trueClass in range(10):
  print("True: ",trueClass,end="")
  for predClass in range(10):
    print("\t",lastCF_test[trueClass][predClass],end="")
  print()
print()

print("Train confusion matrix")
for trueClass in range(10):
  print("True: ",trueClass,end="")
  for predClass in range(10):
    print("\t",lastCF_train[trueClass][predClass],end="")
  print()
print()


Training
   Split  1.0 ; accuracyMicro test/train 0.8781818181818182 0.9584090909090909 ; accuracyMacro test/train 0.8777550525016533 0.9583432290590896
Training
   Split  2.0 ; accuracyMicro test/train 0.8613636363636363 0.96375 ; accuracyMacro test/train 0.8600415288030387 0.9639726662434761
Training
   Split  3.0 ; accuracyMicro test/train 0.8754545454545455 0.9625 ; accuracyMacro test/train 0.8756474043333871 0.9625930754816242
Training
   Split  4.0 ; accuracyMicro test/train 0.865 0.9645454545454546 ; accuracyMacro test/train 0.8672396470767807 0.9642454116468427
Training
   Split  5.0 ; accuracyMicro test/train 0.8631818181818182 0.9609090909090909 ; accuracyMacro test/train 0.8624542280124128 0.9609462727311823
average accuracyMicro test:   0.8686363636363638
average accuracyMicro train:  0.9620227272727273
average accuracyMacro test:   0.8686275721454544
average accuracyMacro train:  0.9620201310324429
avg_precisionMacro train:  0.9619778989728319
avg_precisionMacro test:  0.8