# CAB420 - Troubleshooting Help: Classification Edition
Dr Simon Denman (s.denman@qut.edu.au)
Dr Ethan Goan (e.goan@qut.edu.au)

## Overview

Anyone who has done any amount of software development will know that it's not a matter of if things will go wrong, it's only a matter of when. This notebook will contain a walkthrough of some common errors we have seen students make over the past few years, some hints for diagnosing whats going on and why, and solutions to these errors. It is hoped that this notebook will serve as a bit of a first port of call for whatever issues you might encounter throughout CAB420. I will put a bit of narrative around the topics here so you can maybe get a bit more of an understanding, but this isn't something here that we expect you all to read from top to bottom. This is more along the lines of you should be able to `Ctrl-f` what you need, and then maybe just read the section for things giving you grief; or perhaps find the right section based on the other information on Canvas that's leading you to this.

We will use the term "error" here in a more broad sense, where "error" means things simply not working as you intended. This could include python spitting some red wall of text at you for a programming error, or a model failing to yield any meaningful output when put to the test. 

### Classification Edition

This notebook is particularly focussed on classification errors. There are lots of possible errors in CAB420 - putting them all in one notebook just seemed a bit much.

Writing this straight after the regression one, there are some very big common errors to both. Any standardisation error you can make with regression, you can make with classification. I'm not going to cover those again here, as the fix is the same either way - though will touch on one classification specific standardisation issue.

In [1]:
import pandas
import numpy
import matplotlib.pyplot as plt
import re
import string
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.svm import SVC, NuSVC
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from scipy.stats import norm
from sklearn import tree
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import PredefinedSplit
from sklearn.model_selection import GridSearchCV

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas


Recent verisons of sklearn don't like setting $C$ in an SVM to $\infty$, so I'm creating a big value here that we can pretend is $\infty$.

In [2]:
inf = 1e250

In [3]:
# don't worry about this, this is just so I can add a magic command that will continue to run a cell
# if there is an error.
from IPython.core.magic import register_cell_magic
import sys, traceback

@register_cell_magic('handle')
def handle(line, cell):
    try:
        return eval(cell)
    except Exception as exc:
        exc_type, exc_value, exc_traceback = sys.exc_info()
        traceback.print_exc(limit=8, file=sys.stdout)
        #print(f"\033[1;31m{exc.__class__.__name__} : \033[1;31;47m{exc}\033[0m")

## Some Data

We're going to use the red wine multi-class data from the third prac here.

I'm going to create two version of the labels, one with numbers indicating the class, and one with strings. The reaons for doing this is that I'll get some variation in errors depending on which one I use. Please note, neither one of these is better than the other - they will both work equally well - they will just change the error text a bit.

In [4]:

# reading in the multiclass data set
redwine_multiclass = pandas.read_csv('../data/wine/redwine-multiclass.csv')
# seperating into our covariates/feratures and our response variable
# can get the response variable by just dropping the `quality` column (which is our response variable)
X = redwine_multiclass.drop('quality', axis=1).to_numpy()
# now get the response variable by just getting the `quality` column
Y = redwine_multiclass['quality'].to_numpy()
# create a categorical version, with string labels - so we can see other errors
Y_cat = pandas.Categorical(Y)
Y_cat = Y_cat.rename_categories({3:'horrid', 4:'bad', 5:'medicocre', 6:'ok', 7:'good', 8:'great'}).to_numpy()

# split data
X_train, X_other, Y_train, Y_other, Y_cat_train, Y_cat_other = train_test_split(X, Y, Y_cat, train_size=0.7, random_state=42)
X_val, X_test, Y_val, Y_test, Y_cat_val, Y_cat_test = train_test_split(X_other, Y_other, Y_cat_other, train_size=0.5, random_state=42)

## More fun with Standardisation

Let's go through the standardisation checklist quickly:
* Firstly, do you need to standardise? Things to consier here are:
  * Do you have scale variation between dimensions?
  * Do you have a method that's impacted by standardisation?
  * Is it appropriate to apply standardisation to your data?
* If you decide to standardise, you should make sure:
  * **You compute the mean and standard deviation on the training set only** (this is in bold because some of you - probably those not reading this - will forget this)
  * If you compute the mean and standard deviation yourseld, make sure that you use `axis=0` when you compute the mean and standard deviation so that you compute these for each column
* For classification, **do not standardise your labels**. These are categories, we don't need to standardise them.
  
For our data and models, we have SVMs and CKNNs that will benefit from standardised data (RFs wont). For our data, it's tabular data and we have lots of scale variation, and two of our three models that will be impacted by that, so standardisation makes sense here.

In [5]:
# standardisation
# get standardisation parameters on training set
x_scaler = StandardScaler().fit(X_train)
# apply to all sets
X_train_std = x_scaler.transform(X_train)
X_val_std = x_scaler.transform(X_val)
X_test_std = x_scaler.transform(X_test)

Note that above, we did not standardise our y labels. These are categoricals and doing this would not make sense. 

Let's look at the fun errors we get when we make this poor life choice.

In [6]:
# standardise numeric labels
y_scaler = StandardScaler().fit(Y_train.reshape(-1, 1))
Y_train_std_do_not_do_this = y_scaler.transform(Y_train.reshape(-1, 1))

# fit a model using standardised numeric labels
rf = RandomForestClassifier().fit(X_train_std, Y_train_std_do_not_do_this)

  return fit_method(estimator, *args, **kwargs)


ValueError: Unknown label type: continuous. Maybe you are trying to fit a classifier, which expects discrete classes on a regression target with continuous values.

The error message here is pretty telling: `ValueError: Unknown label type: continuous. Maybe you are trying to fit a classifier, which expects discrete classes on a regression target with continuous values.
`

Essentially, SKLearn is asking us "what are you doing using a continuous output on a classifier? Should you be doing regression?".

In [7]:
# standardise string labels
y_scaler_string = StandardScaler().fit(Y_cat_train.reshape(-1, 1))

ValueError: could not convert string to float: 'good'

Trying to standardise our string labels also goes badly, this time falling at the first hurdle when looking at the string "good" and trying to work out what to do with it.

Standardisation needs numeric values, and you should not standardise your labels for a classification task.

## Inputs and Outputs around the Wrong Way

Let's look at this one on the classification methods. 

In [8]:
rf = RandomForestClassifier().fit(Y_train, X_train_std)

ValueError: Expected 2D array, got 1D array instead:
array=[7. 5. 5. ... 5. 7. 6.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

First up when using the numeric labels, we get an error `Expected 2D array, got 1D array instead`, telling us our data is in the wrong shape. In our case this is because we've passed it in the wrong order.

In [9]:
rf = RandomForestClassifier().fit(Y_cat_train, X_train_std)

ValueError: could not convert string to float: 'good'

Now using the categorical labels, we get the model trying to convert 'good' to a number. While we can have strings as our labels, our predictors/inputs need to be numeric.

We get similar errors for the SVM and CKNN.

In [10]:
svm = SVC().fit(Y_train, X_train)

ValueError: Expected 2D array, got 1D array instead:
array=[7. 5. 5. ... 5. 7. 6.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [11]:
cknn = KNeighborsClassifier().fit(Y_train, X_train)

ValueError: Expected 2D array, got 1D array instead:
array=[7 5 5 ... 5 7 6].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

## My SVM Trains for an Ice Age

The default for the SVM is to train until convergence - even if this takes until the end of time itself (I'm not going to demonstrate this here. If you want to do that yourself, just remove the setting of `max_iter` in the below code). For some people, this may not be practical, and this will be a particular problem if:
* You have a very large value of $C$, and/or
* Your data is not linearly separable.

What you can do to deal with this is specify the maximum number of iterations like this:

In [12]:
# setting max iterations in a regular old SVM
svm = SVC(C=inf, kernel='linear', max_iter=1000000).fit(X_train, Y_train)

# doing the same, but as part of a One vs the Rest ensemble
ovr = OneVsRestClassifier(SVC(C=inf, kernel='linear', max_iter=1000000)).fit(X_train, Y_train)



You should be setting `max_iter` to something big, as it will crank through the iterations quickly. If you're in doubt, start with a samller value (maybe 10,000?) and see just how fast it is. Also keep in mind that if it takes millions of iterations to converge with a particular value of $C$, you might want to try reducing that value a bit anyway.

## Holdout Validation Sets with GridSearchCV

When we do a grid search, generally we're going to evaluate each of options on a holdout validation set. GridSearchCV, and [sklearns other variants of this](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) don't make how to do this obvious. 

The magic secret is to create a combined training and validation version of the data, and a mask array that goes alongside that to indicate what samples belong to train, and what samples belong to test. Then, we can simply pass this into the GridSearchCV object and it will use the splits that we define.

In [13]:
X_train_and_val = numpy.vstack((X_train_std, X_val_std))
Y_train_and_val = numpy.hstack((Y_train, Y_val))
mask = numpy.zeros(len(Y_train_and_val))
mask[0:len(Y_train)] = -1
train_val_split = PredefinedSplit(mask)

param_grid = [
  {'C': [0.1, 1, 10, 100], 'gamma': [0.1, 0.01, 0.001], 'kernel': ['rbf']}
 ]
svm = SVC(class_weight='balanced')
grid_search = GridSearchCV(svm, param_grid, cv=train_val_split, refit=False)
grid_search.fit(X_train_and_val, Y_train_and_val)
grid_search.cv_results_

{'mean_fit_time': array([0.06373572, 0.07393026, 0.07756829, 0.04518294, 0.05264044,
        0.07116604, 0.04312897, 0.0448544 , 0.05291629, 0.0670712 ,
        0.0532279 , 0.0480237 ]),
 'std_fit_time': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 'mean_score_time': array([0.0123024 , 0.01222396, 0.01214933, 0.01136875, 0.01165652,
        0.01208711, 0.01054931, 0.01121402, 0.01215982, 0.00989795,
        0.01076174, 0.01125216]),
 'std_score_time': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 'param_C': masked_array(data=[0.1, 0.1, 0.1, 1, 1, 1, 10, 10, 10, 100, 100, 100],
              mask=[False, False, False, False, False, False, False, False,
                    False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_gamma': masked_array(data=[0.1, 0.01, 0.001, 0.1, 0.01, 0.001, 0.1, 0.01, 0.001,
                    0.1, 0.01, 0.001],
              mask=[False, False, False, False, False, False, False, False,
          

From this, we can see that the 3rd last model has the best performance. 