# Data preparation

### General data preparation
The following splits the raw data set into training and testing data sets.

After loading *carabids.csv* as `carabid`, we must do some data cleaning and initialize some objects.

The function `%!in%` is effectively the opposite of `%in%` (i.e. it finds instances in x that are not in y).

`sptable` allows species with few specimens (<30) to be easily identified (as `spname`). The "correct" cutoff value may change depending on your data's composition and objective of the model.

In [None]:
carabid = read.csv(file.choose())

'%!in%' = function(x,y)!('%in%'(x,y))

sptable = table(carabid$SpeciesName)
spname = names(which(sptable < 30))

#Train:Test ratio is 8:2, so I am initializing those values here
fractionTraining = 0.8
fractionvalid = 0.2

seed = 123

Next, we remove the species in `spname` and set the seed. Setting the seed will allow us to run this same code again and get the exact same split.

When computing the sample size of the training dataset, we use `floor()` to round down to the nearest integer. You could also use `ceiling()` to round up, what matters is that we get a whole integer value.

We then take a random sample from our dataset and turn that into our training dataset, with the remaining data being our test dataset.

In [None]:
carabid_keep = subset(carabid, SpeciesName %!in% c('Carabidae sp.',spname))

set.seed(seed)

# Compute sample sizes.
sampleSizeTraining = floor(fractionTraining * nrow(carabid_keep))

# Create the randomly-sampled indices for the dataframe. Use setdiff() to
# avoid overlapping subsets of indices.
indicesTraining = sort(sample(seq_len(nrow(carabid_keep)), size=sampleSizeTraining))
indicestest = setdiff(seq_len(nrow(carabid_keep)), indicesTraining)

dfTraining = carabid_keep[indicesTraining, ]
dfTest = carabid_keep[indicestest, ]

### Traditional machine learning preparation
Now that we've split our data, we need to do some additional cleaning. First, we will remove unnecessary columns, confirm our predictor variables are numeric, and remove specimens with missing data.

In [None]:
#Remove unnecessary columns
remove = c(1:17,25,29:32,43,47:50,55:57,65,73,81,89,97,105)
dfTraining = dfTraining[,-remove]
dfTest = dfTest[,-remove]

#Make predictor variables numeric
numcols = ncol(dfTraining)
dfTraining[,2:numcols] = as.data.frame(lapply(dfTraining[,2:numcols], as.numeric))
dfTest[,2:numcols] = as.data.frame(lapply(dfTest[,2:numcols], as.numeric))

#Find variables with missing values
sapply(dfTraining, function(x) sum(is.na(x)))
sapply(dfTest, function(x) sum(is.na(x)))

As we can see, some of the columns containing colour data have missing values. We can remove the rows with missing data and re-check our datasets to make sure there is no remaining missing data.

These steps can also be completed before splitting the data, and you may want to ensure that removing specimens did not lower any species abundance below your cutoff.

In [None]:
#Find the name of the first column with missing values
nacol = names(which(sapply(dfTraining, function(x) sum(is.na(x))) > 0))[1]
                   
dfTraining = dfTraining[-which(is.na(dfTraining[,nacol])),]
dfTest = dfTest[-which(is.na(dfTest[,nacol])),]
                           
sapply(dfTraining, function(x) sum(is.na(x)))
sapply(dfTest, function(x) sum(is.na(x)))

Finally, we will standardize our data using the `preProcess` function from the `caret` package.

In [None]:
normParam = preProcess(dfTraining)
norm.train = predict(normParam, dfTraining)
norm.test = predict(normParam, dfTest)

### Deep learning data preparation

To prepare our data for deep learning, we must set up our image directories.

First, we must read in the filenames from our unsorted image directory and compare them to our tabular specimen data.

In [None]:
#Set "image_directory" to the directory where you store your unsorted images
#image_directory = your_image_direcory
filenames = list.files(image_directory)

#Create vector of image names based on tabular data
ImageLabels = carabid_keep$ImageLabel
rois = carabid_keep$ROINumber
DorsalLabels = paste0(substr(ImageLabels, 1, 15), rois, ".jpg")
#We only have the dorsal image names so far, so we must add the ventral names
VentralLabels = paste0(substr(ImageLabels,1,8), "VENTRAL.", rois, ".jpg")

#Check to see if all specimens have matching dorsal and ventral images
#If this comes back FALSE, there might be something wrong with your images, image names, or tabular data
all(ImageLabels %in% filenames)

Next, we can initialize some functions and objects that will help with our image sorting. `my.file.rename` will copy our files from the unsorted image directory to the sorted image directory. We use `file.copy` so that the original unsorted directory is left intact, but you can use `file.rename` if you don't want duplicate copies of your images. Leaving the original directory intact can be useful if you want to make changes to your training/testing datasets later on, but will take up more memory on your hard drive.

`split` is a vector that will inform our code to put the images into either the training or testing directory.

In [None]:
my.file.rename <- function(from, to) {
  todir <- dirname(to)
  if (!isTRUE(file.info(todir)$isdir)) dir.create(todir, recursive=TRUE)
  file.copy(from = from,  to = to)
}

split = vector(mode = 'character', length = ImageLabels)
split[indicesTraining] = "training"
split[indicesTest] = "testing"

This loop will copy your unsorted images into a sorted directory with the structure "/dataset/SpeciesName/filename" (e.g. "/training/Cyclotrachelus furtivus/BLAN_01_DORSAL.1.jpg").

`sites` and `sitedex` serve no purpose in the loop other than keeping track of progress, as this process can take some time if you have many images.

In [None]:
#Set "sorted_directory" to the directory where you will store your sorted images
#sorted_directory = your_sorted_image_direcory

sites = c()
sitedex = 1
for(filename in filenames){
  if(filename %in% DorsalLabels){
    taxadex = which(DorsalLabels == filename) %% length(DorsalLabels)
    taxa = carabid_keep$SpeciesName[taxadex]
    dataset = split[taxadex]
    my.file.rename(from = paste(image_directory,filename, sep = "/"),
                   to = paste(sorted_directory, dataset, taxa, filename, sep = "/"))
  }
  if(filename %in% VentralLabels){
    taxadex = which(VentralLabels == filename) %% length(DorsalLabels)
    taxa = carabid_keep$SpeciesName[taxadex]
    dataset = split[taxadex]
    my.file.rename(from = paste(image_directory,filename, sep = "/"),
                   to = paste(sorted_directory, dataset, taxa, filename, sep = "/"))
  }
  if(substr(filename, 1, 4) %!in% sites){
    sites[sitedex] = substr(filename, 1, 4)
    sitedex = sitedex + 1
    print(paste("Starting", substr(filename, 1, 4), Sys.time()))
  }
}

#Traditional machine learning

### XGBoost
The following code will train an eXtreme Gradient Boosting (XGBoost) model.

Before we begin, we must load the `xgboost` R package.

In [None]:
library(xgboost)

The XGBoost model requires labels to be numeric starting at 0. The training data must also be reformatted as an `xgb.DMatrix` object to be compatible with the model. You can also format the test data as an `xgb.DMatrix` and include it as part of the `watchlist` object. This will allow you to monitor the performance of the model on the test data while the model trains, which will allow you to determine if/when the model becomes overfit.

In [None]:
trainLabels = norm.train$SpeciesName
testLabels = norm.test$SpeciesName

trainlab = as.numeric(trainLabels) - 1
testlab = as.numeric(testLabels) - 1

dtrain <- xgb.DMatrix(label = trainlab, data = as.matrix(norm.train[,2:69]))
dtest <- xgb.DMatrix(label = testlab, data = as.matrix(norm.test[,2:69]))
watchlist = list(train = dtrain, test = dtest)

Finally you can train your XGBoost model. A brief description of each of the model's parameters used here is as follows:

`data`: Your `xgb.DMatrix` training data.

`max.depth`: The maximum complexity of the model's trees. The higher this number is, the more variables each tree will consider.

`eta`: The learning rate of the model (min approaches 0, max = 1).

`nrounds`: The number of trees in the model. Increase this as you decrease `eta`.

`watchlist`: Named datasets for model evaluation.

`verbose`: The amount of information that will be printed as the model trains. 0 = no information, 1 = some evaluation metrics, 2 = more evaluation metrics.

`num_class`: The number of 'classes' (i.e. species) in the model.

`objective`: The objective function. Setting this as "multi:softprob" will give us a classification probability matrix when we classify new data with this model. Using "multi:softmax" would only give us the top-1 classifications without probabilities.

In [None]:
model = xgb.train(data = dtrain, 
                max.depth = 9, 
                eta = 0.1, 
                nrounds = 400, 
                watchlist = watchlist,
                eval.metric = "merror",
                eval.metric = "mlogloss",
                verbose = 1,
                num_class = 25,
                objective = "multi:softprob")

You can then make classifications using your model.

In [None]:
probs = predict(model, as.matrix(norm.test[,2:ncol(norm.test)]))

# Data evaluation

### Basic metrics
**(The following code also works with outputs from the deep learning tutorial)**

Most basic performance metrics for your model can be easily measured using the `confusionMatrix` function from the `caret` package.

To make our data compatible with the `confusionMatrix` function, we will need to convert our probability matrix to a vector of predicted classes.

In [None]:
library(caret)
colnames(probs) = levels(as.factor(trainLabels))
preds = apply(probs, MARGIN = 1, FUN = function(x){names(which.max(x))})
confmat = confusionMatrix(as.factor(preds), as.factor(testLabels))

Accuracy and related metrics can be found in `confmat$overall`.

Many useful metrics for comparing performance within classes can be found in `confmat$byclass`, such as precision, recall, and F1 score. You can also measure the macro average of these metrics by taking measuring the average across all classes.

In [None]:
accuracy = confmat$overall[1]
#Simply using mean() to measure macro averages can work, but might
#give an incorrect result if values are missing
f1 = sum(confmat$byclass[,"F1"], na.rm = T)/nrow(confmat$byclass)

Creating a confusion matrix is a great way to visualize how your model is making classifications. Using a confusion matrix, you can easily see which classes are being misclassified and what they are erroneously being classified as. You can create a confusion matrix using `confmat$table` and the `dcast` function from `reshape2`.

In [None]:
library(reshape2)
conftab = as.data.frame(confmat$table)
conftab = dcast(conftab, formula = Prediction~Reference)

### Top x accuracy

If you want to measure your model's accuracy if it had several attempts to make a classification, you can use these custom functions. `topxacc` will return your models accuracy when given 'x' number of attempts, while `topxpreds` returns the top 'x' most likely classifications for your test specimens. (e.g. top 3 accuracy or top 3 predictions).

These functions work by finding the top 'x' highest probability classes for any given classification in your classification probability matrix.

In [None]:
topxacc = function(x, testlab, prob, numclass){
  #Creating a dataframe of top 'x' predictions
  topxpreds = data.frame(matrix(NA, nrow = nrow(prob), ncol = x))
  for(i in 1:nrow(prob)){
    #Each row is reordered by descending probability. The name of the top 'x' columns
    #from each row are recorded in topxpreds
    topxpreds[i,] =  as.numeric(names(prob[i,order(as.numeric(prob[i,]), decreasing = T)])[1:x])
  }
  topxaccuracy = sum(testlab == topxpreds)/length(testlab)
  
  return(topxaccuracy)
}

topxpreds = function(x, testlab, prob, numclass){
  #Creating a dataframe of top 'x' predictions
  topxpreds = data.frame(matrix(NA, nrow = nrow(prob), ncol = x))
  for(i in 1:nrow(prob)){
    #Each row is reordered by descending probability. The name of the top 'x' columns
    #from each row are recorded in topxpreds
    topxpreds[i,] =  as.numeric(names(prob[i,order(as.numeric(prob[i,]), decreasing = T)])[1:x])
  }
  
  return(topxpreds)
}

top3acc = topxacc(3, testLabels, probs, 25)
top3preds = topxpreds(3, testLabels, probs, 25)

### Hierarchical classification

Hierarchical classifiers allow you to make classifications at multiple taxonomic levels simultaneously. To set up a hierarchical classifier, you will need to make a taxonomic hierarchy dataframe for your data. First, create a vector called `hierarchylevels` that contains the column names in your dataset of the taxonomic levels you want to include. Then, use this custom `hierarchy` function to create your taxonomic hierarchy dataframe.

In [None]:
hierarchylevels = c("SpeciesName", 
                    "Genus", 
                    "Subtribe", 
                    "Tribe", 
                    "Supertribe", 
                    "Subfamily")

hierarchy = function(data, ranks){

  baselabels = data[,ranks[1]]
    
  #Get all unique base labels
  uniquelabels = levels(as.factor(baselabels))
  
  #Create hierarchy DF
  hierarchy = data.frame(matrix(NA, nrow = length(baselabels), ncol = length(ranks)))
  colnames(hierarchy) = ranks
  #Set first column of DF as the unique baselabels
  hierarchy[,1] = uniquelabels
  #For loop starting with second taxonomic level/column
  for(i in 2:ncol(hierarchy)){
    #Iterate through each unique LITL label
    for(j in 1:length(uniquelabels)){
      row = which(baselabels == uniquelabels[j])[1]
      if(which(ranks == LITLs[row]) < i){
        hierarchy[j,i] = as.character(data[row,ranks[i]])
      }
      else{
        hierarchy[j,i] = hierarchyignore[j,1]
      }
    }
  }
  return(hierarchy)
}

myhierarchy = hierarchy(carabid, hierarchylevels)

Once you have your hierarchy dataframe, you can use it as a reference to convert your model's classifications into hierarchical classifications using this custom `hpredict` function. You can also use `hpredict` on your test labels for easier comparison with with your hierarchical classifications. You can then measure performance metrics at any taxonomic level using `confusionMatrix`.

In [None]:
hpredict = function(preds, hierarchy){
  #Initializing prediction dataframe
  preddf = data.frame(matrix(NA, nrow = length(preds), ncol = ncol(hierarchy)))
  colnames(preddf) = colnames(hierarchy)
  preddf[,1] = preds
  
  
  for(i in 2:ncol(hierarchy)){
    for(j in 1:nrow(preddf)){
      #For the current pred (in this case preddf[j,i-1]), find the first match
      #in the hierarchy dataframe and assign the label in the next column up
      #as preddf[j,i].
      preddf[j,i] = hierarchy[which(hierarchy[,i-1] == preddf[j,i-1])[1], i]
    }
  }  
}

hpreds = hpredict(preds, myhierarchy)
htest = hpredict(testLabels, myhierarchy)

genusconfmat = confusionMatrix(as.factor(hpreds$Genus), as.factor(htest$Genus))