# Data
The data consists of fluorescence spectra readings from six different species of bacteria: Bacillus cereus, Listeria monocytogenes, Staphylococcus aureus, Salmonella enterica, Escherichia coli, and Pseudomonas aureginosa.
For each bacteria sample there are spectra readings for about 1043 different wavelengths of light and the three growth phases: lag, log, and stat (stationary). This means that for each bacteria sample there are 3 * 1304 data points. Furthermore, the spectra readings are generated with two different integration times (time spent gathering the spectra reading), 16ms and 32ms. 

When looking at a single growth phase, the data is used as-is. However, when using all growth phases the bacteria samples that do not have data for all three growth phases are discarded. There are 47, 41, and 47 bacteria samples for the lag, log and stationary growth phases, respectively. The are 39 bacteria samples with data for all three growth phases.

There are some large numbers in the dataset (some spectra readings exceed 25,000). This poses a problem when training SVM models that use the linear kernel as the linear kernel is very slow for large values. For example, a SVM using the rbf kernel would take less than ~0.1 second to train while a SVM using could take up to ~16 minutes to train. To mitigate this effect I scaled the data into the interval [0.0, 1.0]. However, scaling is done 'globally' as opposed to scaling each feature individually as is done in the sklearn scaling libraries. This retains the relative scale between features. It is important to keep the relative scaling between features because technically all the features in this dataset are readings of the same feature. Ignoring relative scale and scaling on a per-feature basis worsens classification peformance.

The dataset suffers from two other problems: a low sample size and class imbalance. There are 39 unique bacteria samples, of which 12 belong to the majority class (Bc) and three which belong to the minority class (lm). These numbers exclude incomplete bacteria samples that did not have spectra readings for every growth phase.

The labels (or targets) are set to be the species of the given bacteria sample. These are:
- Bc - Bacillus cereus 
- lm - Listeria monocytogenes
- sa - Staphylococcus aureus 
- se - Salmonella enterica
- ec - Escherichia coli
- pa - Pseudomonas aureginosa

# Models
The classifiers used in the following experiments are:
1. Naive Bayes
2. SVM
3. RandomForest with Decision Stumps
4. RandomForest with Decision Trees
5. AdaBoost with Decision Stumps
6. AdaBoost with Decision Trees.

Additionally, the parameters 'C', 'gamma', and 'kernel' are optimised for the SVM model via grid search. The score given for the SVM model is the model initialised with the best parameters found in this parameter search.
The decision stumps/trees used with AdaBoost and RandomForest are tested with a max tree depth of 1 (for decision stumps) and 3. RandomForest models are tested with 512 classifiers and AdaBoost with 256 classifiers.

# Methodology
In the code below an experiment refers to a sequence of tests which evaluate the performance of various models. An experiment is run for the entire dataset and again for each subset of the dataset, where a subset is simply the data from a single growth phase. In each experiment I run the same series of tests twice, once for each integration time. 

Each model is evaluated using both the original, untransformed data and a PCA transformed version of the data. Models are evaluated using repeated stratified k-fold cross validation where the data is split into three folds (n_splits) 20 times (n_repeats). The scores given for both the untransformed data and the PCA data consist of the mean score over all the 60 indvidual folds +/- two standard deviations.

The random state is set to the same value across different modules (e.g. train_test_split, RandomForest initialisation) to ensure results can be reproduced consistently.

At the end of the notebook I have added brief summaries of the results with a table with the top three configurations (in terms of both data and models) and a bar chart comparing classification scores across each configuration. The black lines on the bars in the bar chart indicate the +/- two standard deviation ranges.

The code for these experiments can be found under the file `experiment.py`.

# Brief Summary of Results
Overall, none of models are able to produce good results, with the best classification accuracy being ~57% with a SVM. 
Since, in the case of using all growth phase data, there are about 12 samples from the majority class out of a total of 39 samples, the best score a classifier could get by consistently guessing the majority class would be around 30%. So while 57% is quite a bit better, it is still too unreliable for practical use. 

In [1]:
from experiment import Experiment

In [2]:
experiment_lag = Experiment('lag')
experiment_lag.run()

################################################################################
Running tests for 16ms integration time.
################################################################################
**************************
Running Naive Bayes tests.
**************************
Accuracy: 0.41 (+/- 0.18)
PCA Accuracy: 0.29 (+/- 0.17)
Elapsed time: 0.69s.
******************
Running SVM tests.
******************
Fitting 60 folds for each of 220 candidates, totalling 13200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done 4973 tasks      | elapsed:   10.4s
[Parallel(n_jobs=-1)]: Done 12973 tasks      | elapsed:   24.1s
[Parallel(n_jobs=-1)]: Done 13200 out of 13200 | elapsed:   24.4s finished


Best grid search score was 0.48 with the following settings: {'C': 1, 'gamma': 1, 'kernel': 'linear'}
Accuracy: 0.50 (+/- 0.19)
PCA Accuracy: 0.48 (+/- 0.19)
Elapsed time: 24.98s.
**************************************************************************
Running RandomForest tests using 512 Decision Trees with a max depth of 1.
**************************************************************************
Accuracy: 0.47 (+/- 0.10)
PCA Accuracy: 0.46 (+/- 0.09)
Elapsed time: 53.84s.
**************************************************************************
Running RandomForest tests using 512 Decision Trees with a max depth of 3.
**************************************************************************
Accuracy: 0.43 (+/- 0.17)
PCA Accuracy: 0.45 (+/- 0.16)
Elapsed time: 58.33s.
**********************************************************************
Running AdaBoost tests using 256 Decision Trees with a max depth of 1.
**********************************************************************
A

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 509 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 5609 tasks      | elapsed:    7.7s
[Parallel(n_jobs=-1)]: Done 13200 out of 13200 | elapsed:   15.8s finished


Best grid search score was 0.54 with the following settings: {'C': 1, 'gamma': 1, 'kernel': 'linear'}
Accuracy: 0.54 (+/- 0.19)
PCA Accuracy: 0.54 (+/- 0.20)
Elapsed time: 16.32s.
**************************************************************************
Running RandomForest tests using 512 Decision Trees with a max depth of 1.
**************************************************************************
Accuracy: 0.49 (+/- 0.08)
PCA Accuracy: 0.45 (+/- 0.12)
Elapsed time: 55.10s.
**************************************************************************
Running RandomForest tests using 512 Decision Trees with a max depth of 3.
**************************************************************************
Accuracy: 0.46 (+/- 0.13)
PCA Accuracy: 0.44 (+/- 0.14)
Elapsed time: 57.55s.
**********************************************************************
Running AdaBoost tests using 256 Decision Trees with a max depth of 1.
**********************************************************************
A

In [3]:
experiment_log = Experiment('log')
experiment_log.run()

################################################################################
Running tests for 16ms integration time.
################################################################################
**************************
Running Naive Bayes tests.
**************************
Accuracy: 0.40 (+/- 0.18)
PCA Accuracy: 0.39 (+/- 0.18)
Elapsed time: 0.56s.
******************
Running SVM tests.
******************
Fitting 60 folds for each of 220 candidates, totalling 13200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 235 tasks      | elapsed:    1.7s
[Parallel(n_jobs=-1)]: Done 10135 tasks      | elapsed:   11.7s
[Parallel(n_jobs=-1)]: Done 13200 out of 13200 | elapsed:   14.9s finished


Best grid search score was 0.50 with the following settings: {'C': 1, 'gamma': 1, 'kernel': 'linear'}
Accuracy: 0.48 (+/- 0.21)
PCA Accuracy: 0.50 (+/- 0.21)
Elapsed time: 15.47s.
**************************************************************************
Running RandomForest tests using 512 Decision Trees with a max depth of 1.
**************************************************************************
Accuracy: 0.48 (+/- 0.11)
PCA Accuracy: 0.39 (+/- 0.18)
Elapsed time: 54.65s.
**************************************************************************
Running RandomForest tests using 512 Decision Trees with a max depth of 3.
**************************************************************************
Accuracy: 0.48 (+/- 0.18)
PCA Accuracy: 0.50 (+/- 0.17)
Elapsed time: 57.66s.
**********************************************************************
Running AdaBoost tests using 256 Decision Trees with a max depth of 1.
**********************************************************************
A

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2300 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 13200 out of 13200 | elapsed:   12.4s finished


Best grid search score was 0.50 with the following settings: {'C': 10, 'gamma': 1, 'kernel': 'linear'}
Accuracy: 0.48 (+/- 0.23)
PCA Accuracy: 0.50 (+/- 0.25)
Elapsed time: 13.00s.
**************************************************************************
Running RandomForest tests using 512 Decision Trees with a max depth of 1.
**************************************************************************
Accuracy: 0.47 (+/- 0.12)
PCA Accuracy: 0.36 (+/- 0.16)
Elapsed time: 54.99s.
**************************************************************************
Running RandomForest tests using 512 Decision Trees with a max depth of 3.
**************************************************************************
Accuracy: 0.44 (+/- 0.20)
PCA Accuracy: 0.39 (+/- 0.23)
Elapsed time: 59.08s.
**********************************************************************
Running AdaBoost tests using 256 Decision Trees with a max depth of 1.
**********************************************************************


In [None]:
experiment_stat = Experiment('stat')
experiment_stat.run()

################################################################################
Running tests for 16ms integration time.
################################################################################
**************************
Running Naive Bayes tests.
**************************
Accuracy: 0.41 (+/- 0.18)
PCA Accuracy: 0.33 (+/- 0.18)
Elapsed time: 0.61s.
******************
Running SVM tests.
******************
Fitting 60 folds for each of 220 candidates, totalling 13200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 335 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done 13200 out of 13200 | elapsed:   12.1s finished


Best grid search score was 0.50 with the following settings: {'C': 0.1, 'gamma': 1, 'kernel': 'linear'}
Accuracy: 0.50 (+/- 0.18)
PCA Accuracy: 0.51 (+/- 0.18)
Elapsed time: 12.60s.
**************************************************************************
Running RandomForest tests using 512 Decision Trees with a max depth of 1.
**************************************************************************
Accuracy: 0.50 (+/- 0.15)
PCA Accuracy: 0.34 (+/- 0.13)
Elapsed time: 53.55s.
**************************************************************************
Running RandomForest tests using 512 Decision Trees with a max depth of 3.
**************************************************************************
Accuracy: 0.52 (+/- 0.17)
PCA Accuracy: 0.47 (+/- 0.21)
Elapsed time: 57.44s.
**********************************************************************
Running AdaBoost tests using 256 Decision Trees with a max depth of 1.
**********************************************************************

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 987 tasks      | elapsed:    1.7s
[Parallel(n_jobs=-1)]: Done 11187 tasks      | elapsed:    8.7s
[Parallel(n_jobs=-1)]: Done 13200 out of 13200 | elapsed:   10.1s finished


Best grid search score was 0.56 with the following settings: {'C': 10, 'gamma': 1, 'kernel': 'linear'}
Accuracy: 0.59 (+/- 0.19)
PCA Accuracy: 0.56 (+/- 0.19)
Elapsed time: 10.66s.
**************************************************************************
Running RandomForest tests using 512 Decision Trees with a max depth of 1.
**************************************************************************
Accuracy: 0.51 (+/- 0.15)
PCA Accuracy: 0.42 (+/- 0.15)
Elapsed time: 52.40s.
**************************************************************************
Running RandomForest tests using 512 Decision Trees with a max depth of 3.
**************************************************************************
Accuracy: 0.52 (+/- 0.16)


In [None]:
experiment_all = Experiment('all')
experiment_all.run()

In [None]:
print('Results for Tests Run on Lag Growth Phase Data')

print('Top Three Configurations:\n', experiment_lag.top_three())
experiment_lag.plot_results();

In [None]:
print('Results for Tests Run on Log Growth Phase Data')

print('Top Three Configurations:\n', experiment_log.top_three())
experiment_log.plot_results();

In [None]:
print('Results for Tests Run on Stationary Growth Phase Data')

print('Top Three Configurations:\n', experiment_stat.top_three())
experiment_stat.plot_results();

In [None]:
print('Results for Tests Run on All Growth Phase Data')

print('Top Three Configurations:\n', experiment_all.top_three())
experiment_all.plot_results();