# Data preprocessing

Here we preprocess the data, loading the image files and saving them to disk them as numpy arrays.

We can specify the folder structure of how we save the files, which images we import for training and testing, and what type of preprocessing we choose to do on the images.

In [None]:
from basic_preprocessing import DataPreprocessor

### Specify which folders to read from when loading training and testing data

Each training image must be placed in a folder whose name is the label for the image. For example, a Type 1 image called `1.jpg` must have a path ending in `"/Type_1/1.jpg"`.

The testing images, on the other hand, have no labels and should thus be located all together in a folder.

 - `training_folders` should be a list of folder-names we get images from, i.e. a list of strings.
 - `testing_folder` should be a single folder, i.e. a string.

In [None]:
# ===================== USER INPUT =====================
training_folders = ["./Data/train"]
# + ["./Data/Type_1", "./Data/Type_2", "./Data/Type_3"]
testing_folder = "./Data/test"
# ======================================================

### Create a preprocessing object and test whether it reads the images correctly

In [None]:
# ============ DOES NOT REQUIRE USER INPUT =============
preproc = DataPreprocessor(training_folders=training_folders, testing_folder=testing_folder)
preproc.test_loading()

### Find a good image resizing

The input images are large, typically several thousands of pixels on each axis. We want to resize them to as small as possible, while allowing a human to look at the image and still be able to tell which Type it is.

Here we can play with the variable `resize_shape` to choose one that does not wash out important information in the images.

In [None]:
# ============ DOES NOT REQUIRE USER INPUT =============
preproc.test_resizing(resize_shape=(150, 150, 3), index_image=17)

### Specify parameters for preprocessing

 - `data_folder` is the folder in which we save the preprocessed numpy arrays, needed for TensorFlow training.
 - `training_subfolder` is the subfolder inside `data_folder` which contains the training numpy arrays (and their labels).
 - `testing_subfolder` is the subfolder inside `data_folder` which contains the testing numpy arrays (as well as a little auxiliary information about the arrays: which image each row corresponds to, and what order the Type 1, Type 2 and Type 3 appear in the one-hot-encoded labels).
 - `optimal_resize_shape` is the best resizing shape chosen above.
 - There are a lot of training images. We thus preprocess them in batches. `batch_size` controls the number of images in each batch. Note that if some images cannot be preprocessed, some batches will contain fewer than `batch_size` images.
 - Preprocessing can take some time. `parallelize` controls whether we want to parallelize this job over all local cores.

In [None]:
# ===================== USER INPUT =====================
data_folder = "./TensorFlow_data"
training_subfolder = "/training_data"
testing_subfolder = "/testing_data"
optimal_resize_shape = (150, 150, 3)
batch_size = 2**7
parallelize = True
# ======================================================

### Execute the preprocessing

*N.B. The preprocessing only gets recomputed if it is unable to detect the correct number of files in the expected folders. Therefore, if you change `optimal_resize_shape` and leave all other variables intact, the preprocessing will not happen! If you want to make sure all the numpy arrays get recomptued, you should delete the folders containing the numpy-array data.*

In [None]:
# ============ DOES NOT REQUIRE USER INPUT =============
preproc.preprocess_save(data_folder=data_folder,
                        training_subfolder=training_subfolder,
                        testing_subfolder=testing_subfolder,
                        resize_shape=optimal_resize_shape, batch_size=batch_size,
                        parallelize=parallelize)

In [None]:
# TODO: Still need to include image flips!

# Neural Network

We construct a variety of neural networks and study their behavior to the amount and quality of the data.

In [None]:
from convnet import ConvNet

## LetNet-like networks

We will first study simple LeNet-like convolutional neural networks, which consist of a couple of convolutional layers (with max pooling) followed by a few fully-connected layers.

After constructing a simple network, we'll begin by checking how the validation performance of the network is affected by the quantitity of training data, to determine whether the network needs to be more complex or whether we need to increase the amount of data we train on.

### Specify parameters

We will train on a network with a specific architecture. To speed up training, we will train with a larger learning rate, and fine-tune our training with a smaller learning rate.

 - `input_shape` is the dimension-shape of the input image arrays.
 - `output_channels` is the number of output categories (i.e. cervix Types).
 - `convolutional_layers` specifies the parameters for the convolutional + max pooling layers. Each layer has the form  
 ```[int (number of output channels),  
    tuple of length 2 (size of conv filter),  
    tuple of length 2 (step size of conv filter),  
    tuple of length 2 (size of max pooling filter),  
    tuple of length 2 (step size max pooling filter)]```
 - `connected_layers` specifies the parameters for the fully connected layers. It is a list where each element is the number of neurons in the layer.
 - `keep_prob` is the dropout keep_prob (usually recommended to be 0.5).
 - `learning_rate`is the learning rate. We can change this later with the function `set_learning_rate`.
 - `model_name` is the model's name, used when saving and loading trained models.

In [None]:
# ===================== USER INPUT =====================
input_shape = (150, 150, 3)
output_channels = 3
convolutional_layers = [[20, (4, 4), (1, 1), (2, 2), (2, 2)],
                        [30, (4, 4), (1, 1), (2, 2), (2, 2)]]
connected_layers = [100, 30]
keep_prob = 0.5
learning_rate = 0.001 # standard value is 0.001
model_name = "testmodel"
# ======================================================

Now we initialize the network with the properties specified above.

In [None]:
# ============ DOES NOT REQUIRE USER INPUT =============
simple_network = ConvNet(input_shape=input_shape,
                         output_channels=output_channels, 
                         convolutional_layers=convolutional_layers,
                         connected_layers=connected_layers, 
                         keep_prob=keep_prob,
                         learning_rate=learning_rate,
                         model_name=model_name) 
                         # This name means we have 2 convolutional layers and 2 fully connected layers

## Training

We will now train the neural network. We begin by specifying the parameters for training.

 - `data_folder` is the folder in which we save the preprocessed numpy arrays, needed for TensorFlow training.
 - `training_subfolder` is the subfolder inside `data_folder` which contains the training numpy arrays (and their labels).
 - `testing_subfolder` is the subfolder inside `data_folder` which contains the testing numpy arrays (as well as a little auxiliary information about the arrays: which image each row corresponds to, and what order the Type 1, Type 2 and Type 3 appear in the one-hot-encoded labels).
 - `saved_model_subfolder` is the subfolder inside `data_folder` in which we will store our trained models.
 - `size_of_minibatch` is the size of each training-step minibatch, i.e. the number of input images in each traininig step

The validation set can be specified in one of two ways. Either we can specify a batch number to be validation set, or we can feed it a specific array of our choice.

 - `validation_batch` is the integer specifying which validation batch we use for validation.
 - Alternatively, we can specify `validation_inputarray` as the input array (with shape `(number of data points, size of x-axis, size of y-axis, number of color channels)`). We then also need to specify the validation labels with `validation_labels` (which should have shape `(number of data points, number of output channels)`).
 
***N.B. It's important you ONLY set `validation_batch`, or ONLY set `validation_inputarray` and `validation_labels`!***

In [None]:
# ===================== USER INPUT =====================
data_folder = "./TensorFlow_data"
training_subfolder = "/training_data"
testing_subfolder = "/testing_data"
saved_model_subfolder = "/trained_models"
size_of_minibatch = 2**6
validation_batch = simple_network.count_training_batches(data_folder + training_subfolder) - 1
#validation_inputarray = load_training_data(validation_batch)
#validation_labels = load_training_labels(validation_batch)
# ======================================================

Test whether the network correctly loads the images

In [None]:
# ============ DOES NOT REQUIRE USER INPUT =============
simple_network.test_loading(batch_and_index=(0, 19), batch_loc=(data_folder+training_subfolder))

### Overfit the network on a single batch

To choose good parameters for training, it helps to first try and overfit the network on a batch. This will also help us check if everything is working correctly.

Each time we train we specify:

 - `epochs` is the number of epochs we train for.
 - `load_saved_model` is used for loading a saved model. It is the full path to the saved model, including epoch number, e.g. `./TensorFlow_data/trained_models/mysavedmodel-20`. If we do not want to load a model, we set `load_saved_model=""`.
 - `training_batches` is a list speciying the set of batches we use for training, i.e. each element of `training_batches` is an integer.
 - `save_model` is a boolean specifying whether we should save our trained model as we train. The ConvNet saves the model every ten epochs, as well as the final epoch.
 - `model_destination_folder` is the folder in which we want to save our trained model. If we are loading a saved model and further training it, this variable can be set to `model_destination_folder=""`, which will save the further trained models into the same folder as `load_saved_model`. 
 - `printout` is a boolean which sets whether we want printouts with loss and accuracy as we train.
 - We set the learning rate for this training run with the `set_learning_rate` function.

In [None]:
# ===================== USER INPUT =====================
epochs = 45
load_saved_model = "./TensorFlow_data/deleteme_cont/testmodel-90"
training_batches = [1, 2]
save_model = True
model_destination_folder = ""
printout = True
simple_network.learning_rate = 0.0002
simple_network.keep_prob = 0.75
# ======================================================

In [None]:
# ============ DOES NOT REQUIRE USER INPUT =============
#Work out whether we specify the validation set using a batch number or a manually set array
try:
    validation_batch
    validation_inputarray = []
    validation_labels = []
except:
    validation_batch=0

# Perform the training
(accuracy_list,
training_losses,
validation_losses) = simple_network.train(epochs=epochs,
                                          load_saved_model=load_saved_model,
                                          training_batches=training_batches,
                                          size_of_minibatch=size_of_minibatch,
                                          validation_inputarray=validation_inputarray,
                                          validation_labels=validation_labels,
                                          validation_batchnum=validation_batch,
                                          printout=printout,
                                          save_model=save_model,
                                          model_destination_folder=model_destination_folder)
    

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(accuracy_list)
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.show()

plt.plot(training_losses[2:])
plt.plot(validation_losses[2:])
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()