# Data preprocessing

Here we preprocess the data, loading the image files and saving them to disk them as numpy arrays.

We can specify the folder structure of how we save the files, which images we import for training and testing, and what type of preprocessing we choose to do on the images.

In [None]:
from basic_preprocessing import DataPreprocessor

### Specify which folders to read from when loading training and testing data

Each training image must be placed in a folder whose name is the label for the image. For example, a Type 1 image called `1.jpg` must have a path ending in `"/Type_1/1.jpg"`.

The testing images, on the other hand, have no labels and should thus be located all together in a folder.

 - `training_folders` should be a list of folder-names we get images from, i.e. a list of strings.
 - `testing_folder` should be a single folder, i.e. a string.

In [None]:
# ===================== USER INPUT =====================
training_folders = ["./Data/train"]
# + ["./Data/Type_1", "./Data/Type_2", "./Data/Type_3"]
testing_folder = "./Data/test"
# ======================================================

### Create a preprocessing object and test whether it reads the images correctly

In [None]:
# ============ DOES NOT REQUIRE USER INPUT =============
preproc = DataPreprocessor(training_folders=training_folders, testing_folder=testing_folder)
preproc.test_loading()

### Find a good image resizing

The input images are large, typically several thousands of pixels on each axis. We want to resize them to as small as possible, while allowing a human to look at the image and still be able to tell which Type it is.

Here we can play with the variable `resize_shape` to choose one that does not wash out important information in the images.

In [None]:
# ============ DOES NOT REQUIRE USER INPUT =============
preproc.test_resizing(resize_shape=(150, 150, 3), index_image=17)

### Specify parameters for preprocessing

 - `data_folder` is the folder in which we save the preprocessed numpy arrays, needed for TensorFlow training.
 - `training_subfolder` is the subfolder inside `data_folder` which contains the training numpy arrays (and their labels).
 - `testing_subfolder` is the subfolder inside `data_folder` which contains the testing numpy arrays (as well as a little auxiliary information about the arrays: which image each row corresponds to, and what order the Type 1, Type 2 and Type 3 appear in the one-hot-encoded labels).
 - `optimal_resize_shape` is the best resizing shape chosen above.
 - There are a lot of training images. We thus preprocess them in batches. `batch_size` controls the number of images in each batch. Note that if some images cannot be preprocessed, some batches will contain fewer than `batch_size` images.
 - Preprocessing can take some time. `parallelize` controls whether we want to parallelize this job over all local cores.

In [None]:
# ===================== USER INPUT =====================
data_folder = "./TensorFlow_data"
training_subfolder = "/training_data"
testing_subfolder = "/testing_data"
optimal_resize_shape = (150, 150, 3)
batch_size = 2**7
parallelize = True
# ======================================================

### Execute the preprocessing

*N.B. The preprocessing only gets recomputed if it is unable to detect the correct number of files in the expected folders. Therefore, if you change `optimal_resize_shape` and leave all other variables intact, the preprocessing will not happen! If you want to make sure all the numpy arrays get recomptued, you should delete the folders containing the numpy-array data.*

In [None]:
# ============ DOES NOT REQUIRE USER INPUT =============
preproc.preprocess_save(data_folder=data_folder,
                        training_subfolder=training_subfolder,
                        testing_subfolder=testing_subfolder,
                        resize_shape=optimal_resize_shape, batch_size=batch_size,
                        parallelize=parallelize)

In [None]:
# TODO: Still need to include image flips!

# Neural Network

We construct a variety of neural networks and study their behavior to the amount and quality of the data.

In [None]:
from convnet import ConvNet

In [None]:
simple_network = ConvNet(input_shape=(150, 150, 3),
                         output_channels=3, 
                         convolutional_layers=[[20, (4, 4), (1, 1), (2, 2), (2, 2)],
                                               [30, (4, 4), (1, 1), (2, 2), (2, 2)]],
                         connected_layers=[100, 30], 
                         keep_prob=1.0,
                         learning_rate=0.002,
                         model_name="testmodel") 
                         # This name means we have 2 convolutional layers and 2 fully connected layers

In [None]:
simple_network.test_loading(batch_and_index=(0, 19))

In [None]:
# ===================== USER INPUT =====================
data_folder = "./TensorFlow_data"
training_subfolder = "/training_data"
testing_subfolder = "/testing_data"
saved_model_subfolder = "/trained_models"
size_of_minibatch = 2**6
validation_batch = simple_network.count_training_batches(data_folder + training_subfolder) - 1
#validation_inputarray = load_training_data(validation_batch)
#validation_labels = load_training_labels(validation_batch)
epochs = 50
# ======================================================

In [None]:
# FIRST TRAIN ON A SINGLE BATCH
(accuracy_list,
 training_losses,
 validation_losses) = simple_network.train(epochs=epochs, training_batches=[0], size_of_minibatch=2**6,
                                           validation_batchnum=validation_batch, printout=True, save_model=False)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(accuracy_list)
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.show()

plt.plot(training_losses)
plt.plot(validation_losses)
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()

In [None]:
#train(self, epochs=10, load_saved_model="", training_batches=[],
#              size_of_minibatch=2**6, validation_inputarray=[],
#              validation_labels=[], validation_batchnum=0, printout=True,
#              save_model=True, model_destination_folder="")