# Deep learning
We must first import the modules required to train our model. If you do not yet have tensorflow installed, follow the instructions here: https://www.tensorflow.org/install/pip. If you do not yet have pandas installed, follow the instructions here:https://pandas.pydata.org/docs/getting_started/install.html. If you do not yet have numpy installed, follow the instructions here: https://numpy.org/install/.

In [None]:
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.applications.resnet50 import preprocess_input, ResNet50
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Model
import pandas as pd
import numpy as np

Here we are initializing our image sizes, batch sizes, and image directories. All images will be resized to the specified image size (224⨉224 in this case). Setting the image size smaller will decrease the amount of information given to the model, but will make the model train more quickly. The opposite is true for larger images.

Batch size determines how many images will be propagated through the model at a time. Large batch sizes are more memory intensive, and can take longer to train. You might need to change the batch size based on your hardware specifications.

Set `mytrainingdirectory` and `mytestingdirectory` to the path where your training and testing images are stored, respectively.

In [None]:
img_height, img_width = (224,224)
batch_size = 128

train_data_dir = mytrainingdirectory
test_data_dir = mytestingdirectory

`ImageDataGenerator` will generate our batches of images with several augmentation methods. 

In [None]:
train_datagen = ImageDataGenerator(preprocessing_function = preprocess_input,
                                   shear_range = 0.2,
                                   zoom_range = 0.2,
                                   horizontal_flip = True,
                                   vertical_flip = True)

Within `ImageDataGenerator`, we will use `flow_from_directory`. This will read in batches directly from our image directory, opposed to `flow`, which would read images preloaded into the Python environment. This makes `flow_from_directory` less memory intensive.

We create `train_generator` and `test_generator` with the same parameters, and the only differences being the image directory and that the `test_generator` is not shuffled.

In [None]:
train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size = (img_height, img_width),
    batch_size = batch_size,
    seed = 123,
    shuffle = True,
    class_mode = 'categorical')


test_generator = train_datagen.flow_from_directory(
    test_data_dir,
    target_size = (img_height, img_width),
    batch_size = batch_size,
    seed = 123,
    shuffle = False,
    class_mode = 'categorical')

This code sets the seed and ensures reproduceability for our model training. The training will work without this code, but the final model will be slighlty different each time. This code was adapted from https://stackoverflow.com/a/52897216/18022123.

In [None]:
# Seed value
seed_value= 321

# 1. Set the `PYTHONHASHSEED` environment variable at a fixed value
import os
os.environ['PYTHONHASHSEED']=str(seed_value)

# 2. Set the `python` built-in pseudo-random generator at a fixed value
import random
random.seed(seed_value)

# 3. Set the `numpy` pseudo-random generator at a fixed value
import numpy as np
np.random.seed(seed_value)

# 4. Set the `tensorflow` pseudo-random generator at a fixed value
import tensorflow as tf
tf.random.set_seed(seed_value)

Here we are defining our model architecture. As a base, we are using the ImageNet ResNet50, making this a transfer learning convolutional neural network. After this we add a global average pooling layer, one dense layer with ReLu activation, and a softmax classification layer. 

We must also set the base layers as non-trainable so their weights are not modified.

The model is then compiled and ready to be fit to the data.

In [None]:
base_model = ResNet50(include_top = False, weights = 'imagenet')
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation = 'relu')(x)
predictions = Dense(train_generator.num_classes, activation = 'softmax')(x)
model = Model(inputs = base_model.input, outputs = predictions)

for layer in base_model.layers:
    layer.trainable = False

model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

This code trains the model using data from `train_generator` over 25 epochs. It is also tested using the test data after every epoch, which allows you to see its progression.

In [None]:
model.fit(train_generator,
          epochs = 25,
          validation_data = test_generator)

Finally, we can create a prediction probability matrix and export it for further analysis in R.

In [None]:
preds = model.predict(test_generator)
preddf = pd.DataFrame(preds)
preddf.to_csv("DL-Predictions.csv", index = False)