In [None]:
%matplotlib inline

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
from skimage.io import imread

import tensorflow as tf

from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Dense, Dropout

from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.metrics import BinaryAccuracy
from tensorflow.keras.callbacks import TensorBoard

# Image Classification
## Demo

link to data we use:
https://drive.google.com/open?id=1xsi7LtfirtGIh3FqiGUmXoOzQWnZF6WZ
You should add the downloaded folder to the project main folder.

The dataset comes from [Kaggle](https://www.kaggle.com/chetankv/dogs-cats-images). It was originally split into `training` and `testing` sets but they were merged togerther for the purpose of demonstrating how to create our own data splits.

In [None]:
# Constants
DATA_DIR = "data"
SPLITS_DIR = "splits"

CAT_CLASS, DOG_CLASS = "cat", "dog"

TRAIN_PCT, VAL_PCT, TEST_PCT = 0.9, 0.05, 0.05

IMAGE_SIZE = (224, 224)
BATCH_SIZE = 32

### Getting and cleaning data
First, we need to get all filenames. It's impractical to load all images in memory since the dataset can be really large. We can keep track of the filenames and "materialize them" (i.e. read the images) only when we need to do so, one mini-batch at a time. This is one of many examples of *lazy execution* we've seen.

In [None]:
def get_all_filenames(base_dir, class_names_plural = True):
    """
    Returns the filenames and their corresponding classes.
    Assumes the following structure:
    |---base_dir
    | |---class1
    | | |---image1.jpg
    | | |---image2.jpg
    | |---class2
    | | |---image1.jpg
    | | |---image2.jpg
    
    Since in our case we can infer the class from the filename,
    we could skip returning it but we'll do so for clarity.
    """
    
    filenames = {
        "image_filename": [],
        "image_class": []
    }
    
    for image_class in os.listdir(base_dir):
        image_class_dir = os.path.join(base_dir, image_class)
        filenames_in_class = [os.path.join(image_class_dir, file) for file in os.listdir(image_class_dir)]
        filenames["image_filename"].extend(filenames_in_class)
        
        normalized_image_class = image_class[:-1] if class_names_plural else image_class
        filenames["image_class"].extend([normalized_image_class] * len(filenames_in_class))
    return pd.DataFrame(filenames)

In [None]:
filenames = get_all_filenames(DATA_DIR)

In [None]:
filenames.shape

We can see that the names are more than 10 000 which is unexpected. A simple regex on the filenames shows that there are duplicate filenames, e.g. `data\cats\cat.4142(1).jpg`. We can see that there are exactly 28 of them, and they contain the exact same information.

We can assume that no more "contaminations" of the dataset exist although it might be useful to check for duplicate images with different filenames, or similar images.

In [None]:
filenames = filenames[~filenames.image_filename.str.contains("\(")]

In [None]:
filenames.shape

### Data exploration
This is not meant to be a comprehensive exploration, just some really simple checks. As usual, we need to see the distribution of classes, image resolution (number of pixels), colors (if there's a mix of grayscale and color images, it can throw off our algorithm and we need to take it into account), etc.

In [None]:
filenames.groupby("image_class").size()

In [None]:
def get_image_dimensions(image_filename):
    """
    Returns the dimensions of the image (height, width, channels) in pixels.
    
    There are better methods which don't involve reading the entire image
    and loading it in memory but this is simple enough.
    """
    return imread(image_filename).shape

In [None]:
dimensions = filenames.image_filename.apply(get_image_dimensions)

In [None]:
dimensions = pd.DataFrame([*dimensions], columns = ["height", "width", "channels"])

There are no grayscale images.

In [None]:
len(dimensions[dimensions.channels != 3])

In [None]:
fig, (h_ax, w_ax, res_ax) = plt.subplots(1, 3, figsize = (18, 6))
h_ax.hist(dimensions.height, bins = 30)
h_ax.set_xlabel("Image height")

w_ax.hist(dimensions.width, bins = 30)
w_ax.set_xlabel("Image width")

res_ax.hist(dimensions.width / dimensions.height, bins = 40)
res_ax.set_xlabel("Aspect ratio")

plt.suptitle("Distributions of image dimensions")
plt.show()

We can see that most of the images are pretty small and they are close to a usual `4:3` ratio.

An in-depth data analysis would try to find specific "harder" examples, e.g. animals behind a cage, in strange positions, too small (or too large) images, images where the animal is not clearly visible, etc. It would also try to find mislabeled data (and if possible, quantify the percentage of wrong labels).

### Training and testing sets
We could split the data in multiple ways. One recommended way is to use a pre-defined split. That is, shuffle the data (use stratified shuffling when necessary), split, and write the results into a file. After that, it gets much easier to load the files, and the splits won't incur additional randomness in the model performance.

A relatively good split seems to be 90% / 5% / 5%. This leaves 250 images from each class for validation and testing while maximizing the amount of training images.

**Note:** We could make the function much more abstract by allowing it to create an arbitrary number of splits. Then, we simply need to iterate over them. This will allow repeated code (dataframe initialization, concatenation, final shuffling) to be replaced by a for-loop. However, it will make the code harder to understand from a scientific point of view. We always need to balance both.

In [None]:
train_data, val_data, test_data = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

for class_name, data in filenames.groupby("image_class"):
    # Randomize the order (shuffle) before splitting
    data = data.sample(frac = 1)
    train_end_index, val_end_index = round(TRAIN_PCT * len(data)), round((TRAIN_PCT + VAL_PCT) * len(data))
    train_data_in_class, val_data_in_class, test_data_in_class = \
        data[:train_end_index], data[train_end_index: val_end_index], data[val_end_index:]
    train_data = pd.concat([train_data, train_data_in_class])
    val_data = pd.concat([val_data, val_data_in_class])
    test_data = pd.concat([test_data, test_data_in_class])
    
# Randomize the sets once again so the classes are not consecutive
train_data = train_data.sample(frac = 1)
val_data = val_data.sample(frac = 1)
test_data = val_data.sample(frac = 1)

Then, we can save the datasets and load them from files later (so we don't have to regenerate them).

In [None]:
if not os.path.exists(SPLITS_DIR):
    os.makedirs(SPLITS_DIR)

In [None]:
for dataset, filename in zip([train_data, val_data, test_data], ["train", "val", "test"]):
    dataset.to_csv(os.path.join(SPLITS_DIR, filename + ".csv"), index = False)

### Preparing images for modelling
Because of the specific of the model we're going to use (`Dense` layers), we'll need to make sure all images are the same size before passing them. The easiest way to do this is to resize them to a fixed size. This will stretch or squish them but the model will still be able to learn from them.

In practice, we usually do the image resizing beforehand so we don't have to do it every time we're reading a particular image. This is to say, we're *caching* the results.

`tensorflow` works with many dataset formats. The one we'd like to use is a tuple `(image_data, image_class)` where `image_data` is a 3D tensor, and `image_class` is a number (0 for `dog` / 1 for `cat` in our case).

In [None]:
train_data = pd.read_csv(os.path.join(SPLITS_DIR, "train.csv"))
val_data = pd.read_csv(os.path.join(SPLITS_DIR, "val.csv"))
test_data = pd.read_csv(os.path.join(SPLITS_DIR, "test.csv"))

In [None]:
def read_and_prepare_image(image_filename, image_class):
    # Get image
    image = tf.io.read_file(image_filename)
    image = tf.image.decode_jpeg(image)
    
    # Resize and normalize
    image = tf.image.resize(image, IMAGE_SIZE)
    
    def preprocess_image(x):
        """
        This is a stripped-down version of Keras' own imagenet preprocessing function,
        as the original one is throwing an exception
        """

        backend = tf.keras.backend
    
        # 'RGB'->'BGR'
        x = x[..., ::-1]
        mean = [103.939, 116.779, 123.68]
        std = None

        mean_tensor = backend.constant(-np.array(mean))

        # Zero-center by mean pixel
        if backend.dtype(x) != backend.dtype(mean_tensor):
            x = backend.bias_add(
                x, backend.cast(mean_tensor, backend.dtype(x)))
        else:
            x = backend.bias_add(x, mean_tensor)
        if std is not None:
            x /= std
        return x

    image = preprocess_image(image)
    
    
    # Return the correct class
    image_class_encoded = tf.where(image_class == CAT_CLASS, 1, 0)
    return image, image_class_encoded

In [None]:
# Note that we're rewriting variable names in a lot of places. This is not always
# a good practice but we're working in a notebook so: 1) we don't need all variables
# but they're still allocated; 2) we can reuse the same name in a different context
def initialize_tf_dataset(data, should_batch = True, should_repeat = True):
    dataset = tf.data.Dataset.from_tensor_slices((data.image_filename.values, data.image_class.values))
    dataset = dataset.map(read_and_prepare_image)
    dataset = dataset.shuffle(buffer_size = len(data))
    
    if should_batch:
        dataset = dataset.batch(BATCH_SIZE)
    else:
        dataset = dataset.batch(len(data))
        
    if should_repeat:
        dataset = dataset.repeat()
    return dataset

train_data = initialize_tf_dataset(train_data)
val_data = initialize_tf_dataset(val_data)
test_data = initialize_tf_dataset(test_data, should_batch = False, should_repeat = False)

In [None]:
# Let's look at an example to see if it was created correctly
for batch in train_data:
    print(batch[0].shape, batch[1].shape)
    break

### Getting a model for transfer learning
We'll use ResNet50 as the model base. We could easily omit the "head", i.e. the `Dense` layers but in order to demonstrate how to get inputs and outputs from an existing model, we'll do this manually.

In [None]:
resnet50 = ResNet50()

In [None]:
resnet50.summary()

In [None]:
# Get the inputs and outputs we want, essentially getting a part of the original model.
# This paradigm is widely used, especially when we want to debug or explain a (part of a) model
resnet50_conv = Model(inputs = resnet50.get_layer("input_1").input, outputs = resnet50.get_layer("avg_pool").output)

In [None]:
# Sequential makes the usage a bit simpler. # Also, adding the resnet
# separately allows us to see a shorter summary
model = Sequential()
model.add(resnet50_conv)
model.add(Dense(64, activation = "relu"))
model.add(Dense(32, activation = "relu"))
model.add(Dense(1, activation = "sigmoid"))

In [None]:
model.get_layer("model").trainable = False

In [None]:
model.summary()

In [None]:
# We're looking at the simplest case, when we have a well-defined loss function
# and easy-to-use metrics
model.compile(
    optimizer = Adam(),
    loss = BinaryCrossentropy(),
    metrics = [BinaryAccuracy()])

### Training
First, we need to train the newly added layers. We might want to add additional callbacks, e.g. for early stopping, changing the learning rate as we train, or stopping training entirely if the loss function (or an output) is `NaN`. It's also pretty common to write our own callbacks.

In [None]:
# Note that there might be slightly more, or slightly fewer steps per epoch using this approach.
# Since we're running through the entire dataset multiple times, this is not a problem. We could
# round up or down, without any significant impact on the training.
steps_per_epoch_train = round(len(filenames) * TRAIN_PCT / BATCH_SIZE)
steps_per_epoch_val = round(len(filenames) * VAL_PCT / BATCH_SIZE)

steps_per_epoch_train, steps_per_epoch_val

In [None]:
history = model.fit(
    train_data,
    epochs = 5,
    steps_per_epoch = steps_per_epoch_train,
    validation_data = val_data,
    validation_steps = steps_per_epoch_val,
    callbacks = [TensorBoard()])

We were able to get to ~95% accuracy on the training set, ~91% on the validation data, and ~84% on the testing set. The model is able to learn relatively easily (even after a few epochs) because it has already been trained on images of animals. Our task is really similar.

However, the model tends to overfit the training images too quickly - before being able to learn a lot of useful information (in the `Dense` layers).

### Conclusion
Usually, we do multiple iteration steps in a single research until we reach a definitive result or prove that it is unattainable (using the given resources).

To move forward, we may try a variety of steps, including:
* Regularizing the model by decreasing the number of parameters in the classification head
* Adding dropout to the classification head to reduce node dependencies
* Using a much smaller learning rate (in order to prevent quick overfitting)
* Using a different pre-trained model as a starting point
* Using one or more of the intermediate outputs of the model: this will allow the classification part to use lower-level features as input. This is not common in the overfitting case but it usually improves generalization performance
* Adding more data, by using data augmentation or by additional sampling

To debug the model(s), we need to:
* Save each (or several of the best) model architectures and weights so we can reuse them
* Evaluate and compare their performances
* Fine-tune hyperparameters (including model architecture)
* Look at right / wrong predictions (in this case, the algorithm didn't seem to struggle with any particular type of image, besides possibly animal posture; it just needed more data to generalize better)
* Select one, or even an ensemble of the best models