# Planet: Understanding the Amazon from Space

Contributors: Yuem Park

This notebook uses data from this closed Kaggle competition: https://www.kaggle.com/c/planet-understanding-the-amazon-from-space

As this was primarily a learning experience, much of the code was inspired by bits and pieces from the following notebooks:

* https://www.kaggle.com/robinkraft/getting-started-with-the-data-now-with-docs
* https://www.kaggle.com/anokas/data-exploration-analysis
* https://github.com/EKami/planet-amazon-deforestation/blob/master/notebooks/amazon_forest_notebook_preview.ipynb
* https://www.tensorflow.org/tutorials/load_data/images
* https://www.tensorflow.org/tutorials/images/transfer_learning
* https://www.tensorflow.org/guide/data
* https://cs230-stanford.github.io/tensorflow-input-data.html

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook

The input directory structure:

In [None]:
!ls -lha ../input/planet-understanding-the-amazon-from-space

File sizes:

In [None]:
input_path = '../input/planet-understanding-the-amazon-from-space/'

print('File sizes:')
print('')
for f in os.listdir(input_path):
    if not os.path.isdir(input_path + f):
        print(f.ljust(30) + str(round(os.path.getsize(input_path + f) / 1000000, 2)) + 'MB')
    else:
        sizes = [os.path.getsize(input_path+f+'/'+x)/1000000 for x in os.listdir(input_path + f)]
        print(f.ljust(30) + str(round(sum(sizes), 2)) + 'MB' + ' ({} files)'.format(len(sizes)))

It appears as if there are 40,479 training images, with the jpg's being much smaller than the tif's.

## Training data table

The training data table:

In [None]:
train_df = pd.read_csv(input_path + 'train_v2.csv')
train_df.head()

In [None]:
train_df.info()

Let's binarize this to make it easier to deal with:

In [None]:
# build list with unique labels
label_list = []
for tag_str in train_df['tags'].values:
    labels = tag_str.split(' ')
    for label in labels:
        if label not in label_list:
            label_list.append(label)
            
label_list

In [None]:
# binarize the labels
for label in label_list:
    train_df[label] = train_df['tags'].apply(lambda x : 1 if label in x.split(' ') else 0)
    
train_df.head()

How many of each label are there?

In [None]:
train_df[label_list].sum().sort_values().plot.bar()
plt.show()

How much overlap is there?

In [None]:
# a function to quickly plot the co-occurences
def show_cooccurence_matrix(labels):
    numeric_df = train_df[labels]
    co_matrix = numeric_df.T.dot(numeric_df)
    
    fig, ax = plt.subplots(figsize=(7,7))
    im = ax.imshow(co_matrix)
    ax.set_xticks(np.arange(len(labels)))
    ax.set_yticks(np.arange(len(labels)))
    ax.set_xticklabels(labels, rotation=45, ha='right', rotation_mode='anchor')
    ax.set_yticklabels(labels)
    cbar = fig.colorbar(im, ax=ax) # not sure why this isn't working...
    plt.show(fig)
    
# compute the co-ocurrence matrix
show_cooccurence_matrix(label_list)

Each image should have only one weather label:

In [None]:
weather_labels = ['clear', 'partly_cloudy', 'haze', 'cloudy']
show_cooccurence_matrix(weather_labels)

But land labels can overlap:

In [None]:
land_labels = ['primary', 'agriculture', 'water', 'cultivation', 'habitation']
show_cooccurence_matrix(land_labels)

Given this data format, a plan could be to train two models: the first is a single label classifier that will assign a single weather label to each image, and the second is a multi-label classifier that will assign at least one land label to each image.

## Training images

First, inspect the jpg's:

In [None]:
import cv2

# read it in unchanged, to make sure we aren't losing any information
img = cv2.imread(input_path + 'train-jpg/{}.jpg'.format(train_df['image_name'][0]), cv2.IMREAD_UNCHANGED)
np.shape(img)

Looks like RGB (but note that, from the example notebook, the order is BGR). Let's just plot a few:

In [None]:
jpg_channels = ['B','G','R']

fig, ax = plt.subplots(nrows=2, ncols=3, sharex=True, sharey=True, figsize=(20,12))

ax = ax.flatten()

for i in range(6):
    img = cv2.imread(input_path + 'train-jpg/{}.jpg'.format(train_df['image_name'][i]), cv2.IMREAD_UNCHANGED)
    ax[i].imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
    ax[i].set_title('{} - {}'.format(train_df['image_name'][i], train_df['tags'][i]))
    
plt.show()

What's the data type of the jpeg's?

In [None]:
type(img[0,0,0])

A histogram of values:

In [None]:
jpg_channel_colors = ['b','g','r']

fig, ax = plt.subplots()

for i in range(len(jpg_channels)):
    ax.hist(img[:,:,i].flatten(), bins=100, range=[np.min(img), np.max(img)],
            label=jpg_channels[i], color=jpg_channel_colors[i], alpha=0.5)
    ax.legend()
    
ax.set_xlim(0,255)
    
plt.show(fig)

Now let's look at the tiff's:

In [None]:
# read it in unchanged, to make sure we aren't losing any information
img = cv2.imread(input_path + 'train-tif-v2/{}.tif'.format(train_df['image_name'][0]), cv2.IMREAD_UNCHANGED)
np.shape(img)

This should be BGR and near infrared (NIR). Let's plot an image to see these 4 channels:

In [None]:
tif_channels = ['B','G','R','NIR']

fig, ax = plt.subplots(nrows=1, ncols=4, sharex=True, sharey=True, figsize=(20,15))

ax = ax.flatten()

img = cv2.imread(input_path + 'train-tif-v2/{}.tif'.format(train_df['image_name'][1]), cv2.IMREAD_UNCHANGED)

for i in range(len(tif_channels)):
    ax[i].imshow(img[:,:,i])
    ax[i].set_title('{} - {}'.format(train_df['image_name'][1], tif_channels[i]))
    
plt.show()

What's the data type of the tiff's?

In [None]:
type(img[0,0,0])

A histogram of values:

In [None]:
tif_channel_colors = ['b','g','r','magenta']

fig, ax = plt.subplots()

for i in range(len(tif_channels)):
    ax.hist(img[:,:,i].flatten(), bins=100, range=[np.min(img), np.max(img)],
            label=tif_channels[i], color=tif_channel_colors[i], alpha=0.5)
    ax.legend()
    
plt.show(fig)

A little unclear how to interpret this histogram... the "getting started" notebook posted by the competition hosts says this of a similar histogram:

> Note how the intensities are distributed in a relatively narrow region of the dynamic range.

Not sure what is meant by "intensity" or "dynamic range" - perhaps one way to think of it for now is that it's just something related to wavelength?

## Transfer learning

Some overview notes on what we're about to do...

In general, there are two types of transfer learning in the context of deep learning:

* via feature extraction
    * treating the network as an arbitrary feature extractor
* via fine-tuning
    * removing the fully-connected layers of an existing network, placing a new set of fully-connected layers on top of the network, and then fine-tuning these weights (and optionally previous layers) to recognize the new object classes
    
** Feature extraction **

When treating networks as a feature extractor, we essentially 'chop off' the network at our pre-specified layer (typically prior to the fully-connected layers when actual classification predictions are made). We propagate some input through this 'shortened' network, get the output array, flatten it, and use that as the feature vector for the original input in another classification algorithm.

The two most common machine learning models for transfer learning via feature extraction are logistic regression and linear SVM. These are preferred because:

* CNN's are non-linear models capable of learning non-linear features — we are assuming that the features learned by the CNN are already robust and discriminative.
* Feature vectors tend to be very large and have high dimensionality. We therefore need a fast model that can be trained on top of the features. Linear models tend to be very fast to train.

### Data preprocessing

In [None]:
import tensorflow as tf
tf.__version__

We could use `tf.keras.preprocessing` to load the images, but it has 3 downsides:

1. it's slow
2. it lacks fine-grained control
3. it's not well integrated with the rest of TensorFlow

Alternatively, we could use `tf.data.Dataset`. From the documentation:

> The `tf.data` API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training.

This API allows the user to only call the data (and apply transformations, etc.) when the data is actually needed, instead of keeping all the images in RAM indefinitely.

Some notes when trying to implement this pipeline:

* `decode_tiff` is currently not implemented in TensorFlow - therefore we need to wrap `cv2.imread()`
* although `decode_jpeg` is implemented in TensorFlow, the function was unable to correctly extract the information stored in the jpeg's used in this competition, and just returned a matrix of zeros. Some digging suggested that jpeg's stored on Kaggle's servers use a different encoding to that which is expected by `decode_jpeg`.

First, get a list of the paths to the images. We'll use the jpeg's because the neural network that we will be using (VGG16) expects 3 channels:

In [None]:
# extract image names from the .csv with the labels
path_to_images = train_df['image_name'].copy().values

# convert to path
for i in range(len(path_to_images)):
    path_to_images[i] = input_path + 'train-jpg/' + path_to_images[i] + '.jpg'

path_to_images[:5]

Then get the labels:

In [None]:
weather_labels_array = train_df[weather_labels].copy().values.astype(bool)

weather_labels_array[:5]

In [None]:
train_df[weather_labels].head()

Merge the two into a Dataset:

In [None]:
weather_ds = tf.data.Dataset.from_tensor_slices((path_to_images, weather_labels_array))

# note that the `numpy()` function is required to grab the actual values from the Dataset
for path, label in weather_ds.take(5):
    print("path  : ", path.numpy().decode('utf-8'))
    print("label : ", label.numpy())

Define functions to get the actual images from the paths, and map these onto the Dataset:

Note that this is where the most issues were experienced. The code here, which seems simple enough, will fail with an opaque error message even if the slightest thing is done incorrectly. Take note of the comments carefully...

In [None]:
# this function wraps `cv2.imread` - we treat it as a 'standalone' function, and therefore can use
# eager execution (i.e. the use of `numpy()`) to get a string of the path.
# note that no tensorflow functions are used here
def cv2_imread(path, label):
    # read in the image, getting the string of the path via eager execution
    img = cv2.imread(path.numpy().decode('utf-8'), cv2.IMREAD_UNCHANGED)
    # change from BGR to RGB
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    return img, label

# this function assumes that the image has been read in, and does some transformations on it
# note that only tensorflow functions are used here
def tf_cleanup(img, label):
    # convert to Tensor
    img = tf.convert_to_tensor(img)
    # unclear why, but the jpeg is read in as uint16 - convert to uint8
    img = tf.dtypes.cast(img, tf.uint8)
    # set the shape of the Tensor
    img.set_shape((256, 256, 3))
    # convert to float32, scaling from uint8 (0-255) to float32 (0-1)
    img = tf.image.convert_image_dtype(img, tf.float32)
    # resize the image
    img = tf.image.resize(img, [256, 256])
    return img, label

AUTOTUNE = tf.data.experimental.AUTOTUNE

# map the cv2 wrapper function using `tf.py_function`
weather_ds = weather_ds.map(lambda path, label: tuple(tf.py_function(cv2_imread, [path, label], [tf.uint16, label.dtype])),
                            num_parallel_calls=AUTOTUNE)

# map the TensorFlow transformation function - no need to wrap
weather_ds = weather_ds.map(tf_cleanup, num_parallel_calls=AUTOTUNE)

Check to make sure everything was read in correctly:

In [None]:
for image, label in weather_ds.take(1):
    print("image shape : ", image.numpy().shape)
    print("label       : ", label.numpy())

In [None]:
fig, ax = plt.subplots(nrows=3, ncols=4, sharex=True, sharey=True, figsize=(20,15))

i = 0

for image, label in weather_ds.take(3):
    ax[i,0].imshow(image[:,:,0])
    ax[i,0].set_title('{} - {}'.format(label.numpy(), 'R'))
    ax[i,1].imshow(image[:,:,1])
    ax[i,1].set_title('{} - {}'.format(label.numpy(), 'G'))
    ax[i,2].imshow(image[:,:,2])
    ax[i,2].set_title('{} - {}'.format(label.numpy(), 'B'))
    ax[i,3].imshow(image)
    ax[i,3].set_title('{} - {}'.format(label.numpy(), 'RGB'))
    
    i = i+1

In [None]:
fig, ax = plt.subplots(nrows=3, ncols=3, sharex=True, sharey=True, figsize=(20,15))

ax[0,0].set_xlim(0,1)

i = 0

for image, label in weather_ds.take(3):
    ax[i,0].hist(image[:,:,0].numpy().flatten())
    ax[i,0].set_title('{} - {}'.format(label.numpy(), 'R'))
    ax[i,1].hist(image[:,:,1].numpy().flatten())
    ax[i,1].set_title('{} - {}'.format(label.numpy(), 'G'))
    ax[i,2].hist(image[:,:,2].numpy().flatten())
    ax[i,2].set_title('{} - {}'.format(label.numpy(), 'B'))
    
    i = i+1

Split into training and validation. Note that the buffer size for shuffling defines how random the Dataset becomes - a buffer size that's equal to the number of instances will result in a uniform shuffling over the entire Dataset, and a buffer size equal to 1 will result in no shuffling.

In [None]:
n_all = len(path_to_images)
n_train = int(n_all * 0.8)
n_val = n_all - n_train

# shuffle the Dataset
SHUFFLE_BUFFER_SIZE = 1000
weather_ds = weather_ds.shuffle(SHUFFLE_BUFFER_SIZE)

# n_train will be used for training, the rest will be used for validation
train_weather_ds = weather_ds.take(n_train)
val_weather_ds = weather_ds.skip(n_train)

Batch the data. It is unclear, but using a batch size of 32 seems like a common practice, and using powers of 2 is preferred when using a GPU.

In [None]:
BATCH_SIZE = 32

train_weather_batches_ds = train_weather_ds.batch(BATCH_SIZE)
val_weather_batches_ds = val_weather_ds.batch(BATCH_SIZE)

Inspect a batch of data:

In [None]:
for image_batch, label_batch in train_weather_batches_ds.take(1):
    print(image_batch.shape)

### Model setup

In [None]:
IMG_SHAPE = (256, 256, 3)

# create the base model from the pre-trained model VGG16
# note that, if using a Kaggle server, internet has to be turned on
weather_VGG16 = tf.keras.applications.VGG16(input_shape=IMG_SHAPE,
                                            include_top=False,
                                            weights='imagenet')

# freeze the convolutional base
weather_VGG16.trainable = False

# look at the model architecture
weather_VGG16.summary()