# Planet: Understanding the Amazon from Space 🌳🦌
***By Nhan Phan, November 2019, as an entry to the competition [Planet: Understanding the Amazon from Space](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space/data) by Kaggle.***

![](https://storage.googleapis.com/kaggle-competitions/kaggle/6322/logos/header.png)

Every minute, the world loses an area of forest the size of 48 football fields. And deforestation in the Amazon Basin accounts for the largest share, contributing to reduced biodiversity, habitat loss, climate change, and other devastating effects. But better data about the location of deforestation and human encroachment on forests can help governments and local stakeholders respond more quickly and effectively.

This analysis uses Deep Learning to classify the spatial images of the Amazon forest taken by the satilite. And from that, it hopes to shed a light on understanding how the forest has change naturally and manually. Thus, help preventing deforestation.

The dataset is acquired from the Kaggle competition in 2016: https://www.kaggle.com/c/planet-understanding-the-amazon-from-space/data

The dataset contains more than 40.000 images, taken by Planet using sattelites.



> Planet, designer and builder of the world’s largest constellation of Earth-imaging satellites, will soon be collecting daily imagery of the entire land surface of the earth at 3-5 meter resolution. While considerable research has been devoted to tracking changes in forests, it typically depends on coarse-resolution imagery from Landsat (30 meter pixels) or MODIS (250 meter pixels). This limits its effectiveness in areas where small-scale deforestation or forest degradation dominate.

<center><img src="https://storage.googleapis.com/kaggle-competitions/kaggle/6322/media/planet.png" width=300'></center>



### ☆ **RESULT**
The project successfully got the score of 0.90 on the official test set.

|  | THIS PROJECT | WINNER |
|:--:|:--:|:--:|
| **Score (F-Beta)** | 0.90 | 0.93 |

Training information:

|  | Loss | F-Beta Score |
|:--:|:--:|:--:|
| **Train** | 0.09 | 0.90 |
| **Validation** | 0.11 | 0.89 |




### **☆ CHALLENGE**
Several key learnings undercovered through the analysis:

1. **Multi-label:** Each image is labeled with multiple tags (at least 2, at max 9). The tags fall into 17 categories, which are the forest landscape types. Since the tags in each label are mutually exclusive, they are treated as multiple binary classification problems. Thus, `binary cross-entropy` are chosen to be the loss function. 

2. **Imbalance:** The dataset is severely imbalance with tags like Primary or Agriculture appear in 90% of the dataset. While other tags like Blooming or Conventional Mine can only be seen in less than 500 observations (even less than 100 for Burn Down).

  To tackle the problem of imbalance dataset, evaluation metrics has to be chosen carefully. In the first base-line experiment, the model was totally bias toward the major tags. It predicts the major tags to appear in every data and almost never made a prediction with the minor tags. 

  `F2` is chosen to be the main metrics to evaluate the training. It watches over the harmonic mean between the Precision and Recall while favors Recall specifically. In other word, it is the attempt to reduce the number of False Negative, where the model fails to identify the absence of a tag. 

3. **Optimization:** 400.000 images, a CNN model, and Google Colab's limited resource do not seem to mix well together. The training was slow at first and interupted often. Several improvements, mostly on the Tensorflow pipeline, were conducted to speed up the training:

  - Using **TFRecord** to convert the raw images into byte-like data to reduce the amount of time spending on reading data from their paths. 
  - Using [**tf.data.Dataset**](https://www.tensorflow.org/guide/data_performance) with `shuffle`, `map`, `batch`, `prefetch` to optimize the reading data process by redistributing the tasks for agents to work concurrently, thus, avoid bottleneck. An attempt to use `cache` was also made but failed due to the limited RAM. 

4. **Processing image with Tensorflow:** The dataset contains images in JPG - RGBA. The built-in decode function `tf.io.decode_jpeg` only works on 1 or 3-channel image. Attempt on encoding a JPG RGBA image returns black black and black. We need a tensorflow encoding function to work in this part because the pipeline is built entirely on Tensor for the optimization purpose. 

  To tackle the problem, the raw images were first read by Matplotlib then converted into byte-like and wrote into TFRecords. When reading the data from TF Record, instead of using the built-in decode image function, we use `tf.io.parse_tensor` following with reshaping.




## 1. PREPROCESS DATA



### 1.1 Import Libraries

In [None]:
!pip install tensorflow-addons

In [None]:
import tensorflow as tf
print(tf.__version__)
print(tf.config.list_physical_devices('GPU'))
tf.random.set_seed(142)

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, BatchNormalization
from tensorflow.keras.layers import Conv2D, MaxPooling2D
import tensorflow_addons as tfa
from tensorflow_addons.metrics import FBetaScore 

import tqdm.notebook as tq
import os
import logging
import warnings
warnings.filterwarnings('ignore')
logging.getLogger("tensorflow").setLevel(logging.ERROR)

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

### 1.2 Explore Dataset

In [None]:
# # Dataset folders are manually downloaded from Kaggle as .tar.7z file
# # Install 7zip if not in the environment yet 
# ! sudo apt-get install p7zip-full

# # Unzip dataset
# ! 7z x -so /content/gdrive/MyDrive/PROJECT/AMAZON/data/train-jpg.tar.7z | tar xf - -C /content/gdrive/MyDrive/PROJECT/AMAZON/data

In [None]:
PROJECT_FOLDER = '/content/gdrive/MyDrive/PROJECT/AMAZON'
DATA_PATH = os.path.join(PROJECT_FOLDER, "data")
TRAIN_JPG_DIR = os.path.join(DATA_PATH, "train-jpg")
TRAIN_CSV_PATH = os.path.join(DATA_PATH, "train_v2.csv")

In [None]:
df = pd.read_csv(TRAIN_CSV_PATH)
df.head(5)

One-hot encode the labels.

In [None]:
dummies = df['tags'].str.get_dummies(' ')
df = pd.concat([df, dummies], axis=1)

labels = dummies.columns.values
N_LABELS = len(labels)
dummies

In [None]:
print(f"There are {N_LABELS} unique labels including {labels}")

Read more about the definition of each label [HERE](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space/data) 

In [None]:
# Countplot of label distribution
label_count = dummies.sum(axis=0).sort_values()
print(label_count)

In [None]:
label_count.plot(kind='barh', figsize=(15, 10))
for i in range(label_count.shape[0]):
    plt.text(label_count.iloc[i] + 4, i, label_count.iloc[i], va='center')

As we can see, the dataset's labels are not evenly distributed. The `primary` and `clear` tags appear in more than 80% of the dataset while some others, for examples, `blooming`, `slash burn` or `blow_down` are rarely observed.

Let's take a closer look as what these labels visually depict.

In [None]:
images_title = [df[df['tags'].str.contains(label)].iloc[i]['image_name'] + '.jpg' for i, label in enumerate(labels)]

In [None]:
_, axs = plt.subplots(5, 4, sharex='col', sharey='row', figsize=(15, 20))
axs = axs.ravel()

for i, (image_name, label) in enumerate(zip(images_title, labels)):
    img_path = os.path.join(TRAIN_JPG_DIR, image_name)
    img = plt.imread(img_path)
    axs[i].imshow(img)
    axs[i].set_title(f'{image_name} - {label}')

### 1.4. Split dataset

- The datatset is divided into 80% train and 20% validation. We also split the train into 4 folds, which later will be store in 4 TFRecords shards. 
- `MultilabelStratifiedKFold` is used to maintain the ratio of label across each shard. 

In [None]:
!pip install iterative_stratification -q
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

In [None]:
y = df[labels].values
X = df['image_name'].values

df['fold'] = np.nan

mskf = MultilabelStratifiedKFold(n_splits=5, random_state=104)
for i, (_, test_index) in enumerate(mskf.split(X, y)):
    df.iloc[test_index, -1] = i
   
df['fold'] = df['fold'].astype('int')
df['is_valid'] = False
df['is_valid'][df['fold'] == 0] = True

In [None]:
# Number of observations of each tags in each fold. 
df.groupby('fold')[labels].sum()

In [None]:
TRAIN_SIZE = df.shape[0] - df['is_valid'].sum()
VAL_SIZE = df['is_valid'].sum()

### 1.4. Export raw data to TFRecords

In [None]:
# Converting the values into features

def _image_feature(value):
    """Returns a bytes_list from a string / byte."""
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[tf.io.encode_jpeg(value).numpy()]))

def _int64_feature(value):
    if type(value) != list:
        value = [value]
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))): # if value ist tensor
        value = value.numpy()
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def serialize_array(array):
    array = tf.io.serialize_tensor(array)
    return array


def image_feature(path, label):
    image = plt.imread(path)

    # image = tf.io.decode_jpeg(image, channels=3)
    feature = {'height': _int64_feature(image.shape[0]),
               'width': _int64_feature(image.shape[1]),
               'channel': _int64_feature(image.shape[2]),
               'image': _bytes_feature(serialize_array(image)),
               'label': _int64_feature(label),}
    return tf.train.Example(features=tf.train.Features(feature=feature))


def create_record(df, folder_path, record_name):
    all_image_paths = df['image_name'].apply(lambda x: os.path.join(TRAIN_JPG_DIR, x+'.jpg')).values
    all_labels = df[labels].values

    record_path = os.path.join(folder_path, f"{record_name}.tfrecords")
    writer = tf.io.TFRecordWriter(record_path) 

    for i in tq.tqdm(range(df.shape[0])):
        path = all_image_paths[i]
        label = all_labels[i].tolist()
        example = image_feature(path, label)
        writer.write(example.SerializeToString())
    
    writer.close()

In [None]:
# for i in range(5): 
#     create_record(df[df['fold'] == i], DATA_PATH, f'fold_{i}')

### 1.5 Load data from TFRecord

In [None]:
RECORDS = tf.io.gfile.glob(str(DATA_PATH + '/*.tfrecords'))
RECORDS

In [None]:
IMG_WIDTH, IMG_HEIGHT = 192, 192
CHANNELS = 3

def read_tfrecord(example):
    tfrecord_format = {
        "height": tf.io.FixedLenFeature([], tf.int64),
        "width": tf.io.FixedLenFeature([], tf.int64),
        "channel": tf.io.FixedLenFeature([], tf.int64),
        "image": tf.io.FixedLenFeature([], tf.string),
        "label": tf.io.FixedLenFeature([17], tf.int64, default_value=np.zeros((17,)).astype('int').tolist())
    }
    example = tf.io.parse_single_example(example, tfrecord_format)

    # Extract information
    height = example['height']
    width = example['width']
    channel = example['channel']
    image = example['image']
    label = example['label']

    # Convert raw image back to array
    image = tf.io.parse_tensor(image, out_type=tf.uint8)
    image = tf.reshape(image, shape=[height, width, channel])
    if channel == 4:
        image = image[:,:,:3]

    image = tf.image.resize(image, [IMG_WIDTH, IMG_HEIGHT])

    return (image, label)

In [None]:
BATCH_SIZE = 64
SHUFFLE_BUFFER_SIZE = 1024
AUTOTUNE = tf.data.experimental.AUTOTUNE

In [None]:
def augmentation(image, label):
    image = tf.image.random_brightness(image, .1)
    image = tf.image.random_contrast(image, lower=0.0, upper=1.0)
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_flip_up_down(image)
    return image, label

def load_dataset(filenames, shuffle=False, augment=False):
    """Load a list of pahts of TFRecords 
       and split them into train and validation set."""

    dataset = tf.data.TFRecordDataset(filenames)
    dataset = dataset.map(read_tfrecord, num_parallel_calls=AUTOTUNE)

    # dataset = dataset.cache()

    if shuffle == True:
        dataset = dataset.shuffle(buffer_size = SHUFFLE_BUFFER_SIZE).repeat()

    if augment == True:
        dataset.map(augmentation, num_parallel_calls=AUTOTUNE)
    
    dataset = dataset.batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
    
    return dataset

In [None]:
RECORDS

In [None]:
train_ds = load_dataset(RECORDS[:4], shuffle=True, augment=True)
val_ds = load_dataset(RECORDS[4], shuffle=False, augment=False)

In [None]:
train_ds

In [None]:
val_ds

Now, our data is ready to flow into the model.

Each batch will be a pair of arrays (one that holds the features and another that hold labels). 
- The features array will be of shape (BATCH_SIZE, IMG_WIDTH, IMG_HEIGHT, CHANNELS).
- The labels array will be of shape (BATCH_SIZE, N_LABELS).

In [None]:
for i in train_ds.take(1):
    plt.imshow(i[0][1].numpy() / 255.)
    plt.show()

### ❊ Read sample

For acquiring a small sample of dataset (20%) and split to train-validation (80/20) 

In [None]:
# sample = df[df['fold'] == 0]
# sample.head()

In [None]:
# sample[labels].sum() / df[labels].sum()

In [None]:
# X_sample = sample['image_name'].apply(lambda x: os.path.join(DATA_PATH, 'train-jpg', x+".jpg")).values
# y_sample = sample[labels].values 

In [None]:
# sample.drop(columns=['fold', 'is_valid'], inplace=True)

In [None]:
# sample['fold'] = np.nan

# mskf = MultilabelStratifiedKFold(n_splits=5, random_state=104)
# for i, (_, test_index) in enumerate(mskf.split(X_sample, y_sample)):
#     sample.iloc[test_index, -1] = i
   
# sample['fold'] = sample['fold'].astype('int')
# sample['is_valid'] = False
# sample['is_valid'][sample['fold'] == 0] = True

In [None]:
# sample.groupby('is_valid')[labels].sum()

In [None]:
# create_record(sample[sample['is_valid'] == False], DATA_PATH, '1_train_sample')

In [None]:
# create_record(sample[sample['is_valid'] == True], DATA_PATH, '1_val_sample')

In [None]:
# sample['is_valid'].sum()

In [None]:
# TRAIN_SIZE = sample.shape[0] - sample['is_valid'].sum()
# VAL_SIZE = sample['is_valid'].sum()

In [None]:
# train_ds = load_dataset(RECORDS[-2], shuffle=True, augment=True)
# val_ds = load_dataset(RECORDS[-1])

In [None]:
# for i in train_ds.take(1):
#     plt.imshow(i[0][1].numpy() / 255.)
#     plt.show()

## 2. MODELLING

### 2.1. Build model

Instead of building and training a new model from scratch, we will use a pre-trained model in a process called transfer learning. In this case, we use the MobileNetv2 model.

![alt text](https://miro.medium.com/max/1400/1*yT0lWepQ39hrwn5KBaMz_A.png)

In [None]:
def build_model(trainable = False, fine_tune_at = 0):
    """Build a Sequential model with the MobileNetv2 as base model and additional top layers.
       Certain number of layers of the MobileNetv2 can be trained.
       args:
           trainable: boolean, whether transfer learning model can be trained or not.
           fine_tune_at: int, number of trainable layers.
    
    """
    mobile_net = tf.keras.applications.MobileNetV2(input_shape=(IMG_WIDTH, IMG_HEIGHT, CHANNELS), include_top=False)
    if trainable == True:
        mobile_net.trainable=True

        for layer in mobile_net.layers[:fine_tune_at]:
            layer.trainable = False
    else: 
        mobile_net.trainable = False
    

    input = tf.keras.Input(shape=(IMG_WIDTH, IMG_HEIGHT, CHANNELS), name='input')
    x = tf.keras.applications.mobilenet.preprocess_input(input)
    x = mobile_net(x)
    x = tf.keras.layers.Dense(1024, activation = 'relu')(x)
    x = tf.keras.layers.GlobalAveragePooling2D()(x)
    output = tf.keras.layers.Dense(N_LABELS, activation = 'sigmoid')(x)

    model = tf.keras.Model(input, output)
    return model

In [None]:
model = build_model()
model.summary()

Take a look at prediction. The model will return a list of 17 values according to 17 labels. Each value represents the probability that the observation includes that label. 

In [None]:
for batch in train_ds: 
    print(model.predict(batch[0]))
    break

### 2.2. Train

Using **Checkpoint**, **LearningRateDecay**, and **CSVLogger** to assist the training. 

In [None]:
from datetime import datetime
today = str(datetime.now().date())
try:
    os.mkdir(os.path.join(PROJECT_FOLDER, 'log', today))
except:
    print('Folder exists.')

In [None]:
# Setting up CheckPoint 
checkpoint_path = os.path.join(PROJECT_FOLDER, 'log', today, f"full_1_decay.h5")
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights by default it saves the weights every epoch
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_best_only = True,
                                                 save_weights_only=True,
                                                 mornitor = "val_loss",
                                                 verbose=1)

# Learning rate decay
lr_decay = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', 
                                                factor=0.2, 
                                                patience=3, 
                                                verbose=1, 
                                                mode='auto', 
                                                epsilon=0.0001, 
                                                cooldown=0, 
                                                min_lr=0.0000001)

# Logger
log_path = os.path.join(PROJECT_FOLDER, 'log', today, f"full_1_decay.csv")
logger = tf.keras.callbacks.CSVLogger(log_path, separator=',', append=True)

In [None]:
LR = 1e-5
EPOCHS = 60
num_steps_train = tf.math.ceil(float(TRAIN_SIZE)/BATCH_SIZE)              
num_steps_val = tf.math.ceil(float(VAL_SIZE)/BATCH_SIZE)

fbeta = FBetaScore(num_classes=N_LABELS,
                   average='weighted',
                   beta=2.0,
                   threshold=0.2,
                   name='fbeta')

In [None]:
# Compile model with optimizer
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = LR),
               loss = 'binary_crossentropy',
               metrics = [fbeta, tf.keras.metrics.AUC()])

In [None]:
# model.load_weights(checkpoint_path)

In [None]:
# Train model
history = model.fit(train_ds,
                  steps_per_epoch = num_steps_train,
                  epochs = EPOCHS,
                  validation_data = val_ds,
                  validation_steps = num_steps_val,
                  callbacks=[cp_callback, lr_decay, logger])

In [None]:
 def plot_stats(training_stats, val_stats, x_label='Training Steps', stats='loss'):
    stats, x_label = stats.title(), x_label.title()
    legend_loc = 'upper right' if stats=='loss' else 'lower right'
    training_steps = len(training_stats)
    test_steps = len(val_stats)

    plt.figure()
    plt.ylabel(stats)
    plt.xlabel(x_label)
    plt.plot(training_stats, label='Training ' + stats)
    plt.plot(np.linspace(0, training_steps, test_steps), val_stats, label='Validation ' + stats)
    plt.ylim([0,max(plt.ylim())])
    plt.legend(loc=legend_loc)
    plt.show()

In [None]:
plt.figure(figsize = (15, 10))

plot_stats(history.history['loss'], history.history['val_loss'], x_label='Epochs', stats='loss')
plot_stats(history.history['fbeta'], history.history['val_fbeta'], x_label='Epochs', stats='fbeta');

### 2.3 Export model

In [None]:
SAVE_PATH = os.path.join(PROJECT_FOLDER, 'log', today, '1_full_decay.h5')
model.save(SAVE_PATH)

### 2.4 Predict

In [None]:
SAVE_PATH = '/content/gdrive/MyDrive/PROJECT/AMAZON/data/log/2021-08-23/1_full_decay.h5'
model = tf.keras.models.load_model(SAVE_PATH)

In [None]:
def read_tfrecord_label_only(example):
    tfrecord_format = {
        "height": tf.io.FixedLenFeature([], tf.int64),
        "width": tf.io.FixedLenFeature([], tf.int64),
        "channel": tf.io.FixedLenFeature([], tf.int64),
        "image": tf.io.FixedLenFeature([], tf.string),
        "label": tf.io.FixedLenFeature([17], tf.int64, default_value=np.zeros((17,)).astype('int').tolist())
    }
    example = tf.io.parse_single_example(example, tfrecord_format)

    # Extract information
    height = example['height']
    width = example['width']
    channel = example['channel']
    image = example['image']
    label = example['label']

    return label

In [None]:
train_label_ds = tf.data.TFRecordDataset(RECORDS[:4])
train_label_ds = train_label_ds.map(read_tfrecord_label_only, num_parallel_calls=AUTOTUNE)

In [None]:
true = list(train_label_ds.as_numpy_iterator())
true = np.array(true)
true.shape

In [None]:
train_image_ds = load_dataset(RECORDS[:4], shuffle=False, augment=False)

In [None]:
predictions = model.predict(train_image_ds)
final_predictions = (predictions > 0.2).astype('int')
final_predictions.shape

In [None]:
from sklearn.metrics import fbeta_score
fbeta_score(true, final_predictions, average='weighted', beta=2)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(true, final_predictions))

In [None]:
for data in train_image_ds.take(1):
    sample_images = data[0].numpy().astype('int')
    sample_labels = data[1].numpy().astype('bool')

sample_images = sample_images[:40]
sample_labels = sample_labels[:40]
sample_predictions = final_predictions[:40]

In [None]:
fig, axes = plt.subplots(10, 4, figsize=(20, 30))
axes = axes.ravel()

for i, (image, label) in enumerate(zip(sample_images, sample_labels)):
    axes[i].imshow(image)
    predict_label = labels[sample_predictions[i] == 1]
    predict_label = ', '.join(predict_label)
    correct = ', '.join(labels[label])
    axes[i].set_title(f"PREDICT: {predict_label} \nCORRECT: {correct}")

plt.subplots_adjust(wspace=1, hspace=1)
plt.show()

## 3. TUNING THRESHOLD

Since the multi-label problem is viewed as multiple binary classification tasks, it is necessary to define an appropriate threshold. Any predicted probability above the threshold yield a positive prediction for that tag. The value of the threshold affects the F score greatly, especially in the context of an imbalance dataset, where each label has different frequency.

In the experiment of tuning threshold, we increase the threshold from 0 to 1 by a step of 0.01 for each label and observe the change of Precision, Recall, F1 and FBeta Score.

In [None]:
def perf_grid(y_hat_val, y_val, label_names, n_thresh=100):
    
    # Find label frequencies in the validation set
    label_freq = y_val.sum(axis=0)

    # Define thresholds
    thresholds = np.linspace(0, 1, n_thresh+1).astype(np.float32)
    
    # Compute all metrics for all labels
    ids, labels, freqs, tps, fps, fns, precisions, recalls, f1s, f2s = [], [], [], [], [], [], [], [], [], []
    
    for i in tq.tqdm(range(len(label_names))):
        for thresh in thresholds:   
            ids.append(i)
            labels.append(label_names[i])
            freqs.append(round(label_freq[i]/len(y_val),2))

            y = y_val[:, i]
            y_pred = y_hat_val[:, i] > thresh

            tp = np.count_nonzero(y_pred  * y)
            fp = np.count_nonzero(y_pred * (1-y))
            fn = np.count_nonzero((1-y_pred) * y)
            precision = tp / (tp + fp + 1e-16)
            recall = tp / (tp + fn + 1e-16)
            f1 = 2*tp / (2*tp + fn + fp + 1e-16)
            f2 = fbeta_score(y, y_pred, average='weighted', beta=2)
            
            tps.append(tp)
            fps.append(fp)
            fns.append(fn)
            precisions.append(precision)
            recalls.append(recall)
            f1s.append(f1)
            f2s.append(f2)
            
    # Create the performance dataframe
    grid = pd.DataFrame({'id':ids,
                         'label':labels,
                         'freq':freqs,
                         'threshold':list(thresholds)*len(label_names),
                         'tp':tps,
                         'fp':fps,
                         'fn':fns,
                         'precision':precisions,
                         'recall':recalls,
                         'f1':f1s,
                         'f2': f2s})
    
    return grid

In [None]:
predictions.shape

In [None]:
true.shape

In [None]:
# Performance table
grid = perf_grid(predictions, true, labels)

In [None]:
grid[grid['label'].str.contains('primary')].head(20)

In [None]:
# Choose the best threshold of 
grid_max = grid.loc[grid.groupby(['id', 'label'])[['f2']].idxmax()['f2'].values]
grid_max

## 4. PREDICT ON TEST 

In [None]:
# # # Dataset folders are manually downloaded from Kaggle as .tar.7z file
# # # Install 7zip if not in the environment yet 
# ! sudo apt-get install p7zip-full

# # # Unzip dataset
# ! 7z x -so /content/gdrive/MyDrive/PROJECT/AMAZON/data/test-jpg-additional.tar.7z | tar xf - -C /content/gdrive/MyDrive/PROJECT/AMAZON/data

In [None]:
test_paths = tf.io.gfile.glob(str(DATA_PATH + '/test-jpg/*.jpg')) 
additional_paths = tf.io.gfile.glob(str(DATA_PATH + '/test-jpg-additional/*.jpg')) 

In [None]:
def create_test_record(data_folder, record_name):
    paths = tf.io.gfile.glob(str(data_folder + f'/{record_name}/*.jpg')) 

    record_path = os.path.join(data_folder, f"{record_name}.tfrecords")
    writer = tf.io.TFRecordWriter(record_path) 

    for i in tq.tqdm(range(len(paths))):
        path = paths[i]
        image = plt.imread(path)
        feature = {'height': _int64_feature(image.shape[0]),
                   'width': _int64_feature(image.shape[1]),
                   'channel': _int64_feature(image.shape[2]),
                   'image': _bytes_feature(serialize_array(image))}
        example = tf.train.Example(features=tf.train.Features(feature=feature))
        writer.write(example.SerializeToString())
    writer.close()

In [None]:
create_test_record(DATA_PATH, 'test-jpg')
create_test_record(DATA_PATH, 'test-jpg-additional')

In [None]:
def read_test_record(example): 
    tfrecord_format = {
            "height": tf.io.FixedLenFeature([], tf.int64),
            "width": tf.io.FixedLenFeature([], tf.int64),
            "channel": tf.io.FixedLenFeature([], tf.int64),
            "image": tf.io.FixedLenFeature([], tf.string)
            }

    example = tf.io.parse_single_example(example, tfrecord_format)

    # Extract information
    height = example['height']
    width = example['width']
    channel = example['channel']
    image = example['image']

    # Convert raw image back to array
    image = tf.io.parse_tensor(image, out_type=tf.uint8)
    image = tf.reshape(image, shape=[height, width, channel])
    if channel == 4:
        image = image[:,:,:3]

    image = tf.image.resize(image, [IMG_WIDTH, IMG_HEIGHT])
    return image

In [None]:
def load_test_dataset(filenames, shuffle=False, augment=False):
    dataset = tf.data.TFRecordDataset(filenames)
    dataset = dataset.map(read_test_record, num_parallel_calls=AUTOTUNE)
    dataset = dataset.batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
    
    return dataset

In [None]:
RECORDS = tf.io.gfile.glob(str(DATA_PATH + '/*.tfrecords'))
RECORDS

In [None]:
test_ds = load_test_dataset(RECORDS[-2])
additional_ds = load_test_dataset(RECORDS[-1])

In [None]:
test_predictions = model.predict(test_ds)
additional_predictions = model.predict(additional_ds)

In [None]:
threshold = { 'agriculture':0.164,
          'artisinal_mine':0.114,
          'bare_ground':0.138,
          'blooming':0.168,
          'blow_down':0.2,
          'clear':0.13,
          'cloudy':0.076,   
          'conventional_mine':0.1,
          'cultivation':0.204,
          'habitation':0.17,
          'haze':0.204,
          'partly_cloudy':0.112,
          'primary':0.204,
          'road':0.156,
          'selective_logging':0.154,
          'slash_burn':0.38,
          'water':0.182
            }
            
thresholds = np.fromiter(threshold.values(), dtype=float)

In [None]:
def get_tag(prediction):
    return ' '.join(labels[(prediction >= threshold_values)])

final_test_predictions = list(map(get_tag, test_predictions))
final_additional_predictions = list(map(get_tag, additional_predictions))

In [None]:
len(final_test_predictions)

In [None]:
test_filenames = list(map(lambda x: x.split('/')[-1][:-4], test_paths))
additional_filenames = list(map(lambda x: x.split('/')[-1][:-4], additional_paths))

In [None]:
submission = pd.DataFrame({'image_name': test_filenames, 'tags': final_test_predictions})
submission['count'] = submission['image_name'].str.strip('test_').astype('int')
submission = submission.sort_values('count', ascending=True).reset_index(drop=True)
submission.drop(columns=['count'], inplace=True)
submission

In [None]:
submission_2 = pd.DataFrame({'image_name': additional_filenames, 'tags': final_additional_predictions})
submission_2['count'] = submission_2['image_name'].str.strip('file_').astype('int')
submission_2 = submission_2.sort_values('count', ascending=True).reset_index(drop=True)
submission_2.drop(columns=['count'], inplace=True)
submission_2

In [None]:
final_submission = pd.concat([submission, submission_2], axis=0)
final_submission

In [None]:
final_submission.to_csv(PROJECT_FOLDER + "/final_submission_2_time.csv", index=False)

## 5. SUMMARY

In this project, we tackle a 17-label classification problem. We succeed to reach the FBeta Score of 0.90 on the official test dataset. Yet, there are still many puzzles that have not been solved. 

- One biggest challenge is the performance of the model on minor class. It is, still, very poor.
<img src='https://i.imgur.com/G8ZXOsb.png' width=500>

- Several techniques can be considered for future tuning:
  - Attempt on using XGBoost or other ensemble learning techniques.
  - Using dehaze in preprocessing images to highlight the obscure features. 
  - Exploring on the original `.tiff` dataset instead of the converted JPG.

Otherwise, this is the end of the project 🍺