# Planet: Understanding the Amazon deforestation from Space 

Identifying deforestration is an extremely important task for our planet.  For this task we employ machine learning implemented in this notebook.
We are using satellite images from [Planet: Understanding th
e Amazon from Space](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space)  Kaggle competition [dataset](https://www.kaggle.com/competitions/planet-understanding-the-amazon-from-space/data). Additionally, we are leveraging the work done by [EKami](https://github.com/EKami/planet-amazon-deforestation), using a VGG16 convolutional model pre-trained with the Imagenet dataset and retrained to predict the type of cover land on top of the satellite images.

# Install necessary packages
We have put the dependencies in a `requirements.txt` file so we will use that to install the neccessary packages by running `pip install --user <package_name>`

> NOTE: Do not forget to use the `--user` argument. It is necessary if you want to use Kale to transform this notebook into a Kubeflow pipeline

In [None]:
!pip install --user -r requirements_ml.txt

# Imports
In this section we import the packages we need. In the original notebook we use mathplotlib but for the pipeline run we decided that it's not neccessary. 

In [None]:
import os
import gc
#import bcolz
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
#import matplotlib.pyplot as plt
#import matplotlib.image as mpimg
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau, History
from keras.models import load_model
import vgg16
import data_helper
from data_helper import AmazonPreprocessor
from PIL import Image
#from kaggle_data.downloader import KaggleDataDownloader

#%matplotlib inline
#%config InlineBackend.figure_format = 'retina'

# Project hyper-parameters

In this cell, we define the different hyper-parameters. Defining them in one place makes it easier to experiment with their values and also facilitates the execution of HP Tuning experiments using Kale and Katib.

Print tensorflow version for reuse (the Keras module is used directly from the tensorflow framework)

In [None]:
tf.__version__

# Load and preprocess data

In this section, we load and process the dataset to get it in a ready-to-use form by the model. First, let us load the image labels.

In [None]:
train_jpeg_dir, test_jpeg_dir, test_jpeg_additional, train_csv_file = data_helper.get_jpeg_data_files_paths()
labels_df = pd.read_csv(train_csv_file)
labels_df.head()

Each image can be tagged with multiple tags, lets list all uniques tags

In [None]:
# Print all unique tags
from itertools import chain
labels_list = list(chain.from_iterable([tags.split(" ") for tags in labels_df['tags'].values]))
labels_set = set(labels_list)
print("There is {} unique labels including {}".format(len(labels_set), labels_set))

### Repartition of each labels

In [None]:
# Histogram of label instances
labels_s = pd.Series(labels_list).value_counts() # To sort them by count
fig, ax = plt.subplots(figsize=(16, 8))
sns.barplot(x=labels_s, y=labels_s.index, orient='h')

## Images
Visualize some chip images to know what we are dealing with.
Lets vizualise 1 chip for the 17 images to get a sense of their differences.

In [None]:
images_title = [labels_df[labels_df['tags'].str.contains(label)].iloc[i]['image_name'] + '.jpg' 
                for i, label in enumerate(labels_set)]

plt.rc('axes', grid=False)
_, axs = plt.subplots(5, 4, sharex='col', sharey='row', figsize=(15, 20))
axs = axs.ravel()

for i, (image_name, label) in enumerate(zip(images_title, labels_set)):
    img = mpimg.imread(train_jpeg_dir + '/' + image_name)
    axs[i].imshow(img)
    axs[i].set_title('{} - {}'.format(image_name, label))

# Image resize & validation split
Define the dimensions of the image data trained by the network. Recommended resized images could be 32x32, 64x64, or 128x128 to speedup the training. 

You could also use `None` to use full sized images.

Be careful, the higher the `validation_split_size` the more RAM you will consume.

In [None]:
img_resize = (128, 128) # The resize size of each image ex: (64, 64) or None to use the default image size
validation_split_size = 0.2

# Preprocess data 
Due to the hudge amount of memory the preprocessed images can take, we will create a dedicated `AmazonPreprocessor` class which job is to preprocess the data right in time at specific steps (training/inference) so that our RAM don't get completely filled by the preprocessed images. 

The only exception to this being the validation dataset as we need to use it as-is for f2 score calculation as well as when we calculate the validation accuracy of each batch.

In [None]:
preprocessor = AmazonPreprocessor(train_jpeg_dir, train_csv_file, test_jpeg_dir, test_jpeg_additional, 
                                  img_resize, validation_split_size)
preprocessor.init()

In [None]:
print("X_train/y_train length: {}/{}".format(len(preprocessor.X_train), len(preprocessor.y_train)))
print("X_val/y_val length: {}/{}".format(len(preprocessor.X_val), len(preprocessor.y_val)))
print("X_test/X_test_filename length: {}/{}".format(len(preprocessor.X_test), len(preprocessor.X_test_filename)))
preprocessor.y_map

# Define and train the model

We are now ready to train our model. We use a predefined VGG16 standard architecture and finetune it.

In [None]:
model = vgg16.create_model(img_dim=(128, 128, 3))
model.summary()

## Funtune conv layers
We will now train all layers in the VGG16 model. 

In [None]:
history = History()
callbacks = [history, 
             EarlyStopping(monitor='val_loss', patience=3, verbose=1, min_delta=1e-4),
             ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=1, cooldown=0, min_lr=1e-7, verbose=1),
             ModelCheckpoint(filepath='weights/weights.best.hdf5', verbose=1, save_best_only=True, 
                             save_weights_only=True, mode='auto')]

X_train, y_train = preprocessor.X_train, preprocessor.y_train
X_val, y_val = preprocessor.X_val, preprocessor.y_val

batch_size = 128
train_generator = preprocessor.get_train_generator(batch_size)
steps = len(X_train) / batch_size

model.compile(optimizer=Adam(lr=1e-4), loss='binary_crossentropy', metrics = ['accuracy'])
#previous epochs=25
history = model.fit_generator(train_generator, steps, epochs=2, verbose=1, 
                    validation_data=(X_val, y_val), callbacks=callbacks)

## Visualize Loss Curve

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

## Load Best Weights

In [None]:
model.load_weights("weights/weights.best.hdf5")
print("Weights loaded")

In [None]:
model.save('vgg16_trained.h5')
print("Model Saved")

## Evaluate the model

Finally, we are ready to evaluate the model using the two test sets.

## Check Fbeta Score

In [None]:
fbeta_score = vgg16.fbeta(model, X_val, y_val)

fbeta_score

## Make predictions

In [None]:
predictions, x_test_filename = vgg16.predict(model, preprocessor, batch_size=128)
print("Predictions shape: {}\nFiles name shape: {}\n1st predictions ({}) entry:\n{}".format(predictions.shape, 
                                                                              x_test_filename.shape,
                                                                              x_test_filename[0], predictions[0]))

Before mapping our predictions to their appropriate labels we need to figure out what threshold to take for each class

In [None]:
thresholds = [0.2] * len(labels_set)

Now lets map our predictions to their tags by using the thresholds

In [None]:
predicted_labels = vgg16.map_predictions(preprocessor, predictions, thresholds)

Finally lets assemble and visualize our predictions for the test dataset to run batch predictions

In [None]:
tags_list = [None] * len(predicted_labels)
for i, tags in enumerate(predicted_labels):
    tags_list[i] = ' '.join(map(str, tags))

final_data = [[filename.split(".")[0], tags] for filename, tags in zip(x_test_filename, tags_list)]

In [None]:
final_df = pd.DataFrame(final_data, columns=['image_name', 'tags'])
print("Predictions rows:", final_df.size)
final_df.head()

## Test 

## Reload Model and Test Single Predictions

In [None]:
new_model = load_model('vgg16_trained.h5')
# Show the model architecture
new_model.summary()

In [None]:
image_name = 'train_1.jpg'
img = mpimg.imread(train_jpeg_dir + '/' + image_name)
plt.imshow(img)

In [None]:
img_reshape = np.expand_dims(img, axis=0) 

In [None]:
import cv2
img_reshape = np.asarray(Image.fromarray(img).convert("RGB"), dtype=np.float32)
img_reshape = cv2.resize(src = img_reshape, dsize=(128, 128))
img_reshape = np.expand_dims(img_reshape, axis=0) 
img_reshape.shape

In [None]:
prediction = new_model.predict(img_reshape)

In [None]:
labels = list(labels_set)
results = prediction[0] > 0.2
true_index_values = [i for i, x in enumerate(results) if x]
tags_results = [labels[x] for x in true_index_values]

In [None]:
tags_results