# Title:

### Planet: Understanding the Amazon from space -- *(a competition hosted on Kaggle)*
#### Use satelite data to track the human footprint in the Amazon rainforest

# Description:

Every minute, the world loses an area of forest the size of 48 football fields. And deforestation in the Amazon Basin accounts for the largest share, contributing to reduced biodiversity, habitat loss, climate change, and other devastating effects. But better data about the location of deforestation and human encroachment on forests can help governments and local stakeholders respond more quickly and effectively.

Planet, designer and builder of the world’s largest constellation of Earth-imaging satellites, will soon be collecting daily imagery of the entire land surface of the earth at 3-5 meter resolution. While considerable research has been devoted to tracking changes in forests, it typically depends on coarse-resolution imagery from Landsat (30 meter pixels) or MODIS (250 meter pixels). This limits its effectiveness in areas where small-scale deforestation or forest degradation dominate.

Furthermore, these existing methods generally cannot differentiate between human causes of forest loss and natural causes. Higher resolution imagery has already been shown to be exceptionally good at this, but robust methods have not yet been developed for Planet imagery.

In this competition, Planet and its Brazilian partner SCCON are challenging Kagglers to label satellite image chips with atmospheric conditions and various classes of land cover/land use. Resulting algorithms will help the global community better understand where, how, and why deforestation happens all over the world - and ultimately how to respond.

In [None]:
%matplotlib inline

In [None]:
# importing useful libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from skimage import io
import os
import tensorflow as tf
from tensorflow.keras.preprocessing import image

In [None]:
!pwd # checking current working directory

In [None]:
# loading in the training classes, it is a dataframe mapping of image name to tags
train_classes_df = pd.read_csv("/kaggle/input/planets-dataset/planet/planet/train_classes.csv")
print(train_classes_df.shape)
train_classes_df.head()

In [None]:
# let's check the color channels in a randomly selected image...say the image with image_name 'train_10.jpg'
train_img10 = image.load_img("/kaggle/input/planets-dataset/planet/planet/train-jpg/train_10.jpg")
train_img10.mode # checking the color channels 

In [None]:
!ls "/kaggle/input/planets-dataset/planet/planet/train-jpg/" | wc -l # checking the total number of images in the
                                                                     # training image folder

In [None]:
test1 = !ls "/kaggle/input/planets-dataset/planet/planet/test-jpg/" | wc -l # checking total number of images in 
                                                                            # testing images folder
float(test1[0])

In [None]:
# checking total number of images in the testing images additional folder
test_additional = !ls "/kaggle/input/planets-dataset/test-jpg-additional/test-jpg-additional/" | wc -l
float(test_additional[0])

In [None]:
# loading and checking the sample submission dataframe
sample_submission = pd.read_csv("/kaggle/input/planets-dataset/planet/planet/sample_submission.csv")
print(sample_submission.shape)
sample_submission.head()

In [None]:
# let's confirm that the sum of image files in the testing and testing-additional equals the number of images in
# the sample submission dataframe
assert sample_submission.shape[0] == float(test1[0]) + float(test_additional[0])

### Let's find the unique tags in ```train_classes``` data
- First, we create a function that adds elements of a list to a Set
- Secondly, we apply this function to the ```tags``` column of ```train_classes``` after splitting its values to a list

In [None]:
unique_labels = set()
def append_labels(tags):
    for tag in tags.split():
        unique_labels.add(tag)

train_classes = train_classes_df.copy()
train_classes['tags'].apply(append_labels)
unique_labels = list(unique_labels) # casting 'unique_labels' as a list because set isn't an
                                    # indexed data structure

In [None]:
print(unique_labels)
len(unique_labels)

### Let's vectorize (one hot encode) the ```tags``` in ```train_classes``` using ```unique_labels``` 

In [None]:
# let's confirm that there is no image_name duplicate in the 'train_classes' dataframe
assert len(train_classes['image_name'].unique()) == train_classes.shape[0]

In [None]:
# let's do one hot encoding (vectorize) the labels in 'train_classes'
for tag in unique_labels:
    train_classes[tag] = train_classes['tags'].apply(lambda x: 1 if tag in x.split() else 0)
    
# adding '.jpg' extension to 'image_name'
train_classes['image_name'] = train_classes['image_name'].apply(lambda x: '{}.jpg'.format(x)) 
train_classes.head()

In [None]:
train_classes[unique_labels].sum().sort_values().plot.bar() # an histogram of the number of tags

In [None]:
# creating a function that generates a concurrent matrix (a matrix that contains the number of overlaps of pairs
# of tags)
def get_concurrent_matrix(tags):
    concur_df = train_classes[tags]
    concur_matrix = concur_df.T.dot(concur_df)
    mask = np.triu(np.ones((len(tags), len(tags))))
    sns.heatmap(concur_matrix, cmap=sns.cm.rocket_r, mask=mask)
    
    return concur_matrix

In [None]:
# classifying the tags into the three categories of 'atmospheric condition', 'common land cover' and 
# 'rare land cover'
atm_condition_tags = ['clear', 'partly_cloudy','cloudy', 'haze']
common_land_cover_tags = ['primary', 'water', 'habitation', 'agriculture', 'road', 'cultivation', 'bare_ground']
rare_land_cover_tags = [tag for tag in unique_labels if (tag not in atm_condition_tags) and (tag not in \
                                                                                        common_land_cover_tags)]

In [None]:
# concurrent matrix of atmospheric condition tags
atm_tags_concurrent_matrix = get_concurrent_matrix(atm_condition_tags) 
atm_tags_concurrent_matrix

No overlap in atmospheric condition

In [None]:
get_concurrent_matrix(common_land_cover_tags) # concurrent matrix of common land cover tags

```'primary'``` and ```'agriculture'``` seems to have the most overlap amongst ```common_land_cover_tags```

In [None]:
get_concurrent_matrix(rare_land_cover_tags) # concurrent matrix of rare land cover

Fairly overlaps but ```'selective_logging'``` and ```'blooming'``` seems to have the most overlap amongst ```'rare_land_cover_tags'``` 

In [None]:
get_concurrent_matrix(unique_labels) # concurrent matrix of all tags

Not much overlaps amongst all tags, but ```'primary'``` and ```'clear'``` seems to have the most ovelap

In [None]:
# let's check if indeed every image must have one atmospheric condition tag
total_atm_tags = np.matmul(np.array(atm_tags_concurrent_matrix), (np.ones((4, 1)))).sum()
print(total_atm_tags)
total_atm_tags == train_classes.shape[0]

In [None]:
# the above cell returned false, it seems only one image doesn't contain any atmospheric condition.
# let's check it out
image_atm_tags_df = train_classes.loc[:, ['image_name']+atm_condition_tags] 
image_without_atm_df = image_atm_tags_df.loc[image_atm_tags_df.sum(axis=1) == 0]
image_without_atm_df

In [None]:
# let's view this image without any atmospheric condition
image_without_atm_name = image_without_atm_df.loc[24448, 'image_name']
image_without_atm = io.imread('/kaggle/input/planets-dataset/planet/planet/train-jpg/{}'.format( \
                                                                                    image_without_atm_name))
plt.imshow(image_without_atm)

In [None]:
# let's checkout the tags associated with this image above
train_classes_df[train_classes_df['image_name'] == image_without_atm_name[:-4]]

The data says it is water, does it look like one? Perhaps it is a dirty water or some random noise!

In [None]:
# let's view a sample image say 'train_10.jpg' 
image_number = 10
sample_img = io.imread('/kaggle/input/planets-dataset/planet/planet/train-jpg/train_{}.jpg'.format(image_number))
r, g, b = sample_img[:, :, 0], sample_img[:, :, 1], sample_img[:, :, 2]
sample_img.shape

In [None]:
fig = plt.figure()
fig.set_size_inches(12, 4)
for ind, (img, channel) in enumerate(((r, 'r'), (g, 'g'), (b, 'b'))):
    a = fig.add_subplot(1, 4, ind+1)
    a.set_title(channel)
    plt.imshow(img)
    
# displaying the red, green and blue channels seperately

In [None]:
plt.imshow(sample_img) # displaying all channels at once

In [None]:
y_col = list(train_classes.columns[2:]) # storing the tags column names as a variable

# initializing an image generator with some data augumentation
image_gen = tf.keras.preprocessing.image.ImageDataGenerator(rotation_range=45, horizontal_flip=True, \
                                            vertical_flip=True, zoom_range=0.2)

# loading images from dataframe
X = image_gen.flow_from_dataframe(dataframe=train_classes, \
        directory='/kaggle/input/planets-dataset/planet/planet/train-jpg/', x_col='image_name', y_col=y_col, \
       target_size=(128, 128), class_mode='raw', seed=1, batch_size=128)

In [None]:
# X is an iterable, It contains 317 batches, each batch contains 128 images and labels because 
#40479 / 128 is 316 remainder 31 each image is of shape (128, 128, 3), each label is of shape (17, )

# let's abitrarily view an image
x109 = X[0][0][109] # first batch, images, 109th image
y109 = X[0][1][109] # first batch, labels, 109th label
print("each image's shape is {}".format(x109.shape))
print("each label's shape is {}".format(y109.shape))
print('we have {} batches'.format(len(X)))
print('each batch has {} images/labels'.format(X[0][0].shape[0]))
print('40479/128 is {:.2F}, so the last batch will have {} images/labels'.format(40479/128, X[316][0].shape[0]))

In [None]:
plt.imshow(x109/255) # divided by 255 so the image can be displayed

In [None]:
# importing useful deep learning libraries

from tensorflow.keras.applications.vgg19 import VGG19
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint

In [None]:
# defining a function to calculate fbeta score

def fbeta(ytrue, ypred, beta=2, threshold=0.2, epsilon=1e-7):
    # threshold is set to 0.2 to maximize recall since f2 score is recall biased
    # epsilon is set to 1e-7 to avoide Nan values due to zero division
    
    beta_squarred = float(beta)**2
    
    ytrue = tf.cast(ytrue, tf.float32) # casts ytrue as a float
    # convert ypred to bool, then to float
    ypred = tf.cast(tf.greater(tf.cast(ypred, tf.float32), tf.constant(threshold)), tf.float32) 
    
    tp = tf.reduce_sum(tf.cast(tf.equal((2.0*ytrue + ypred), tf.constant(3.0)), tf.float32), axis=1) 
    fp = tf.reduce_sum(tf.cast(tf.equal((2.0*ytrue + ypred), tf.constant(1.0)), tf.float32), axis=1)
    fn = tf.reduce_sum(tf.cast(tf.equal((2.0*ytrue + ypred), tf.constant(2.0)), tf.float32), axis=1)

    precision = tp / (tp+fp)
    recall = tp / (tp+fn)
    fb = (beta_squarred+1) * precision * recall / (precision*beta_squarred + recall + epsilon)
  
    return fb

In [None]:
# creating a function to calculate multi-label accuracy 

def multi_label_acc(ytrue, ypred, threshold=0.2, epsilon=1e-7):
    # threshold is set to 0.2 to maximize recall since f2 score is recall biased
    # epsilon is set to 1e-7 to avoide Nan values due to zero division
    
    ytrue = tf.cast(ytrue, tf.float32) # casts ytrue as a float
    # convert ypred to bool, then to float
    ypred = tf.cast(tf.greater(tf.cast(ypred, tf.float32), tf.constant(threshold)), tf.float32) 
    
    tp = tf.reduce_sum(tf.cast(tf.equal((2.0*ytrue + ypred), tf.constant(3.0)), tf.float32), axis=1) 
    fp = tf.reduce_sum(tf.cast(tf.equal((2.0*ytrue + ypred), tf.constant(1.0)), tf.float32), axis=1)
    fn = tf.reduce_sum(tf.cast(tf.equal((2.0*ytrue + ypred), tf.constant(2.0)), tf.float32), axis=1)
    tn = tf.reduce_sum(tf.cast(tf.equal((2.0*ytrue + ypred), tf.constant(0.0)), tf.float32), axis=1)
    
    acc = (tp+tn) / (tp+fp+fn+tn+epsilon)  
    
    return acc

In [None]:
# creating a function to build a sequential model

def build_model():
    base_model = VGG19(include_top=False, weights='imagenet', input_shape=(128, 128, 3))
    model = Sequential()
    model.add(BatchNormalization(input_shape=(128, 128, 3)))
    model.add(base_model)
    model.add(Flatten())
    model.add(Dense(17, activation='sigmoid'))
    opt = Adam(lr=1e-4)
    model.compile(loss='binary_crossentropy', optimizer=opt, metrics=[multi_label_acc, fbeta])
    
    return model

In [None]:
# initializing callbacks
#early_stopping = EarlyStopping(monitor='val_fbeta', patience=15, mode='max', verbose=1)
reduced_lr = ReduceLROnPlateau(monitor='val_fbeta', patience=3, cool_down=2, mode='max')
save_best_check_point = ModelCheckpoint(filepath='best_model3.hdf5', monitor='val_fbeta', \
                                        mode='max', save_best_only=True, save_weights_only=True)

In [None]:
# initializing an image data generator object with a validation split of 80:20
train_image_gen = tf.keras.preprocessing.image.ImageDataGenerator(rotation_range=180, horizontal_flip=True, \
                                            vertical_flip=True, validation_split=0.2)

# generating the 80% training image data
train_gen = train_image_gen.flow_from_dataframe(dataframe=train_classes, \
        directory='/kaggle/input/planets-dataset/planet/planet/train-jpg/', x_col='image_name', y_col=y_col, \
       target_size=(128, 128), class_mode='raw', seed=0, batch_size=128, subset='training')

# generating the 20% validation image data
val_gen = train_image_gen.flow_from_dataframe(dataframe=train_classes, \
        directory='/kaggle/input/planets-dataset/planet/planet/train-jpg/', x_col='image_name', y_col=y_col, \
       target_size=(128, 128), class_mode='raw', seed=0, batch_size=128, subset='validation')

In [None]:
# setting step size for training and validation image data
step_train_size = int(np.ceil(train_gen.samples / train_gen.batch_size))
step_val_size = int(np.ceil(val_gen.samples / train_gen.batch_size))

In [None]:
model1 = build_model() # building a sequential model for training

#model1.load_weights('../input/my-best-model2/best_model2.hdf5')
# fitting the model
model1.fit(x=train_gen, steps_per_epoch=step_train_size, validation_data=val_gen, validation_steps=step_val_size,
         epochs=50, callbacks=[reduced_lr, save_best_check_point] )

The model training lasted for approximately 2hrs 30mins with a best val_fbeta score of 0.925. Early stopping was triggered after the 29th epoch.

In [None]:
model2 = build_model() # building a sequential model for testing

#loading in the weights of the trained model
model2.load_weights('best_model3.hdf5')

In [None]:
# adding .jpg extension to 'image_name' in sample_submission data
sample_submission['image_name'] = sample_submission['image_name'].apply(lambda x: '{}.jpg'.format(x))
sample_submission.head()

In [None]:
# selecting the first 40669 'image_name'(s) from the submission_sample dataframe to generate image data from 
# test.jpg folder
test1_df = sample_submission.iloc[:40669]['image_name'].reset_index().drop('index', axis=1)
test1_df.head()

In [None]:
# initializing an image data generator object for the first 40669 images in the sample submission dataframe
test_image_gen1 = tf.keras.preprocessing.image.ImageDataGenerator()

# generating the image data for the first 40669 images in the sample submission dataframe
test_gen1 = test_image_gen1.flow_from_dataframe(dataframe=test1_df, \
            directory='../input/planets-dataset/planet/planet/test-jpg/', x_col='image_name', y_col=None, \
            batch_size=128, shuffle=False, class_mode=None, target_size=(128, 128))

# setting the step size for the testing set for the first 40669 images in the sample submission dataframe
step_test_size1 = int(np.ceil(test_gen1.samples / test_gen1.batch_size))

In [None]:
test_gen1.reset() # reseting the generator to be sure of avoiding shuffling
pred1 = model2.predict(test_gen1, steps=step_test_size1, verbose=1) # predicts the first 40669 images in the 
                                                                    # sample submission dataframe

In [None]:
test_file_names1 = test_gen1.filenames # storing the filenames (images names) of the first 40669 images names in 
                                       # the sample submission dataframe as ordered in the prediction as a 
                                       # variable
        
# converting the predictions of the first 40669 to tag names
pred_tags1 = pd.DataFrame(pred1)
pred_tags1 = pred_tags1.apply(lambda x: ' '.join(np.array(unique_labels)[x > 0.2]), axis=1)

# converting the predictions of the first 40669 to a dataframe
result1 = pd.DataFrame({'image_name': test_file_names1, 'tags': pred_tags1})
result1.head()

In [None]:
# selecting the remaining 'image_name'(s) from the submission_sample dataframe to generate image data from 
# test-additional.jpg folder
test2_df = sample_submission.iloc[40669:]['image_name'].reset_index().drop('index', axis=1)
test2_df.head()

In [None]:
# initializing an image data generator object for the remaining images in the sample submission dataframe
test_image_gen2 = tf.keras.preprocessing.image.ImageDataGenerator()

# generating the image data for the remaining images in the sample submission dataframe
test_gen2 = test_image_gen2.flow_from_dataframe(dataframe=test2_df, \
            directory='../input/planets-dataset/test-jpg-additional/test-jpg-additional/', x_col='image_name', \
            y_col=None, batch_size=128, shuffle=False, class_mode=None, target_size=(128, 128))

# setting the step size for the testing set for the remaining images in the sample submission dataframe
step_test_size2 = int(np.ceil(test_gen2.samples / test_gen2.batch_size))

In [None]:
test_gen2.reset() # reseting the generator to be sure of avoiding shuffling
pred2 = model2.predict(test_gen2, steps=step_test_size2, verbose=1) # predicts the remaining images in the 
                                                                    # sample submission dataframe

In [None]:
test_file_names2 = test_gen2.filenames # storing the filenames (images names) of the remaining images names in 
                                       # the sample submission dataframe as ordered in the prediction as a 
                                       # variable
        
# converting the predictions of the remaining images to tag names
pred_tags2 = pd.DataFrame(pred2)
pred_tags2 = pred_tags2.apply(lambda x: ' '.join(np.array(unique_labels)[x > 0.2]), axis=1)

# converting the predictions of the remaining to a dataframe
result2 = pd.DataFrame({'image_name': test_file_names2, 'tags': pred_tags2})
result2.head()

In [None]:
final_result = pd.concat([result1, result2]) # concatenate the predictions of the test.jpg and 
                                             # test-additional.jpg into a single dataframe
    
final_result = final_result.reset_index().drop('index', axis=1) # reseting the index of the dataframe so it 
                                                                # matches that of sample submission datafarme

print(final_result.shape)
final_result.head()

In [None]:
# confirming that the predicted images are ordered as in sample submission dataframe
assert sum(sample_submission['image_name'] == final_result['image_name']) == 61191

In [None]:
# removing the .jpg extension from 'iamge_name' column
final_result['image_name'] = final_result['image_name'].apply(lambda x: x[:-4])
final_result.head()

In [None]:
final_result.to_csv('sixth_submission.csv', index=False) # saving the predictions