# Experiment 04: Amazon Planet (GPU version)

This experiment uses the data from the Kaggle competition [Planet: Understanding the Amazon from Space](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space/leaderboard). Here we use a pretrained ResNet50 model to generate the features from the dataset.

For details of virtual machine we used and the versions of LightGBM and XGBoost, please refer to [experiment 1](01_airline.ipynb).

In [28]:
import sys, os
from collections import defaultdict
import numpy as np
import pkg_resources
from libs.loaders import load_planet_kaggle
from libs.planet_kaggle import threshold_prediction
from libs.timer import Timer
import lightgbm as lgb
import xgboost as xgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from tqdm import tqdm
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session, get_session

print("System version: {}".format(sys.version))
print("XGBoost version: {}".format(pkg_resources.get_distribution('xgboost').version))
print("LightGBM version: {}".format(pkg_resources.get_distribution('lightgbm').version))

System version: 3.5.2 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
XGBoost version: 0.6
LightGBM version: 0.2


In [2]:
%env MOUNT_POINT=/datadrive

env: MOUNT_POINT=/datadrive


In [3]:
#Configure TF to use only one GPU, by default TF allocates memory in all GPUs
config = tf.ConfigProto(device_count = {'GPU': 1})
#Configure TF to limit the amount of GPU memory, by default TF takes all of them. 
config.gpu_options.per_process_gpu_memory_fraction = 0.3
set_session(tf.Session(config=config))

The images are loaded and featurised using a pretrained ResNet50 model available from Keras

In [4]:
X_train, y_train, X_test, y_test = load_planet_kaggle()

Featurising training images: 100%|██████████| 1094/1094.0 [07:08<00:00,  2.56it/s]
Featurising validation images: 100%|██████████| 172/172.0 [01:06<00:00,  2.93it/s]


In [5]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(35000, 2048)
(35000, 17)
(5479, 2048)
(5479, 17)


## XGBoost 

We will use a one-v-rest. So each classifier will be responsible for determining whether the assigned tag applies to the image

In [6]:
def train_and_validate_xgboost(params, train_features, train_labels, validation_features, num_boost_round):
    n_classes = train_labels.shape[1]
    y_val_pred = np.zeros((validation_features.shape[0], n_classes))
    time_results = defaultdict(list)
    for class_i in tqdm(range(n_classes)):
        dtrain = xgb.DMatrix(data=train_features, label=train_labels[:, class_i])
        dtest = xgb.DMatrix(data=validation_features)
        with Timer() as t:
            model = xgb.train(params, dtrain, num_boost_round=num_boost_round)
        time_results['train_time'].append(t.interval)
        
        with Timer() as t:
            y_val_pred[:, class_i] = model.predict(dtest)
        time_results['test_time'].append(t.interval)
        
    return y_val_pred, time_results

In [7]:
def train_and_validate_lightgbm(params, train_features, train_labels, validation_features, num_boost_round):
    n_classes = train_labels.shape[1]
    y_val_pred = np.zeros((validation_features.shape[0], n_classes))
    time_results = defaultdict(list)
    for class_i in tqdm(range(n_classes)):
        lgb_train = lgb.Dataset(train_features, train_labels[:, class_i], free_raw_data=False)
        with Timer() as t:
            model = lgb.train(params, lgb_train, num_boost_round = num_boost_round)
        time_results['train_time'].append(t.interval)
        
        with Timer() as t:
            y_val_pred[:, class_i] = model.predict(validation_features)
        time_results['test_time'].append(t.interval)
        
    return y_val_pred, time_results

In [8]:
metrics_dict = {
    'Accuracy': accuracy_score,
    'Precision': lambda y_true, y_pred: precision_score(y_true, y_pred, average='samples'),
    'Recall': lambda y_true, y_pred: recall_score(y_true, y_pred, average='samples'),
    'F1': lambda y_true, y_pred: f1_score(y_true, y_pred, average='samples'),
}

def classification_metrics(metrics, y_true, y_pred):
    return {metric_name:metric(y_true, y_pred) for metric_name, metric in metrics.items()}

In [9]:
results_dict = dict()
num_rounds = 50

Now we are going to define the different models.

In [10]:
xgb_params = {'max_depth':2, #'max_depth':6 
              'objective':'binary:logistic', 
              'min_child_weight':1, 
              'eta':0.1, 
              'scale_pos_weight':2, 
              'gamma':0.1, 
              'reg_lamda':1, 
              'subsample':1,
              'tree_method':'exact', 
              'updater':'grow_gpu'
             }

*NOTE: We got an out of memory error with xgb. Please see the comments at the end of the notebook.*

In [None]:
y_pred, timing_results = train_and_validate_xgboost(xgb_params, X_train, y_train, X_test, num_boost_round=num_rounds)

In [None]:
results_dict['xgb']={
    'train_time': np.sum(timing_results['train_time']),
    'test_time': np.sum(timing_results['test_time']),
    'performance': classification_metrics(metrics_dict, 
                                          y_test, 
                                          threshold_prediction(y_pred, threshold=0.1)) 
}



Now let's try with XGBoost histogram.


In [12]:
xgb_hist_params = {'max_depth':0, 
                  'objective':'binary:logistic', 
                  'min_child_weight':1, 
                  'eta':0.1, 
                  'scale_pos_weight':2, 
                  'gamma':0.1, 
                  'reg_lamda':1, 
                  'subsample':1,
                  'tree_method':'hist', 
                  'max_leaves':2**6, 
                  'grow_policy':'lossguide',
                  'updater':'grow_gpu_hist'
                 }

In [13]:
y_pred, timing_results = train_and_validate_xgboost(xgb_hist_params, X_train, y_train, X_test, num_boost_round=num_rounds)


  0%|          | 0/17 [00:00<?, ?it/s][A
100%|██████████| 17/17 [38:51<00:00, 127.44s/it]


In [14]:
results_dict['xgb_hist']={
    'train_time': np.sum(timing_results['train_time']),
    'test_time': np.sum(timing_results['test_time']),
    'performance': classification_metrics(metrics_dict, 
                                          y_test, 
                                          threshold_prediction(y_pred, threshold=0.1)) 
}

## LightGBM 



In [21]:
lgb_params = {'num_leaves': 2**6,
             'learning_rate': 0.1,
             'scale_pos_weight': 2,
             'min_split_gain': 0.1,
             'min_child_weight': 1,
             'reg_lambda': 1,
             'subsample': 1,
             'objective':'binary',
             'device': 'gpu',
             'gpu_device_id':3,
             'task': 'train'
             }

In [29]:
import keras as K
def limit_mem():
    get_session().close()
    cfg = tf.ConfigProto()
    cfg.gpu_options.allow_growth = True
    set_session(tf.Session(config=cfg))

In [30]:
limit_mem()

In [22]:
y_pred, timing_results = train_and_validate_lightgbm(lgb_params, X_train, y_train, X_test, num_boost_round=num_rounds)


  0%|          | 0/17 [00:00<?, ?it/s][A
 18%|█▊        | 3/17 [01:13<05:45, 24.69s/it]

KeyboardInterrupt: 

In [17]:
results_dict['lgbm']={
    'train_time': np.sum(timing_results['train_time']),
    'test_time': np.sum(timing_results['test_time']),
    'performance': classification_metrics(metrics_dict, 
                                          y_test, 
                                          threshold_prediction(y_pred, threshold=0.1)) 
}

Finally, we show the results.

In [18]:
# Results
print(json.dumps(results_dict, indent=4, sort_keys=True))

{
    "lgbm": {
        "performance": {
            "Accuracy": 0.3719656871691915,
            "F1": 0.8218938627635469,
            "Precision": 0.7435246989225818,
            "Recall": 0.9733034790846435
        },
        "test_time": 0.2908596449851757,
        "train_time": 401.54383966495516
    },
    "xgb_hist": {
        "performance": {
            "Accuracy": 0.37871874429640445,
            "F1": 0.8220252909027159,
            "Precision": 0.7447899193746976,
            "Recall": 0.9720717197264013
        },
        "test_time": 0.20933848500135355,
        "train_time": 2261.7880187399714
    }
}


In this dataset we have a big feature size, 2048. When using the standard version of XGBoost, xgb, we get an out of memory using a NVIDIA M60 GPU, even if we reduce the max depth of the tree to 2. A solution to this issue would be to reduce the feature size. One option could be using PCA and another could be to use a different featurizer, instead of ResNet whose last hidden layer has 2048 units, we could use VGG, [also provided by Keras](https://github.com/fchollet/keras/blob/master/keras/applications/vgg16.py), whose last hidden layer has 512 units. 

