# Experiment 04: Amazon Planet

This experiment uses the data from the Kaggle competition [Planet: Understanding the Amazon from Space](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space/leaderboard). Here we use a pretrained ResNet50 model to generate the features from the dataset.

For details of virtual machine we used and the versions of LightGBM and XGBoost, please refer to [experiment 1](01_airline.ipynb).

In [1]:
import sys
from collections import defaultdict
import numpy as np
import pkg_resources
from libs.loaders import load_planet_kaggle
from libs.planet_kaggle import threshold_prediction
from libs.timer import Timer
from libs.utils import get_number_processors
import lightgbm as lgb
import xgboost as xgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from tqdm import tqdm

print("System version: {}".format(sys.version))
print("XGBoost version: {}".format(pkg_resources.get_distribution('xgboost').version))
print("LightGBM version: {}".format(pkg_resources.get_distribution('lightgbm').version))


Using TensorFlow backend.


System version: 3.5.2 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
XGBoost version: 0.6
LightGBM version: 0.2


In [2]:
%env MOUNT_POINT=/datadrive

env: MOUNT_POINT=/datadrive


The images are loaded and featurised using a pretrained ResNet50 model available from Keras

In [3]:
X_train, y_train, X_test, y_test = load_planet_kaggle()

Featurising training images: 100%|██████████| 1094/1094.0 [36:44<00:00,  1.90s/it]
Featurising validation images: 100%|██████████| 172/172.0 [05:45<00:00,  1.59s/it]


In [11]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(35000, 2048)
(35000, 17)
(5479, 2048)
(5479, 17)


## XGBoost vs LightGBM benchmark

We will compare both libraries on speed and preformance.

In [4]:
number_processors = get_number_processors()
print("Number of processors: ", number_processors)

Number of processors:  24


We will use a one-v-rest. So each classifier will be responsible for determining whether the assigned tag applies to the image

In [5]:
def train_and_validate_xgboost(params, train_features, train_labels, validation_features, num_boost_round):
    n_classes = train_labels.shape[1]
    y_val_pred = np.zeros((validation_features.shape[0], n_classes))
    time_results = defaultdict(list)
    for class_i in tqdm(range(n_classes)):
        dtrain = xgb.DMatrix(data=train_features, label=train_labels[:, class_i])
        dtest = xgb.DMatrix(data=validation_features)
        with Timer() as t:
            model = xgb.train(params, dtrain, num_boost_round=num_boost_round)
        time_results['train_time'].append(t.interval)
        
        with Timer() as t:
            y_val_pred[:, class_i] = model.predict(dtest)
        time_results['test_time'].append(t.interval)
        
    return y_val_pred, time_results

In [28]:
def train_and_validate_lightgbm(params, train_features, train_labels, validation_features, num_boost_round):
    n_classes = train_labels.shape[1]
    y_val_pred = np.zeros((validation_features.shape[0], n_classes))
    time_results = defaultdict(list)
    for class_i in tqdm(range(n_classes)):
        dtrain = lgb.Dataset(train_features, train_labels[:, class_i], free_raw_data=False)
        with Timer() as t:
            model = lgb.train(params, dtrain, num_boost_round=num_boost_round)
        time_results['train_time'].append(t.interval)
        
        with Timer() as t:
            y_val_pred[:, class_i] = model.predict(validation_features)
        time_results['test_time'].append(t.interval)
        
    return y_val_pred, time_results

In [7]:
metrics_dict = {
    'Accuracy': accuracy_score,
    'Precision': lambda y_true, y_pred: precision_score(y_true, y_pred, average='samples'),
    'Recall': lambda y_true, y_pred: recall_score(y_true, y_pred, average='samples'),
    'F1': lambda y_true, y_pred: f1_score(y_true, y_pred, average='samples'),
}

def classification_metrics(metrics, y_true, y_pred):
    return {metric_name:metric(y_true, y_pred) for metric_name, metric in metrics.items()}

In [8]:
results_dict = dict()

In [18]:
xgb_params = {'max_depth':2**3, 
              'objective':'binary:logistic', 
              'min_child_weight':1, 
              'eta':0.1, 
              'colsample_bytree':0.80,
              'scale_pos_weight':2, 
              'gamma':0.1, 
              'reg_lamda':1, 
              'subsample':1,
              'nthread':number_processors
             }

In [19]:
y_pred, timing_results = train_and_validate_xgboost(xgb_params, X_train, y_train, X_test, num_boost_round=50)

100%|██████████| 17/17 [06:29<00:00, 22.55s/it]


In [20]:
results_dict['xgb']={
    'train_time': np.sum(timing_results['train_time']),
    'test_time': np.sum(timing_results['test_time']),
    'performance': classification_metrics(metrics_dict, 
                                          y_test, 
                                          threshold_prediction(y_pred, threshold=0.1)) 
}

In [21]:
xgb_hist_params = {'max_depth':0, 
                  'objective':'binary:logistic', 
                  'min_child_weight':1, 
                  'eta':0.1, 
                  'colsample_bytree':0.80,
                  'scale_pos_weight':2, 
                  'gamma':0.1, 
                  'reg_lamda':1, 
                  'subsample':1,
                  'nthread':number_processors,
                  'tree_method':'hist', 
                  'max_leaves':2**3, 
                  'grow_policy':'lossguide',
                  'max_bins': 63
                 }

In [22]:
y_pred, timing_results = train_and_validate_xgboost(xgb_hist_params, X_train, y_train, X_test, num_boost_round=50)

100%|██████████| 17/17 [07:25<00:00, 25.67s/it]


In [23]:
results_dict['xgb_hist']={
    'train_time': np.sum(timing_results['train_time']),
    'test_time': np.sum(timing_results['test_time']),
    'performance': classification_metrics(metrics_dict, 
                                          y_test, 
                                          threshold_prediction(y_pred, threshold=0.1)) 
}

In [25]:
lgbm_params = {'num_leaves': 2**3,
               'learning_rate': 0.1,
               'scale_pos_weight': 1,
               'min_split_gain': 0.1,
               'min_child_weight': 1,
               'reg_lambda': 1,
               'subsample': 1,
               'objective':'binary',
               'task': 'train',
               'max_bin': 63,
               'nthread': number_processors
               }

In [29]:
y_pred, timing_results = train_and_validate_lightgbm(lgbm_params, X_train, y_train, X_test, num_boost_round=50)

100%|██████████| 17/17 [01:31<00:00,  4.95s/it]


In [30]:
results_dict['lgbm']={
    'train_time': np.sum(timing_results['train_time']),
    'test_time': np.sum(timing_results['test_time']),
    'performance': classification_metrics(metrics_dict, 
                                          y_test, 
                                          threshold_prediction(y_pred, threshold=0.1)) 
}

In [31]:
# Results
print(json.dumps(results_dict, indent=4, sort_keys=True))

{
    "lgbm": {
        "performance": {
            "Accuracy": 0.0007300602299689724,
            "F1": 0.6977013289256292,
            "Precision": 0.5577368804463164,
            "Recall": 0.9712049904831435
        },
        "test_time": 0.4870613479288295,
        "train_time": 90.77806165185757
    },
    "xgb": {
        "performance": {
            "Accuracy": 0.3907647380908925,
            "F1": 0.8317469810679177,
            "Precision": 0.7596291178149182,
            "Recall": 0.9687829722142554
        },
        "test_time": 0.14375658286735415,
        "train_time": 372.5387697040569
    },
    "xgb_hist": {
        "performance": {
            "Accuracy": 0.29731702865486404,
            "F1": 0.7734681033989623,
            "Precision": 0.6749294038078488,
            "Recall": 0.9825033243814043
        },
        "test_time": 0.19390580093022436,
        "train_time": 430.5022775421385
    }
}
