# Example usage of current code

In [None]:
import sys
import os
import numpy as np
import pandas as pd

from sklearn.ensemble import GradientBoostingRegressor

sys.path.append("../..")
from earthquakes.engineering import sequence_generator, FeatureComputer, create_feature_dataset
from earthquakes.modeling import train_and_predict, cv_with_feature_computer, predict_on_test

Load data

In [None]:
pd.options.display.precision = 15
data_dir = "../data"

train = pd.read_csv(os.path.join(data_dir, "train.csv"),
                    dtype={"acoustic_data": np.int16, "time_to_failure": np.float64})

Replicate the work in the starter notebook with the functions from the `engineering` and `modeling` modules. Let's use a slightly different model though and some more quantiles.

In addition, the cross validation method now has an option to predict on the test set at every fold by setting `predict_on_test=True`. In that case, the method returns a dataframe with predictions on the test set besides the cross validation scores. We can use this to blending.

In [None]:
computer = FeatureComputer(quantiles=[0.01, 0.02, 0.05, 0.1, 0.2, 0.3, 0.7, 0.8, 0.9, 0.95, 0.98, 0.99])

params = {
    "n_estimators": 1000,
    "loss": 'lad',
    "verbose": 1,
}

scores, test_predictions = cv_with_feature_computer(train, GradientBoostingRegressor, computer,
                                                    train_samples=5000, val_samples=1000,
                                                    predict_test=True, data_dir=data_dir)

print("Cross validation score: {}".format(np.mean(scores)))

Let's try blending by averaging over the predictions.

In [None]:
submission = test_predictions[["seg_id", "time_to_failure"]].copy()
submission["time_to_failure"] = test_predictions.drop("seg_id", axis=1).mean(axis=1)
submission.head(20)

In [None]:
submission.to_csv(os.path.join(data_dir, "submissions", "gradient_boosting_with_blending.csv"), index=False)

__This notebook achieved 1.592 on the public leaderboard (again, without any tuning whatsoever).__ I just added simple blending, used Gradient Boosting instead of Random Forest, and used some more training data in every fold (5000 samples vs 1000).